Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7056 Skip unknown types in a dataset which cause the whole dataset to… #7057

Merged
merged 3 commits into from Jul 13, 2020
Merged

7056 Skip unknown types in a dataset which cause the whole dataset to… #7057

merged 3 commits into from Jul 13, 2020

Conversation

JingMa87
Copy link
Contributor

@JingMa87 JingMa87 commented Jul 6, 2020

… fail from being harvested

What this PR does / why we need it: Makes datasets with unknown types harvestable using dataverse_json

Which issue(s) this PR closes: 7056

Closes #7056

Special notes for your reviewer: Good fix

Suggestions on how to test this: Harvest the Princeton_Authored_Datasets OAI set from the harvard repo using dataverse_json. Without this fix you will get 267 successful harvest and 2 fails. With the fix you will harvest all 269 datasets. The controversial datasets are https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/B6OJKG and https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/I5O6OS.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: I think it's an important change for harvesting from another dataverse.

Additional documentation: N/A

@coveralls
Copy link

coveralls commented Jul 6, 2020

Coverage Status

Coverage increased (+0.003%) to 19.656% when pulling 057a8af on JingMa87:7056-skip-unknown-field-type into 6daf219 on IQSS:develop.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Code Review 🦁 to QA 🔎✅ Jul 9, 2020
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine so I'm moving it to QA. If other developers want to talk a look, please go ahead. We've never promoted using the dataverse_json format for harvesting (I believe DDI is preferred) but this should make it more possible to use it.

(I don't think it's worth sending back to the contributor over the wildcard imports but if someone feels strongly about this we certainly can.)

import edu.harvard.iq.dataverse.FileMetadata;
import edu.harvard.iq.dataverse.MetadataBlockServiceBean;
import edu.harvard.iq.dataverse.TermsOfUseAndAccess;
import edu.harvard.iq.dataverse.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this in a review on another pull request but we favor using the full import statement instead of wildcards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reverse the change, my Intellij automatically does this. Super nice feature though otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've heard that about Intellij. Thanks for reversing the change.

@@ -566,7 +553,8 @@ public DatasetField parseField(JsonObject json, Boolean testType) throws JsonPar


if (type == null) {
throw new JsonParseException("Can't find type '" + json.getString("typeName", "") + "'");
logger.info("Can't find type '" + json.getString("typeName", "") + "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to happen a lot? If so, should it be reduced from "info" to "fine" so it doesn't clutter the logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdurbin This'll happen potentially multiple thousands of times when doing a harvesting run on a big OAI set for the first time. So I'll just change it to "fine".

@jggautier
Copy link
Contributor

We've never promoted using the dataverse_json format for harvesting (I believe DDI is preferred) but this should make it more possible to use it.

Maybe not in the guides but I've promoted it to reduce information loss and most of the harvesting Harvard Dataverse does from other Dataverse installations is using dataverse_json. I would think that DDI Codebook would be preferred if all metadata could be mapped to it, but that's probably going to be less and less true with increasing support for other standards.

@kcondon kcondon self-assigned this Jul 10, 2020
@kcondon
Copy link
Contributor

kcondon commented Jul 10, 2020

@JingMa87 I am still seeing the two failures.

<record>
  <date>2020-07-10T16:28:10</date>
  <millis>1594398490946</millis>
  <sequence>8513</sequence>
  <logger>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean.kc0605202020-07-10T16-27-22</logger>
  <level>SEVERE</level>
  <class>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean</class>
  <method>logGetRecordException</method>
  <thread>187</thread>
  <message>Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/B6OJKG, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:20721] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_20721] unknown field 'mraCollection')</message>
</record>

@JingMa87
Copy link
Contributor Author

@JingMa87 I am still seeing the two failures.

<record>
  <date>2020-07-10T16:28:10</date>
  <millis>1594398490946</millis>
  <sequence>8513</sequence>
  <logger>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean.kc0605202020-07-10T16-27-22</logger>
  <level>SEVERE</level>
  <class>edu.harvard.iq.dataverse.harvest.client.HarvesterServiceBean</class>
  <method>logGetRecordException</method>
  <thread>187</thread>
  <message>Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/B6OJKG, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:20721] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_20721] unknown field 'mraCollection')</message>
</record>

@kcondon I can't reproduce the error. I've looked into the DatasetCreate Command but without the actual error happening it's like looking for a needle in a haystack. Are you on the latest version with the latest changes? What happens if you re-run the harvest, does it work the second time?

@kcondon
Copy link
Contributor

kcondon commented Jul 11, 2020

@JingMa87 I have run it again and still see the failures. Running it again just results in 0 harvested, 0 failed. Looks like failures do are not harvested again.

Is it possible you have some unchecked in code on your local build env?

@JingMa87
Copy link
Contributor Author

JingMa87 commented Jul 11, 2020

@kcondon I pushed all my changes to the branch already. What if you delete the harvesting client, add it again and re-run, do you still get the issue? Sometimes I get weird intermittent errors that disappear if I retry.

@kcondon
Copy link
Contributor

kcondon commented Jul 11, 2020 via email

@JingMa87
Copy link
Contributor Author

Does this query in the database return something for you? If so, this row is not supposed to be there and stops the field type mraCollection from being filtered out by my code change. Then, it gets passed to Solr and the error happens.
SELECT * FROM public.datasetfieldtype where name = 'mraCollection';

I've looked into adding debugging code, but the class where the error happens (CreateHarvestedDatasetCommand) is a whole big mess of polymorphism, generics and interfaces gone wrong. When I use the debugger, I can't inspect the execute() function of that Command where the error happens so I'm relegating to reading the code. Hopefully this helps.

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

@JingMa87 Yes, that sql query returns a record. Why do you say it should not be there? Also, wouldn't you have seen the same record in your own test? I understand about the code complexity. I was thinking about a lower tech debugging of adding debug logging if that helps.

Ok, I think I understand what you were getting at: the purpose of the fix is so that harvest clients who do not have this field in their db can skip a field that is present on the harvest server/source. I did not consider this when I looked at the test case. However, it should actually work normally in the case where it is present? I will test with and without it on the client so at least we have information and compare to the code currently on develop branch.

A quick test using another branch shows json format harvest also fails when the field is present on both client and server. Although this should work, I believe it is outside the scope of this ticket. Will confirm the original intent, masking the fields that are not present on the client.

@JingMa87
Copy link
Contributor Author

JingMa87 commented Jul 14, 2020

@kcondon I made a new ticket for clients that do have the field in the DB #7076 and will try to fix that in the coming days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Unknown types in a dataset causes the whole dataset to fail from being harvested
6 participants