Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter dataset/file pids with special characters #4811

Closed
matthew-a-dunlap opened this issue Jul 3, 2018 · 4 comments
Closed

Filter dataset/file pids with special characters #4811

matthew-a-dunlap opened this issue Jul 3, 2018 · 4 comments
Assignees

Comments

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 3, 2018

As part of #4761 we have been discussing pid structure and the challenges of external systems using these values. There is concern that we will start importing pids with a lot of weird characters, for example this valid DOI:

https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

These edge case pids could lead to systems breaking, injection attacks, etc. This is of more concern as we are trying to have all external systems use pids to reference files and datasets, and they are unlikely to expect such a variety of characters. Especially legacy systems that only expect the alphanumeric pids generated in Dataverse.

A solution that was discussed was to enforce limits on the characters we import. The general idea is to be pretty strict but simple and then add in edge cases as requested.

I did not read into Handle's or ezid, tho my limited understanding is that those allow a different/weird assortment of characters. See here for more info on DataCite's pid. https://support.datacite.org/docs/doi-basics / https://www.crossref.org/blog/dois-and-matching-regular-expressions/

@matthew-a-dunlap matthew-a-dunlap changed the title filter file pids with special characters Filter file pids with special characters Jul 3, 2018
@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Jul 3, 2018

This was originally discussed in #4606 (review) but we decided to break it out into its own story. We really shouldn't release #3083 without it tho.

@djbrooke
Copy link
Contributor

djbrooke commented Jul 9, 2018

Thanks @scolapasta and @matthew-a-dunlap for the discussion post standup. Moved to backlog for discussion. Before we pick this up, we should determine the criteria for filtering.

@matthew-a-dunlap matthew-a-dunlap changed the title Filter file pids with special characters Filter dataset/file pids with special characters Jul 9, 2018
@pdurbin pdurbin self-assigned this Jul 16, 2018
@pdurbin pdurbin removed their assignment Jul 16, 2018
@pdurbin
Copy link
Member

pdurbin commented Jul 16, 2018

I just made pull request #4852 and moved this issue to code review.

In addition to the JUnit tests I added, I've been testing the running system with this:

curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$SERVER_URL/api/dataverses/$DV_ALIAS/datasets/:import?pid=$PERSISTENT_IDENTIFIER&release=yes" --upload-file scripts/api/data/dataset-package-files.json

@scolapasta
Copy link
Contributor

This is a good start, but only blocks certain characters and would need to be updated for others. Let's take the opposite approach and define what we allow and make sure PIDs parse that way. (and when some group says they need us to allow something else, we can consider adding it to what is allowable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants