Filter dataset/file pids with special characters #4811

matthew-a-dunlap · 2018-07-03T22:43:48Z

As part of #4761 we have been discussing pid structure and the challenges of external systems using these values. There is concern that we will start importing pids with a lot of weird characters, for example this valid DOI:

https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

These edge case pids could lead to systems breaking, injection attacks, etc. This is of more concern as we are trying to have all external systems use pids to reference files and datasets, and they are unlikely to expect such a variety of characters. Especially legacy systems that only expect the alphanumeric pids generated in Dataverse.

A solution that was discussed was to enforce limits on the characters we import. The general idea is to be pretty strict but simple and then add in edge cases as requested.

I did not read into Handle's or ezid, tho my limited understanding is that those allow a different/weird assortment of characters. See here for more info on DataCite's pid. https://support.datacite.org/docs/doi-basics / https://www.crossref.org/blog/dois-and-matching-regular-expressions/

The text was updated successfully, but these errors were encountered:

matthew-a-dunlap · 2018-07-03T22:45:00Z

This was originally discussed in #4606 (review) but we decided to break it out into its own story. We really shouldn't release #3083 without it tho.

djbrooke · 2018-07-09T15:39:51Z

Thanks @scolapasta and @matthew-a-dunlap for the discussion post standup. Moved to backlog for discussion. Before we pick this up, we should determine the criteria for filtering.

pdurbin · 2018-07-16T18:50:55Z

I just made pull request #4852 and moved this issue to code review.

In addition to the JUnit tests I added, I've been testing the running system with this:

curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$SERVER_URL/api/dataverses/$DV_ALIAS/datasets/:import?pid=$PERSISTENT_IDENTIFIER&release=yes" --upload-file scripts/api/data/dataset-package-files.json

scolapasta · 2018-07-17T15:42:49Z

This is a good start, but only blocks certain characters and would need to be updated for others. Let's take the opposite approach and define what we allow and make sure PIDs parse that way. (and when some group says they need us to allow something else, we can consider adding it to what is allowable).

…4811

…ers regex. #4811

disallow some special characters in dataset DOI import #4811

matthew-a-dunlap changed the title ~~filter file pids with special characters~~ Filter file pids with special characters Jul 3, 2018

djbrooke added the Status: Backlog label Jul 9, 2018

matthew-a-dunlap changed the title ~~Filter file pids with special characters~~ Filter dataset/file pids with special characters Jul 9, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Jul 10, 2018

pdurbin self-assigned this Jul 16, 2018

pdurbin added Status: Development and removed Status: This/Next Sprint labels Jul 16, 2018

pdurbin added a commit that referenced this issue Jul 16, 2018

disallow some special characters in dataset DOI import #4811

30785b8

pdurbin mentioned this issue Jul 16, 2018

disallow some special characters in dataset DOI import #4811 #4852

Merged

pdurbin removed their assignment Jul 16, 2018

pdurbin added Status: Code Review and removed Status: Development labels Jul 16, 2018

scolapasta added Status: Development and removed Status: Code Review labels Jul 17, 2018

pdurbin added Status: This/Next Sprint and removed Status: Development labels Jul 18, 2018

benjamin-martinez self-assigned this Jul 25, 2018

djbrooke added Status: Development and removed Status: This/Next Sprint labels Jul 25, 2018

benjamin-martinez added a commit that referenced this issue Jul 25, 2018

Changed the "blacklist" for allowed PID characters to a "whitelist" #…

f5391a5

…4811

benjamin-martinez added a commit that referenced this issue Jul 26, 2018

Minor clean-up and reformatting #4811

d365996

djbrooke added Status: Code Review and removed Status: Development labels Jul 26, 2018

djbrooke assigned matthew-a-dunlap and unassigned benjamin-martinez Jul 26, 2018

benjamin-martinez added a commit that referenced this issue Jul 26, 2018

Merge branch 'develop' into 4811-block-fancy-pid #4811

bb8c795

matthew-a-dunlap added Status: Development and removed Status: Code Review labels Jul 26, 2018

matthew-a-dunlap removed their assignment Jul 26, 2018

benjamin-martinez added a commit that referenced this issue Jul 26, 2018

Changed special PID case error message and removed disallowed charact…

55a1295

…ers regex. #4811

benjamin-martinez added Status: Code Review and removed Status: Development labels Jul 26, 2018

matthew-a-dunlap added Status: QA and removed Status: Code Review labels Jul 26, 2018

kcondon self-assigned this Jul 26, 2018

djbrooke assigned matthew-a-dunlap Jul 30, 2018

benjamin-martinez added a commit that referenced this issue Jul 30, 2018

Negation symbol added to verification logic in Dataverses.java #4811

0883ca3

kcondon added a commit that referenced this issue Jul 30, 2018

Merge pull request #4852 from IQSS/4811-block-fancy-pid

ac3e8ad

disallow some special characters in dataset DOI import #4811

kcondon closed this as completed Jul 30, 2018

kcondon removed the Status: QA label Jul 30, 2018

djbrooke added this to the 4.9.2 - Stata Upgrades, etc. milestone Jul 30, 2018

pdurbin mentioned this issue Aug 20, 2018

Import API: Specifying the PID in json does not work. #4839

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter dataset/file pids with special characters #4811

Filter dataset/file pids with special characters #4811

matthew-a-dunlap commented Jul 3, 2018 •

edited

matthew-a-dunlap commented Jul 3, 2018 •

edited

djbrooke commented Jul 9, 2018

pdurbin commented Jul 16, 2018 •

edited

scolapasta commented Jul 17, 2018

Filter dataset/file pids with special characters #4811

Filter dataset/file pids with special characters #4811

Comments

matthew-a-dunlap commented Jul 3, 2018 • edited

matthew-a-dunlap commented Jul 3, 2018 • edited

djbrooke commented Jul 9, 2018

pdurbin commented Jul 16, 2018 • edited

scolapasta commented Jul 17, 2018

matthew-a-dunlap commented Jul 3, 2018 •

edited

matthew-a-dunlap commented Jul 3, 2018 •

edited

pdurbin commented Jul 16, 2018 •

edited