-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter dataset/file pids with special characters #4811
Comments
This was originally discussed in #4606 (review) but we decided to break it out into its own story. We really shouldn't release #3083 without it tho. |
Thanks @scolapasta and @matthew-a-dunlap for the discussion post standup. Moved to backlog for discussion. Before we pick this up, we should determine the criteria for filtering. |
I just made pull request #4852 and moved this issue to code review. In addition to the JUnit tests I added, I've been testing the running system with this:
|
This is a good start, but only blocks certain characters and would need to be updated for others. Let's take the opposite approach and define what we allow and make sure PIDs parse that way. (and when some group says they need us to allow something else, we can consider adding it to what is allowable). |
disallow some special characters in dataset DOI import #4811
As part of #4761 we have been discussing pid structure and the challenges of external systems using these values. There is concern that we will start importing pids with a lot of weird characters, for example this valid DOI:
https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8
These edge case pids could lead to systems breaking, injection attacks, etc. This is of more concern as we are trying to have all external systems use pids to reference files and datasets, and they are unlikely to expect such a variety of characters. Especially legacy systems that only expect the alphanumeric pids generated in Dataverse.
A solution that was discussed was to enforce limits on the characters we import. The general idea is to be pretty strict but simple and then add in edge cases as requested.
I did not read into Handle's or ezid, tho my limited understanding is that those allow a different/weird assortment of characters. See here for more info on DataCite's pid. https://support.datacite.org/docs/doi-basics / https://www.crossref.org/blog/dois-and-matching-regular-expressions/
The text was updated successfully, but these errors were encountered: