-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve validation for HTAN IDs #268
Comments
Suggest modifying rule to |
@inodb found that the case of spaces in parent IDs was causing files to be missing from Release 4. Another use case why we should plan for this in the September spring @aclayton555 @elv-sb |
Consider: if we implement this, which IDs will this then fail; should these pass or be fixed? Work through this through Sept sprint ( @clarisse-lau appreciate you help on this). Will bring to an ops call mid sprint to ensure alignment on team, include checks on BQ and dashboard, etc. |
Close out 23-09 sprint with this in staging for testing purposes. If okay, HTAN Parent ID also needs to be updated. But we need to see how this works and what breaks |
Testing for HTAN Data File ID with the following query
More than 21k HTAPP files fail this validation
|
SRRS issues as follows
|
Here are the first 5 rows of the HTAPP errors These do not match as they contain a 4th @clarisse-lau do you think these should be valid?
|
@clarisse-lau coming back to this issue. Do we think the HTAPP ids above should be considered valid? |
@elv-sb and I discussed today. As we currently will only generate a |
Apologies @adamjtaylor, missed this the first time around! I agree with the proposed approach (leaving as is). The center ID and participant components are the most crucial, and having an extra "_ group" should not cause issues downstream. |
It sounds like we will be supporting an additional TNP soon that will use |
In light of the revised HTAN identifier SOP, we might want to improve our validation rule for HTAN ID fields: https://github.com/ncihtan/data-models/blob/main/HTAN.model.csv#L11-L16
Currently, these are not validated. We used to validate with
regex match HTA_*
, but this was removed in Jan 2022 as "the regex would need to be updated"Propose we update this to below
HTAN Data File / Biospecimen ID:
regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning
HTAN Parent Data File / Biospecimen ID:
list like :: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning
HTAN Participant ID:
regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))$ warning
The text was updated successfully, but these errors were encountered: