Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validation for HTAN IDs #268

Closed
adamjtaylor opened this issue Jul 17, 2023 · 11 comments · Fixed by #304
Closed

Improve validation for HTAN IDs #268

adamjtaylor opened this issue Jul 17, 2023 · 11 comments · Fixed by #304
Assignees
Labels
effort-high This one's hard

Comments

@adamjtaylor
Copy link
Contributor

adamjtaylor commented Jul 17, 2023

In light of the revised HTAN identifier SOP, we might want to improve our validation rule for HTAN ID fields: https://github.com/ncihtan/data-models/blob/main/HTAN.model.csv#L11-L16

Currently, these are not validated. We used to validate with regex match HTA_*, but this was removed in Jan 2022 as "the regex would need to be updated"

Propose we update this to below
HTAN Data File / Biospecimen ID: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning
HTAN Parent Data File / Biospecimen ID: list like :: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))_([1-9]\d*|0000)$ warning

HTAN Participant ID: regex match ^(HTA([1-9]|1[0-5]))_((EXT)?([1-9]\d*|0000))$ warning

@clarisse-lau
Copy link
Contributor

Suggest modifying rule to ^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\d*|0000))_([0-9]\d*|0000) to capture IDs such as HTA10_01_00491395202451674766778335220084 submitted by Stanford

@adamjtaylor
Copy link
Contributor Author

@inodb found that the case of spaces in parent IDs was causing files to be missing from Release 4. Another use case why we should plan for this in the September spring @aclayton555 @elv-sb

@aclayton555
Copy link
Contributor

aclayton555 commented Aug 30, 2023

Consider: if we implement this, which IDs will this then fail; should these pass or be fixed?

Work through this through Sept sprint ( @clarisse-lau appreciate you help on this). Will bring to an ops call mid sprint to ensure alignment on team, include checks on BQ and dashboard, etc.

@aclayton555
Copy link
Contributor

Close out 23-09 sprint with this in staging for testing purposes. If okay, HTAN Parent ID also needs to be updated. But we need to see how this works and what breaks

@adamjtaylor
Copy link
Contributor Author

Testing for HTAN Data File ID with the following query

SELECT 
  HTAN_Center, 
  REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$') AS regex,
  COUNT(REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$'))
FROM `htan-dcc.combined_assays.*` 
WHERE (Component LIKE '%Level%') OR (Component LIKE '%Level%')
GROUP BY HTAN_Center, REGEXP_CONTAINS(HTAN_Data_File_ID, '^(HTA([1-9]|1[0-5]))_((EXT)?([0-9]\\d*|0000))_([0-9]\\d*|0000)$') 

More than 21k HTAPP files fail this validation
32 SRRS files fail

HTAN_Center regex count
HTAN HTAPP false 21087
HTAN SRRS false 32

@adamjtaylor
Copy link
Contributor Author

SRRS issues as follows

Row HTAN_Center Filename HTAN_Data_File_ID regex  
1 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-7_Scan1.ome.tiff HTA5_1664_1003-7 false  
2 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-6_Scan1.ome.tiff HTA5_1664_1003-6 false  
3 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-5_Scan1.ome.tiff HTA5_1664_1003-5 false  
4 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-4_Scan1.ome.tiff HTA5_1664_1003-4 false  
5 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-3_Scan1.ome.tiff HTA5_1664_1003-3 false  
6 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-2_Scan1.ome.tiff HTA5_1664_1003-2 false  
7 HTAN SRRS DFCI_imaging_level_2/HTA5_1664_1003-1_Scan1.ome.tiff HTA5_1664_1003-1 false  
8 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-5_Scan1.ome.tiff HTA5_1348_1003-5 false  
9 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-4_Scan1.ome.tiff HTA5_1348_1003-4 false  
10 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-3_Scan1.ome.tiff HTA5_1348_1003-3 false  
11 HTAN SRRS DFCI_imaging_level_2/HTA5_1348_1003-2_Scan1.ome.tiff HTA5_1348_1003-2 false  
12 HTAN SRRS DFCI_imaging_level_2/HTA5_039_40XScan1.ome.tiff HTA5_039_40XScan1.ome.tiff false  
13 HTAN SRRS DFCI_imaging_level_2/HTA5_039_A3_40X_Scan1.ome.tiff HTA5_039_A3 false  
14 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-19.ome.tiff BPC-21-NC-1390-19.ome.tiff false  
15 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-18.ome.tiff BPC-21-NC-1390-18.ome.tiff false  
16 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-17.ome.tiff BPC-21-NC-1390-17.ome.tiff false  
17 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-16.ome.tiff BPC-21-NC-1390-16.ome.tiff false  
18 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-15.ome.tiff BPC-21-NC-1390-15.ome.tiff false  
19 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-14.ome.tiff BPC-21-NC-1390-14.ome.tiff false  
20 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-13.ome.tiff BPC-21-NC-1390-13.ome.tiff false  
21 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-12.ome.tiff BPC-21-NC-1390-12.ome.tiff false  
22 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-11.ome.tiff BPC-21-NC-1390-11.ome.tiff false  
23 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-10.ome.tiff BPC-21-NC-1390-10.ome.tiff false  
24 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-9.ome.tiff BPC-21-NC-1390-9.ome.tiff false  
25 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-8.ome.tiff BPC-21-NC-1390-8.ome.tiff false  
26 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-7.ome.tiff BPC-21-NC-1390-7.ome.tiff false  
27 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-6.ome.tiff BPC-21-NC-1390-6.ome.tiff false  
28 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-5.ome.tiff BPC-21-NC-1390-5.ome.tiff false  
29 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-4.ome.tiff BPC-21-NC-1390-4.ome.tiff false  
30 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-3.ome.tiff BPC-21-NC-1390-3.ome.tiff false  
31 HTAN SRRS DFCI_imaging_level_2/BPC-21-NC-1390-2.ome.tiff BPC-21-NC-1390-2.ome.tiff false  

@adamjtaylor
Copy link
Contributor Author

adamjtaylor commented Oct 24, 2023

Here are the first 5 rows of the HTAPP errors

These do not match as they contain a 4th _ separated group

@clarisse-lau do you think these should be valid?

Row HTAN_Center Filename HTAN_Data_File_ID regex  
1 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch03SHIFT_affine.h5 HTA1_982_7629305_02131 false  
2 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch02SHIFT_affine.h5 HTA1_982_7629305_02130 false  
3 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch01SHIFT_affine.h5 HTA1_982_7629305_02129 false  
4 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round007_ch00_affine.h5 HTA1_982_7629305_02128 false  
5 HTAN HTAPP exseq_level_1/htapp_982/htapp982-F077_round006_ch03SHIFT_affine.h5 HTA1_982_7629305_02127 false

@adamjtaylor
Copy link
Contributor Author

@clarisse-lau coming back to this issue. Do we think the HTAPP ids above should be considered valid?

@adamjtaylor
Copy link
Contributor Author

@elv-sb and I discussed today. As we currently will only generate a warning we suggest leaving as is and allowing failures for the above noted HTAN IDs

@clarisse-lau
Copy link
Contributor

Apologies @adamjtaylor, missed this the first time around!

I agree with the proposed approach (leaving as is). The center ID and participant components are the most crucial, and having an extra "_ group" should not cause issues downstream.

@clarisse-lau
Copy link
Contributor

It sounds like we will be supporting an additional TNP soon that will use HTA16. Suggest modifying the rule for first group to accommodate this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort-high This one's hard
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants