You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.
In tests/validators/test_validate_test_data.py there are tests like:
def test_validate_testdata_bii_s_3_isatab(self):
test_case = 'BII-S-3'
with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
log_level=self._reporting_level)
if len(report['errors']) > 0:
self.fail("Error found when validating ISA tab: {}".format(report['errors']))
def test_validate_testdata_bii_s_3_isatab(self):
test_case = 'BII-S-3'
with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
log_level=self._reporting_level)
if len(report['errors']) > 0:
# self.fail("Error found when validating ISA tab: {}".format(report['errors']))
self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))
I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.
If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.
Here is an example, tests/data/tab/BII-I-1/a_proteome.txt:
Sample Name
Protocol REF
Extract Name
Protocol REF
Labeled Extract Name
Label
Term Source REF
Term Accession Number
MS Assay Name
Comment[PRIDE Accession]
Comment[PRIDE Processed Data Accession]
Raw Spectral Data File
Normalization Name
Protein Assignment File
Peptide Assignment File
Post Translational Modification Assignment File
Data Transformation Name
Derived Spectral Data File
Factor Value[limiting nutrient]
Term Source REF
Term Accession Number
Factor Value[rate]
Unit
Term Source REF
Term Accession Number
S-0.1-aliquot11
protein extraction
S-0.1
ITRAQ labeling
JC_S-0.1
iTRAQ reagent 117
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
sulphur
0.1
l/hr
C-0.1-aliquot11
protein extraction
C-0.1
ITRAQ labeling
JC_C-0.1
iTRAQ reagent 116
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
carbon
0.1
l/hr
N-0.1-aliquot11
protein extraction
N-0.1
ITRAQ labeling
JC_N-0.1
iTRAQ reagent 115
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
nitrogen
0.1
l/hr
S-0.1-aliquot11
protein extraction
S-0.1
ITRAQ labeling
Pool1
iTRAQ reagent 114
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
l/hr
C-0.1-aliquot11
protein extraction
C-0.1
ITRAQ labeling
Pool1
iTRAQ reagent 114
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
l/hr
N-0.1-aliquot11
protein extraction
N-0.1
ITRAQ labeling
Pool1
iTRAQ reagent 114
8761
8761
8761
spectrum.mzdata
norm1
proteins.csv
peptides.csv
ptms.csv
datatransformation1
PRIDE_Exp_Complete_Ac_8761.xml
l/hr
C-0.2-aliquot11
protein extraction
C-0.2
ITRAQ labeling
JC_C-0.2
iTRAQ reagent 117
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
carbon
0.2
l/hr
N-0.2-aliquot11
protein extraction
N-0.2
ITRAQ labeling
JC_N-0.2
iTRAQ reagent 116
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
nitrogen
0.2
l/hr
P-0.1-aliquot11
protein extraction
P-0.1
ITRAQ labeling
JC_P-0.1
iTRAQ reagent 115
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
phosphorus
0.1
l/hr
C-0.2-aliquot11
protein extraction
C-0.2
ITRAQ labeling
Pool2
iTRAQ reagent 114
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
l/hr
N-0.2-aliquot11
protein extraction
N-0.2
ITRAQ labeling
Pool2
iTRAQ reagent 114
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
l/hr
P-0.1-aliquot11
protein extraction
P-0.1
ITRAQ labeling
Pool2
iTRAQ reagent 114
8762
8762
8762
spectrum.mzdata
norm2
proteins.csv
peptides.csv
ptms.csv
datatransformation2
PRIDE_Exp_Complete_Ac_8762.xml
l/hr
P-0.2-aliquot11
protein extraction
P-0.2
ITRAQ labeling
JC_P-0.2
iTRAQ reagent 116
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
phosphorus
0.2
l/hr
S-0.2-aliquot11
protein extraction
S-0.2
ITRAQ labeling
JC_S-0.2
iTRAQ reagent 115
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
sulphur
0.2
l/hr
P-0.2-aliquot11
protein extraction
P-0.2
ITRAQ labeling
Pool3
iTRAQ reagent 117
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
l/hr
S-0.2-aliquot11
protein extraction
S-0.2
ITRAQ labeling
Pool3
iTRAQ reagent 117
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
l/hr
P-0.2-aliquot11
protein extraction
P-0.2
ITRAQ labeling
Pool3
iTRAQ reagent 114
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
l/hr
S-0.2-aliquot11
protein extraction
S-0.2
ITRAQ labeling
Pool3
iTRAQ reagent 114
8763
8763
8763
spectrum.mzdata
norm3
proteins.csv
peptides.csv
ptms.csv
datatransformation3
PRIDE_Exp_Complete_Ac_8763.xml
l/hr
This has a few issues:
"Protocol REF" columns appear to be missing in 3 places, before each of the "Name" columns. The preprocess function in ProcessSequenceFactory.py warns about this.
The "Factor Value" columns are at the end instead of being after "Sample Name".
There are completely empty "Term Source REF" and "Term Accession Number" columns.
Questions about these issues:
Although missing "Protocol REF" columns will not cause the ProcessSequenceFactory to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that.
The ProcessSequenceFactory handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? This ProcessSequenceFactory also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent.
Completely empty "Term Source REF" and "Term Accession Number" columns doesn't cause any errors, but it kind of suggests that at one time they were required maybe? If we end up editing the data I would like to know whether we could just remove these or not.
I have also come across some other issues while investigating this that aren't illustrated in the example.
Can there be multiple "Name" columns, such as "Assay Name"? If so then they need to be matched a little differently in some of the code. Currently, they are matched using == instead of startswith like "Protocol REF" columns are.
Is order supposed to matter for "Term Source REF" and "Term Accession Number"? As well as "Unit", "Term Source REF", "Term Accession Number"? They do in the code, get_value used by ProcessSequenceFactory requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.
If it is easier to meet and discuss this I am available.
The text was updated successfully, but these errors were encountered:
TODO:
change the test from:
self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))
to:
self.fail("Error found when validating ISA tab: {}".format(report['errors']))
and fix the test files in response to errors detected.
A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.
In tests/validators/test_validate_test_data.py there are tests like:
In #551 though many of them changed to:
I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.
If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.
Here is an example, tests/data/tab/BII-I-1/a_proteome.txt:
This has a few issues:
preprocess
function in ProcessSequenceFactory.py warns about this.Questions about these issues:
ProcessSequenceFactory
to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that.ProcessSequenceFactory
handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? ThisProcessSequenceFactory
also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent.I have also come across some other issues while investigating this that aren't illustrated in the example.
==
instead ofstartswith
like "Protocol REF" columns are.get_value
used byProcessSequenceFactory
requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.If it is easier to meet and discuss this I am available.
The text was updated successfully, but these errors were encountered: