Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output "NA" as subtype for samples that fail QC with no subtype result or no targets found #112

Closed
glabbe opened this issue Sep 9, 2019 · 11 comments
Assignees

Comments

@glabbe
Copy link
Collaborator

glabbe commented Sep 9, 2019

@dankein noticed that it is not possible to link metadata to the results using the biohansel metadata option if the "subtype" field is empty, as is the case for QC FAIL due to "NO TARGETS FOUND" or "NO SUBTYPE RESULT"). A possible solution would be to output "NA" in the subtype column in these cases to allow metadata to be returned with the results when "NA" is the subtype.

@peterk87
Copy link
Contributor

peterk87 commented Sep 9, 2019

Hi @glabbe
If it's desirable to have subtype metadata attached to a null result, in the metadata table, you could add an row with an empty cell under the subtype field and whatever metadata values are appropriate in the other fields, e.g.

subtype subtype_metadata
1.1 metadata for 1.1
metadata for null

Could you or @dankein give an example of subtype result metadata that would be returned with a null subtype result?

@dankein
Copy link

dankein commented Sep 9, 2019

Hi @peterk87 I tried doing what you suggested with the metadata file before mentioning this issue to @glabbe but it doesn't appear to work in the command line or galaxy versions.

We're using the metadata to reformat the "tech results" into a format that can be pasted line by line into out LIMS system for reporting. This includes version numbers of the scheme, metadata, galaxy tool, as well as custom comments for reports and/ or instructions not to report certain species without repeats... that sort of thing. Below is a partial example of what we're using.

In the case of "No subtype result" we still would like to attach the version metadata and include an instruction to not report the test.

subtype Species differentiation status scheme version metadata version galaxy tool version comments
1 M. tuberculosis differentiation complete 2.1 2.3 2.2
1.1 M. bovis / M. bovis BCG partial differentiation 2.1 2.3 2.2 identification incomplete - repeat sequencing
1.1.1 M. bovis BCG differentiation complete 2.1 2.3 2.2 M. bovis BCG is a vaccine strain.
NA no subtype differentiation Failed 2.1 2.3 2.2 Do not report - no subtype found

Thanks for your help!
Dan

@peterk87
Copy link
Contributor

peterk87 commented Sep 9, 2019

FYI the development branch version of biohansel (v2.3.0) outputs #N/A for null subtype results (added in PR #81). I'm not sure if this is the version in Galaxy at the moment.

@glabbe
Copy link
Collaborator Author

glabbe commented Sep 9, 2019

Thanks for the heads up @peterk87, the Galaxy version of biohansel is still v2.2.0: will need to be updated

@glabbe
Copy link
Collaborator Author

glabbe commented Sep 9, 2019

Darian has started a pull request (#152) to update biohansel in Galaxy:
phac-nml/galaxy_tools#152 (comment)

@glabbe
Copy link
Collaborator Author

glabbe commented Sep 9, 2019

@dankein I will talk with @Takadonet in the coming days about how to update the biohansel version in Galaxy to fix this issue

@glabbe glabbe closed this as completed Sep 9, 2019
@glabbe
Copy link
Collaborator Author

glabbe commented Sep 9, 2019

@peterk87 Actually I just found that the fix implemented in PR #81 only outputs '#N/A' if there is no k-mer match found. If there are only negative k-mers found, and therefore no subtype found, the subtype field is still left blank. See output file attached that I got when using a truncated MTB sequence.
MTB_truncated_test.txt
Truncated sequence used with the tb_lineage scheme (changed extension to .txt as .fasta is not supported in GitHub):
truncated_H37Rv_reference.txt

@glabbe glabbe reopened this Sep 9, 2019
@DarianHole
Copy link
Member

DarianHole commented Sep 9, 2019

I used dfsummary['subtype'].fillna(value='#N/A', inplace=True) as a way to add #N/A to columns if there was no subtype found. This is in the code after the creation of the dataframe and from what I understand, should fill the column if the cell is blank.

So, I'm going to hazard a guess that when a kmer is found but no subtype is given, the column is filled with a ' ' or something similar that prevents fillna from working in those cases. But I'm not 100% sure on that, just a guess that I must have missed when testing the earlier change (#81).

@glabbe
Copy link
Collaborator Author

glabbe commented Sep 10, 2019

@peterk87 @schonfju Justin found a fix: pandas treats an empty string differently from a missing value. @DarianHole Darian's fix handles the case where it's missing value. We also need to handle the case where it's an empty string. Will do pull request, will add the following line under Darian's line in Main.py

dfsummary['subtype'].fillna(value='#N/A', inplace=True)
dfsummary['subtype'].replace('','#N/A', inplace=True)

@glabbe
Copy link
Collaborator Author

glabbe commented Sep 10, 2019

You were right @DarianHole, the field changes after merging the results with metadata (which happens for the tb_lineage and Typhi schemes by default). If there are no kmer matches found, there is a bypass in subtyper.py, so the results are not merged with the metadata and the dataframe ends up being different than when kmers are found.

@DarianHole
Copy link
Member

DarianHole commented Oct 2, 2019

Merged in #120. Fixes worked for all cases I could think of and that were tested. If something else comes up, reopen or create a new Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants