Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent subtype function #10

Closed
peterk87 opened this issue Sep 27, 2017 · 3 comments
Closed

Fix inconsistent subtype function #10

peterk87 opened this issue Sep 27, 2017 · 3 comments

Comments

@peterk87
Copy link
Contributor

peterk87 commented Sep 27, 2017

Sample SRR1958005 returns multiple subtype calls 2.1; 2.2 with all subtype calls being 2; 2.1; 2.2. Subtypes 2.1 and 2.2 are not consistent with each other.

Testing bio_hansel -s enteritids on the Enterobase genome assembly of SRR1958005 (46121.fasta) produced the following results:

sample scheme subtype all_subtypes tiles_matching_subtype are_subtypes_consistent inconsistent_subtypes n_tiles_matching_all n_tiles_matching_all_total n_tiles_matching_positive n_tiles_matching_positive_total n_tiles_matching_subtype n_tiles_matching_subtype_total file_path
46121 enteritidis 2.1; 2.2 2; 2.1; 2.2 134979-2.1; 1402592-2.1; 2483597-2.2 TRUE   652 652;652 14 13;36 3 2;25 /home/peter/Downloads/46121.fasta

Other samples which should have been are_subtypes_consistent == False include:

  • SRR3049064; all_subtypes == "2; 2.1; 2.2.2.1.1"
  • SRR1968431; all_subtypes == "2; 2.1; 2.2.2.1.1"
  • ERR1069299; all_subtypes == "2; 2.1; 2.2.2"
  • SRR1968272; all_subtypes == "2; 2.1; 2.2.2"

See function here:
https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/utils.py#L80

Also, would be a good idea to add a test for that function.

@mgopez
Copy link
Member

mgopez commented Sep 28, 2017

When you mean inconsistent, is that because in 2.1 and 2.2 the ._ value is different?

@peterk87
Copy link
Contributor Author

Yes, so the inconsistent subtypes would be 2.1 and 2.2. I originally had the function try to find the highest resolution subtype (i.e. longest string of numbers) and use that as the reference to compare against, but in cases where the highest resolution subtypes are equal length, the function fails to report inconsistencies.

@mgopez
Copy link
Member

mgopez commented Sep 29, 2017

I think I found an issue.
https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/utils.py#L98
I believe it's supposed to freq >= 1.

If both strings are same length, then >= 1 would add both to the incon_subtypes list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants