New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: provide output optionally as csv/tsv for automated analysis/sharing #19
Comments
|
I'm still working through the code, but I'd be happy to implement this in my fork |
Nice proposal @ktmeaton. I like how you solved the problem that we may not know the position of the breakpoint exactly, and have only lower and upper bounds. If I understand your proposal correctly, the regions would be the areas that we are sure of, with the unsure region not being part of any region. In this case, the breakpoint would be somewhere between 21618 and 21762:
I'm wondering whether there's any extra information that could make the output file more useful. It could be interesting to include private mutations, mutations that are present with respect to reference but that are in none of the parents. That would help creating covSpectrum queries and thus finding other samples of the same recombinant. |
Yes, that's how I would interpret the regions (breakpoint between 21618 and 21762). Would reporting private mutations generate unique output that |
Yes, it would provide more comprehensive private mutations, because Nextclade will attach to the bigger donor and hence mask some interesting mutations that may occur in the nearest neighbour. So for example, if we have a AY.39.1 / BA.2 recombinant, Nextclade would show private mutations with respect to AY.39.1 and that would exclude all the AY.39.1 mutations. sc2rf would report all mutations that are in addition to 21J and BA.2, including the defining AY.39.1 mutations. |
Thanks for the explanation, that seems very useful to report as well! |
Would it be better to have an option that switches between screen and CSV file outputs, or a flag that specifies a path to optionally write CSV output? |
|
Okay I've got the beginnings of CSV output: art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv
Potential recombinants between ['Omicron / BA.1 / 21K', 'Omicron / BA.2 / 21L']:
coordinates 1111111111111222222222222222222222222222222222222222222222222222222
223445899990000123455789011112222222222222333333333333344445566666677777888889
26780133334580144581427419067892566666778899000000245568914450502557823338238885
47933828942362944389041165516480777788781899144567002905432600867370558880718881
10027416344469879705804035582670834968563225308535235944804930400079892347111230
genes 1a 1b S 3aEM 6 7b9N
ref CTCACGCTGCACCCCGCACTCCCCACACCCGTGTCTCAGAGTGCAAGAATCACTCCGCATCCCCCACGCAGATCACGGGA
Omicron / BA.1 / 21K T••GT••GA••••T••AG•CTT••G•••TT••ACTCT•••••AACGAGTCAGTGAATATATTT•TGGA•C•••TTTAAC•
Omicron / BA.2 / 21L TGT•TAT••TGTTTTAA•T•T•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC
2022-05483 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05506 T••GT••GA••••T••AG•CT•TTGTGT••AGANNNNNACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05508 T••GT••GA••••T••AG•CT•TTGNNT••AGN•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05526 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05535 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05565 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAATNTA•TTTTNNATCCTCTTTAACC 1 BP
2022-05568 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
2022-05602 T••GT••GA••••T••AG•CT•TTGTGT••AGA•TCTGACTGAACG•GTC•GTGAAT•TA•TTTT•GATCCTCTTTAACC 1 BP
made with Sc2rf - available at https://github.com/lenaschimmel/sc2rf
art@Wernstrom sc2rf % head temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05506,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05508,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05526,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05535,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05565,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05568,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1,
2022-05602,"Omicron / BA.1 / 21K,Omicron / BA.2 / 21L",0,1, obviously haven't implemented |
Now reporting regions and private mutations in CSV:
|
issued PR #25 |
Thank you a lot for your PR and for putting up with my uncommented, chaotic code! I just tested it and noticed that the format of the regions is different from the one @ktmeaton and @corneliusroemer discussed above. Is this a deliberate choice? I think I would prefer their version. I was just about to implement that change myself, but realized it might be better to discuss it before. |
I would prefer their version too, but I reached the end of the day and I'm supposed to be reading someone's thesis :-) If you can change the format, that would be great! |
Ok, I can try to change the format, but I can't promise that I'll find time for it (even it's only about 20 minutes, I guess) before Tuesday evening. |
Ah it's ok, I'll work on it - I just got the impression from your earlier comment that you were already about to implement a format change! |
@ktmeaton - your example indicates that the last column should report regions that match more than one example: Incorporating multiple matches in a given region is going to take a bit more work because we'll need to record this information under the condition |
Actually cases where a region matches multiple examples are not even registered as breakpoints in the current code. If the list Lines 671 to 673 in c96954b
Is this an edge case that we can ignore, or should this also be considered a breakpoint that is not marked in the screen output, but recorded in the CSV? |
In my view, recombinants of more than 2 parents are very rare and this is an edge case that could be ignored for now. Could always be extended later on? Triple recombinants are rather unlikely for now I think. |
I've almost got it, though :-) |
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions
2022-05483,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05506,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05508,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05526,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05535,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05565,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05568,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
2022-05602,"21K,21L",0,1,"241:14408|21K,15240:29510|21L"
art@Wernstrom sc2rf % python3 sc2rf.py data/BA1-BA2-Finland.aln.fa --csvfile temp.csv --show-private-mutations > /dev/null
art@Wernstrom sc2rf % cat temp.csv
sample,examples,intermissions,breakpoints,regions,privates
2022-05483,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05506,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05508,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,A22628C,A22634C,C22792T,C27945T"
2022-05526,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T,G29759T"
2022-05535,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C21T,C3241T,G5924A,C22792T,C27945T"
2022-05565,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","T2886G,C3241T,G5924A,C22792T,C27945T"
2022-05568,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,C22792T,C27945T"
2022-05602,"21K,21L",0,1,"21:14408|21K,15240:29759|21L","C3241T,G5924A,T19857A,C22792T,C27945T,G29759C" just need a test case to confirm that all examples are reported when two or more match a given region |
@ktmeaton @corneliusroemer - while generating test cases, I realized that not all examples have a |
@ArtPoon can you explain what you mean by "not all examples have a NextstrainClade or PangoLineage", I don't quite understand. Can you give an example of a full name? |
|
@corneliusroemer wrote:
Yes, I believe that triple recombinants will occur very rarely in reality - even though we might encounter recombinants where one of the two parents is already a recombinant, and maybe one that has not already been assigned an @ArtPoon wrote:
I think it is okay to ignore this edge case for now, and not generate the proper output for the affected samples. The case for results with many parentsThe final result of analyzing a recombinant sample will most likely show two parents and a small number of breakpoints. Situations with more than two parents are not very relevant as final results, but might become important as intermediate results. Sometimes, Sc2rf will propose three or more potential parents when it can't be very sure about which two parents are the actual ones. I think it can be useful to show this to the user, so that they can decide it manually, or using some other tool. In the mid-future, I want to add another output mode to Sc2rf, so that the visual display of 3 or more potential parents is much more useful that it currently is. Once this is working, we might want to reduce the filtering that Sc2rf currently applies to avoid three-parent-hypothesis during analysing. Pherhaps, Sc2rf might even do something like "These are all potential parents for this sample with more than 5% likelyhood" and then list 7 parents, some of which are very similar to each other. On the one hand, saving those unclear intermediate results to a file so that some other tool can try to make a better decision. On the other hand, I'm unsure if the current file format is a good approach to capture these unclear results, or if we would need another, more complex format anyway. But let's not worry too much about that future now. |
@lenaschimmel - thanks for the detailed response! I agree that it would be useful to indicate to the user when matches are ambiguous (with multiple parents). |
Absolutely! |
Ok, pushed the above change. CSV now looks like this:
|
Proposed fix for #19 - output CSV
Right now the output is good for interactive human analysis, but there's a lack of csv/tsv machine readable output for sharing/further analysis.
From my experience with Nextclade, main difficulty here is the design of the specs of the file, which columns to include etc, which separators to take if you need an intra-column separator etc.
Maybe best to discuss on this issue before implementing something as one will kind of get locked in to the format.
The text was updated successfully, but these errors were encountered: