-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to pull out clade-specific indels and supernodes #3
Comments
How to do the second bit (b) ? make a filelist, listing all the cleaned sample graph binaries in a cluster, FLIST
Then list this in a file cortex_var_31_c1 --colour_list COLLIST --mem_height 21 --mem_width 100 --sample_id cluster_name --output_supernodes contigs.fa --max_var_len 100000 This gives us a fasta file contigs.fa Then make a filelist for every sample Then, say there are 21 samples in our clyster This will produce a matrix - for each contig (row), it will say what % of the kmers in that contig are in each sample (column) Then we just look for contigs present only in our cluster (set of columns) |
Rachel -can I give this to you to try out on Henk's listeria dataset please? Could even test on Jen;s paper I suppose, but I think we have plenty to be going with |
OK. :) |
A) Are you suggesting that after making a tree using phyml based on snps in samples, we take these clusters and try to further divide them by indels? Or that we build a tree to separate clusters based just on indels? Or should I just choose one of these that works? |
So this all hinges on how you interpret a tree. Usually, people want to interpret a tree in an evolutionary context, and they want branch lengths to be proportional to time. On that basis, they use just SNPs, making the approximation that there is a fixed mutation rate for SNPs. Indels occur by different mechanisms to SNPs and the rate will be different, so they don't include them. On that basis, i don't really want to make a global tree based on both SNPs and indels. |
We could easily experiment by taking the Minnesota clusters and Listeria testset clusters, and looking at (high confidence) indels in those clusters, and phages. Does this all make sense? |
But I would re-call variants within a cluster with a local reference... |
I'm still confused. Here is what I think you are suggesting:
|
OK, you are close. Here is my corrected version Once we have a phyml constructed tree (from SNPs), we 'extract' clusters. Within clusters we want more resolution. This can be done with 1) indels 2) phages. So for each cluster: (all right so far)
A reference has to be a fully assembled reference genome, so we can't just use one of them - we'd need to assemble it fully, which I want to avoid. At worst we can use the same one we used before, but a closer one might be an improvement. However, we could choose our favourite reference (note we only have a big set of references for Salmonella at the moment) by looking at core SNPs (I know I've not explained how to do this for now)
This really is call again from scratch using run_calls
yes, well I guess first I would look at SNPs called by this method and build a tree just of these samples, and then look at indels on top
yes
Well, I don't think we can automate that in advance |
OK, thanks! |
Hold on - why won't MASH work? What have you found so far? |
Not much…I still have to try it Henks way with sketching the references first and ‘grooming’(?), but if you brute force it like I did yesterday morning: for ref in /data3/projects/outbreak_challenge/salm/ref_info/Salm_refs/*.fasta; do
Done It takes forever! It will be quicker done properly… I had got the impression you could only compare sample(s) against one reference at a time, but rereading Henks guidelines, they suggest otherwise. My misunderstanding. |
I believe henk's suggestion, that MASH will work best on trimmed reads, so I think it will be worthwhile uysing Trimmomatic to clean the reads first. Pretty sure if you do it right it is extremely fast |
Lecture starts in 5 min; but yes, groomed/trimmed reads is essential. Otherwise it performs pretty badly. Takes < 30 s for one search on my Mac. Sent from my iPhone
|
|
|
Rachel, I put 56 potential other reference sequences in the species/Listeria/refdata folder on the outbryk repository. You can check with Mash which one is the closest match. I found it was L2624 with a smaller number of references. Sent from my iPhone
|
Henk has warned that it can be hard to separate outbreak samples from other environmental samples in Listeria monocytogenes, as the mutation rate is low, so there may be very few SNPs.
But Jen Gardy pointed out this very interesting paper
http://jcm.asm.org/content/early/2015/08/20/JCM.00202-15.full.pdf
which says that you can use insertions/deletions and mobile elements to improve resolution.
Basically, we might be able to
a) look at the indels in the VCF and see which samples have them, to pull apart clusters
b) use --pan_genome_matrix on all supernodes in a cluster, and see if there are big contigs that split the cluster
The text was updated successfully, but these errors were encountered: