Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to pull out clade-specific indels and supernodes #3

Open
iqbal-lab opened this issue Sep 9, 2015 · 17 comments
Open

Add script to pull out clade-specific indels and supernodes #3

iqbal-lab opened this issue Sep 9, 2015 · 17 comments
Assignees

Comments

@iqbal-lab
Copy link
Owner

Henk has warned that it can be hard to separate outbreak samples from other environmental samples in Listeria monocytogenes, as the mutation rate is low, so there may be very few SNPs.
But Jen Gardy pointed out this very interesting paper
http://jcm.asm.org/content/early/2015/08/20/JCM.00202-15.full.pdf
which says that you can use insertions/deletions and mobile elements to improve resolution.

Basically, we might be able to
a) look at the indels in the VCF and see which samples have them, to pull apart clusters
b) use --pan_genome_matrix on all supernodes in a cluster, and see if there are big contigs that split the cluster

@iqbal-lab
Copy link
Owner Author

How to do the second bit (b) ?

make a filelist, listing all the cleaned sample graph binaries in a cluster, FLIST

cat FLIST
sample1.ctx
sample2.ctx
etc

Then list this in a file
ls FLIST > COLLIST

cortex_var_31_c1 --colour_list COLLIST --mem_height 21 --mem_width 100 --sample_id cluster_name --output_supernodes contigs.fa --max_var_len 100000

This gives us a fasta file contigs.fa
Go through this and discard any contig shorter than 500bp, and calculate the longest read - say this length is X.

Then make a filelist for every sample
ls sample1.ctx > sample1.ctx.list
...
etc
and list these in a file COL2
ls sample*ctx.list > COL2

Then, say there are 21 samples in our clyster
cortex_var_31_c21 --colour_list COL2 --mem_height 21 --mem_width 100 --pan_genome_matrix contigs.fa --max_read_length X

This will produce a matrix - for each contig (row), it will say what % of the kmers in that contig are in each sample (column)

Then we just look for contigs present only in our cluster (set of columns)

@iqbal-lab
Copy link
Owner Author

Rachel -can I give this to you to try out on Henk's listeria dataset please? Could even test on Jen;s paper I suppose, but I think we have plenty to be going with

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 9, 2015

OK. :)

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 10, 2015

A) Are you suggesting that after making a tree using phyml based on snps in samples, we take these clusters and try to further divide them by indels? Or that we build a tree to separate clusters based just on indels?
Or were you thinking of building a tree using both snps and indels (currently using snp-sites to get phylip file to input to phyml. Snp-sites does snps only.)

Or should I just choose one of these that works?

@iqbal-lab
Copy link
Owner Author

So this all hinges on how you interpret a tree. Usually, people want to interpret a tree in an evolutionary context, and they want branch lengths to be proportional to time. On that basis, they use just SNPs, making the approximation that there is a fixed mutation rate for SNPs. Indels occur by different mechanisms to SNPs and the rate will be different, so they don't include them. On that basis, i don't really want to make a global tree based on both SNPs and indels.
However when looking at clusters, these should differ by very few SNPs (~4), and honestly I was not completely clear in my head how I would compare them. I was initially thinking of seeing whether samples that seem identical (or nearly) can be distinguished by indels or phages. Making a tree of things works for vertically inherited indels, but not for horizontally transferring phages.

@iqbal-lab
Copy link
Owner Author

We could easily experiment by taking the Minnesota clusters and Listeria testset clusters, and looking at (high confidence) indels in those clusters, and phages. Does this all make sense?

@iqbal-lab
Copy link
Owner Author

But I would re-call variants within a cluster with a local reference...

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 10, 2015

I'm still confused. Here is what I think you are suggesting:
Once we have a phyml constructed tree (from SNPs), we 'extract' clusters. Within clusters we want more resolution. This can be done with 1) indels 2) phages. So for each cluster:

  • pick a local reference (my feeling with MASH is that it isn't going to be ideal for this - would there be a problem with just picking one of the samples randomly/the first?)
  • re-call variants within that cluster with that reference (is this the combine_vcfs.pl script in cortex, or an actual run_calls.pl again?)
  • using the new vcf(?) subset just indel type variation and try to spot sub-clusters? (indels)
  • separately do what you suggest above with finding large contigs and making matrix with cortex. Try to spot contigs that appear in some groups in the cluster and not others. (phages?)
  • combine this information?

@iqbal-lab
Copy link
Owner Author

OK, you are close. Here is my corrected version

Once we have a phyml constructed tree (from SNPs), we 'extract' clusters. Within clusters we want more resolution. This can be done with 1) indels 2) phages. So for each cluster: (all right so far)

  1. pick a local reference (my feeling with MASH is that it isn't going to be ideal for this - would there be a problem with just picking one of the samples randomly/the first?)

A reference has to be a fully assembled reference genome, so we can't just use one of them - we'd need to assemble it fully, which I want to avoid. At worst we can use the same one we used before, but a closer one might be an improvement. However, we could choose our favourite reference (note we only have a big set of references for Salmonella at the moment) by looking at core SNPs (I know I've not explained how to do this for now)

  1. re-call variants within that cluster with that reference (is this the combine_vcfs.pl script in cortex, or an actual run_calls.pl again?)

This really is call again from scratch using run_calls

using the new vcf subset just indel type variation and try to spot sub-clusters? (indels)

yes, well I guess first I would look at SNPs called by this method and build a tree just of these samples, and then look at indels on top

separately do what you suggest above with finding large contigs and making matrix with cortex.

yes

Try to spot contigs that appear in some groups in the cluster and not others.
yes

(phages?)
I would take Henk's list of phages, as a fasta, and maybe remove reduncancy using CD-HIT, and then use pan_genome_matrix to see which samples do/not have them

combine this information?

Well, I don't think we can automate that in advance
Z

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 10, 2015

OK, thanks!

@iqbal-lab
Copy link
Owner Author

Hold on - why won't MASH work? What have you found so far?

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 10, 2015

Not much…I still have to try it Henks way with sketching the references first and ‘grooming’(?), but if you brute force it like I did yesterday morning:

for ref in /data3/projects/outbreak_challenge/salm/ref_info/Salm_refs/*.fasta; do

    ref_id=${ref%.fasta*}  # get the part before the .fasta

    ref_id=${ref_id##*/}

    echo $ref_id

    mash dist $ref ../pseudo_samples/sample_* &> mash_summary_$ref_id.txt

Done

It takes forever!

It will be quicker done properly…

I had got the impression you could only compare sample(s) against one reference at a time, but rereading Henks guidelines, they suggest otherwise. My misunderstanding.

@iqbal-lab
Copy link
Owner Author

I believe henk's suggestion, that MASH will work best on trimmed reads, so I think it will be worthwhile uysing Trimmomatic to clean the reads first. Pretty sure if you do it right it is extremely fast

@hcdenbakker
Copy link
Collaborator

Lecture starts in 5 min; but yes, groomed/trimmed reads is essential. Otherwise it performs pretty badly. Takes < 30 s for one search on my Mac.

Sent from my iPhone

On Sep 10, 2015, at 9:16 AM, Zamin Iqbal notifications@github.com wrote:

I believe henk's suggestion, that MASH will work best on trimmed reads, so I think it will be worthwhile uysing Trimmomatic to clean the reads first. Pretty sure if you do it right it is extremely fast


Reply to this email directly or view it on GitHub.

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 11, 2015

  1. Do we have alternative references for Listeria? Or just the one?
  2. Do we only need to run run_calls again if we are using a new reference?
  3. If not rebuilding the graphs, is there still a way to do the combining and genotyping and merging into one vcf?

@iqbal-lab
Copy link
Owner Author

  1. Yes here
    https://github.com/iqbal-lab/outbryk/tree/master/species/Listeria/refdata
  2. Yes
  3. The joint workflow will do this

@hcdenbakker
Copy link
Collaborator

Rachel,

I put 56 potential other reference sequences in the species/Listeria/refdata folder on the outbryk repository. You can check with Mash which one is the closest match. I found it was L2624 with a smaller number of references.

Sent from my iPhone

On Sep 11, 2015, at 8:14 AM, rmnorris notifications@github.com wrote:

Do we have alternative references for Listeria? Or just the one?
Do we only need to run run_calls again if we are using a new reference?
If not rebuilding the graphs, is there still a way to do the combining and genotyping and merging into one vcf?

Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants