Using Peroba data.

# Setup

First enter the VM:

In [None]:
! ssh madeline-01

Next is variable assignment.  In here are all the things you can change before running the notebook.

In [None]:
#paths and folders
! results=/home/ubuntu/20220225 #today's date: this will be the name of the folder to keep things in
! dbpath=/home/ubuntu/ongaeshi-mnt/databases/ #folder with Peroba data
! metadata=peroba_meta.220209_173253.tsv.xz #Peroba metadata
! alignment=peroba_align.220209_173253.aln.xz #Peroba alignment file
! scripts=/home/ubuntu/scripts
! Wuhan-Hu-1=/home/ubuntu/Wuhan-Wu-1.fasta

#analysis parameters
! start_date=2021-07-01
! end_date=2021-07-31

Set up a folder to keep your results in:

In [None]:
! mkdir $results #directory for today's data

# Preprocessing for England between `start_date` and `end_date`

Subset the metadata by extracting only English records, decompressing it along the way:

In [None]:
! xzgrep 'England/' ${dbpath}${metadata} >> ${results}/${metadata/.tsv.xz/.England.tsv}

Subset the English metadata further by keeping only records between two dates:

In [None]:
! s1=${results}/${metadata/.tsv.xz/.England.tsv}_{start_date}_{end_date}
! s2=${s1//-/}
! s3=${s2/.tsv./.}

! seqnames_file=${s3}.seqnames.txt #output file for sequence names from England between dates
! meta_subset_file=${s3}.tsv #output file for metadata slice (England between two dates)

! python3 ${scripts}/subset_peroba_metadata_by_date.py --metadata_tsv ${results}/${metadata/.tsv.xz/.England.tsv} --start_date {start_date} --end_date {end_date} --output_metadata meta_subset_file --output_names seqnames_file

Now, use this to get the matching alignments:

In [None]:
! xzgrep -w -A 1 -f ${seqnames_file} ${alignment} --no-group-separator | xz > ${results}/England_alignment.aln.xz

From this dataset, remove duplicate sequences (based on the sequence content):

In [None]:
! seqkit rmdup -s < ${results}/England_alignment.aln.xz > ${results}/England_alignment.dedup.aln.xz

Add Wuhan-Wu-1 for masking:

In [None]:
! xzcat ${results}/England_alignment.dedup.aln.xz >> $Wuhan-Hu-1 | xz > ${results}/England_alignment.dedup.ref.aln.xz

And mask any problematic sites:

In [None]:
python3 /home/ubuntu/ProblematicSites_SARS-CoV2/src/mask_alignment_using_vcf.py -v ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf -i ${results}/England_alignment.dedup.ref.aln.xz -o ${results}/England_alignment.dedup.ref.masked.aln.xz -r "Wuhan/WH01/2019"

# Separating into NORW and England-other

Now that you have the English metadata and alignments all ready, extract the NORW sequence data from this to use as your query dataset.

In [None]:
! xzgrep -A 1 "NORW" ${results}/England_alignment.dedup.ref.masked.aln.xz > ${results}/NORW_${start_date}_${end_date}.aln

Likewise, get rid of all the "NORW" sequences in the England dataset:

In [None]:
! xzgrep -Av 1 "NORW" ${results}/England_alignment.dedup.ref.masked.aln.xz | xz > ${results}/England_alignment.dedup.ref.masked.noNORW.aln

# Nearest neighbours to NORW

Then find the 4 nearest neighbours of each NORW sequence in the England file.  This will take a long time (>6h the first time) and should be left to run overnight.  The output is a `.gz` file.

In [None]:
! nohup uvaia -n 4 -r ${results}/England_alignment.dedup.ref.masked.noNORW.aln -o ${results}/NORW_${start_date}_${end_date}.4nn.aln.gz -t 12 ${results}/NORW_${start_date}_${end_date}.aln

# Reducing to the most diverse neighbours

Remove the redundant nearest neighbours by creating a rapid neighbour-joining tree and then choosing only the sequences that are the most diverse.

NJ tree:

In [None]:
! gunzip ${results}/NORW_${start_date}_${end_date}.4nn.aln.gz > ${results}/NORW_${start_date}_${end_date}.4nn.aln
! nohup rapidnj ${results}/NORW_${start_date}_${end_date}.4nn.aln -i fa -t d -n -c 8 -o t -x ${results}/NORW_${start_date}_${end_date}.4nn.aln.tre

Picking the 100 most diverse leaves:

In [None]:
! nohup iqtree -k 100 ${results}/NORW_${start_date}_${end_date}.4nn.aln.tre

The results are saved as `${results}/NORW_${start_date}_${end_date}.4nn.aln.tre.pda`.  Extract the list of sequence names to keep, being careful to reformat the names to match the alignment file and to remove duplicate names.

In [None]:
! grep -A 100 "The optimal PD set has 100 taxa:" ${results}/NORW_${start_date}_${end_date}.4nn.aln.tre.pda | sed '1d'  | sed 's/^.//;s/.$//;s/_/\//;s/_/\//' | sort | uniq > ${results}/nn_seqs_to_keep.txt

Prepare a fasta file of the most diverse neighbours:

In [None]:
! grep -w -A 1 -f ${results}/nn_seqs_to_keep.txt ${results}/NORW_${start_date}_${end_date}.4nn.aln --no-group-separator > ${results}/reduced_refs.aln

Then, redraw the tree using just the diverse neighbours.  This is another slow step.

In [None]:
! iqtree -s ${results}/reduced_refs.aln

# Combining neighbours and queries into an ML tree

Concatenate the query and neighbour sequences together:

In [None]:
! cat ${results}/NORW_${start_date}_${end_date}.aln ${results}/reduced_refs.aln > ${results}/allseqs.aln

Now make a tree out of everything, using the neighbour tree you just made as the backbone:

In [None]:
! iqtree -s ${results}/allseqs.aln -m HKY+G -g ${results}/reduced_refs.aln.treefile -t PARS

It is saved as `${results}/allseqs.aln.treefile`.

# Getting leaftips and relevant metadata

Now for things you might need later...

A list of leaftips in the final tree, formatted to match the tree:

In [None]:
! grep "England" ${results}/allseqs.aln | sed 's/>//g' > ${results}/leaftips.txt

The metadata that belongs to the samples in your tree:

In [None]:
# straighten out the metadata formatting in-place

! sed -i 's/ \/ /\//g;s/location_cogukage/location_coguk    age/;1s/\s\+/\t/g' ${meta_subset_file}

#subset out just the records in your tree.  The names are not yet matching the names in the tree, but in the next section this will be dealt with.

! grep -f ${results}/leaftips.txt ${meta_subset_file} --no-group-separator > ${results}/leaftips_metadata.tsv

Make the fasta file names match too:

In [None]:
! sed -i 's/\//_/g' ${results}/allseqs.aln

# Run `treetime`

Get the names, dates, regions and countries in a `.csv` acceptable by `treetime`:

In [None]:
! python ${scripts}/get_treetime_states.py --metadata_tsv ${results}/leaftips_metadata.tsv --leaftips ${results}/leaftips.txt --out ${results}/leaftips_treetime_states.tsv

Run a mugration model using `region`:

In [None]:
! treetime mugration --tree ${results}/allseqs.aln.treefile --states ${results}/leaftips_treetime_states.tsv --attribute 'region' --outdir ${results}/timetree-mugration-region