In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [1]:
from ete3 import Tree
from Bio import AlignIO
from cyvcf2 import VCF

###### Purpose: 
prepare input data for using SNPPar according to requirements described here: https://github.com/d-j-e/SNPPar

20210508NC

### 1. [Get full-length recombination-free alignment](#1)
### 2. [Get list of SNP positions](#2)
### 3. [Use SNP sites to get SNP alignments](#3)
### 4. [Convert residual Ns to -s](#4)

<a id="1"></a>
### 1. Get a full length, recombination-free alignment

Currently I only have the standard output of Gubbins, which is a recombination-free SNP alignment. Because I want to know the position of each SNP in the alignment, I need to start with a full-length alignment, but I also want to exclude recombination events. Here I use a script from Nick Croucher to use mask the full length alignment using the Gubbins output. Script can be found here: https://github.com/sanger-pathogens/gubbins/tree/masking_aln




In [5]:
!python3 ../scripts/mask_gubbins_aln.py -h

usage: mask_gubbins_aln [-h] --aln ALN --gff GFF --out OUT [--out-fmt OUT_FMT]
                        [--missing-char MISSING_CHAR]

Mask recombinant regions detected by Gubbins from the input alignment

optional arguments:
  -h, --help            show this help message and exit
  --aln ALN             Input alignment (FASTA format)
  --gff GFF             GFF of recombinant regions detected by Gubbins
  --out OUT             Output file name
  --out-fmt OUT_FMT     Format of output alignment
  --missing-char MISSING_CHAR
                        Character used to replace recombinant sequence


In [1]:
input_alignment='/n/data1/hms/dbmi/farhat/nikki/abscessus/fasta_for_gubbins/mab_MSA_for_Gubbins_w_outgroup_ref.fasta'
gubbins_gff='/n/data1/hms/dbmi/farhat/nikki/abscessus/gubbins/mab/raxml/Gubbins_run1/mab_raxml.recombination_predictions.gff'
full_masked_aln_for_snpPar='../vars/mab_fullLengthAln_gubbinsMasked_for_snpPar_20210507.fasta'

#!python3 ../scripts/mask_gubbins_aln.py --aln $input_alignment --gff $gubbins_gff --out $full_masked_aln_for_snpPar --missing-char '-' --out-fmt 'fasta'

** Note: in this masked alignment, Ns represent sites I masked previously because of quality issues, and - represent recombination. This is the opposite of how Gubbins outputs its alignment, so be sure to account for that later when tabulating the % of the genome that is predicted recombination regions.

<a id="2"></a>
### 2. Use SNP sites to generate VCFs from the SNP data and get the SNP positions

#### First make sure the isolates in the tree match the isolates in the fasta file:

Now I have two alignments, input alignment and full_masked_aln_for_snpPar

In [21]:
tree_path='/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/working_trees/mab/mab_upid_dropped_outgroup_and_outlier_distance_rooted.tree'

In [22]:
# read in the tree
tree=Tree(tree_path, format=0)

In [23]:
# get a list of all the isolates in the tree
isolates_in_tree=[l.name for l in tree.get_leaves()]

In [24]:
len(isolates_in_tree)

356

In [25]:
# write isolate list to a text file
with open('../vars/isolates_in_mab_tree.txt', 'w') as filehandle:
    for isolate in isolates_in_tree:
        filehandle.write('%s\n' % isolate)

In [2]:
input_aln_filtered='/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_unmasked_msa_treeIsolatesFiltered.fasta'
masked_aln_filtered='/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_masked_msa_treeIsolatesFiltered.fasta'

A. Filter the unmasked alignment:

In [3]:
#use seqkit to subset the fasta file
!seqkit grep -f ../vars/isolates_in_mab_tree.txt $input_alignment > $input_aln_filtered

In [28]:
# double check I have the right number of sequences in the fasta file
!grep ">" $input_aln_filtered | wc -l

356


B. Filter the recombination free alignment:

In [29]:
#use seqkit to subset the fasta file
!seqkit grep -f ../vars/isolates_in_mab_tree.txt $full_masked_aln_for_snpPar > $masked_aln_filtered

In [30]:
# double check I have the right number of sequences in the fasta file
!grep ">" $masked_aln_filtered | wc -l

356


#### Use snp-sites to get an output VCF:

A. Unmasked alignment:

In [31]:
unmasked_vcf="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_snpSites_unmasked.vcf"

In [33]:
!snp-sites -v -o $unmasked_vcf $input_aln_filtered

B. Masked alignment:

In [35]:
masked_vcf="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_snpSites_masked.vcf"

In [36]:
!snp-sites -v -o $masked_vcf $masked_aln_filtered 

#### parse VCF for all the SNP positions

A. get snp positions from unmasked alignment:

In [38]:
snp_pos=[]
v1=VCF(unmasked_vcf)
for v in v1:
    snp_pos.append(v.POS)

In [39]:
len(snp_pos)

189066

In [40]:
# convert into a position file 
with open('../vars/mab_unmasked_snp_positions.txt', 'w') as filehandle:
    for pos in snp_pos:
        filehandle.write('%s\n' % pos)

B. get snp positions from unmasked alignment:

In [43]:
snp_pos_masked=[]
v2=VCF(masked_vcf)
for v in v2:
    snp_pos_masked.append(v.POS)

In [45]:
# convert into a position file 
with open('../vars/mab_masked_snp_positions.txt', 'w') as filehandle:
    for pos in snp_pos_masked:
        filehandle.write('%s\n' % pos)

<a id="3"></a>
### 3. Use SNP sites to get SNP alignments

A. get snps from the unmasked alignment:

In [10]:
unmasked_snp_alignment="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_unmasked_snpAln.fasta"

In [48]:
!snp-sites -o $unmasked_snp_alignment $input_aln_filtered

In [49]:
aln_unmasked=AlignIO.read(unmasked_snp_alignment, "fasta")

In [51]:
len(aln_unmasked[0])

189066

B. get snps from the masked alignment

In [2]:
masked_snp_alignment="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_masked_snpAln.fasta"

In [53]:
#!snp-sites -o $masked_snp_alignment $masked_aln_filtered 

In [6]:
aln_masked=AlignIO.read(masked_snp_alignment, "fasta")

In [80]:
len(aln_masked[0])

65231

<a id="4"></a>
### 4. Convert remaining N to -

SNPPar wants an input fasta with all missing or ambiguous sites to be - but my alignments still have some Ns in them. Here I convert the 'N's to '-'s using the script convert_aln_char.py

In [27]:
!python3 ../scripts/convert_aln_char.py -h

usage: convert_aln_char [-h] --in_aln IN_ALN --out_aln OUT_ALN --old_char
                        OLD_CHAR --new_char NEW_CHAR

Remove all instances of one characterfrom an alignment and replace them with a
new character

optional arguments:
  -h, --help           show this help message and exit
  --in_aln IN_ALN      Input alignment (FASTA format)
  --out_aln OUT_ALN    Output file name (FASTA format)
  --old_char OLD_CHAR  character we want to replace
  --new_char NEW_CHAR  character we want to insert instead of --from_char


In [12]:
unmasked_snp_aln_forSnpPar="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_unmasked_snpAln_forSnpPar.fasta"

In [45]:
!python3 ../scripts/convert_aln_char.py --in_aln $unmasked_snp_alignment --out_aln $unmasked_snp_aln_forSnpPar --old_char 'N' --new_char '-' 

In [13]:
masked_snp_aln_forSnpPar="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_masked_snpAln_forSnpPar.fasta"

In [47]:
!python3 ../scripts/convert_aln_char.py --in_aln $masked_snp_alignment --out_aln $masked_snp_aln_forSnpPar --old_char 'N' --new_char '-' 

##### Convert to MFASTA format (one entry per 2 lines)

In [14]:
masked_snpAln_unwrapped="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_masked_snpAln_unwrapped.fasta"
unmasked_snpAln_unwrapped="/n/data1/hms/dbmi/farhat/nikki/abscessus/0_NOTEBOOKS/010_homoplasy/vars/mab_unmasked_snpAln_unwrapped.fasta"

In [15]:
!seqkit seq -w 0 $masked_snp_aln_forSnpPar > $masked_snpAln_unwrapped

In [16]:
!seqkit seq -w 0 $unmasked_snp_aln_forSnpPar > $unmasked_snpAln_unwrapped