`mutyper` demo
==

This notebook demonstrates how to use the Python package [`mutyper`](https://github.com/harrispopgen/mushi/tree/master/mutyper): a utility for... 


## Computing mutation type data to VCF/BCF data

Studying mutation spectra usually begins with computing the mutation type for SNPs in a VCF/BCF file. To polarize SNPs from REF/ALT to ancestral/derived and determine their local context, we need an estimate of the ancestral sequence, which usually takes the form of a FASTA file.

`mutyper` is a Python package for doing this, and it also has a command-line interface for easily integrating in more complex bioinformatic pipelines.

### Python API demo
Path to an ancestral FASTA file for human chromosome 1

In [1]:
fasta = 'data/human_ancestor_GRCh37_e59/human_ancestor_1.fa'

The `mutyper` package has a single module `ancestor` containing a single class `Ancestor`. We now import it and instantiate an `Ancestor` object using our FASTA file, and tell it we're interested in 3-mer context (by default it will be triplet/3-mer)

In [2]:
from mutyper.ancestor import Ancestor

ancestor = Ancestor(fasta, k=3)

We can access the `fasta` attribute to inspect FASTA record ids, and Pythonically slice sequences via fast random access without loading into memory ([`pyfaidx`](https://pythonhosted.org/pyfaidx/) under the hood):

In [3]:
ancestor.fasta.keys()

odict_keys(['ANCESTOR_for_chromosome:GRCh37:1:1:249250621:1'])

In [4]:
chrom = 'ANCESTOR_for_chromosome:GRCh37:1:1:249250621:1'

In [5]:
ancestor.fasta[chrom]

FastaRecord("ANCESTOR_for_chromosome:GRCh37:1:1:249250621:1")

In [6]:
start = 20000000
end = 20000100
ancestor.fasta[chrom][start:end]

>ANCESTOR_for_chromosome:GRCh37:1:1:249250621:1:20000001-20000100
AGACTAACATGGAGAAACCCCATCTCTACTAAGGTACAAAATTAGATGGGCATGGTGGTGCACACCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGA

We can also access these `FastaRecord` slices as string-like biopython `Seq` objects

In [7]:
ancestor.fasta[chrom][start:end].seq

'AGACTAACATGGAGAAACCCCATCTCTACTAAGGTACAAAATTAGATGGGCATGGTGGTGCACACCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGA'

**The `mutation_type` method** allows us to specify a SNP by the usual CHROM, POS, REF, ALT, and returns the correctly polarized mutation type as a tuple of ancestral kmer and derived kmer.

In [8]:
ancestor.mutation_type(chrom, 20000000, 'A', 'T')

('GAG', 'GTG')

In [9]:
ancestor.mutation_type(chrom, 20000000, 'T', 'A')

('GAG', 'GTG')

infinite sites violation (i.e. not biallelic)

In [10]:
ancestor.mutation_type(chrom, 20000000, 'T', 'C')

(None, None)

**The `region_context` method** returns a generator of ancestral contexts over the positions in a region specified as in BED file format CHROM, START, END

In [11]:
start = 20000100
end = 20000120
for context in ancestor.region_context(chrom, start, end):
    print(context)

TCT
GAA
AAT
GAT
TCG
GCG
GCT
AAG
CAA
TCA
GAA
AAC
ACC
CCC
None
None
None
TCC
GAG
CCT


Note that sites in the FASTA that are not capital ACGT are assumed ambiguous, so the context is returned as `None`. 

We can use this generator method to lazily compute the masked genomic target size for each context (which may be useful for normalizing spectra in different regions, or calibrating mutation rates). As an example here's the path to a BED mask file (based on the 1000 Genomes strict mask).

In [12]:
bed = '1KG/scons_output/mask.bed.gz'

Loop over BED file entries and update a `Counter` object for the different triplets (we'll just do the first 1000 entries)

In [13]:
from collections import Counter
import gzip

target_sizes = Counter()
for ct, line in enumerate(gzip.open(bed, 'rt')):
    this_chrom, start, end = line.split('\t')
    target_sizes.update(ancestor.region_context(chrom, int(start), int(end)))
    if ct > 1000:
        break
target_sizes

Counter({None: 102186,
         'GAA': 5841,
         'AAC': 4009,
         'ACT': 4808,
         'CAG': 17463,
         'ACA': 7791,
         'CAC': 11748,
         'TCA': 6794,
         'ACG': 3653,
         'CCG': 6784,
         'ACC': 8539,
         'TAC': 2245,
         'TAA': 2513,
         'AAA': 5847,
         'AAT': 3018,
         'GAG': 11975,
         'CAT': 4778,
         'GAT': 3496,
         'TCC': 11111,
         'CCT': 14828,
         'GCT': 10598,
         'GCC': 16467,
         'CCA': 14039,
         'TAT': 1882,
         'GCA': 10462,
         'GAC': 7134,
         'TCT': 7907,
         'TAG': 2461,
         'GCG': 4844,
         'CCC': 20202,
         'AAG': 6256,
         'CAA': 4950,
         'TCG': 2484})

### Command line interface demo

Path to a BCF file for chromosome 22 from the 1000 Genomes Project.

In [14]:
bcf = 'data/phase3_1000genomes/bcfs/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf'

Here's a bash cell (not python, due to the `!`) that pipes SNP data filtered with [`bcftools view`](http://samtools.github.io/bcftools/bcftools.html#view) into **the [`mutyper.ancestor`](https://github.com/harrispopgen/mushi/blob/master/mutyper/ancestor.py) program** to add mutation_type to the INFO field. (for speed will just look at a 1kb region). This program takes an optional argument `k` to define kmer size (3 by default). We then pipe to [`bcftools query`](http://samtools.github.io/bcftools/bcftools.html#query) to show that the INFO field now contains mutation_type

In [15]:
!bcftools view -T {bed} -r1:100000000-101000000 -m2 -M2 -v snps -c 1:minor -Ou -f PASS -U {bcf} \
 | python mutyper/ancestor.py {fasta} \
 | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/mutation_type\n' | head

1	100000012	G	T	CCA>CAA
1	100000081	C	T	GCT>GTT
1	100000185	C	T	CCG>CTG
1	100000268	G	T	CCT>CAT
1	100000320	A	G	CAA>CGA
1	100000421	G	T	ACC>AAC
1	100000507	C	A	CCC>CAC
1	100000625	C	T	CCA>CTA
1	100000659	A	G	CAT>CGT
1	100000675	A	G	CAC>CGC


We can compute ancestral sequence content in a pipelined fashion too. We use [`tabix`](http://www.htslib.org/doc/tabix.html) to filter the gzipped BED mask file to a region of interest (first 10Mb of chromosome 1), and pipe that to **the [`mutyper.masked_size`](https://github.com/harrispopgen/mushi/blob/master/mutyper/masked_size.py) program** to print the triplet-wise target sizes in tabular format. This program also takes an optional argument `k`.

In [16]:
!tabix -p bed {bed} 1:100000000-101000000 | python mutyper/masked_size.py {fasta} | head

AAA	67863
AAC	24350
AAG	33414
AAT	45656
ACA	31404
ACC	15653
ACG	2300
ACT	26517
CAA	30469
CAC	18888


Finally, we can pipe BCF/VCF data containing mutation_type data in the INFO field (see above) to **the program [`mutyper.ksfs`](https://github.com/harrispopgen/mushi/blob/master/mutyper/ksfs.py)** to compute a $k$-SFS in tabular form. The $k$-SFS is a matrix of the count of each mutation type (columns) in the input SNP data at each allele frequency (rows). This can also be used to compute mutation spectra of individuals.

**A single individual's mutation spectrum** (filter the BCF to sample HG00096):

In [17]:
!bcftools view -T {bed} -r1:100000000-101000000 -m2 -M2 -v snps -c 1:minor -Ou -f PASS -U {bcf} \
 | bcftools view -s HG00096 -c 1:minor \
 | python mutyper/ancestor.py {fasta} | python mutyper/ksfs.py \
 | cut -f-10

AAA>ACA	AAA>AGA	AAA>ATA	AAC>ACC	AAC>AGC	AAC>ATC	AAG>ACG	AAG>AGG	AAG>ATG	AAT>ACT
8	15	5	2	8	3	3	13	2	5



**$k$-SFS of all individuals in the BCF**

In [18]:
!bcftools view -T {bed} -r1:100000000-101000000 -m2 -M2 -v snps -c 1:minor -Ou -f PASS -U {bcf} \
 | python mutyper/ancestor.py {fasta} | python mutyper/ksfs.py \
 | head -10 | cut -f-10

sample_frequency	AAA>ACA	AAA>AGA	AAA>ATA	AAC>ACC	AAC>AGC	AAC>ATC	AAG>ACG	AAG>AGG	AAG>ATG
1	109	207	65	40	117	23	57	152	31
2	17	34	15	12	19	10	9	34	3
3	8	22	10	1	13	2	7	9	4
4	9	10	10	4	10	1	5	9	2
5	6	7	5	3	6	1	4	6	3
6	5	5	3	2	4	1	1	5	2
7	4	3	3	0	5	1	0	1	2
8	6	5	2	2	4	1	7	3	0
9	2	3	0	1	5	1	1	3	1
