# Germline variants

Germline rare variants have been analysed for two species:

- [H. sapiens](#human)
- [A. thaliana](#thali)


To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes, rotational and increase. In addition, some external data needs to be downloaded. In each section you can find further details.

## H. sapiens <a id="human"></a>

Create a folder named **sapiens** and download inside the data from the 1000 genomes phase 3 project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

Classify the data into 3 different categories according to the allele frequency:
- very low: allele freq. $< 0.01$
- low: $0.01 \leq$ allele freq. $< 0.05$
- high: $0.05 \leq$ allele freq. $< 0.5$

In [None]:
import glob
import gzip
from os import path

ws = 'sapiens'

files = [f for f in glob.iglob(path.join(ws, 'ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'))]
files.append(path.join(ws, 'ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz'))
files.append(path.join(ws, 'ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz'))

output_file_low = path.join(ws, 'polymorphisms_very_low.tsv.gz')
output_file_medium = path.join(ws, 'polymorphisms_low.tsv.gz')
output_file_high = path.join(ws, 'polymorphisms_high.tsv.gz')

with gzip.open(output_file_low, 'wt') as outfile_low, \
        gzip.open(output_file_medium, 'wt') as outfile_medium, gzip.open(output_file_high, 'wt') as outfile_high:
    for file in files:
        with gzip.open(file, 'rt') as infile:
            for line in infile:
                # skip the header
                if not line.startswith('#'):
                    # select only PASS variants
                    if 'PASS' in line:
                        line_spl = line.rstrip().split('\t')
                        chrom = line_spl[0]
                        pos = line_spl[1]
                        ref = line_spl[3]
                        alts = line_spl[4]
                        info = line_spl[7]

                        # get the dictionary with the info
                        dic_info = {x.split('=')[0]: x.split('=')[1] for x in info.split(';') if "=" in x}
                        list_AF = dic_info['AF'].split(',')

                        # check if the reference is one base length
                        if len(ref) == 1:
                            # with this we check whether there are different alts for the base
                            for ix, alt in enumerate(alts.split(',')):
                                # we only get SNPs
                                if len(alt) == 1:
                                    
                                    maf = float(list_AF[ix])
                                    out = 'chr{}\t{}\t{}\t{}\t-\n'.format(chrom, pos, ref, alt)
                                    if maf < 0.01:
                                        outfile_low.write(out)
                                    elif maf < 0.05:
                                        outfile_medium.write(out)
                                    elif maf < 0.5:
                                        outfile_high.write(out)

Compute the relative increase in mutation rate for very low polymorphisms:
- ``increase``: analysis using all nucleosomes
- ``increase_rot_high``: analysis using high rotational nucleosomes
- ``increase_rot_low``: analysis using low rotational nucleosomes

In [None]:
%%bash 

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd sapiens

bash ${increase_scripts}/increase.sh polymorphisms_very_low.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase
    
bash ${increase_scripts}/increase.sh polymorphisms_very_low.tsv.gz zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_high

bash ${increase_scripts}/increase.sh polymorphisms_very_low.tsv.gz zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_low

Compute the relative increase in mutation rate for the low and high polymorphisms

In [None]:
%%bash 

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd sapiens

bash ${increase_scripts}/increase.sh polymorphisms_low.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_low
    
bash ${increase_scripts}/increase.sh polymorphisms_high.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_high

## Thaliana <a id="thali"></a>

Create a folder named **thaliana** and downloaded inside the data from the 1001 Genomes Project: http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_only_ACGTN.vcf.gz

Filter the data to keep only SNVs with PASS filter

In [None]:
%%bash 

source activate env_nucperiod
cd thaliana

# it will generate an out.frq file in the same folder
vcftools --gzvcf 1001genomes_snp-short-indel_only_ACGTN.vcf.gz --freq --remove-filtered-geno-all --remove-indels

Filter by allele frequency < 0.01 and format the output.

In [None]:
import gzip
import os

ws="thaliana"
in_file = os.path.join(ws, 'out.frq')
out_file = os.path.join(ws, 'polymorphisms.tsv.gz')

with open(in_file) as infile, gzip.open(out_file, 'wt') as outfile:
    next(infile)
    for line in infile:

        line_spl = line.rstrip().split('\t')
        n_alleles = int(line_spl[2])
        ref = line_spl[4].split(':')[0]

        # af stands for allele frequency
        ref_af = float(line_spl[4].split(':')[1])

        for al in line_spl[5:]:

            alt = al.split(':')[0]
            alt_af = float(al.split(':')[1])

            # if the reference allele is higher than the alt and its a low freq polymorphism
            if (ref_af > alt_af) & (alt_af < 0.01):
                out = 'chr{}\t{}\t-\t-\t-\n'.format(line_spl[0], line_spl[1])
                outfile.write(out)

Compute the relative increase in mutation rate

In [None]:
%%bash 

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/thaliana
mapping=${PWD}/../nucleosomes/thaliana

cd thaliana

bash ${increase_scripts}/increase.sh polymorphisms.tsv.gz zoomin tair10 5 ${mapping}/dyads.bed.gz \
    ${increase}/tair10_filtered_5mer_counts.json.gz increase