# Periodicity of the relative increase of mutation rate

This notebook computes some required data for the relative increase of mutation rate analysis. That analysis has been perform for multiple data sources and for three different species:

- [H. sapiens](#human)
- [A. thaliana](#thali)
- [S. cerevisiae](#yeast)

To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes.

Please, note that the Relative Increase of Mutation Rate analysis is performed in other notebooks.

## H. sapiens <a id="human"></a>

Compute the counts of k-mer in the human genome:
- hg19_5mer_counts.json.gz: 5-mer counts in the whole genome
- hg19_filtered_5mer_counts.json.gz: 5-mer counts in the mappable non-genic regions
- hg19_exons_5mer_counts.json.gz: 5-mer counts in the exonic regions
- hg19_filtered_nodyads_5mer_counts.json.gz: 5-mer counts in mappable non-genic regions that do not belong to any nucleosomes
- hg19_3mer_counts.json.gz: 3-mer counts in the whole genome
- hg19_filtered_3mer_counts.json.gz: 3-mer counts in the mappable non-genic regions

In [None]:
%%bash --out output1 --err error1
# TODO remove

source activate env_nucperiod

genome="hg19"
kmer=5
cores=6
mapping=${PWD}/../nucleosomes/sapiens
scripts=${PWD}/scripts

mkdir -p sapiens
cd sapiens

# Remove chrM form coverage file
zcat ${mapping}/coverage.bed.gz | \
    awk '{OFS="\t"}{if ($1 != "chrM") {print $0}}' | \
    gzip > coverage.bed.gz

# Create a third file with regions that do not fall into any dyad
zcat ${mapping}/dyads.bed.gz | \
    awk '{OFS="\t";}{print $1, $2-73, $3+73}' | \
    subtractBed -a coverage.bed.gz -b stdin | \
    gzip > nodyads.bed.gz


# Whole genome 5 mer counts  (used for the zoomout analysis)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_${kmer}mer_counts.json.gz \
    --cores ${cores} --kmer ${kmer}
        
# Filtered genome 5 mer counts (used for the zoomin analysis)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_filtered_${kmer}mer_counts.json.gz \
    --regions coverage.bed.gz --cores ${cores} --kmer ${kmer}

# Exons 5-mer counts (used for analysing PanCanAtlas data)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_exons_${kmer}mer_counts.json.gz \
    --regions ${mapping}/exons.merged.bed.gz --cores ${cores} --kmer ${kmer}


# Other files used for comparison

# Filtered genome without nucleosomes 5-mer counts  (used for the no nucleosomes in context analysis)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_filtered_nodyads_${kmer}mer_counts.json.gz \
    --regions nodyads.bed.gz --cores ${cores} --kmer ${kmer} 

kmer=3
# Whole genome 3 mer counts  (used for the zoomout analysis)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_${kmer}mer_counts.json.gz \
    --cores ${cores} --kmer ${kmer}
        
# Filtered genome 3 mer counts (used for the zoomin analysis)
python ${scripts}/genome_content_extended.py ${genome} ${genome}_filtered_${kmer}mer_counts.json.gz \
    --regions coverage.bed.gz --cores ${cores} --kmer ${kmer}

Compute, for each position of a dyad, the number of dyads that are nearby  (this will used for the zoomout analysis).

In [None]:
%%bash --out output2 --err error2

source activate env_nucperiod

nucleosomes=${PWD}/../nucleosomes/sapiens/dyads.bed.gz
scripts=${PWD}/scripts

cd sapiens

zcat ${nucleosomes} | \
    awk '{OFS="\t"}{print $1, $2-1000, $3+1000, $1 "_" $2 "_" $3 }' | \
    intersectBed -a stdin -b  ${nucleosomes}  -sorted  -wo | \
    awk '{OFS="\t"}{print $4, $6-($2+1000)}' | gzip > closer_dyads.tsv.gz

python ${scripts}/closer_dyads_list.py closer_dyads.tsv.gz closer_dyads.npy

## A. thaliana <a id="thali"></a>

Compute 5-mer counts in the TAIR10 genome excluding genic regions.

In [None]:
%%bash --out output3 --err error3

source activate env_nucperiod

genome="tair10"
kmer=5
cores=6
mapping=${PWD}/../nucleosomes/thaliana
scripts=${PWD}/scripts

mkdir -p thaliana
cd thaliana

awk '{OFS="\t"}{if ($1 != "chrM" && $1 != "chrC") {print $1, 0, $2}}' ${mapping}/tair10.chrom.sizes | \
    subtractBed -a stdin -b ${mapping}/TAIR10_CDS.bed.gz | gzip > coverage.bed.gz

python ${scripts}/genome_content_extended.py ${genome} ${genome}_filtered_${kmer}mer_counts.json.gz \
    --regions coverage.bed.gz --cores ${cores} --kmer ${kmer}

## S. cererevisiae <a id="yeast"></a>

Compute the counts of the 5-mers in the sacCer3 genome.

In [None]:
%%bash --out output4 --err error4

source activate env_nucperiod

genome="saccer3"
kmer=5
cores=6
scripts=${PWD}/scripts

mkdir -p cerevisiae
cd cerevisiae

python ${scripts}/genome_content_extended.py ${genome} ${genome}_${kmer}mer_counts.json.gz \
    --cores ${cores} --kmer ${kmer}