# Accessibilty

The reads obtained from sequencing DNA fragments obtained by digestion with DNAse for two species have been analysed:
- [H. sapiens](#human)
- [S. cerevisiae](#yeast)

To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes, rotational and increase. In addition, some external data needs to be downloaded. In each section you can find further details.

## H. sapiens <a id="human"></a>

Create a folder named **sapiens** and place inside:

- The data from http://eqtl.uchicago.edu/dsQTL_data/MAPPED_READS/all.mapped.reads.tar.gz  
Please, uncompress and extract the file before running the notebook.

- The hg18 to hg19 chain file downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg19.over.chain.gz

Parse the input data.

In [None]:
%%bash --out output1 --err error1

cd sapiens

data_folder=data/share/DNaseQTLsV2  # folder where the extracted files are

for individual in NA18507 NA18508 NA18516 NA18522 NA19193 NA19238 NA19239
do
    zcat ${data_folder}/${individual}* | \
        awk -v sample=${individual} '{OFS="\t"}{print $1, $2-1, $2, $3, sample}'
done | gzip > filtered_dnase.bed.gz

source activate env_crossmap  # CrossMap needs to run on a different environment
CrossMap.py bed hg18ToHg19.over.chain.gz filtered_dnase.bed.gz filtered_dnase_hg19.bed
gzip -f filtered_dnase_hg19.bed

zcat filtered_dnase_hg19.bed.gz | \
    awk '{OFS="\t"}{print $1, $3, "-", "-", $5}' | \
    gzip > dnase.tsv.gz

Compute the relative increase in mutation rate for the zoomin using all the dyads (``increase`` folder), using high rotational dyads (``increase_rot_high`` folder) and using low rotational  dyads (``increase_rot_low`` folder).

In [None]:
%%bash --out output2 --err error2

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd sapiens

bash ${increase_scripts}/increase.sh dnase.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase
    
bash ${increase_scripts}/increase.sh dnase.tsv.gz zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_high
    
bash ${increase_scripts}/increase.sh dnase.tsv.gz zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_low

## S. cerevisiae <a id="yeast"></a>

Create a folder named **cerevisiase** and place inside:

- The data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE69651&format=file  
Please, extract the files before running the notebook.

- sacCer2 to sacCer3 chain file downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/sacCer2/liftOver/

Concatenate all the files into a single one, keeping separate sample identifiers.

In [None]:
import glob
from os import path

import pandas as pd

from nucperiod.utils import int2roman

ws = 'cerevisiae'
output = path.join(ws, 'DNase-seq_saccer2.bed.gz')

data = []
for file in glob.iglob(path.join(ws, '*.csv.gz')):
    sample_id = path.basename(file).replace('GSM1705337_DNase-seq_W303_S_cerevisiae_', '').replace('.csv.gz', '')
    df_saccer = pd.read_csv(file, low_memory=False)
    df_saccer = df_saccer[df_saccer['chr']!='M']
    df_saccer['sample'] = sample_id
    data.append(df_saccer)
    

total_saccer = pd.concat(data)
total_saccer = total_saccer[total_saccer['total_count']>0]
total_saccer['fixed_chr'] = total_saccer['chr'].apply(lambda x : 'chr{}'.format(int2roman(int(x))))
total_saccer['pos-1'] = total_saccer['pos']-1

total_saccer.to_csv(output, columns=['fixed_chr', 'pos-1', 'pos', 'sample'], 
             sep ='\t', header = False, index=False, compression='gzip')

Converto to sacCer3 and format output

In [None]:
%%bash --out output3 --err error3

cd cerevisiae

source activate env_crossmap
CrossMap.py bed sacCer2ToSacCer3.over.chain.gz DNase-seq_saccer2.bed.gz DNase-seq_saccer3.bed
gzip -f DNase-seq_saccer3.bed

zcat DNase-seq_saccer3.bed.gz | awk '{OFS="\t"}{print $1, $3, "-", "-", $4}' | gzip > dnase.tsv.gz

Compute relative increase of mutation rate (the results are in the ``increase`` folder).

In [None]:
%%bash --out output4 --err error4

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/cerevisiae
mapping=${PWD}/../nucleosomes/cerevisiae

cd cerevisiae

bash ${increase_scripts}/increase.sh dnase.tsv.gz zoomin saccer3 5 ${mapping}/dyads.bed.gz \
    ${increase}/saccer3_5mer_counts.json.gz increase