# Damage data


We have analysed damage data for 2 species:

- [UV damage](#uv) for H. sapiens
- [NMP damage](#nmp) for S. cerevisiae


To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes, rotational and increase. In addition, some external data needs to be downloaded. In each section you can find further details.

## UV damage <a id="uv"></a>

Create a folder named **uv** and place inside the data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE98025&format=file  
Please, extract the tar file before running the notebook.

Get only the files and interest and combide the ones that are from the same experiment.

In [None]:
import os
import gzip

# Description from the original paper paper :
# Bed files contain genomic locations of damages of the most common two dinucleotides at
# the damage sites for (6-4)PP and CPD.
# Each interval length is 10 nt, and the pyrimidine dimer is located at the 4-5th positions.
#  Data obtained from wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE98025&format=file
equivalent_ID = {
                 'GSM2585687': 'NHF1_CPD_10J_0h_A',
                 'GSM2585688': 'NHF1_CPD_10J_0h_B',
                 'GSM2585693': 'NHF1_CPD_10J_24h_A',
                 'GSM2585694': 'NHF1_CPD_10J_24h_B',
                 'GSM2585715': 'GM12878_CPD_20J_nakedDNA_A',
                 'GSM2585716': 'GM12878_CPD_20J_nakedDNA_C',
                 'GSM2585701': 'NHF1_6-4_20J_0h_A',
                 'GSM2585702': 'NHF1_6-4_20J_0h_B',
                 'GSM2585705': 'NHF1_6-4_20J_1h_A',
                 'GSM2585706': 'NHF1_6-4_20J_1h_B',
                 'GSM2585711': 'GM12878_6-4_20J_nakedDNA_A',
                 'GSM2585712': 'GM12878_6-4_20J_nakedDNA_C',
                 }

ws = 'uv'
infiles = [f for f in os.listdir(ws) if f.endswith('.bed.gz')]

outfiles = set()
for file in infiles:
    name_original = file.split('_')[0]
    name_equivalent = equivalent_ID.get(name_original, None)
    if name_equivalent is None:
        continue
    else:
        name = name_equivalent.rsplit('_', 1)[0]

    out_file = os.path.join(ws, name + '.tsv.gz')
    
    if out_file in outfiles:
        mode = 'at'
    else:
        mode = 'wt'  # ensure the file is created from scratch
    outfiles.add(out_file)

    with gzip.open(os.path.join(ws, file), 'rt') as infile, gzip.open(out_file, mode) as outfile:
        for line in infile:
            line_spl = line.rstrip().split('\t')
            strand = line_spl[3]
            chrom = line_spl[0]

            # this is in theory where the damage is located
            real_pos = int(line_spl[1]) + 5

            # to make sure the damage is found always in the first dypirimidine of the middle in the read
            if strand == '-':
                real_pos = real_pos + 1

            out = '{}\t{}\t-\t-\t-\n'.format(chrom, real_pos)
            outfile.write(out)

Compute the relative increase of the mutation rate:
- ``increase_CPD_0h_high`` contains the zoomin analysis of *NHF1_CPD_10J_0h* using the high rotational dyads
- ``increase_CPD_0h_low`` contains the zoomin analysis of *NHF1_CPD_10J_0h* using the low rotational dyads
- ``increase_CPD_24h_high`` contains the zoomin analysis of *NHF1_CPD_10J_24h* using the high rotational dyads
- ``increase_CPD_24h_low`` contains the zoomin analysis of *NHF1_CPD_10J_24h* using the low rotational dyads

- ``increase_PP-6-4_0h`` contains the zoomin analysis of *NHF1_6-4_20J_0h* using all dyads
- ``increase_PP-6-4_1h`` contains the zoomin analysis of *NHF1_6-4_20J_1h* using all dyads
- ``increase_CPD_naked`` contains the zoomin analysis of *GM12878_CPD_20J_nakedDNA* using all dyads
- ``increase_PP-6-4_naked`` contains the zoomin analysis of *GM12878_6-4_20J_nakedDNA* using all dyads

In [None]:
%%bash --out output1 --err error1
# TODO remove

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd uv

# NHF1_CPD_10J_0h
bash ${increase_scripts}/increase.sh NHF1_CPD_10J_0h.tsv.gz zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_CPD_0h_high
    
bash ${increase_scripts}/increase.sh NHF1_CPD_10J_0h.tsv.gz zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_CPD_0h_low
    
# NHF1_CPD_10J_24h
bash ${increase_scripts}/increase.sh NHF1_CPD_10J_24h.tsv.gz zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_CPD_24h_high
    
bash ${increase_scripts}/increase.sh NHF1_CPD_10J_24h.tsv.gz zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_CPD_24h_low

# NHF1_6-4_20J_0h
bash ${increase_scripts}/increase.sh NHF1_6-4_20J_0h.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_PP-6-4_0h
    
# NHF1_6-4_20J_1h
bash ${increase_scripts}/increase.sh NHF1_6-4_20J_1h.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_PP-6-4_1h

# GM12878_CPD_20J_nakedDNA
bash ${increase_scripts}/increase.sh GM12878_CPD_20J_nakedDNA.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_CPD_naked

# GM12878_6-4_20J_nakedDNA
bash ${increase_scripts}/increase.sh GM12878_6-4_20J_nakedDNA.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_PP-6-4_naked

### NMP damage <a id="nmp"></a>

Create a folder named **nmp** and place inside the data from https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE98031&format=file

Please, extract the tar file and uncompress all gzipped files before running the notebook.


Convert the wig files into bed files.

In [None]:
%%bash --out output2 --err error2
# TODO remove

source activate env_nucperiod

cd nmp

for file in *.wig
do
    out_file=${file/.wig/.bed.gz}
    wig2bed < $file | gzip > ${out_file}
done

Combine the corresponding files.

In [None]:
import glob
import gzip
import os

ws = 'nmp'
to_merge = set([v.split('_bk')[0] for v in os.listdir(ws) if v.endswith('.bed.gz')])

for merge in to_merge:
    out_f = os.path.join(ws, '{}.tsv.gz'.format(merge))
    with gzip.open(out_f, 'wt') as outfile:
        for file in glob.iglob(os.path.join(ws, merge + '*.bed.gz')):
            with gzip.open(file, 'rt') as infile:
                for line in infile:
                    line_spl = line.rstrip().split('\t')
                    if float(line_spl[4])>0:
                        out = '{}\t{}\t-\t-\t-\n'.format(line_spl[0], line_spl[2])
                        for i in range(int(float(line_spl[4]))):
                            outfile.write(out)

Compute the relative increase of the mutation rate (for only the ones to be plotted):

- ``increase_newmag`` contains the zoomin analysis of *GSM2585804_newmag1_0hr_A2_1bp_Greads*
- ``increase_0h`` contains the zoomin analysis of *GSM2585801_0hr_mag1_A5_1bp_Greads*
- ``increase_2h`` contains the zoomin analysis of *GSM2585802_2hr_wt_A4_1bp_Greads*

In [None]:
%%bash  --out output3 --err error3
# TODO remove

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/cerevisiae
mapping=${PWD}/../nucleosomes/cerevisiae

cd nmp

# GSM2585804_newmag1_0hr_A2_1bp_Greads
bash ${increase_scripts}/increase.sh ${PWD}/GSM2585804_newmag1_0hr_A2_1bp_Greads.tsv.gz zoomin saccer3 5 ${mapping}/dyads.bed.gz \
        ${increase}/saccer3_5mer_counts.json.gz increase_newmag
        
# GSM2585801_0hr_mag1_A5_1bp_Greads
bash ${increase_scripts}/increase.sh ${PWD}/GSM2585801_0hr_mag1_A5_1bp_Greads.tsv.gz zoomin saccer3 5 ${mapping}/dyads.bed.gz \
        ${increase}/saccer3_5mer_counts.json.gz increase_0h
        
# GSM2585802_2hr_wt_A4_1bp_Greads
bash ${increase_scripts}/increase.sh ${PWD}/GSM2585802_2hr_wt_A4_1bp_Greads.tsv.gz zoomin saccer3 5 ${mapping}/dyads.bed.gz \
        ${increase}/saccer3_5mer_counts.json.gz increase_2h