# Somatic mutations

We have analysed mutations in H. sapiens from 5 different sources:

- [ICGC](#icgc)
- [TCGA](#tcga)
- [TCGA PanCanAtlas](#pancanatlas)
- [Normal eyelid skin](#eyelid)
- [XPC wildtype and XPC mutant](#xpc)

To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes, rotational and increase. In addition, some external data needs to be downloaded. In each section you can find further details.

## ICGC <a id="icgc"></a>

Create a folder named **icgc** and place the whole genome mutational files.
In this work have used ICGC release 26.

The names of the downloaded files follow this nomenclature: ``simple_somatic_mutation.open.[COHORT]-[COUNTRY].tsv.gz``

Get only the mutations of interest and save the results in the ``cohort`` folder. In addition, the files are splitted into one mutation file per sample and saved in the ``samples`` directory.

In [None]:
%%bash --out output1 --err error1

source activate env_nucperiod
scripts=${PWD}/scripts

cd icgc

cohort=cohorts
samples=samples

mkdir -p ${cohort}
mkdir -p ${samples}

# Filter files to get only mutations form whole genome sequencing and mapped to GRCh37
for file in simple_somatic_mutation.open.*.tsv.gz
do
    name=$(basename ${file})
    name="${name/simple_somatic_mutation.open./}"
    zcat ${file} | grep -w WGS | grep -w GRCh37 | \
        awk -F "\t" 'BEGIN{OFS="\t";}{if (length($16)==1 && length($17)==1 && $16!="-" && $17!="-")  print "chr"$9,$10,$16,$17,$2}' | \
        sort | uniq | gzip > ${cohort}/${name}
    
done
# remove empty files
for file in ${cohort}/*.tsv.gz
do
  x=`zcat ${file} | wc -l`
  if [ $x == 0 ]
  then
      rm ${file}
  fi
done

# Create a mutation file for each sample
python ${scripts}/samples.py ${cohort} ${samples}

Compute the relative increase of mutation rate of the cohorts:

- zoomin: zoomin analysis using all nucleosomes
- zoomin_rot_high: zoomin analysis using high rotational nucleosomes
- zoomin_rot_low: zoomin analysis using low rotational nucleosomes
- zoomin_no_nuc: zoomin analysis # TODO
- zoomin_3mer: zoomin analysis using all nucleosomes and the 3-mer context

- zoomout: zoomout analysis using all nucleosomes
- zoomout_rot_high: zoomout analysis using high rotational nucleosomes
- zoomout_rot_low: zoomout analysis using low rotational nucleosomes
- zoomout_no_nuc: zoomout analysis # TODO
- zoomout_3mer: zoomout analysis using all nucleosomes and the 3-mer context

In [None]:
%%bash --out output2 --err error2
# TODO remove

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd icgc/cohorts

for file in *.tsv.gz
do

    f=$(basename $file)
    name=${f/.tsv.gz/}

    # Zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin/${name}
    # Rotational high zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin_rot_high/${name}
    # Rotational low zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin_rot_low/${name}
    # No nucleosomes context zoomin
    bash ${increase_scripts}/increase_no_context.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_nodyads_5mer_counts.json.gz ${increase}/nodyads.bed.gz \
        increase_zoomin_linker/${name}
    # Zoomin 3-mer
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 3 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_3mer_counts.json.gz increase_zoomin_3mer/${name}
        
    # Zoomout
    bash ${increase_scripts}/increase.sh ${file} zoomout hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_5mer_counts.json.gz increase_zoomout/${name} ${increase}/closer_dyads.npy
    # No nucleosomes context zoomout
    bash ${increase_scripts}/increase_no_context.sh ${file} zoomout hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_nodyads_5mer_counts.json.gz ${increase}/nodyads.bed.gz \
        increase_zoomout_linker/${name} ${increase}/closer_dyads.npy
    # Zoomout 3-mer
    bash ${increase_scripts}/increase.sh ${file} zoomout hg19 3 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_3mer_counts.json.gz increase_zoomout_3mer/${name} ${increase}/closer_dyads.npy
        
done

Compute the relative increase of mutation rate of the samples.

In [None]:
%%bash --out output3 --err error3

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd icgc/samples

for ctype in $(find . -maxdepth 1 -mindepth 1 -type d)
do
    
    for file in ${ctype}/*.tsv.gz
    do
    
        if [ ! -f ${file} ]
        then
            continue
        fi

        f=$(basename $file)
        name=${f/.tsv.gz/}

        # Zoomin
        bash ${increase_scripts}/increase_samples.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
            ${increase}/hg19_filtered_5mer_counts.json.gz increase/${ctype}/${name}
      
      done  
done

## TCGA <a id="tcga"></a>

Create a folder named **tcga** and place inside the 505 samples version 3 file named as ``mutations.tsv.gz``.

Get only the mutations of interest and save the results in the ``cohort`` folder. In addition, the files are splitted into one mutation file per sample and saved in the ``samples`` directory.

In [None]:
%%bash --out output4 --err error4

source activate env_nucperiod

scripts=${PWD}/scripts

cd tcga
cohort=cohorts
samples=samples

mkdir -p ${cohort}
mkdir -p ${samples}


for ctype in $(zcat mutations.tsv.gz | cut -f2 | sort -u | grep -v cancer)
do
    zcat mutations.tsv.gz | \
        awk -v ctype=${ctype} 'BEGIN{OFS="\t";}{if($2==ctype && length($5)==1 && length($6)==1 && ($6 !~ /[+-]/)){print "chr"$3,$4,$5,$6,$1}}' | \
        gzip -c > ${cohort}/${ctype}.tsv.gz
done

# Create a mutation file for each sample
python ${scripts}/samples.py ${cohort} ${samples}

Compute the relative increase of mutation rate for the cohorts:

- zoomin: zoomin analysis using all nucleosomes
- zoomin_rot_high: zoomin analysis using high rotational nucleosomes
- zoomin_rot_low: zoomin analysis using low rotational nucleosomes
- zoomin_no_nuc: zoomin analysis # TODO
- zoomin_3mer: zoomin analysis using all nucleosomes and the 3-mer context

- zoomout: zoomout analysis using all nucleosomes
- zoomout_rot_high: zoomout analysis using high rotational nucleosomes
- zoomout_rot_low: zoomout analysis using low rotational nucleosomes
- zoomout_no_nuc: zoomout analysis # TODO
- zoomout_3mer: zoomout analysis using all nucleosomes and the 3-mer context

In [None]:
%%bash --out output5 --err error5

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd tcga/cohorts

for file in *.tsv.gz
do

    f=$(basename $file)
    name=${f/.tsv.gz/}
    
    # Zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin/${name}
    # Rotational high zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin_rot_high/${name}
    # Rotational low zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_zoomin_rot_low/${name}
    # No nucleosomes context zoomin
    bash ${increase_scripts}/increase_no_context.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_nodyads_5mer_counts.json.gz ${increase}/nodyads.bed.gz \
        increase_zoomin_linker/${name}
    # Zoomin 3-mer
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 3 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_3mer_counts.json.gz increase_zoomin_3mer/${name}
        
    # Zoomout
    bash ${increase_scripts}/increase.sh ${file} zoomout hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_5mer_counts.json.gz increase_zoomout/${name} ${increase}/closer_dyads.npy
    # No nucleosomes context zoomout
    bash ${increase_scripts}/increase_no_context.sh ${file} zoomout hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_nodyads_5mer_counts.json.gz ${increase}/nodyads.bed.gz \
        increase_zoomout_linker/${name} ${increase}/closer_dyads.npy
    # Zoomout 3-mer
    bash ${increase_scripts}/increase.sh ${file} zoomout hg19 3 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_3mer_counts.json.gz increase_zoomout_3mer/${name} ${increase}/closer_dyads.npy
        
done

Compute the relative increase of mutation rate of the samples.

In [None]:
%%bash --out output6 --err error6

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd tcga/samples

for ctype in $(find . -maxdepth 1 -mindepth 1 -type d)
do
    
    for file in ${ctype}/*.tsv.gz
    do
    
        if [ ! -f ${file} ]
        then
            continue
        fi

        f=$(basename $file)
        name=${f/.tsv.gz/}

        # Zoomin
        bash ${increase_scripts}/increase_samples.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
            ${increase}/hg19_filtered_5mer_counts.json.gz increase/${ctype}/${name}
      
      done  
done

## PanCanAtlas <a id="pancanatlas"></a>

Create a folder named **pancanatlas** and
place inside:

- [Mutations](https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc)  from
  https://gdc.cancer.gov/about-data/publications/pancanatlas

- [Clinical data](https://api.gdc.cancer.gov/data/1b5f413e-a8d1-4d10-92eb-7c4ae739ed81) from
  https://gdc.cancer.gov/about-data/publications/pancanatlas

Split the file into a single file per cancer type.

In [None]:
import os
import pandas as pd

ws = 'pancanatlas'

clinical_data = pd.read_excel(os.path.join(ws, 'TCGA-CDR-SupplementalTableS1.xlsx'), 
                              sheet_name='TCGA-CDR',
                              usecols=[1,2])
mutations = pd.read_csv(os.path.join(ws, 'mc3.v0.2.8.PUBLIC.maf.gz'), 
                        sep='\t', 
                        usecols=['Chromosome','Start_Position', 'FILTER', 'Variant_Type', 'Reference_Allele', 'Tumor_Seq_Allele1', 'Tumor_Seq_Allele2', 'Tumor_Sample_Barcode'],
                        dtype={'Chromosome': str})
pass_snp = mutations[(mutations['FILTER'] == 'PASS') & (mutations['Variant_Type']=='SNP')].copy()
pass_snp['SAMPLE'] = pass_snp['Tumor_Sample_Barcode'].str.split('-').apply(lambda x: '{}-{}-{}{:.0}{:.0}{:.0}{:.0}'.format(*x))
data = pass_snp.merge(clinical_data, left_on='SAMPLE', right_on='bcr_patient_barcode')
data['Alt'] = data.apply(lambda x: x['Tumor_Seq_Allele2'] if x['Tumor_Seq_Allele2'] != x['Reference_Allele'] else x['Tumor_Seq_Allele1'], axis=1)
data['Chromosome'] = "chr" + data['Chromosome']
for name, group in data.groupby('type'):
    print(name)
    group.to_csv(os.path.join(ws, '{}.tsv.gz'.format(name)), compression='gzip', index=False, header=False, sep='\t',
                columns=['Chromosome', 'Start_Position', 'Reference_Allele', 'Alt', 'SAMPLE'])

## Eyelid <a id="eyelid"></a>

Create a folder named **eyelid** and place the *eyelid.txt.gz* file inside.

Parse and format the file.

In [None]:
%%bash --out output7 --err error7

source activate env_nucperiod
cd eyelid

zcat eyelid.txt.gz | awk '{OFS="\t";}{print $1,$3,$4,$5,$6}' |\
    gzip > eyelid.tsv.gz

Compute relative increase in mutation rate

In [None]:
%%bash --out output8 --err error8

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd eyelid

bash ${increase_scripts}/increase.sh eyelid.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase

## XPC <a id="xpc"></a>

Create a folder named **xpc** and inside two subfolders named *mutant* and *wild* and place inside the corresponding *.snp* files for the XPC wild-type and XPC-mutant tumors.

Parse the files.

In [None]:
%%bash

cd xpc

# read the varscan output and extract somatic mutations (with allele frequency >=0.2 and p-value <0.01 as used in Fredrisksson et al)
for i in $(ls wild/*.snp)
do 
    name=`echo $i | awk -F"/" '{print $NF}' | cut -d "." -f1`
    grep Somatic $i | \
    awk -v var=$name 'BEGIN{OFS="\t";}{vaf=$10/($9+$10);if(vaf>=0.2 && $15<0.01){print "chr"$1,$2,$3,$4,var}}'
done | sort -k1,1 -k2,2n | gzip > XPC_wt.tsv.gz

for i in $(ls mutant/*.snp)
do 
    name=`echo $i | awk -F"/" '{print $NF}' | cut -d "." -f1`
    grep Somatic $i | \
    awk -v var=$name 'BEGIN{OFS="\t";}{vaf=$10/($9+$10);if(vaf>=0.2 && $15<0.01){print "chr"$1,$2,$3,$4,var}}'
done |  sort -k1,1 -k2,2n | gzip > XPC_mutant.tsv.gz

Compute the relative increase of mutation rate.

In [None]:
%%bash --out output7 --err error7

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd xpc

bash ${increase_scripts}/increase.sh XPC_wt.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_wt
    
bash ${increase_scripts}/increase.sh XPC_mutant.tsv.gz zoomin hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_filtered_5mer_counts.json.gz increase_mutant
    
bash ${increase_scripts}/increase.sh XPC_wt.tsv.gz zoomout hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_5mer_counts.json.gz increase_wt_zoomout ${increase}/closer_dyads.npy
    
bash ${increase_scripts}/increase.sh XPC_mutant.tsv.gz zoomout hg19 5 ${mapping}/dyads.bed.gz \
    ${increase}/hg19_5mer_counts.json.gz increase_mutant_zoomout ${increase}/closer_dyads.npy