# Mutational signatures fitting and assignment


The goal is to assess which mutational signatures are active in each tumor sample in the [TCGA 505](#tcga), [ICGC](#icgc) and [PanCanAtlas](#pca) cohorts.

The workflow is:
- preprocess the mutational data to generate the format required for the analysis
- compute the weights associated to each signature in each sample
- compute the most probable signature contributing to each mutation
- if possible, concatenate all the mutations that come from the same signature
- perform the relative increase of mutation rate analysis

To be able to run this notebook it is required to run previously the ones in the following folders: nucleosomes, rotational, mutations and increase. In addition, you need to download the probabilities of the signatures from https://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt

Please, note that samples with less than 50 mutations are discarded for this analysis.

## TCGA <a id="tcga"></a>

TCGA 505 data has been analysed making used of [deconstructsigs](#tcgad) and [sigfit](#tcgas) R packages.

### DeconstructSigs <a id="tcgad"></a>

The results can be found inside the ``tcga`` directory. In addition, we have added all mutations coming from the same signature (find output in the ``signatures_tcga`` folder).

Run *deconstructsigs* for each cohort and then add all equal signatures together.

In [None]:
%%bash --out output1 --err error1

scripts=${PWD}/scripts
mutations=${PWD}/../mutations

mkdir -p tcga
bash ${scripts}/deconstructsigs.sh ${mutations}/tcga/cohorts tcga wgs

# Concatenate all mutations comming from same signatures
mkdir -p tcga_joined
python ${scripts}/join.py tcga tcga_joined

Compute the relative increse of mutation rate for the addition of signatures:
- zoomin analysis using all dyads
- zoomin analysis using high rotational dyads
- zoomin analysis using low rotational dyads
- zoomout analysis

In [None]:
%%bash --out output2 --err error2

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens
rotational=${PWD}/../rotational/sapiens

cd tcga_joined

for file in Signature_*.tsv.gz
do
    
    f=$(basename ${file})
    name=${f/.tsv.gz/}
    
    # Zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase/${name}
        
    # Rotational high
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/high_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_high/${name}
        
    # Rotational low
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${rotational}/low_rotational_dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase_rot_low/${name}
        
    # Zoomout
    bash ${increase_scripts}/increase.sh ${file} zoomout hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_5mer_counts.json.gz increase_zoomout/${name} ${increase}/closer_dyads.npy
        
done

Compute the relative increse of mutation rate for the each signature in each tumor type (using all dyads).

In [None]:
%%bash --out output3 --err error3

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd tcga

for ctype in $(find . -maxdepth 1 -mindepth 1 -type d)
do
    for file in ${ctype}/Signature_*.tsv.gz
    do
        
        f=$(basename ${file})
        name=${f/.tsv.gz/}
    
        # Zoomin
        bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
            ${increase}/hg19_filtered_5mer_counts.json.gz increase/${ctype}_${name}
    
    done        
done

### Sigfit <a id="tcgas"></a>

The results can be found inside the ``tcga_sigfit`` directory. In addition, we have added all mutations coming from the same signature (find output in the ``signatures_tcga_sigfit`` folder).

Run *sigfit* for each cohort and then add all equal signatures together

In [None]:
%%bash --out output4 --err error4

scripts=${PWD}/scripts
mutations=${PWD}/../mutations

mkdir -p tcga_sigfit
bash ${scripts}/sigfit.sh ${mutations}/tcga/cohorts tcga_sigfit

# Concatenate all mutations comming from same signatures
mkdir -p tcga_joined_sigfit
python scripts/join.py tcga_sigfit tcga_joined_sigfit

Compute the relative increse of mutation rate for the addition of signatures.

In [None]:
%%bash --out output5 --err error5

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd tcga_joined_sigfit

for file in Signature*.tsv.gz
do

    f=$(basename ${file})
    name=${f/.tsv.gz/}

    # Zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
        ${increase}/hg19_filtered_5mer_counts.json.gz increase/${name}

done

## ICGC <a id="icgcd"></a>

TCGA 505 data has been analysed using deconstructsigs R package.

Run *deconstructsigs* for each cohort

In [None]:
%%bash --out output6 --err error6

scripts=${PWD}/scripts
mutations=${PWD}/../mutations

mkdir -p icgc
bash ${scripts}/deconstructsigs.sh ${mutations}/icgc/cohorts icgc wgs

Compute the relative increse of mutation rate for the each signatures in each tumor type

In [None]:
%%bash --out output7 --err error7

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd icgc

for ctype in $(find . -maxdepth 1 -mindepth 1 -type d)
do
    for file in ${ctype}/Signature_*.tsv.gz
    do
    
        if [ "${file}" == "${ctype}/Signature_*.tsv.gz" ]
        then
            continue
        fi
    
        f=$(basename ${file})
        name=${f/.tsv.gz/}

        # Zoomin
        bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads.bed.gz \
            ${increase}/hg19_filtered_5mer_counts.json.gz increase/${ctype}_${name}
    
    done        
done

### PanCanAtlas <a id="pcad"></a>

Run *deconstructsigs* for each cohort and then add all equal signatures together

In [None]:
%%bash --out output8 --err error8

scripts=${PWD}/scripts
mutations=${PWD}/../mutations

mkdir -p pancanatlas
bash ${scripts}/deconstructsigs.sh ${mutations}/pancanatlas pancanatlas wes

# Concatenate all mutations comming from same signatures
mkdir -p pancanatlas_joined
python scripts/join.py pancanatlas pancanatlas_joined

Compute the relative increse of mutation rate for the addition of signatures

In [None]:
%%bash --out output9 --err error9

source activate env_nucperiod

increase_scripts=${PWD}/../increase/scripts
increase=${PWD}/../increase/sapiens
mapping=${PWD}/../nucleosomes/sapiens

cd pancanatlas_joined

for file in Signature_*.tsv.gz
do

    f=$(basename ${file})
    name=${f/.tsv.gz/}
    
    # Zoomin
    bash ${increase_scripts}/increase.sh ${file} zoomin hg19 5 ${mapping}/dyads_genic.bed.gz \
        ${increase}/hg19_exons_5mer_counts.json.gz increase/${name}

done

## Eyelid

In [None]:
%%bash --out output10 --err error10

scripts=${PWD}/scripts
mutations=${PWD}/../mutations

mkdir -p eyelid
bash ${scripts}/deconstructsigs.sh ${mutations}/eyelid eyelid wgs