# Mutation Enrichment Analysis on Interface Residues

### Background

The results shown in this notebook were generated by mapping missense mutations to the residues at the interface between two regions within a protein of interest. 

**The datasets**
- Protein Data Bank Structures: The Protein Data Bank is a repository of 3D shapes of proteins, nucleic acids, and complex bio-molecular assemblies
- COSMIC: A database of mutations in cancer. Here we are using the Cancer Mutation Census, a collection of cancer-associated mutations annotated with their potential role in driving cancer

**The following steps were followed by the pipeline:**
1. PDB identifiers for the provided proteins were determined
2. PDB files for the provided proteins were downloaded 
3. The best PDB files were determined as those containing more than 80% of the residues in regions of interest
4. The residues at the interface were determined and the structures with the largest number of interface residues were selected for the mutation enrichment analysis
5. Potential driver and passenger missense mutations were mapped to the interface residues

### Summary of the results

Below two analyses are carried out

1. A global mutation enrichment analysis
    - Here we determine whether mutated residues are enriched at interface residues for all the provided proteins
    - The analysis is performed using potential cancer-driver missense mutations and with potential passenger missense mutations
    - Below the percentage of residues inside and outside the interface that are hit by at least one missense mutation are plotted. The hypergeometric test is used to determine the enrichment of mutated residues
    
    
2. A protein specific mutation enrichment analysis
    - Here we determine the enrichment of mutations for each protein individually
    - The analysis is performed using potential cancer-driver missense mutations and with potential passenger missense mutations
    - Below the FDR adjusted enrichment p-values are plotted for each protein. The enrichment p-values were obtained using the binomial test. Multiple testing correction was performed with the Benjamini/Hochberg algorithm. A false discovery rate of 0.05 was used.

In [None]:
import pandas as pd
import numpy as np
import mutation_enrichment as mut

## Global mutated residues enrichment analysis

In [None]:
df_driver = pd.read_csv(snakemake.input[0], sep = '\t')
df_passenger = pd.read_csv(snakemake.input[2], sep = '\t')

# df_driver = pd.read_csv('../data/enrichment_analysis/proteins_interface_drivers_norep.tsv', sep = '\t')
# df_passenger = pd.read_csv('../data/enrichment_analysis/proteins_interface_passengers_norep.tsv', sep = '\t')

In [None]:
dfs = [df_driver, df_passenger]
cols = ['mut_in_interface', 'mut_not_in_interface', 'interface_len', 'outside_len']
labels = ['Potential driver\nmissense mutations\n', 'Potential passenger\nmissense mutations\n']
legend = ['Interface residue hit', 'Non Interface residue hit']

hypergeom_results = mut.enrichment_analysis(dfs, cols, repetition = False)
mut.percent_bar(dfs, cols, labels, hypergeom_results, legend, verbose_labels=True, figure_size = (5, 6), save_path = snakemake.output[0])

## Protein specific mutation enrichment analysis

In [None]:
df_driver = pd.read_csv(snakemake.input[1], sep = '\t')
df_passenger = pd.read_csv(snakemake.input[3], sep = '\t')

# df_driver = pd.read_csv('../data/enrichment_analysis/proteins_interface_drivers_rep.tsv', sep = '\t')
# df_passenger = pd.read_csv('../data/enrichment_analysis/proteins_interface_passengers_rep.tsv', sep = '\t')

### Potential driver missense mutations

In [None]:
df_driver['-Log10 (enrichment p-value)'] = -np.log10(df_driver['adjusted_enrichment_pval'])
mut.enrichment_bar(df_driver, 'Gene_name', '-Log10 (enrichment p-value)', (3,6), save_path = snakemake.output[1])

### Potential passenger missense mutations

In [None]:
df_passenger['-Log10 (enrichment p-value)'] = -np.log10(df_passenger['adjusted_enrichment_pval'])
mut.enrichment_bar(df_passenger, 'Gene_name', '-Log10 (enrichment p-value)', (3,6), save_path = snakemake.output[2])