# Mutation Enrichment Analysis on Interface Residues

### Background

Intramolecular interactions between different parts of a protein are important for protein function. Some intramolecular interactions within different parts of the same protein can regulate protein function either positively or negatively. Amino acid residues that lie at the interface between two interacting parts of a protein are important to establish the necessary intramolecular interactions. In some cases, mutations that occur in residues at the interface between two interacting parts of a protein can lead to the deregulation of the activity of the protein and in consequence lead to disease. The results shown in this notebook were generated by mapping cancer missense mutations to the residues at the interface between two regions within a protein of interest to understand whether there is an enrichment of mutations at this interface in cancer. 

**The datasets**
- Protein Data Bank Structures: The Protein Data Bank is a repository of 3D shapes of proteins, nucleic acids, and complex bio-molecular assemblies
- COSMIC: A database of mutations in cancer. Here we are using the Cancer Mutation Census, a collection of cancer-associated mutations annotated with their potential role in driving cancer

**The following steps were followed by the pipeline:**
1. PDB identifiers for the provided proteins were determined
2. PDB files for the provided proteins were downloaded 
3. The best PDB files were determined as those containing more than 80% of the residues in regions of interest
4. The residues at the interface were determined and the structures with the largest number of interface residues were selected for the mutation enrichment analysis
5. Potential driver missense mutations from the COSMIC Cancer Mutation Census were mapped to the interface residues

### Summary of the results

Below two analyses are carried out

1. A global mutation enrichment analysis
    - Here we determine whether mutated residues are enriched at interface residues for all the provided proteins
    - The analysis is performed using potential cancer-driver missense mutations
    - Below the percentage of residues inside and outside the interface that are hit by at least one missense mutation are plotted. The hypergeometric test is used to determine the enrichment of mutated residues
    - Only the proteins with at least one mutation in their entire sequence are considered
    
    
2. A protein specific mutation enrichment analysis
    - Here we determine the enrichment of mutations for each protein individually
    - The analysis is performed using potential cancer-driver missense mutations
    - Below the FDR adjusted enrichment p-values are plotted for each protein. The enrichment p-values were obtained using the binomial test. Multiple testing correction was performed with the Benjamini/Hochberg algorithm. A false discovery rate of 0.05 was used.
    - Only the proteins with a significant enrichment are plotted

In [1]:
import pandas as pd
import numpy as np
import mutation_enrichment as mut

## Global mutated residues enrichment analysis

In [None]:
df_driver = pd.read_csv(snakemake.input[0], sep = '\t')

df_driver = df_driver.loc[(df_driver['mut_in_interface'] > 0) | (df_driver['mut_not_in_interface'] > 0)]

In [None]:
dfs = [df_driver]
cols = ['mut_in_interface', 'mut_not_in_interface', 'interface_len', 'outside_len']
labels = ['Potential driver\nmissense mutations\n']
legend = ['Interface residue hit', 'Non Interface residue hit']

hypergeom_results = mut.enrichment_analysis(dfs, cols, repetition = False)
mut.percent_bar(dfs, cols, labels, hypergeom_results, legend, verbose_labels=True, figure_size = (2.5, 6), save_path = snakemake.output[0])

**Interpretation**

If the interface residues found are important in cancer we expect a significant enrichment of residues at the interface that are hit by cancer driver missense mutations. 

## Protein specific mutation enrichment analysis

In [None]:
df_driver = pd.read_csv(snakemake.input[1], sep = '\t')

df_driver = df_driver.loc[(df_driver['mut_in_interface'] > 0) | (df_driver['mut_not_in_interface'] > 0)]
df_driver = df_driver.loc[df_driver['reject_null_hypothesis'] == True]

### Proteins enriched for potential driver missense mutations within the interface residues

In [None]:
df_driver['-Log10 (FDR adjusted\nenrichment p-value)'] = -np.log10(df_driver['adjusted_enrichment_pval'])
mut.enrichment_bar(df_driver, 'Gene_name', '-Log10 (FDR adjusted\nenrichment p-value)', save_path = snakemake.output[1])

**Interpretation**

Similar to the analysis above if the mutation of interface residues is important in cancer we expect a significant enrichment of potential driver missense mutations at the interface. Here we are able to see for exactly what proteins there is a significant enrichment. 