# Filtering IntAct Dataset

**Version 3**


<b><i class="fa fa-folder-o" area-hidden="true" style="color:#1976D2"> </i>&nbsp; File Location</b><br>
<p style="background:#F5F5F5; text-indent: 1em;">
<code style="background:#F5F5F5; color:#404040; font-weight:bold; font-size:12px">C:\Users\ibrah\Documents\GitHub\Predicting-Mutation-Effects\src\helpers\helpers_training_data</code>
</p>

<b><i class="far fa-file" area-hidden="true" style="color:#1976D2"> </i>&nbsp; File Name</b>
<p style="background:#F5F5F5; text-indent: 1em;">
<code style="background:#F5F5F5; color:#404040; font-weight:bold; font-size:12px">PredatorTrainingDataPreparer.ipynb</code>
</p>

<b><i class="far fa-calendar-alt" area-hidden="true" style="color:#1976D2"> </i>&nbsp; Last Edited</b>
<p style="background:#F5F5F5; text-indent: 1em;">
<code style="background:#F5F5F5; color:#404040; font-weight:bold; font-size:12px">October 06th, 2021</code>
</p>


## Goal

Extract `protein` and `mutation` information from modified IntAct file (see below cell for details). 
These **protein.mutation** then will be used in ELASPIC to retrieve feature columns.

<div class="alert alert-block" style="background-color: #F5F5F5; border: 1px solid; padding: 10px; border-color: #E0E0E0">
    <b><i class="fa fa-info-circle" aria-hidden="true" style="color:#404040"></i></b>&nbsp; <b style="color: #404040">Info</b> <br>
<div>

Following operations are performed:
    
* [Filtered for "9606 - Homo sapiens"](#Filter-for-"9606---Homo-sapiens")
* [Filtered for 1-letter seq. chars](#Filter-for-"One---Letter-Sequence-Character")
* [Filtered for Labels: "mutation(MI:0118)" is removed](#Filter-for-Mutation-Label:-Remove-"mutation(MI:0118)")
* [Filter for "Interaction Participants": Remove Non-homosapien Interactor](#Filter-for-"Interaction-Participants":-Remove-Non-homosapien-Interactor)
    
Number of entries in this dataframe: **36170** 

## Setup

In [26]:
# Imports
import pandas as pd
import numpy as np
import os.path as op

# Original IntAct File
DATA_DIR = r"..\..\..\data"
INTACT_FILE_PATH = op.join(DATA_DIR, "intact_mutations", "intact_mutations_2020-07-06.tsv")

## Pre-processing IntAct Dataset

### Original IntAct

In [27]:
# Read Original Intact Mutation Data
mutations_data = pd.read_table(INTACT_FILE_PATH, delimiter="\t")

# Size of dataframe
print(mutations_data.shape)

# First 5 entries
mutations_data.head(3)

(59422, 15)


Unnamed: 0,#Feature AC,Feature short label,Feature range(s),Original sequence,Resulting sequence,Feature type,Feature annotation,Affected protein AC,Affected protein symbol,Affected protein full name,Affected protein organism,Interaction participants,PubMedID,Figure legend,Interaction AC
0,EBI-464941,O75940:p.Glu133Lys,133-133,E,K,mutation(MI:0118),,uniprotkb:O75940,SMNDC1,,9606 - Homo sapiens,"uniprotkb:O75940(protein(MI:0326), 9606 - Homo...",15494309,,EBI-464937
1,EBI-489661,P15153:p.Gln61Leu,61-61,Q,L,mutation(MI:0118),,uniprotkb:P15153,RAC2,,9606 - Homo sapiens,"uniprotkb:P15153(protein(MI:0326), 9606 - Homo...",11090627,,EBI-489644
2,EBI-495357,Q99640:p.Asn238Ala,238-238,N,A,mutation(MI:0118),,uniprotkb:Q99640,PKMYT1,,9606 - Homo sapiens,"uniprotkb:Q99640(protein(MI:0326), 9606 - Homo...",10373560,Fig. 1B,EBI-495348


### Filter for "9606 - Homo sapiens"

"*Affected protein organism*" column is filtered for "**9606 - Homo sapiens**".

In [8]:
# Filtering the data where "Affected protein organism" is "9606 - Homo sapiens".
mutations_homo_sapiens_data = mutations_data[mutations_data['Affected protein organism'] == "9606 - Homo sapiens"]
print(mutations_homo_sapiens_data.shape)
mutations_homo_sapiens_data.head(3)

(43495, 15)


Unnamed: 0,#Feature AC,Feature short label,Feature range(s),Original sequence,Resulting sequence,Feature type,Feature annotation,Affected protein AC,Affected protein symbol,Affected protein full name,Affected protein organism,Interaction participants,PubMedID,Figure legend,Interaction AC
0,EBI-464941,O75940:p.Glu133Lys,133-133,E,K,mutation(MI:0118),,uniprotkb:O75940,SMNDC1,,9606 - Homo sapiens,"uniprotkb:O75940(protein(MI:0326), 9606 - Homo...",15494309,,EBI-464937
1,EBI-489661,P15153:p.Gln61Leu,61-61,Q,L,mutation(MI:0118),,uniprotkb:P15153,RAC2,,9606 - Homo sapiens,"uniprotkb:P15153(protein(MI:0326), 9606 - Homo...",11090627,,EBI-489644
2,EBI-495357,Q99640:p.Asn238Ala,238-238,N,A,mutation(MI:0118),,uniprotkb:Q99640,PKMYT1,,9606 - Homo sapiens,"uniprotkb:Q99640(protein(MI:0326), 9606 - Homo...",10373560,Fig. 1B,EBI-495348


### Filter for "One - Letter Sequence Character"

"*Original sequence*" and "*Resulting sequence*" columns needs to be **1-letter**. Entries which does not satisfy this property have is removed.

In [9]:
mutations_homo_sapiens_oneletter_data = mutations_homo_sapiens_data[
    (mutations_homo_sapiens_data["Original sequence"].str.len() == 1) & 
    (mutations_homo_sapiens_data["Original sequence"].str.isalpha()) &
    (mutations_homo_sapiens_data["Resulting sequence"].str.len() == 1) &
    (mutations_homo_sapiens_data["Resulting sequence"].str.isalpha())
]  

# Reset index of the dataframe to avoid any possible errors
mutations_homo_sapiens_oneletter_data.reset_index(drop=True, inplace=True)

print(mutations_homo_sapiens_oneletter_data.shape)
mutations_homo_sapiens_oneletter_data.head(3)

(41833, 15)


Unnamed: 0,#Feature AC,Feature short label,Feature range(s),Original sequence,Resulting sequence,Feature type,Feature annotation,Affected protein AC,Affected protein symbol,Affected protein full name,Affected protein organism,Interaction participants,PubMedID,Figure legend,Interaction AC
0,EBI-464941,O75940:p.Glu133Lys,133-133,E,K,mutation(MI:0118),,uniprotkb:O75940,SMNDC1,,9606 - Homo sapiens,"uniprotkb:O75940(protein(MI:0326), 9606 - Homo...",15494309,,EBI-464937
1,EBI-489661,P15153:p.Gln61Leu,61-61,Q,L,mutation(MI:0118),,uniprotkb:P15153,RAC2,,9606 - Homo sapiens,"uniprotkb:P15153(protein(MI:0326), 9606 - Homo...",11090627,,EBI-489644
2,EBI-495357,Q99640:p.Asn238Ala,238-238,N,A,mutation(MI:0118),,uniprotkb:Q99640,PKMYT1,,9606 - Homo sapiens,"uniprotkb:Q99640(protein(MI:0326), 9606 - Homo...",10373560,Fig. 1B,EBI-495348


### Filter for Mutation Label: Remove "mutation(MI:0118)"

Mutation effect (encoded as `Feature type` in IntAct) with label **"mutation(MI:0118)"** is removed.

In [10]:
mutations_label_filtered_data = mutations_homo_sapiens_oneletter_data[
    (mutations_homo_sapiens_oneletter_data["Feature type"] != "mutation(MI:0118)")]

# Reset index of the dataframe to avoid any possible errors
mutations_label_filtered_data.reset_index(drop=True, inplace=True)

print(mutations_label_filtered_data.shape)
mutations_label_filtered_data.head(3)

(39204, 15)


Unnamed: 0,#Feature AC,Feature short label,Feature range(s),Original sequence,Resulting sequence,Feature type,Feature annotation,Affected protein AC,Affected protein symbol,Affected protein full name,Affected protein organism,Interaction participants,PubMedID,Figure legend,Interaction AC
0,EBI-11702293,O43557:p.[Ala138Thr;Ser160Gly;Asp221_Glu222del...,138-138,A,T,mutation causing(MI:2227),,uniprotkb:O43557,TNFSF14,,9606 - Homo sapiens,"uniprotkb:Q71F55(protein(MI:0326), 10090 - Mus...",26977880,2 Am,EBI-11702290
1,EBI-11702293,O43557:p.[Ala138Thr;Ser160Gly;Asp221_Glu222del...,160-160,S,G,mutation causing(MI:2227),,uniprotkb:O43557,TNFSF14,,9606 - Homo sapiens,"uniprotkb:Q71F55(protein(MI:0326), 10090 - Mus...",26977880,2 Am,EBI-11702290
2,EBI-25425844,Q06124-2:p.[Tyr279Cys;Asp425Ala;Cys459Ser],425-425,D,A,mutation causing(MI:2227),,uniprotkb:Q06124-2,PTPN11,,9606 - Homo sapiens,"uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo...",26742426,1d,EBI-25425831


### Filter for "Interaction Participants": Remove Non-homosapien Interactor

"*Interaction Participants*" column may contain other organisms as interactor participants. Those are non-homosapien interactors are removed. Also, if interactor participant is the same as itself, we will skip those as well.

<div class="alert alert-block" style="background-color: white; border: 2px solid; padding: 10px; border-color: #F57C00">
    <b style="color: #F57C00"><i class="fa fa-warning" aria-hidden="true"></i>&nbsp; Warning</b><br>
<div>
    
The data will be altered and reduced significantly:
* Unnecessary columns will be dropped. E.g. *#Feature AC*, *Feature short label* etc., however, they can be retrieve from original data easily.
* "*Feature range(s)*", "*Original sequence*", "*Resulting sequence*" columns will be combined and turned into **mutation**.
* Interaction participants will be expanded: 
    * For each participant, rows are repeated
    * If participant contains itself, it is removed
    * $
A : A, B, C
\ → \ 
\begin{matrix} 
A : B \\ 
A : C \\ 
\end{matrix}
$

   

    
**Resulting data will be as follows*:
    
| IntAct Line | Labels (Mutation effect) | Affected Protein AC | Mutation | (**single**) Interaction Participant
| --- | --- | --- | --- | --- |
| Line 1 | mutation causing(MI:2227) | uniprotkb:Q06124-2 | D425A | uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo sapiens)|
| Line 2 | mutation causing(MI:2227) | uniprotkb:Q06124-2 | C459S | uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo sapiens)|
| Line 3 | mutation causing(MI:2227) | uniprotkb:Q06124-2 | Y279C | uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo sapiens)|
| Line 4 | mutation causing(MI:2227) | uniprotkb:P04264 | L161P | uniprotkb:P37198(protein(MI:0326), 9606 - Homo sapiens)|
| Line 5 | mutation causing(MI:2227) | uniprotkb:P45381 | E285A | uniprotkb:Q14145(protein(MI:0326), 9606 - Homo sapiens)|
| **Line 6** | mutation causing(MI:2227) | uniprotkb:Q99697-3 | L54Q | uniprotkb:P11142(protein(MI:0326), 9606 - Homo sapiens)|
| **Line 6** | mutation causing(MI:2227) | uniprotkb:Q99697-3 | L54Q | uniprotkb:P08238(protein(MI:0326), 9606 - Homo sapiens)|

In [19]:
def reduce_data(dataframe_param):
    """
    Take filtered IntAct dataframe and turned into processed form as described above.
    """
    
    # Reset index of the dataframe to avoid any possible errors
    dataframe_param.reset_index(drop=True, inplace=True)
    
    # Create list that will store the entries
    data_entries = []
    
    # Column names of the new data
    column_names = ["Mutation Effect Label", "Affected Protein AC", "Mutation", "Interaction Participant"] 

    for index, row in dataframe_param.iterrows():
        mutation_effect_label = row["Feature type"]
        affected_protein_ac = row["Affected protein AC"]
        mutation = row["Original sequence"] + row["Feature range(s)"].split('-')[0] + row["Resulting sequence"]  #  W + 45-45 + A → W45A
        interaction_participants = row["Interaction participants"].split(';')
        
        for participant in interaction_participants:
            # Skip if participant is not homosapien or it is itself.
            if "9606 - Homo sapiens" not in participant or affected_protein_ac in participant:
                continue
               
            # Skip also if affected_protein_ac does not have "uniprotkb:" tag, or single participant does not have "uniprotkb:" tag.
            if not (affected_protein_ac.startswith("uniprotkb:") and (participant.startswith("uniprotkb:"))):
                continue
            
            # Append to data entries.
            data_entries.append([mutation_effect_label, affected_protein_ac, mutation, participant])
            
            # [optional] printing 
            # print(mutation_effect_label, affected_protein_ac, mutation, participant, sep='\t')
        
        
    # Construct dataframe from entries  
    processed_data = pd.DataFrame(data_entries, columns=column_names)
    
    return processed_data
    

In [20]:
processed_data = reduce_data(mutations_label_filtered_data)

In [21]:
print(processed_data.shape)
processed_data.head(10)

(36170, 4)


Unnamed: 0,Mutation Effect Label,Affected Protein AC,Mutation,Interaction Participant
0,mutation causing(MI:2227),uniprotkb:Q06124-2,D425A,"uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo..."
1,mutation causing(MI:2227),uniprotkb:Q06124-2,C459S,"uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo..."
2,mutation causing(MI:2227),uniprotkb:Q06124-2,Y279C,"uniprotkb:Q6P1J9(protein(MI:0326), 9606 - Homo..."
3,mutation causing(MI:2227),uniprotkb:P04264,L161P,"uniprotkb:P37198(protein(MI:0326), 9606 - Homo..."
4,mutation causing(MI:2227),uniprotkb:P45381,E285A,"uniprotkb:Q14145(protein(MI:0326), 9606 - Homo..."
5,mutation causing(MI:2227),uniprotkb:Q99697-3,L54Q,"uniprotkb:P11142(protein(MI:0326), 9606 - Homo..."
6,mutation causing(MI:2227),uniprotkb:Q99697-3,L54Q,"uniprotkb:P08238(protein(MI:0326), 9606 - Homo..."
7,mutation causing(MI:2227),uniprotkb:P51795,K725E,"uniprotkb:O43889-2(protein(MI:0326), 9606 - Ho..."
8,mutation causing(MI:2227),uniprotkb:P09871,G630E,"uniprotkb:O43889-2(protein(MI:0326), 9606 - Ho..."
9,mutation causing(MI:2227),uniprotkb:P46527,I119T,"uniprotkb:Q8N9N5-2(protein(MI:0326), 9606 - Ho..."


#### Resulting data

<div class="alert alert-block" style="background-color: white; border: 2px solid; padding: 10px; border-color: #0097A7">
    <b style="color: #0097A7"><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Info</b><br>
<div>

**Dataframe:** `processed_data`
    
Number of Entries: **36170**

## Value Counts of **Labels**

The distribution of *labels* (shown with `feature type` column in the data) is given below. Notice "mutation(MI:0118)" does not exist.

In [14]:
# Value counts of "LABELS"
processed_data["Mutation Effect Label"].value_counts()

mutation with no effect(MI:2226)         18603
mutation disrupting strength(MI:1128)     5808
mutation disrupting(MI:0573)              4029
mutation decreasing(MI:0119)              2962
mutation decreasing strength(MI:1133)     2228
mutation increasing(MI:0382)               759
mutation increasing strength(MI:1132)      697
mutation causing(MI:2227)                  352
mutation disrupting rate(MI:1129)          330
mutation decreasing rate(MI:1130)          289
mutation increasing rate(MI:1131)          113
Name: Mutation Effect Label, dtype: int64

## Exporting `processed_data`

Export the `process_data` as "**processed_data_v3_rs.csv**".

In [13]:
processed_data.shape

(36170, 4)

In [14]:
# processed_data.to_csv("processed_data_v3_rs.csv", index=False)

----