# Convert cmap csv file to a matrix 

This notebook takes as input a csv file with two columns for target and indication - from the original raw CMAP data Anna provided me with. The output is a matrix which has genes/targets as rows, and indication as columns and a 1 if there is a link between a specific gene/target and indication and 0 otherwise.

At almost each step I need to remove duplicates which get generated. 

In [51]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [2]:
# reading in the csv file with the raw data
df = pd.read_csv('cmap_reduced.csv')

In [3]:
df.describe()

Unnamed: 0,target,indication
count,4508,2095
unique,2351,997
top,PTGS1|PTGS2,hypertension
freq,45,66


In [4]:
# Only interested in rows with both a target and indication- these are the links to compare with sentence co-occurence

df_reduced = pd.DataFrame(df[df['target'].notnull() & df['indication'].notnull()])
df_reduced.reset_index(inplace=True)
df_reduced.drop(columns=['index'], inplace=True)

In [5]:
# checking whether there are duplicates
sum(df_reduced.duplicated())

178

Dropping duplicates. Further along I need to do this again, as once I separate out both the target and indication there are more duplicates

In [6]:
df_clean = pd.DataFrame(df_reduced.drop_duplicates())

### The following are two functions for separating out multiple targets within the same cell and multiple indications within the same cell, respectively.

The original data can have in a single row: t1|t2 i1|i2, and I want to separate that out to have 4 rows instead: t1 i1, t1 i2, t2 i1, t2 i2

In [7]:
# this function separates out the targets and creates a new row for each target with the corresponding indication
def split_target(df):
     
    for i in range(0,len(df)):       
        temp_targets = df['target'][i].split('|')       
        df['target'][i] = temp_targets[0]
        
        n = len(temp_targets)      
        if n>=2: # if there is more than one target in the row          
            for j in range(1,n):
                temp_row = [temp_targets[j], df['indication'][i]]
                df = pd.DataFrame(df.append({'target': temp_row[0], 'indication':temp_row[1]}, ignore_index=True))
            
    df.sort_values(['indication', 'target'], inplace=True)
    
    return df        

In [8]:
# this new function separates out the indications and creates a new row for each one with corresponding target
def split_indication(df):
     
    for i in range(0,len(df)):
        temp_ind = df['indication'][i].split('|')      
        df['indication'][i] = temp_ind[0]
    
        n = len(temp_ind) 
        if n>=2: # if there is more than one indication in the row  
            for j in range(1,n):
                temp_row = [df['target'][i], temp_ind[j]]
                df = pd.DataFrame(df.append({'target': temp_row[0], 'indication':temp_row[1]}, ignore_index=True))
            
    df.sort_values(['target', 'indication'], inplace=True)
    
    return df 

In [9]:
# target transformation
df1 = split_target(df_clean)
df1.reset_index(inplace=True)
df1.drop(columns=['index'], inplace=True)

In [11]:
# indication transformation
df2 = split_indication(df1)
df2.reset_index(inplace=True)
df2.drop(columns=['index'], inplace=True)

I also need to drop duplicates here because once you separate out both the target and indication more duplicates are created

In [60]:
df3 = pd.DataFrame(df2.drop_duplicates())
df3.head() # there is some weird stuff due to the original csv having parsing errors

Unnamed: 0,target,indication
0,10-tetrahydroazepino[2,Preclinical
1,2,4-tetrahydro
2,3-beta-D-glucan synthase inhibitor,infectious disease
4,4,6-hexabromocyclohexane
5,4,TRPC1


### Mapping disease names to MeSH id

While still in this column format, I will do a left merge using the mesh_id/indication data from the termite_tag_indication notebook. Then convert it to an incidence matrix

In [61]:
# import indication_to_mesh csv file
indication_to_mesh = pd.read_csv('indication_to_mesh.csv')
indication_to_mesh.drop(columns=['Unnamed: 0'], inplace=True)

In [62]:
indication_to_mesh.head()

Unnamed: 0,indication,mesh_id
0,diabetes mellitus,D003920
1,Addison's disease,D000224
2,African trypanosomiasis,D014353
3,Alzheimer's disease,D000544
4,Buerger's disease,D013919


In [63]:
# left merge
df4 = pd.merge(df3,indication_to_mesh,how='left',on='indication')
df4.head()

In [65]:
# only keeping indications because those are the relationships we are interested in
df5 = df4.dropna()
df5.reset_index(inplace=True)
df5.drop(columns=['index'], inplace=True)
df5.head()

### Incidence Matrix

I will now create a matrix where each element{ij} is equal to 1 if gene i and disease j have a row in the previous dataframe

In [82]:
# there are some indications which have the same mesh_id as they are different names for the same thing
# here I eliminate these duplicates

df6 = df5[['target', 'mesh_id']].drop_duplicates()
df6.describe()

Unnamed: 0,target,mesh_id
count,5835,5835
unique,1137,426
top,CHRM1,D006973
freq,58,205


In [83]:
# define column which will map the relationship
df6['relationship'] = 1


In [84]:
# create matrix
df7 = df6.pivot(index='target', columns='mesh_id', values='relationship')

Saving as a csv for comparison with the sentence co-occurence

In [91]:
df7.to_csv('target_indication_matrix.csv')