# Simple Matrix Factorization Collaborative Filtering for Drug Repositioning on Cell Lines

The discovery of new biological interactions, such as interactions between drugs and cell lines, can improve the way drugs are developed. Recently, there has been important interest for predicting interactions between drugs and targets using recommender systems; and more specifically, using recommender systems to predict drug activity on cellular lines. In this work, we present a simple and straightforward approach for the discovery of interactions between drugs and cellular lines using collaborative filtering. We represent cellular lines by their drug affinity profile, and correspondingly, represent drugs by their cell line affinity profile in a single interaction matrix. Using simple matrix factorization, we predicted previously unknown values, minimizing the regularized squared error. We build a comprehensive dataset with information from the ChEMBL database. Our dataset comprises 300,000+ molecules, 1,200+ cellular lines, and 3,000,000+ reported activities. We have been able to successfully predict drug activity, and evaluate the performance of our model via utility, achieving an Area Under ROC Curve (AUROC) of near 0.9.

First, we should check that compounds and cells share IDs among versions

We load data

In [1]:
import pandas as pd
cell_df = pd.read_csv(filepath_or_buffer='data/cell_summary.csv', index_col=0)
comp_df = pd.read_csv(filepath_or_buffer='data/comp_summary.csv', index_col=0)

We check if the IDs are the same for cellular lines and compounds

In [2]:
print(cell_df[cell_df['id_24'] != cell_df['id_25']])
print(cell_df[cell_df['id_24'] != cell_df['id_26']])
print(cell_df[cell_df['id_24'] != cell_df['id_27']])
# IDs for Cellular lines are all the same

print(comp_df[comp_df['id_24'] != comp_df['id_27']])
# IDs for Compounds differ
# So, we create a new column scomp_id
comp_df['scomp_id'] = comp_df.index

Empty DataFrame
Columns: [cell_name, id_24, id_25, id_26, id_27]
Index: []
Empty DataFrame
Columns: [cell_name, id_24, id_25, id_26, id_27]
Index: []
Empty DataFrame
Columns: [cell_name, id_24, id_25, id_26, id_27]
Index: []
                                    smiles         pref_name    id_24  \
35                     COC(=O)C1=CCCN(C)C1         ARECOLINE     1022   
37         COc1cc(Cc2cnc(N)nc2N)cc(OC)c1OC      TRIMETHOPRIM     1216   
39     Nc1nc(N)c2cc(Sc3ccc4ccccc4c3)ccc2n1               NaN     1312   
62                      CCCC(CCC)C(=O)[O-]  VALPROATE SODIUM     1991   
74                  NCCc1c[nH]c2ccc(O)cc12         SEROTONIN     2214   
...                                    ...               ...      ...   
37798                  CCCCCCn1cc[n+](C)c1               NaN  1672762   
38361           NCCCNCCCNC1CCCCCCCCCCCCCC1               NaN  1745509   
46873   CNCc1cc(OCc2ccc3ccc(N)nc3c2)ccc1Cl               NaN  2051958   
47483               CSc1c(C)c(SC)n2ccncc12   

We have to update activity dataframes in order to standardize the compound IDs

In [3]:
activity_24 = pd.read_csv(filepath_or_buffer='data/summary_24.csv', index_col=0)
activity_25 = pd.read_csv(filepath_or_buffer='data/summary_25.csv', index_col=0)
activity_26 = pd.read_csv(filepath_or_buffer='data/summary_26.csv', index_col=0)
activity_27 = pd.read_csv(filepath_or_buffer='data/summary_27.csv', index_col=0)

activity_24 = pd.merge(left=activity_24, right=comp_df, left_on='comp_id', right_on='id_24')[['cell_id', 'scomp_id',
                                                                                              'activity', 'comp_id']]
activity_25 = pd.merge(left=activity_25, right=comp_df, left_on='comp_id', right_on='id_25')[['cell_id', 'scomp_id',
                                                                                              'activity', 'comp_id']]
activity_26 = pd.merge(left=activity_26, right=comp_df, left_on='comp_id', right_on='id_26')[['cell_id', 'scomp_id',
                                                                                              'activity', 'comp_id']]
activity_27 = pd.merge(left=activity_27, right=comp_df, left_on='comp_id', right_on='id_27')[['cell_id', 'scomp_id',
                                                                                              'activity', 'comp_id']]

activity_24.to_csv(path_or_buf='data/activity_24.csv', header=True, index=True)
activity_25.to_csv(path_or_buf='data/activity_25.csv', header=True, index=True)
activity_26.to_csv(path_or_buf='data/activity_26.csv', header=True, index=True)
activity_27.to_csv(path_or_buf='data/activity_27.csv', header=True, index=True)

  mask |= (ar1 == a)


We obtain the differences between datasets

In [4]:
# The difference dataset diff should be calculated for the pair train-test
activity_24 = activity_24.drop_duplicates(keep="first")
activity_27 = activity_27.drop_duplicates(keep="first")

# diff = pd.concat([activity_26,activity_27]).drop_duplicates(keep=False).reset_index()
diff = activity_27.merge(activity_24, how = 'outer',indicator=False)

Now we can start with the Recommender System.
We are using `scikit-surprise`, and following documentation from:
https://surprise.readthedocs.io/en/stable/getting_started.html

In [5]:
import pandas as pd

cell_df = pd.read_csv(filepath_or_buffer='data/cell_summary.csv', index_col=0)
comp_df = pd.read_csv(filepath_or_buffer='data/comp_summary_nan.csv', index_col=0)

comp_df['scomp_id'] = comp_df.index

activity_24 = pd.read_csv(filepath_or_buffer='data/summary_24.csv', index_col=0)
activity_25 = pd.read_csv(filepath_or_buffer='data/summary_25.csv', index_col=0)
activity_26 = pd.read_csv(filepath_or_buffer='data/summary_26.csv', index_col=0)
activity_27 = pd.read_csv(filepath_or_buffer='data/summary_27.csv', index_col=0)

activity_24 = pd.merge(left=activity_24, right=comp_df, left_on='comp_id', right_on='id_24', how='left')[['cell_id',
                                                                                            'scomp_id', 'activity']]
activity_25 = pd.merge(left=activity_25, right=comp_df, left_on='comp_id', right_on='id_25')[['cell_id', 'scomp_id',
                                                                                              'activity']]
activity_26 = pd.merge(left=activity_26, right=comp_df, left_on='comp_id', right_on='id_26')[['cell_id', 'scomp_id',
                                                                                              'activity']]
activity_27 = pd.merge(left=activity_27, right=comp_df, left_on='comp_id', right_on='id_27')[['cell_id', 'scomp_id',
                                                                                              'activity']]

activity_24.loc[activity_24['activity'] == 0, 'activity'] = -1
activity_25.loc[activity_25['activity'] == 0, 'activity'] = -1
activity_26.loc[activity_26['activity'] == 0, 'activity'] = -1
activity_27.loc[activity_27['activity'] == 0, 'activity'] = -1

Next program is to GridSearch the best algorithm and then compare training performance with test dataset v27

In [None]:
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import GridSearchCV
from surprise import SVD
from surprise import accuracy

reader = Reader(rating_scale=(-1, 1))
data = Dataset.load_from_df(activity_24[['cell_id', 'scomp_id', 'activity']], reader)

# Select your best algo with grid search.
print('Grid Search...')

param_grid = {'n_factors': [10, 50, 100, 200, 300, 400, 500, 1000, 2000, 5000], #Best is: 10
              'n_epochs': [10, 50, 100, 200, 300, 400, 500], #Best is: 300
              'lr_all': [0.001, 0.002, 0.005, 0.01, 0.02, 0.05], #Best is .002
              'reg_all': [0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0], #Best is .1
              'biased' : [True, False] #Best is True
              }

grid_search = GridSearchCV(SVD, param_grid=param_grid, measures=['rmse', 'mae'], cv=10, n_jobs=-1, refit=True)
grid_search.fit(data)

algo = grid_search.best_estimator['rmse']
print(algo)

Grid Search...


Now, we perform predictions

In [None]:
trainset = data.build_full_trainset()
algo.fit(trainset)
predictions = algo.test(trainset.build_testset())
print('Biased accuracy on v24,', end='   ')

accuracy.rmse(predictions)

data_diff = Dataset.load_from_df(diff[['cell_id', 'scomp_id', 'activity']], reader)
testset = data_diff.construct_testset(data_diff.raw_ratings)  # testset is diff
predictions = algo.test(testset)
print('Unbiased accuracy on diff,', end=' ')
accuracy.rmse(predictions)

# v24v27 Unbiased accuracy on diff, RMSE: 0.7838
# v24v25 Unbiased accuracy on diff, RMSE: 0.7751
# v24v25 Unbiased accuracy on diff, RMSE: 0.7398
# v26v27 Unbiased accuracy on diff, RMSE: 0.6843

Finally, we calculate and plot ROC and AUC

In [None]:
r = list()
est = list()
for p in predictions:
    r.append(p.r_ui)
    est.append(p.est)

import matplotlib.pyplot as plt
plt.scatter(r, est)
plt.show()
plt.savefig('img/surprise_gridcv_v24v27.png')
plt.close()

#ROC Curve
from sklearn import metrics
y = pd.DataFrame({'r':r, 'est':est})

y.loc[y['r'] == -1, 'r'] = 0

fpr, tpr, thresholds = metrics.roc_curve(y_true=y['r'], y_score=y['est'])
auc = metrics.roc_auc_score(y_true=y['r'], y_score=y['est'])
print('AUC: %.9f' % auc)
# v24v27 AUC: 0.869446782
# v24v26 AUC: 0.874650992
# v24v25 AUC: 0.896068396


import matplotlib.pyplot as plt
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()
plt.savefig('img/surprise_roc_v24v27.png')
plt.close()