In [2]:
import pandas as pd
import numpy as np
import pickle
import csv
import matplotlib.pyplot as plt
import sklearn.linear_model as sk
from sklearn.preprocessing import PolynomialFeatures
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_selection import SelectFromModel

The input data is a dataframe produced by running "targets_by_drugs.R" 

Each row is a drug. The first five columns are whether the drug is predicted to cause Nausea, Fever, Dyspnoea, Rash or Vomiting. The remaining columns are protein targets. I chose to focus on these five side effects since they were the most common adverse events reported, and so I had more training data available. Both the features (protein targets) and the targets (side effects) are binary: a drug either binds to a particular protein or it doesn't, and a drug either causes a side effect or it doesn't. The feature matrix is quite sparse: see "targets_by_drugs.R" for more details. This is a multi-label classification problem.

This data was compiled from Drugbank.ca and from Health Canada's Adverse Events Report database. 

In [3]:
Multi = pd.read_csv("Multi.txt",  index_col= 0,sep = "\t")

First, we split the data into test and training sets. I chose a 20% split since that's fairly standard.

In [4]:
from sklearn.model_selection import train_test_split
Multi_train, Multi_test = train_test_split(Multi, test_size = 0.2, random_state = 41)
print("We have %i training examples and %i test examples." % (len(Multi_train), len(Multi_test)))

We have 969 training examples and 243 test examples.


Split each set into features (X, comprised of protein targets) and targets (y, comprised of side effects).

In [5]:
#Split into features and targets
y = Multi_train.iloc[:,0:5]
X = Multi_train.iloc[:,6:]
#Split test dataset too
y_test = Multi_test.iloc[:,0:5]
X_test = Multi_test.iloc[:,6:]

Now, we split the training data into multiple folds. I chose to do 5 folds, which is somewhat arbitrary. Since the data is so sparse I didn't want to do >5 folds.

In [6]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, random_state = 17, shuffle = True)

Now we are going to fit a model to each of our train/cv splits using a for loop. 

First, we will add interaction terms, since that improves the accuracy of our model. Since adding second-order interaction terms results in tens of thousands of features, we will then reduce our feature count by fitting a model and then removing those features that have a coefficient with an absolute value of < 4. We will then re-fit our multi-label logistic regression classifier to the new set of features.

Finally, we will export our model and the information we need to transform our feature matrix to feed into this model.

In [7]:
#Initialize index
m_number = 0

#Loop through each train/cv split and fit a new model.
for train_index, test_index in kf.split(X):
    X_train, X_cv = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_cv = y.iloc[train_index,:], y.iloc[test_index,:]
    
    #Add interaction terms
    poly = PolynomialFeatures(degree = 2, interaction_only=True)
    X_new = pd.DataFrame(poly.fit_transform(X_train))
    X_cv_new = pd.DataFrame(poly.fit_transform(X_cv))
    X_test_new = pd.DataFrame(poly.fit_transform(X_test))
    X_new.columns = poly.get_feature_names(X.columns)
    X_cv_new.columns = poly.get_feature_names(X.columns)
    X_test_new.columns = poly.get_feature_names(X.columns)
    
    #Fit model
    model = OneVsRestClassifier(sk.LogisticRegression(class_weight='balanced',  C=100))
    model = model.fit(X_new, y_train)
    
    #Feature reduction
    #Threshold is a bit arbitrary but I found 4 to be reasonable for not adversely impacting
    #predictive power while still avoiding too much over-fitting.
    sfm = SelectFromModel(model, threshold=4)
    sfm.fit(X_new, y_train)
    X_transform = pd.DataFrame(sfm.transform(X_new))
    X_transform.columns = X_new.columns[sfm.get_support()]
    X_cv_transform = pd.DataFrame(sfm.transform(X_cv_new))
    X_cv_transform.columns = X_cv_new.columns[sfm.get_support()]
    X_test_transform = pd.DataFrame(sfm.transform(X_test_new))
    X_test_transform.columns = X_test_new.columns[sfm.get_support()]
    
    #Fit model to reduced features
    model = OneVsRestClassifier(sk.LogisticRegression(class_weight='balanced',  C=100))
    model = model.fit(X_transform, y_train)

    
    #Pickle the models & feature reduction parameters to store
    with open('LogisticRegression_%s.pickle' % m_number, 'wb') as f:
        pickle.dump(model, f)
    with open('poly_%s.pickle' % m_number, 'wb') as f:
        pickle.dump(poly, f)
    with open('sfm_%s.pickle' % m_number, 'wb') as f:
        pickle.dump(sfm, f)
    m_number = m_number + 1
