These imports are required for the training and evaluation of the model

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn import linear_model
from sklearn.preprocessing import Normalizer
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import Imputer
from sklearn.metrics import f1_score

The next cell load and clean data for a model used to predict whether or not a subject will survive until the end of the study. This was a prelimenary step used by the developers of this application to identify features to use in our final model. 

In [6]:
csv = pd.read_csv('../data/PER_PATIENT.csv')
data = csv[['D_PT_gender', 'D_PT_age', 'CREATININE', 'D_PT_therclassn', 'sctflag', 'D_PT_iss', 'D_PT_PRIMARYREASON']]
data['D_PT_PRIMARYREASON'] = data['D_PT_PRIMARYREASON'].fillna(value=0)
data = data.dropna()

data = data.replace({'Death': 1})

data = data[(data['D_PT_PRIMARYREASON'] == 1) | (data['D_PT_PRIMARYREASON'] == 0)]
X = data[['D_PT_gender', 'D_PT_age', 'CREATININE', 'D_PT_therclassn', 'sctflag', 'D_PT_iss']]
y = data['D_PT_PRIMARYREASON'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Here we present a table of several patient visits used for the training of this model. Refer to the documentation for explanation on feature interpretation. All categorical features were convereted to numeric values for the purpose of model training. 

In [11]:
data[:10]

Unnamed: 0,D_PT_gender,D_PT_age,CREATININE,D_PT_therclassn,sctflag,D_PT_iss,D_PT_PRIMARYREASON
1,1,75.0,79.56,3.0,0.0,1.0,1
2,2,79.0,123.76,1.0,0.0,2.0,0
3,1,69.0,97.24,3.0,0.0,3.0,0
4,1,64.0,79.56,3.0,0.0,1.0,0
5,1,78.0,176.8,1.0,0.0,3.0,1
6,1,74.0,97.24,1.0,0.0,2.0,1
7,1,47.0,159.12,1.0,1.0,2.0,1
8,2,59.0,88.4,1.0,1.0,2.0,0
9,2,66.0,53.04,3.0,0.0,1.0,0
10,1,81.0,70.72,3.0,0.0,1.0,1


Finally, we train the model using a support vector machine with a linear kernel and balanced class weights and achieve an accuracy score of 0.71 on the test set.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = SVC(C=0.01, kernel='linear', class_weight='balanced')
clf.fit(X_train, y_train) 

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)

0.7102473498233216

Now we present the more complex pipeline for disease progression prediction. First, we load and clean data.

In [None]:
test = pd.read_csv("../data/PER_PATIENT_VISIT.csv")
work = test[ [
'PUBLIC_ID',
'D_LAB_chem_totprot',
'D_LAB_chem_creatinine',
'D_LAB_chem_calcium',
'D_LAB_chem_bun',
'D_LAB_chem_albumin',
'D_LAB_cbc_abs_neut',
'D_LAB_serum_iga',
'D_LAB_chem_glucose',
'D_LAB_cbc_hemoglobin',
'D_LAB_cbc_platelet',
'D_LAB_cbc_wbc',
'D_TRI_CF_WASCYTOGENICS',

'AT_TREATMENTRESP'] ]

work = work[pd.notnull(work['AT_TREATMENTRESP'])]

work = work.replace({'Stable Disease': 0, 'Partial Response': 0, 'Very Good Partial Response (VGPR)': 0, 'Progressive Disease': 1, 'Complete Response': 0, 'Stringent Complete Response (sCR)': 0})
work = work.replace({'No': 0, 'Yes': 1})

#work = work.dropna()

work.PUBLIC_ID = work.PUBLIC_ID.apply(lambda x: int(x[-4:]))
X = work[[
'PUBLIC_ID',
'D_LAB_chem_totprot',
'D_LAB_chem_creatinine',
'D_LAB_chem_calcium',
'D_LAB_chem_bun',
'D_LAB_chem_albumin',
'D_LAB_cbc_abs_neut',
'D_LAB_serum_iga',
'D_LAB_chem_glucose',
'D_LAB_cbc_hemoglobin',
'D_LAB_cbc_platelet',
'D_LAB_cbc_wbc',
'D_TRI_CF_WASCYTOGENICS'
]]

Y = work['AT_TREATMENTRESP']

imputer = Imputer()
X = imputer.fit_transform(X)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20)#, random_state=42)

for k in ['linear']:
    for c in [0.01]:#[0.0001, 0.001, 0.01, 0.1, 1, 10]:
        clf = SVC(C=c, kernel=k, class_weight='balanced')
        clf.fit(X_train, y_train) 

        y_pred = clf.predict(X_test)
        f1 = f1_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)
        
        print k, c, "Acc:", accuracy_score(y_test, y_pred)
        print 'F1 score: {0:0.2f}'.format(f1)
        
        print 'True stable:', cm[0][0]
        print 'False stable:', cm[1][0]
        print 'True Progression:', cm[1][1]
        print 'False Progression:', cm[1][0]
        print ''


linear 0.0001 Acc: 0.921411387329591
F1 score: 0.00
True stable: 1149
False stable: 98
True Progression: 0
False Progression: 98

linear 0.001 Acc: 0.921411387329591
F1 score: 0.00
True stable: 1149
False stable: 98
True Progression: 0
False Progression: 98



KeyboardInterrupt: 

In [None]:
cm[0][1]