We are trying to predict, given a set of conditions, whether a patient with pancreatic cancer is currently still alive or has already passed away.

Having a prediction model with good accuracy will allow the doctors to not only choose certain methods to treat the patient by finding which treatment is likely to have a greater survival rate given the patient's data, but also take better care of the patients who are labeled as likely to decease. Using machine learning to predict medical conditions can improve the actual survival rate of the patients. Therefore, we want to predict the survival of the patient using our model and (hopefully) keep more patients alive.

The dataset contains information relevant to each patient and their state of health - including (but not limited to) what therapies and treatments they have undergone - and where their cancer started and how much it has progressed (by the TMN scale). Then it finally has which of those patients had already died and which are still alive. We set the survival of the patient as the class label, and do necessary data preprocessing in order to drop irrelvant data and let the data be able to be used to create the models efficiently.

We are excluding several irrelevant or repeating features in order to increase the performance of the model. ‘Patient ID’ and ‘Year of diagnosis’ are irrelevant (the first because it is arbitrarily generated and the second because all diagnosis are within 5 years of each other, and so are unlikely to have a large impact on the data), and ‘Derived AJCC M, 7th ed (2010-2015)’, ‘Radiation sequence with surgery’, and ‘Derived AJCC Stage Group, 7th ed (IB0IA0-IB0IA5)’ were dropped as there are other features that reflect them.
'Sex', 'Primary Site - labeled', 'CS tumor size (<5<5-<5<5)', and 'Race recode (white, black, Other)' were one-hot encoded as they are categorical variables.
‘Insurance’, ‘Derived AJCC N, 7th ed (2010-2015)', 'Derived AJCC T, 7th ed (2010-2015)’, and ‘RX Summ--Surg Prim Site (local or partial excision998+)’ were given necessary fixes to change categorical to numerical data such as transforming “T1” to “1”. For ‘Age’, we are taking the lower bound and dividing it by 85 in order to normalize age data, which is the highest value for the age.

In [None]:
import pandas as pd
from google.colab import files
uploaded = files.upload()
print(uploaded)
df = pd.read_excel('data.xlsx')
my_data = df

Saving data.xlsx to data (6).xlsx
{'data.xlsx': b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\xaa\xf7X\xa4y\x01\x00\x00\x14\x06\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

In [None]:
my_data.head()

Unnamed: 0,Patient ID,Age recode with <1 year olds,"Race recode (white, black, Other)",Sex,Year of diagnosis,Primary Site - labeled,"Derived AJCC Stage Group, 7th ed (IB0IA0-IB0IA5)","Derived AJCC T, 7th ed (2010-2015)","Derived AJCC N, 7th ed (2010-2015)","Derived AJCC M, 7th ed (2010-2015)",RX Summ--Surg Prim Site (local or partial excision998+),Radiation sequence with surgery,Radiation recode,"Chemotherapy recode (1, no/unk)",Survival months,Vital status recode (study cutoff used),CS tumor size (<5<5-<5<5),Insurance Recode (nonnonnon7+)
0,59942,75-79 years,other,female,2013,head,IA,T1,1,M0,local or partial excision,non-PORT,0,1,13,dead,<5,insurance
1,634167,70-74 years,white,male,2013,body,IIB,T3,2,M0,"surgery/pancreatectomy,NOS",non-PORT,0,1,45,alive,<5,insurance
2,649827,50-54 years,other,female,2010,head,IIB,T1,2,M0,local or partial excision,PORT,1,1,82,alive,<5,insurance
3,666597,75-79 years,white,female,2012,head,IIA,T3,1,M0,local or partial excision,non-PORT,0,1,52,alive,<5,insurance
4,726094,85+ years,white,male,2012,head,IIB,T2,2,M0,local or partial excision,non-PORT,0,0,17,dead,<5,insurance


In [None]:
# Data Prep - dropping any missing value
my_data = my_data.dropna()
# The data had NOS values (Not otherwise specified) in a treatment column, and we dropped them.
my_data = my_data[my_data["RX Summ--Surg Prim Site (local or partial excision998+)"] != "surgery/pancreatectomy,NOS"]

In [None]:
# Feature Engineering & Data Prep
my_data = my_data.drop(columns=["Patient ID"])
my_data = my_data.drop(columns=["Survival months"])
my_data = my_data.drop(columns=["Year of diagnosis"])
my_data = my_data.drop(columns=["Derived AJCC M, 7th ed (2010-2015)"])
my_data = my_data.drop(columns=["Radiation sequence with surgery"])
my_data = my_data.drop(columns=["Derived AJCC Stage Group, 7th ed (IB0IA0-IB0IA5)"])

my_data['Derived AJCC N, 7th ed (2010-2015)'] = my_data['Derived AJCC N, 7th ed (2010-2015)'].replace(1, 0)
my_data['Derived AJCC N, 7th ed (2010-2015)'] = my_data['Derived AJCC N, 7th ed (2010-2015)'].replace(2, 1)

my_data['Derived AJCC T, 7th ed (2010-2015)'] = my_data['Derived AJCC T, 7th ed (2010-2015)'].replace("T1", 1)
my_data['Derived AJCC T, 7th ed (2010-2015)'] = my_data['Derived AJCC T, 7th ed (2010-2015)'].replace("T2", 2)
my_data['Derived AJCC T, 7th ed (2010-2015)'] = my_data['Derived AJCC T, 7th ed (2010-2015)'].replace("T3", 3)

for age in my_data['Age recode with <1 year olds'].unique():
  my_data['Age recode with <1 year olds'] = my_data['Age recode with <1 year olds'].replace(age, int(age[0:2]) / 85)

my_data = pd.get_dummies(my_data, columns = ['Sex', 'Primary Site - labeled','CS tumor size (<5<5-<5<5)','Race recode (white, black, Other)'])

my_data['RX Summ--Surg Prim Site (local or partial excision998+)'] = my_data['RX Summ--Surg Prim Site (local or partial excision998+)'].replace("extended pancreatoduodenectomy", 2)
my_data['RX Summ--Surg Prim Site (local or partial excision998+)'] = my_data['RX Summ--Surg Prim Site (local or partial excision998+)'].replace("total pancreatectomy", 1)
my_data['RX Summ--Surg Prim Site (local or partial excision998+)'] = my_data['RX Summ--Surg Prim Site (local or partial excision998+)'].replace("local or partial excision", 0)

my_data['Insurance Recode (nonnonnon7+)'] = my_data['Insurance Recode (nonnonnon7+)'].replace("insurance", 1)
my_data['Insurance Recode (nonnonnon7+)'] = my_data['Insurance Recode (nonnonnon7+)'].replace("any medicaid", 1)
my_data['Insurance Recode (nonnonnon7+)'] = my_data['Insurance Recode (nonnonnon7+)'].replace("non", 0)

my_data = my_data.drop(columns=['Sex_female', 'CS tumor size (<5<5-<5<5)_<5'])



#get the labels
label = my_data['Vital status recode (study cutoff used)']
my_data = my_data.drop(columns=["Vital status recode (study cutoff used)"])


my_data.head()

Unnamed: 0,Age recode with <1 year olds,"Derived AJCC T, 7th ed (2010-2015)","Derived AJCC N, 7th ed (2010-2015)",RX Summ--Surg Prim Site (local or partial excision998+),Radiation recode,"Chemotherapy recode (1, no/unk)",Insurance Recode (nonnonnon7+),Sex_male,Primary Site - labeled_Other specified parts,Primary Site - labeled_body,Primary Site - labeled_duct,Primary Site - labeled_head,Primary Site - labeled_overlapping,Primary Site - labeled_tail,CS tumor size (<5<5-<5<5)_≥5,"Race recode (white, black, Other)_black","Race recode (white, black, Other)_other","Race recode (white, black, Other)_white"
0,0.882353,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0
2,0.588235,1,1,0,1,1,1,0,0,0,0,1,0,0,0,0,1,0
3,0.882353,3,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1
4,1.0,2,1,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1
5,0.823529,3,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,1


Our dataset has a lot of categorical data. Things like the site of the surgery, the stage of cancer, etc, are all categorical. This kind of data goes well with a decision tree, so we wanted to try using one to model the data. That way it isn't heavily impacted by the few non-categoricals that we have. Furthermore, we thought that by starting with a decision tree, we could get an idea of what features are the most relevant to separating out the data. Furthermore, we didn't do any outlier removal yet, as we wanted to go through a few models and then see what to go ahead with before we start dealing with outliers. Therefore, we decided that a decision tree would give us a good measure, before we start that, of what to expect.

In [None]:
#Decision Tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

features = my_data
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=42)
print("Training set:", len(X_train))
print("Test set:", len(X_test))

decision_tree = DecisionTreeClassifier(criterion = 'entropy')
decision_tree = decision_tree.fit(X_train,y_train)
# Uncomment this line to print the tree
# tree.plot_tree(decision_tree)
y_pred = decision_tree.predict(X_test)
accuracy_score(y_test, y_pred)

Training set: 2536
Test set: 635


0.6094488188976378


Akin to decision trees, Naive-Bayes works pretty well with categorical data. The idea is that for a point, it will calculate the chance that it belongs to a certain class given the fact that it has a certain feature. We wanted to try this as well, the difference being that it operates with probability instead of eccentricity, which will give us a different way to see what is happening. Granted, we are aware that a lot of the data isn't exactly independent, as the stages of cancer are based upon much of the information in the other columns, causing many of the columns to be grouped together, which is one reason why it ended up not working as well as we hoped it might. But ultimately it served to help us, akin to the decision tree, with conceptually understanding the impact that each feature has on the general spread of the data.

In [None]:
#Naive-Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
clf = GaussianNB()
cross_score = cross_val_score(clf, my_data, label, cv=10)
print(sum(cross_score)/len(cross_score))

0.6546961490387477


A random forest is a logical next step from the decison tree. Instead of only living and dying by one model, we will rely on an ensemble of models in order to diversify how we are predicting the data. The fact that we got a baseline understanding from the decision tree as well as how our data would be picked apart by the random forests made us feel comfortable using it as we could use it as essentially a step up from our previous model.

In [None]:
#Random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [50,100,150],
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')

y_pred = cross_val_predict(grid_search, my_data, label, cv=5)
print(classification_report(label, y_pred))

              precision    recall  f1-score   support

       alive       0.42      0.24      0.31       994
        dead       0.71      0.85      0.77      2177

    accuracy                           0.66      3171
   macro avg       0.56      0.54      0.54      3171
weighted avg       0.62      0.66      0.63      3171



We wanted to use an SVM because of our high-dimensional dataset.
This is furthered by the amount of one hot encoding we had to do - and because SVMs perform really well in spite of the curse of dimensionality (not affected by it) so we decided to test it out. One benefit is that the SVM is also more deterministic as instead of a greedy runthrough (as a decision tree employs) -  the SVM will be guaranteed to find the global minimum - so we can definitely tell when we are making improvements in the SVM. Thus we thought that having the SVM would give us a good basis for the later models as well because an improvement in the SVM likely means that overall we are doing something better that can then be applied to other models to improve them. At this point the SVM was our best accuracy model - however it only performed marginally better than declaring all of the records as one class - but atleast it was a good sanity check to know we had a baseline.

In [None]:
#SVM
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

scaler = StandardScaler()
pca = PCA()
svc = SVC()

features = my_data

pipe = Pipeline([('scaler', scaler), ('pca', pca), ('svc', svc)])
param_grid = {
    'pca__n_components': list(range(5, 16)),
    'svc__kernel': ['rbf', 'poly', 'linear']
}

clf = GridSearchCV(pipe, param_grid, cv = 5)
y_pred = cross_val_predict(clf, features, label, cv=5)
svm_accuracy = accuracy_score(label, y_pred)

print(svm_accuracy)
print(classification_report(label, y_pred))

0.6937874487543362
              precision    recall  f1-score   support

       alive       0.60      0.07      0.12       994
        dead       0.70      0.98      0.81      2177

    accuracy                           0.69      3171
   macro avg       0.65      0.52      0.47      3171
weighted avg       0.67      0.69      0.60      3171



We have an immense class imbalance preferring the 'dead' label (2/3 of the data). However, this is probably because the survival rate of pancreatic cancer tends to be extremely low due to the lack of symptoms in the early stages. The survival rate, even as of now (2023) is only around 12% overall.
Regardless, due to the class imbalance, this is why we are seeing around 99% recall for that class, as our model is being rewarded for confidently claiming that most patients belong to the dead class. To combat this, we will use the SMOTE technique to balance the classes with synthetic data. We will also use balanced accuracy for the scoring function which is the average of recall on each class. The idea here is to incentivize our SVM to not strictly look at one class and instead actually try to determine a division by weighing the classes equally. This ended up working out to a degree as our recall has drastically increased but our accuracy has fallen as we aren't able to just declare everything in one class.

In [None]:
#optimizing SVM
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, neighbors
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
data_x = my_data
data_y = label
scaler = preprocessing.StandardScaler(with_mean=True, with_std =True)
scaler.mean_ =0
scaler.var_ =1
pca = decomposition.PCA()
svm = SVC()
pipe = Pipeline(steps = [("scaler",scaler),("pca", pca),('over', SMOTE()), ("svm", svm)])
param_grid = {
    'pca__n_components': list(range(5, 16)),
    'svm__kernel': ["linear", "rbf", "poly"],
    'over__sampling_strategy':['minority']
}

gcv = GridSearchCV(pipe,param_grid, scoring='balanced_accuracy')
predictions = cross_val_predict(gcv, data_x, data_y, cv=5)
print(classification_report(data_y, predictions))

              precision    recall  f1-score   support

       alive       0.39      0.50      0.44       994
        dead       0.74      0.65      0.69      2177

    accuracy                           0.60      3171
   macro avg       0.57      0.57      0.56      3171
weighted avg       0.63      0.60      0.61      3171



KNN was chosen as less of an expectation of a great model and more of us testing to see how the data was dispersed. After seeing how the SVM had such a low recall, we wanted to see how difficult it was to really differentiate the data. If the accuracy was higher, that would imply that the data is more clustered together by class - but as is turns out, that isn't true. Instead, the data seems to be more mixed up and so it is harder to use a clustering algorithm to group together our data. This also is backed up by how the SVM just decided to declare everything as one class instead of trying to separate out the data and the class imbalance. The new SVM model also shows how, when trying to draw a differentiation, the points are still often on the other side of our hyperplanes.

In [None]:
# KNN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

scaler = StandardScaler()
pca = PCA()
neigh = KNeighborsClassifier(n_neighbors=7)
pipe = Pipeline([('scaler', scaler), ('pca', pca), ('knn', neigh)])
KN_accuracy = cross_val_score(pipe, features, label, cv=5)
print(sum(KN_accuracy) / len(KN_accuracy))

param_grid = {
    'pca__n_components': list(range(5, 19)),
    'knn__n_neighbors': list(range(5, 10))
}

clf = GridSearchCV(pipe, param_grid, cv = 5)
clf.fit(features, label)
print(clf.best_params_)
print(clf.best_score_)

0.6546849151742468
{'knn__n_neighbors': 9, 'pca__n_components': 7}
0.6625693633721651



The idea behind us using neural nets was that because neural nets work really well with large dimensions, we could use it without cutting out dimensions which would make us not lose a lot of data. This ended up working the best with the SVM, which makes sense as both work realy well at high dimensionality.

The other thing is that we can easily vary the parameters of the neural network to see what will work the best and thus optimize it better than other models. We decided that because the other models gave us an understanding of the data, we were now comfortable using something more akin to a blackbox like a neural net. Notably, the neural nets performed really well along with the SVM, which is why we decided to continue forward with them. The recall isn't as bad as SVM either, which is promising and means that we might be able to take similar steps to better both models.

In [None]:
#Neural Nets
from sklearn.neural_network import MLPClassifier
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score

@ignore_warnings(category=ConvergenceWarning)
def neural_net():
  pipe = Pipeline(steps=[ ("scaler",scaler),('mlpc', MLPClassifier())])
  param_grid = {
      'mlpc__hidden_layer_sizes': [30,40,50,60],
      'mlpc__activation':['logistic', 'tanh', 'relu']
  }

  grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='balanced_accuracy')
  y_pred = cross_val_predict(grid_search, my_data, label, cv=5)
  mlp_accuracy = accuracy_score(label, y_pred)
  print(mlp_accuracy)
  print(classification_report(label, y_pred))

neural_net()

0.6827499211605171
              precision    recall  f1-score   support

       alive       0.48      0.18      0.27       994
        dead       0.71      0.91      0.80      2177

    accuracy                           0.68      3171
   macro avg       0.60      0.55      0.53      3171
weighted avg       0.64      0.68      0.63      3171



The issue that we are running into, as discussed earlier, is that 2/3 of our patients are labeled as "dead". This class imbalance means that our models are drawing too much from these records when predicting and thus perform on par with just predicting everyone as dead. Our best performing model was our neural net with an accuracy of 69%, which does technically outperform the 2/3 (66%) that belong to one class.

To solve this, we realistically had 2 approaches - either we could use SMOTE on new alive patients - or drop some outliers in the dead patients. We are going to be dropping some dead patients according to what is the closest to an outlier according to the SVM, as by the accuracy it performs close to the neural net and also use SMOTE on a few new points.

In [None]:
# Here we are using a one class svm to predict outliers
from sklearn.svm import OneClassSVM
ad = OneClassSVM(kernel="rbf", nu=.25).fit(my_data)
result = ad.predict(my_data)
print(len(result[result == -1]))

796


In [None]:
#We drop 'dead' records which are outliers
import numpy as np
print(len(result[result == 1]))
new_labels = label
new_data = my_data
r = np.array(result == -1)
l = np.array(label == 'dead')

new_labels = new_labels.drop(new_labels[r&l ].index)
new_data = new_data.drop(new_data[r&l ].index)
print(len(new_labels[new_labels == 'dead']))

2375
1665


Now that we have cleaned up some outliers - we are able to rerun our previous models and see if they are going to benefit from not having the added bias that these patients would introduced.


The recall for alive and dead was .52 and .77 respectively, which is slighty more disparate than the SVM used earlier in the project, but we do want to emphasize the recall on the 'dead' samples in this particular scenario so this is fine. What is more important to note however is that by dropping outliers in the 'dead' class and smoting, we get a 10% increase in accuracy from the previous SVM to 68%.

In [None]:
# We re-run the highest performing models, the SVM and Neural Net

from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, neighbors
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
data_x = new_data
data_y = new_labels
scaler = preprocessing.StandardScaler(with_mean=True, with_std =True)
scaler.mean_ =0
scaler.var_ =1
pca = decomposition.PCA()
svm = SVC()
pipe = Pipeline(steps = [("scaler",scaler),("pca", pca),('over', SMOTE()), ("svm", svm)])
param_grid = {
    'pca__n_components': list(range(5, 16)),
    'svm__kernel': ["linear", "rbf", "poly"],
    'over__sampling_strategy':['minority']
}

gcv = GridSearchCV(pipe,param_grid, scoring='balanced_accuracy')
predictions = cross_val_predict(gcv, data_x, data_y, cv=5)
print(classification_report(data_y, predictions))

              precision    recall  f1-score   support

       alive       0.58      0.51      0.54       994
        dead       0.73      0.78      0.75      1665

    accuracy                           0.68      2659
   macro avg       0.65      0.65      0.65      2659
weighted avg       0.67      0.68      0.68      2659



The recalls were .38 and .9 for alive and dead respectively, which is slightly better than the earlier neural net model we were using. The accuracy is also slightly higher, at .7. The improvement is marginal for the neural net when compared to the SVM. Perhaps that is because the SVM has a much more difficult time when having to deal with such a class imbalance, as the neural net when overwhelmed can tone down certain parameters. However the delta (change) is still noticable and helps our models a lot.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.utils._testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
@ignore_warnings(category=ConvergenceWarning)
def neural_net():
  scaler = StandardScaler()
  pipe = Pipeline(steps=[("scaler",scaler), ('mlpc', MLPClassifier())])
  param_grid = {
      'mlpc__hidden_layer_sizes': [30,40,50,60],
      'mlpc__activation':['logistic', 'tanh', 'relu']
  }

  grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='balanced_accuracy')
  y_pred = cross_val_predict(grid_search, new_data, new_labels, cv=5)

  print(classification_report(new_labels, y_pred))

neural_net()

              precision    recall  f1-score   support

       alive       0.72      0.38      0.50       994
        dead       0.71      0.91      0.80      1665

    accuracy                           0.71      2659
   macro avg       0.72      0.65      0.65      2659
weighted avg       0.72      0.71      0.69      2659



If we had to chose only one model we would have to chose the neural net - due to the fact that it has a higher accuracy while also having better recall and precision in most cases. Using this, we can correctly identify the eventual fate of 71% of patients - which is a nontrivial number.

The key usecase we feel this could be used for, though, isn't telling you if a patient is dead (hopefully that is rather obvious), but rather how close they are to that point. Instead of looking at strictly the output of the neural net, by looking at the confidence, we can know how much effort we would have to put into keeping a specific patient on a certain medication, or if we can ween them off as they are in a good spot.

Most importantly we have a good recall for "dead", meaning that our model is more complete when it is predicting "dead", implying that it will be less likely to give us a false negative for that class. The recall for "alive" is lower however meaning that we are less complete when predicting who is alive, and while it may be the safer approach, still means that we may cause attention to be diverted from patients that need it more.

The lower precision for alive also points to a higher false positive rate for this class. This may be concerning as we are predicting people to be "alive" when they were classified as "dead". This combined with the lower recall for "alive" means that the predictions for "alive" are way less trustworthy than those of "dead".



In retrospect - too much of our data was categorical. Not having enough numerical data, as well as a massive class imbalance, led to a model that struggled to draw clean, definite conclusions. I think one possible remedy would be to take the records of a patient before their death for a given dead patient instead of their current record, as this would add more data for living patients and let us know how close someone is to that point.