Notebook for our final project!

Team:
Nolan Jimmo
Nicole Donahue
Frederick Carlson
Xinyu Liu

In [31]:
#Imports, function def and some file reading

import numpy as np
import pandas as pd
import glob
import csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier

def conf_matrix_to_df(conf_matrix, target_names):
    return pd.DataFrame(conf_matrix, columns=target_names, index=target_names)


#reading in EDSS Score data
EDSS_FILENAME = "data/EDSS_Scores.csv"
EDSS_scores = pd.read_csv(EDSS_FILENAME)

Find the subject ids that have valid EDSS scores to be able to just train model on these subjects data. Storing the valid subject id and scores in a dictionary with the structure: {Subject ID: (baseline score, 6mo score)}

In [32]:
valid_sids = {}
for i, row in EDSS_scores.iterrows():
    if type(row["Subject ID "]) == float:
        break
    if row["EDSS Baseline (Score out of 10) "] != np.NaN and row["EDSS 6mo (Score out of 10) "] != np.NaN:
        valid_sids[(row["Subject ID "])] = (str(row["EDSS Baseline (Score out of 10) "]), str(row["EDSS 6mo (Score out of 10) "]))
#print(valid_sids)

converting regular EDSS scores to the binary 0, or 1, for low vs. moderate/severe EDSS score. Everything up to 4 will be 0, everything 4 and above will be moderate/severe score

In [33]:
valid_sids_generalized = {}
for key, value in valid_sids.items():
    if float(value[0]) < 4:
        v1 = 0
    else:
        v1 = 1
    if float(value[1]) < 4:
        v2 = 0
    else:
        v2 = 0
    valid_sids_generalized[key] = (v1, v2)

Get filenames for the valid subject data files out of the data folder, for both the baseline and 6mo data

NOTES: This is all pretty much just data preprocessing, getting the filenames that correspond to the subjects that we know we have EDSS scores for, then going and getting all of the data for each of those valid subjects. For each row of data per subject I add column (feature) that is the target feature, which is just their EDSS score for this time period. I then store that data in a list (called calid_subject_data) in order to facilitate creating the dataframe that I will use in the training/testing of our model

In [34]:
# here, it is the baseline of the gait data
gait_baseline_filenames = glob.glob("data/Processed Data - MS +/Sway/MS1 Session 1/*")
#print((gait_baseline_filenames))
removal = []
for g in gait_baseline_filenames:
    if g[-9:-4] not in valid_sids_generalized.keys():
        removal.append(g)

gait_b_filenames = [l for l in gait_baseline_filenames if l not in removal]

###NOTE: In this test below, sometimes the two lists are not the same length
# HOWEVER, the valid EDSS subject ids list is always longer, so we will always have a
# "target" for each feature set, so we should be good to go
#print(len(gait_b_filenames), len(valid_sids.keys()))


# here, it is the 6mo of the gait data
gait_6mo_filenames = glob.glob("data/Processed Data - MS +/Sway/MS1 Session 2/*")
#print((gait_baseline_filenames))
removal = []
for g in gait_baseline_filenames:
    if g[-9:-4] not in valid_sids_generalized.keys():
        removal.append(g)

gait_6_filenames = [l for l in gait_baseline_filenames if l not in removal]

# Now, loop through the valid files, get the features from each valid subject and assign
MAX_ROWS_PER_SUBJECT = 20
# their EDSS score as the "target"
valid_subject_data = []
cols = []
for g in gait_b_filenames:
    with open(g, 'r') as file:
        reader = csv.reader(file)
        if cols == []:
            cols = next(file).strip().split(',')
            cols.append('target')
        count = 0
        for row in reader:
            if row[0] != 'timestamp_start':
                if valid_sids_generalized[g[-9:-4]][0] == 0 and count < MAX_ROWS_PER_SUBJECT:
                    row.append(valid_sids_generalized[g[-9:-4]][0])
                    valid_subject_data.append(row)
                    count += 1
                elif valid_sids_generalized[g[-9:-4]][0] == 1:
                    row.append(valid_sids_generalized[g[-9:-4]][0])
                    valid_subject_data.append(row)
                else:
                    break

# doing the exact some thing as before, just with the 6 month data
# We can just add this data straight to the valid_subject_data list because it is all going
# to be training data
# We do have to separate the for loops though because we have to add the proper EDSS value
# from the valid_sids dictionary
for g6 in gait_6mo_filenames:
    with open(g6, 'r') as file:
        reader = csv.reader(file)
        if cols == []:
            cols = next(file).strip().split(',')
            cols.append('target')
        count = 0
        for row in reader:
            if row[0] != 'timestamp_start':
                if valid_sids_generalized[g[-9:-4]][1] == 0 and count < MAX_ROWS_PER_SUBJECT:
                    row.append(valid_sids_generalized[g[-9:-4]][1])
                    valid_subject_data.append(row)
                    count += 1
                elif valid_sids_generalized[g[-9:-4]][1] == 1:
                    row.append(valid_sids_generalized[g[-9:-4]][1])
                    valid_subject_data.append(row)
                else:
                    break
#print(cols)
#print(valid_subject_data[:5])
num_observations = len(valid_subject_data)

Final setup for the features dataframe and then training/testing the SVM model!

NOTES:
As you can see from the models that are commented out, I tried a number of different models, and it looks like the random forest classifier is going to be the one that works the best. Basically, here, I drop all of the non-important features colums, break the data in to testing and training partitions, train the model and then test it.

In [35]:
df = pd.DataFrame(valid_subject_data, columns=cols)
#print(df)
#get rid of the non-important or NaN valued "features"
df.drop(df.columns[[0,1,2,6,16]], axis=1, inplace=True)
#print(df)
df.fillna(0)

#Train the model and see what happens!
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, df.columns != 'target'], np.array(df.iloc[:, df.columns == 'target']).reshape(num_observations,), test_size = 0.2, random_state = 0)
#svm = SVC(kernel="poly")
#lin_model = linear_model.LogisticRegression()
rfc = RandomForestClassifier()
print('training')
rfc.fit(x_train, y_train)
print("predicting")
svm_y_predict = rfc.predict(x_test)

conf_matrix_svm = confusion_matrix(y_test, svm_y_predict)
print("\nPrinting confusion matrix")
conf_matrix_to_df(conf_matrix_svm, [0,1])
#print(conf_matrix_svm)

training
predicting

Printing confusion matrix


Unnamed: 0,0,1
0,150,33
1,43,130


Notes moving forward to try and improve performance:

1. Use a regression model rather than an SVM
2. Do a better job of equalizing how much data we have from low EDSS scores (healthier people) vs high EDSS scores (not as healthy people)
    - Currently, there is significantly more data from the healthier people, and not as much data from the not as healthy people, so all of the test data gets predicted as low EDSS (0). We can either omit a proportional amount of the low EDSS score training data, or we can add a bunch of mean-wise approximated data for high EDSS patients
    - This second approach is not as scalable as the first because we can only add data based on data that we already have, so this approach would really only help us for the binary, low/high edss scores, any not the ultimate classification of individual EDSS score (we would then have a high density of data/scores for the small domain of high EDSS scores that we have recorded)

Things done to address the problems/solutions above (3/22/21):
1. Tried a regression model, worked worse than the SVM. Ended up with a RandomForestClassifier() that has proven to work pretty well, certainly much, much better than the SVM or the regression models (even though still not awesome)
2. While it is not a perfect way of dealing with a disproportionate amount of data per target, I just limited the amount of data that there is in the processed dataset based on the target value. I limited healthier scores (target value 0) to 50 rows of data per subject, and did not limit the amount of data per target value 1 subject