# `Computing Challenge 2021`

1. Reducing the dataset 

Using our critical thinking, we decided that ID and residence type attributes for example are unlikely to influence the patient probability of having a stroke. This allowed to reduce the amount of noise created by the additional data and improve the predictivity of the to be chosen model. 

Important strokes factors include : 
age, high cholesterol and obesity (bmi), diabetes (average glucose level), smoking (smoking status, hypertension).  

Factors that may have an effect on the likeliness of a stroke were still included to evaluate their potential impact on the predictivity of the chosen model (marital status, work type, intense stress,hypertension, and employment type). 

In [1]:
#import the necessary libraries 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import sklearn.metrics
import statistics
import csv

data = pd.read_csv('healthcare-dataset-stroke-data.csv') #importing the data

2. Data processing

2.1 Removing NaN values

NaN values were replaced using the mean of the associated attribute (no interference with the data distribution). 

2.2 One hot encoder 

Gave numerical binary values to multiclasses categories (non numerical entries) such as : smoking status, ever married, gender. It was preferred over label encoding to prevent any hierarchical ordering of the values. Additional categories were created, the ones non representative of the whole data set, in particular Other (gender) was removed from the dataset. 

2.3 Feature scaling 

Standardised values to prevent skewing of the data and minimising the influence of outliers. The robust scaler scales features using statistics that are robust to outliers. It removes the median and scales the data in the range between the 25th percentile and 75th percentile (Q1 and Q3). This proved to be particularly efficient on the bmi attribute which exhibits a skewed distribution towards 0 (bmi near 0 and 100 are unlikely) and a noticeable difference between the mean and the median. 



In [2]:
#######Encoding and cleaning#######

#making a copy to avoid damaging original data file
data_replaced = data.copy()
data_replaced = data_replaced.drop(columns ='id')
    
#calculates the means to replace NaN
mean_values = {
    'age': data_replaced['age'].mean(),
    'avg_glucose_level': data_replaced['avg_glucose_level'].mean(),
    'bmi': data_replaced['bmi'].mean(),
}

#creating the new corrected database as a copy
data_replaced_mean = data_replaced.copy().fillna(value = mean_values) 

#Encoding the data with Onehotencoder: gender, work_type, Residence_type and smoking_status

data_onehotencoded = data_replaced_mean.copy()

def onehotencode(label, data):
    """takes column name and data as inputs returns the one hot encoded data"""
    hotencode = set(data_onehotencoded[label])
    for cls in hotencode:
        column = cls #finds each individual value in the column
        data[column] = data[label].apply(lambda x: 1 if x == cls else 0) #creates new columns
    del data[label]#eliminates original columns
    return data

#executes one hot encoding over the non numerical attributes

for i in ['gender','work_type','Residence_type','smoking_status']:
    data_onehotencoded = onehotencode(i,data_onehotencoded)

#replaces values in ever married by 1 (Yes) and 0 (No)
ever_married = set(data_onehotencoded['ever_married'])
data_onehotencoded['ever_married'] = data_onehotencoded['ever_married'].apply(lambda x: 1 if x == 'Yes' else 0)

#deleting columns that are not relevant to the problem. 
for i in ['Rural', 'Urban', 'Other','Male','Female','Unknown']:
    del data_onehotencoded[i]

#######Scaling#######

data_robust = data_onehotencoded.copy()

def robust(label: str, data):
    """Standardises the data using robust scaler"""
    rob = set(data_robust[label])
    q1_value = np.quantile(data_robust[label], 0.25) #calculates first quartile
    q3_value = np.quantile(data_robust[label], 0.75) #calculates third quartile
    diff = q3_value - q1_value #substracts the two quartiles
    #assigns the robust standardised values to the columns
    data_robust[label] = data_robust[label].apply(lambda x: (x - q1_value) / diff)
    return data_robust

#executes feature scaling over the non binary numerical data 

for i in ['age','bmi','avg_glucose_level']:
    data_robust = robust(i,data_robust)

#the non stroke proportion in this dataset is non representative of the UK population where 1/6 of the inhabitants suffer...
#...a stroke in their lifetime. 
#See https://www.gov.uk/government/news/new-figures-show-larger-proportion-of-strokes-in-the-middle-aged
#To compensate the effect of the data bias, 92% of the stroke data is removed. This...
#...corresponds to a random undersampling approach. This was chosen over over-sampling as it required less data processing...
#...and improved computational efficiency (see more justification details in Notebook 2) . 
data_final = data_robust.drop(data_robust[data_robust['stroke'] == 0].sample(frac=0.92).index)

data_final

Unnamed: 0,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,stroke,Never_worked,Private,Self-employed,Govt_job,children,formerly smoked,smokes,never smoked
0,1.166667,0,1,1,4.110327,1.422222,1,0,1,0,0,0,1,0,0
1,1.000000,0,0,1,3.391641,0.565915,1,0,0,1,0,0,0,0,1
2,1.527778,0,1,1,0.778260,0.966667,1,0,1,0,0,0,0,0,1
3,0.666667,0,0,1,2.550821,1.177778,1,0,1,0,0,0,0,1,0
4,1.500000,1,0,1,2.629258,0.022222,1,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5075,1.250000,0,0,1,0.685439,1.555556,0,0,1,0,0,0,0,0,0
5078,-0.472222,0,0,0,-0.025377,-0.922222,0,0,0,0,0,1,0,0,0
5088,1.083333,1,0,1,-0.009635,0.711111,0,0,0,1,0,0,0,0,0
5090,0.027778,0,0,0,0.640657,-0.311111,0,0,0,0,1,0,0,1,0


3. Random Forest Classifier 

We chose Random Forest because it is one of the non-parametric classifiers that gives the highest accuracy (Over 90%). The model works so well because of the low correlation between some of the attributes (bmi and age for example) ,  the trees protect each other from their individual errors. While some trees may be wrong, many other trees will be right, so as a group, the trees are able to move in the correct direction. Some advantages of Random Forest against other classifiers include: 
- It is insensitive to noise or overtraining, which shows the ability in dealing with our unbalanced data.
- Trees can accurately divide the data based on Categorical Variables ( not linear regression).
	
With the selected feature scaling, this model proved to be a good candidate compared to other potential modelling techniques such as Logistic Regression or KNN. 

To evaluate the importance of each attribute, the recall_micro cross validation scores were studied over each iteration. Recall_micro was chosen to count the total number of true positives, false negatives and false positives. This proved to be a good metric to use, as the most dangerous prediction would be to falsely diagnose that a patient will not have a stroke (false negative).

10 cross validations folds were iterated to prevent overfitting of the data and obtain enough datapoints for the final interactive graph. Furthermore, k (nb of cross validation folds) needs not to be increased too much as both computational efficiency and number of datapoints in a single training set will be too low. 

Training fraction was chosen to be 80% in order to obtain the best possible model that will fit our data and prevent important variation of the obtained score. In this scenario, a greater training data quantity will give a better estimate of the stroke output however it must not be increased above this level (ie 90-10). This is because a 10% test set is too little to evaluate the final performance of the model. 



In [3]:
#Classifier Class that includes functions to return predicted data and its associated cross validation recall_micro...
#...score when the input column is removed from the dataset. If the score decreases a lot, the attribute has a...
#...important impact on the accuracy of our classifier predictions. 


class RandomForest(RandomForestClassifier):
    """final class, takes a dataframe as an input, the chosen hyperparameters and the column to be removed """
    def __init__(self, _JB: pd.DataFrame, hyperparameters,removed : list = [], training_fraction: float = 0.8, y_column: str = 'stroke'):
        
        #Initialisation 
        self.data_final = data_final.copy() #copy the data to prevent modifying the orginal set 
        self.hyperparameters = hyperparameters
        
    
        
          # do splitting and remove specified attributes
        X_columns = list(data_final.columns)
        X_columns.remove(y_column)
        for i in removed:
            X_columns.remove (i)
        n_rows = int(training_fraction * len(self.data_final))
        
        
    
    
          # allocate X and y to self.X and self.y and transforming the sets into numpy arrays 
        self.X = self.data_final[X_columns].to_numpy()
        self.y = self.data_final[y_column].to_numpy()
         
        
        

          # get X_train and X_test
        self.X_train = self.X[:n_rows]
        self.X_test = self.X[n_rows:]
        
        #shuffle the data
        
        
        # get y_train and y_test
        self.y_train = self.y[:n_rows]
        self.y_test = self.y[n_rows:]  
        
        #Shuffle the data to prevent the absence of positive stroke cases in the test set which would inflate...
        #...the obtained results. A consistent shuffle is used across all folds of 1000.  
        self.X_train,self.X_test,self.y_train,self.y_test=train_test_split(self.X,self.y,test_size =0.2, random_state = 1000)
        
        #fit and score 
    
        
    def get_predicted(self) -> float :
        classifier = self.hyperparameters.fit(self.X_train, self.y_train)
       
        # get predicted data and allocate to self.y_pred
        self.y_pred = classifier.predict(self.X_test)
        return self.y_pred
    
    
    def cross_validate(self) -> list :
        classifier = self.hyperparameters.fit(self.X_train, self.y_train)
        cv_results = cross_validate(classifier, self.X_train, self.y_train, cv= 10, scoring = 'recall_micro')
        cv_results2 = cv_results['test_score']
  
        
        return cv_results2

In [4]:
#loops the Class removing each time a different attribute and outputs the associated cross validation score at each...
#.. iteration, gives access to the associated predicted values of the stroke attribute
model = {}
columns = list(data_final.columns)
columns.remove('stroke')
model['everything'] = RandomForest(data_final, RandomForestClassifier(n_estimators= 1800, class_weight = "balanced", min_samples_split= 10, min_samples_leaf= 2,max_features= 'sqrt',max_depth= 90))
Prediction_list = []
for i in columns:
	model[i] = RandomForest(data_final, RandomForestClassifier(n_estimators= 1800,class_weight = "balanced", min_samples_split= 10, min_samples_leaf= 2,max_features= 'sqrt',max_depth= 90) , [i])


for i in model: 
    Prediction = model[i].get_predicted ()#updates the values for each column 
    Prediction_list.append(Prediction)
    print (i,Prediction)

everything [0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 0
 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0
 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1
 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1]
age [0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0
 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 1 0
 0 1 0 1 1 0 1 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 0
 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1]
hypertension [0 1 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0
 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0
 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1
 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1]
heart_disease [0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 0
 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0
 1 1 1 1 1 0 1 0 1 1 1 1 

In [5]:
#this time loops the class to access the associated Recall scores when each attribute is removed
Recallscore_list = [] #converts the result into a list that will then be uploaded into a csv file 
for i in model:
    Recallscore = model[i].cross_validate()
    Recallscore_list.append(Recallscore)
    print (i,Recallscore)
    
    


everything [0.74509804 0.66666667 0.78431373 0.7254902  0.64705882 0.7254902
 0.84313725 0.80392157 0.76470588 0.80392157]
age [0.70588235 0.68627451 0.68627451 0.62745098 0.56862745 0.7254902
 0.70588235 0.7254902  0.68627451 0.7254902 ]
hypertension [0.78431373 0.66666667 0.78431373 0.70588235 0.66666667 0.70588235
 0.82352941 0.74509804 0.76470588 0.82352941]
heart_disease [0.76470588 0.64705882 0.78431373 0.70588235 0.68627451 0.7254902
 0.82352941 0.80392157 0.76470588 0.78431373]
ever_married [0.74509804 0.66666667 0.78431373 0.7254902  0.66666667 0.7254902
 0.84313725 0.78431373 0.76470588 0.80392157]
avg_glucose_level [0.80392157 0.70588235 0.76470588 0.64705882 0.66666667 0.7254902
 0.84313725 0.74509804 0.74509804 0.8627451 ]
bmi [0.74509804 0.66666667 0.80392157 0.70588235 0.64705882 0.70588235
 0.8627451  0.74509804 0.74509804 0.82352941]
Never_worked [0.74509804 0.64705882 0.78431373 0.74509804 0.64705882 0.7254902
 0.84313725 0.78431373 0.76470588 0.82352941]
Private [0.7

In [6]:
#Uploads the obtained data into a csv file that will then be used for plotting the graph
Score_evr = Recallscore_list[0]
Score_age = Recallscore_list[1]
Score_hyp = Recallscore_list[2]
Score_hrt = Recallscore_list[3]
Score_mrr = Recallscore_list[4]
Score_glc = Recallscore_list[5]
Score_bmi = Recallscore_list[6]
Score_slf = Recallscore_list[7]
Score_prv = Recallscore_list[8]
Score_gvt = Recallscore_list[9]
Score_nvw = Recallscore_list[10]
Score_cld = Recallscore_list[11]
Score_nvs = Recallscore_list[12]
Score_fms = Recallscore_list[13]
Score_smk = Recallscore_list[14]


with open('data_graph.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["Score", "Fold", "Attributes", "mean" ])
    for i in range (10):
        writer.writerow([Score_evr[i], i+1, "everything", (statistics.mean(Score_evr))])
        writer.writerow([Score_age[i], i+1, "age", (statistics.mean(Score_age))])
        writer.writerow([Score_hyp[i], i+1, "Hypertension", (statistics.mean(Score_hyp))])
        writer.writerow([Score_hrt[i], i+1, "heart_disease", (statistics.mean(Score_hrt))])
        writer.writerow([Score_mrr[i], i+1, "ever_married", (statistics.mean(Score_mrr))])
        writer.writerow([Score_glc[i], i+1, "avg_glucose_level",(statistics.mean(Score_glc))])
        writer.writerow([Score_bmi[i], i+1, "bmi", (statistics.mean(Score_bmi))])
        writer.writerow([Score_gvt[i], i+1, "Govt job", (statistics.mean(Score_gvt))])
        writer.writerow([Score_cld[i], i+1, "children", (statistics.mean(Score_cld))])
        writer.writerow([Score_slf[i], i+1, "Self employed", (statistics.mean(Score_slf))])
        writer.writerow([Score_nvw[i], i+1, "Never_worked", (statistics.mean(Score_nvw))])
        writer.writerow([Score_prv[i], i+1, "private", (statistics.mean(Score_prv))])
        writer.writerow([Score_smk[i], i+1, "smokes", (statistics.mean(Score_smk))])
        writer.writerow([Score_nvs[i], i+1, "never smoked",(statistics.mean(Score_nvs))])
        writer.writerow([Score_fms[i], i+1, "formerly smoked", (statistics.mean(Score_fms))])
        
        
        

        