# Machine-Learning TD
By Corentin Meyer, PhD Student @ CSTB - iCube, 04/10 ESBS


# PART 1: Import, format and split the data

## The Data that we will use
## [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)
### **Context**
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Attribute Information
### **11 clinical features for predicting stroke events**
1. **id:** unique identifier
2. **gender:** "Male", "Female" or "Other"
3. **age**: age of the patient
4. **hypertension**: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. **heart_disease**: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. **ever_married**: "No" or "Yes"
7. **work_type**: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. **Residence_type**: "Rural" or "Urban"
9. **avg_glucose_level**: average glucose level in blood
10. **bmi**: body mass index
11. **smoking_status**: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. **stroke**: 1 if the patient had a stroke or 0 if not  
* Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## **Tasks:**
1. Import the data
2. Print the shape of the dataset and the first 5 lines
3. Calculate the ratio stroke/non-stroke
4. Plot histogram of columns separate by stroke vs non-stroke status.
5. Print the type of each columns (number, object...)

## **Questions:**
1. How many entries (patients) are in the dataset ?
2. How many columns (features) ?
3. Plot the histogram of the age feature. Do separate histogram for stroke vs non-stroke patients
4. What is the percentage of the patients that had a stroke ?
5. Show the type of data in each columns. What type of processing will we have to do for each type ?

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# PART 1: import the data and explore the data.
import pandas as pd
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.set_index("id", inplace=True)
print(df.shape)
df.head()

In [None]:
print(df["stroke"].value_counts(normalize=True))
df.hist("age", by="stroke")

In [None]:
df.dtypes

## **Tasks**
You will now do the pre-processing of data.
For each type of object you will have to process them in a usable format for ML algorithm.

## **Questions**
1. What columns are categorical data, what columns are numeric.
2. What columns are already ready to be used and needs no change.
3. What type of processing do you need to do on categorical data and why
4. What type of processing do you need to do on numeric data and why
5. What columns contains missing data ? What type of processing do you need to do in this case.

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
columns_nothing = ["hypertension", "heart_disease", "stroke"]
columns_categorical = ["gender","ever_married", "work_type", "Residence_type", "smoking_status"]
columns_numeric = ["age", "avg_glucose_level","bmi"]

X_nothing_to_do = df[columns_nothing]
X_nothing_to_do = X_nothing_to_do.to_numpy()

# Handle Categorial data to numeric (one hot encoding)
enc = OneHotEncoder()
X_cat = df[columns_categorical]
X_cat_onehot = enc.fit_transform(X_cat).toarray()
X_cat_columns_onehot = enc.get_feature_names()

# Handle Numeric data (scaling)
X_num = df[columns_numeric]
X_num_scaled = StandardScaler().fit_transform(X_num)
X_num_columns_scaled = X_num.columns

array_data = np.concatenate((X_cat_onehot, X_num_scaled, X_nothing_to_do), axis=1)

# Handle NaN values (Multivariate imputer that estimates each feature from all the others)
imp_mean = IterativeImputer(random_state=777)
imp_mean.fit(array_data)
array_data = imp_mean.transform(array_data)

# Recreate the dataframe from all processed-data
df_data  = pd.DataFrame(data=array_data, columns=list(X_cat_columns_onehot) + list(X_num_columns_scaled) + columns_nothing)

## **Tasks**
Now you can split the data between training and testing data.

## **Questions**
1. What train/test ratio should you use.
2. How many entries are in your train dataset and in your test dataset.
3. Verify that you have the same stroke / no-stroke ratio between train and test dataset.

In [None]:
from sklearn.model_selection import train_test_split

X = array_data[:,:-1]
Y = array_data[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=777)
print(X_train.shape)
print(X_test.shape)
print(np.unique(y_train, return_counts=True)[1])
print(np.unique(y_test, return_counts=True)[1])

# PART 2: Create your machine-learning model

## **Tasks:**
Select from scikit-learn a model and train it (fit) with the train data. Then calculate the accuracy of the model on the test data. Plot the confusion matrix of the test data classification.

## **Questions:**
1. Which model did you choose and why ? Have you set any particular (hyper)parameters ?
2. What accuracy-score do you get and what conclusion can you take ?
3. What do you observe on the confusion matrix and what conclusion can you take ?

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

clf = RandomForestClassifier(class_weight='balanced').fit(X_train, y_train)
print(clf.score(X_test, y_test))
plot_confusion_matrix(clf, X_test, y_test)  
plt.show()  

## **Tasks:**
You will now try to downsample your majority class to the level of minority class to have a 50/50 ratio and re-do a train/test split.  
Then you will re-train a new model with the new ratio-corrected data and get accuracy+confusion matrix plot.

## **Questions**
1. What accuracy-score do you get with the new model and what conclusion can you take.
2. What do you observe on the confusion matrix and what conclusion can you take.

In [None]:
# Indicies of each class' observations
# Source: https://chrisalbon.com/code/machine_learning/preprocessing_structured_data/handling_imbalanced_classes_with_downsampling/
np.random.seed(777)
i_class0 = np.where(Y == 0)[0]
i_class1 = np.where(Y == 1)[0]
# Number of observations in each class
n_class0 = len(i_class0)
n_class1 = len(i_class1)
# For every observation of class 0, randomly sample from class 1 without replacement
i_class0_downsampled = np.random.choice(i_class0, size=n_class1, replace=False)
# Join together class 0's target vector with the downsampled class 1's target vector
Y_down = np.hstack((Y[i_class0_downsampled], Y[i_class1]))
X_down = np.vstack((X[i_class0_downsampled], X[i_class1]))

# Retrain now with sampled-down data
X_train_down, X_test_down, y_train_down, y_test_down = train_test_split(X_down, Y_down, test_size=0.4, random_state=777)
clf_down = RandomForestClassifier(class_weight='balanced', random_state=777).fit(X_train_down, y_train_down)
print(clf_down.score(X_test_down, y_test_down))
plot_confusion_matrix(clf_down, X_test_down, y_test_down)  
plt.show()  

# PART 3: Evaluate your model

## **Tasks:**
In this last part you will have to calculate all relevant metrics for a binary classification to compare your two models.  
Make a table containing the results for both models in terms of: accuracy, balance accuracy, F1 Score, sensitivity (recall), specificity, Precision and confusion matrix data (True Pos., True Neg., False Pos., False Neg.)

## **Questions:**
1. Which model have the accuracy ?
2. Which model have the best area under the curve (AUC) for the ROC-curve ?
3. Which model have the best F1-Score and sensitivity ?
4. Eventually, which model is better according to you ?

In [None]:
# Calculate all relevant metrics for a binary classification
# Accuracy, Balanced Accuracy, F1 Score, Sensitivity (Recall), Specificity, Precision, TP TN FP FN
# Also plot the ROC Curve and the Precision-Recall curve and AUC.
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve, average_precision_score, plot_precision_recall_curve
from sklearn.metrics import roc_curve, auc

def roc_plot_single(clf, X_test, y_test):
	#Plot the ROC Curve and include AUC in figure.
	probas = clf.predict_proba(X_test)
	fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
	roc_auc = auc(fpr, tpr)
	plt.figure()
	lw = 2
	plt.plot(fpr, tpr, color='darkorange',lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
	plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
	plt.xlim([0.0, 1.0])
	plt.ylim([0.0, 1.05])
	plt.xlabel('False Positive Rate')
	plt.ylabel('True Positive Rate')
	plt.title('ROC Curve')
	plt.legend(loc="lower right")
	plt.show()

def get_all_metrics(clf, X_test, y_test):
	y_pred = clf.predict(X_test)

	#calculate and store evaluation metrics
	tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

	ac = accuracy_score(y_test, y_pred)
	# down_weight = [6.24 if x == 1 else 1 for x in y_test_down]
	#bac = balanced_accuracy_score(y_test_down, y_pred, sample_weight=down_weight)
	bac = balanced_accuracy_score(y_test, y_pred)
	re = recall_score(y_test, y_pred)
	pr = precision_score(y_test, y_pred)
	f1 = f1_score(y_test, y_pred)

	#calculate specificity
	if tn == 0 and fp == 0:
		sp = 0
	else:
		sp = tn/float(tn+fp)

	return [bac, ac, f1, re, sp, pr, tp, tn, fp, fn]

def make_table_clf(clf, clf_down, X_test, y_test, X_test_down, y_test_down):
	results = get_all_metrics(clf, X_test, y_test)
	results_down = get_all_metrics(clf_down, X_test_down, y_test_down)
	df = pd.DataFrame([[i for i in results], [j for j in results_down]], columns=["Balanced-Accuracy", "Accuracy", "F1-Score", "Sensitivity (Recall)", "Specificity", "Precision", "TP"," TN", "FP", "FN"], index=["CLF","CLF Down"])
	return df


In [None]:
make_table_clf(clf, clf_down, X_test, y_test, X_test_down, y_test_down)

In [None]:
plot_precision_recall_curve(clf, X_test, y_test)
roc_plot_single(clf, X_test, y_test)

In [None]:
plot_precision_recall_curve(clf_down, X_test_down, y_test_down)
roc_plot_single(clf_down, X_test_down, y_test_down)

# BONUS 1: Do Cross-validation instead of simple test/train split.

## **Tasks:**  
Instead of a simple Test/Train split, we will do cross-validation. This means that we will train multiple models with different splits, so that all data have been used for training and all for testing. Then we will average the results of all models.  
For example, if we do a 80/20% train/test split, then we would need to do a 5 fold cross-validation so that each data has been in the test-set at least once.  
Find a way to do "stratified k fold", calculate "cross val score" with scikit and print the mean and standard deviation of the score.
## **Questions:**  
1. What is the point of cross-validation ? Did it increase performance ? If not, what is it useful for ?

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True)
cross_scores = cross_val_score(
    clf_down,
    X_train_down,
    y_train_down,
    cv=cv,
    scoring="f1",
    error_score="raise",
)
print(cross_scores)
print(np.mean(cross_scores))
print(np.std(cross_scores))

# BONUS 2:  Automatic hyper-parameters tuning

## **Tasks:**
We will use optuna to automatically find the best parameters for the chosen ML Algorithm (Random Forest here). We define a number of trials (50), a scoring metrics to maximize (F1 Score) and a parameters space to explore (params_grid).  
In the end we will have a model with the best parameters that have been found and we will compared its metrics to our previous classifier (clf_down) without optimisation.
## **Questions:**  
1. What are the best parameters detected ? Are your best parameters different from the one of other students ? Why ?
2. Did the metrics improved ? Was the optimisation useful ?

In [None]:
import optuna
from sklearn.base import clone
from sklearn.model_selection import StratifiedKFold, cross_val_score

def make_table_clf_opti(clf, clf_down, X_test, y_test, X_test_down, y_test_down):
	results = get_all_metrics(clf, X_test, y_test)
	results_down = get_all_metrics(clf_down, X_test_down, y_test_down)
	df = pd.DataFrame([[i for i in results], [j for j in results_down]], columns=["Balanced-Accuracy", "Accuracy", "F1-Score", "Sensitivity (Recall)", "Specificity", "Precision", "TP"," TN", "FP", "FN"], index=["CLF Down","CLF Down Optimized"])
	return df

def objective_RF(
    trial, est, x_train, y_train, param_grid, scoring_metric):
    """
    A function that is used by Optuna to get the scoring of a model given the parameters of a specific trial.
    """
    params = {
        "n_estimators": trial.suggest_int(
            "n_estimators", param_grid["n_estimators"][0], param_grid["n_estimators"][1]
        ),
        "criterion": trial.suggest_categorical("criterion", param_grid["criterion"]),
        "max_depth": trial.suggest_int(
            "max_depth", param_grid["max_depth"][0], param_grid["max_depth"][1]
        ),
        "min_samples_split": trial.suggest_int(
            "min_samples_split",
            param_grid["min_samples_split"][0],
            param_grid["min_samples_split"][1],
        ),
        "min_samples_leaf": trial.suggest_int(
            "min_samples_leaf",
            param_grid["min_samples_leaf"][0],
            param_grid["min_samples_leaf"][1],
        ),
        "max_features": trial.suggest_categorical(
            "max_features", param_grid["max_features"]
        ),
        "bootstrap": trial.suggest_categorical("bootstrap", param_grid["bootstrap"]),
        "oob_score": trial.suggest_categorical("oob_score", param_grid["oob_score"]),
        "n_jobs": trial.suggest_categorical("n_jobs", param_grid["n_jobs"]),
        "class_weight": trial.suggest_categorical(
            "class_weight", param_grid["class_weight"]
        ),
    }
    cv = StratifiedKFold(n_splits=5, shuffle=True)
    model = clone(est).set_params(**params)
    performance = np.mean(
        cross_val_score(
            model,
            x_train,
            y_train,
            cv=cv,
            scoring=scoring_metric,
            error_score="raise",
        )
    )
    return performance

#Random Forest parameters space
param_grid = {'n_estimators': [10,400],
                'criterion' : ['gini', 'entropy'],
                'max_depth' : [1, 15],
                'min_samples_split' : [2, 50], 
                'min_samples_leaf' : [1, 50],
                'max_features' : [None, 'auto','log2'],
                'bootstrap' : [True],
                'oob_score' : [False, True],
                'n_jobs' : [-1],
                'class_weight' : [None, 'balanced']}

# Run Hyperparameter sweep: first we set up some settings.
n_trials = 50
scoring_metric = "f1"
timeout = 300 
est = RandomForestClassifier()
sampler = optuna.samplers.TPESampler()
study = optuna.create_study(direction="maximize", sampler=sampler)
optuna.logging.set_verbosity(optuna.logging.CRITICAL)

# We launch the optimization using the previous function as score-returning.
study.optimize(
    lambda trial: objective_RF(
        trial, est, X_train_down, y_train_down, param_grid, scoring_metric),
    n_trials=n_trials,
    timeout=timeout,
    catch=(ValueError,),
)
# We retrieve the best results now that the optimization is finished.
print("Best trial:")
best_trial = study.best_trial
print("  Score: ", best_trial.value)
print("  Params: ")
for key, value in best_trial.params.items():
    print("    {}: {}".format(key, value))

# Train model using 'best' hyperparameters
est = RandomForestClassifier()
clf = clone(est).set_params(**best_trial.params)

model = clf.fit(X_train_down, y_train_down)
y_pred = clf.predict(X_test_down)

plot_precision_recall_curve(model, X_test_down, y_test_down)
roc_plot_single(model, X_test_down, y_test_down)
make_table_clf_opti(clf_down, model, X_test_down, y_test_down, X_test_down, y_test_down)

# Additional Ressources

If you want to become an AI Expert: AI Rodmap [https://i.am.ai/roadmap](https://i.am.ai/roadmap)  
Visualisation in Python [https://www.python-graph-gallery.com/](https://www.python-graph-gallery.com/)  
Public Dataset to create projects [https://github.com/awesomedata/awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets)  
Machine Learning Competitions with Prize [https://www.kaggle.com/](https://www.kaggle.com/)