<a href="https://colab.research.google.com/github/mmassonn/heart_attack_prediction/blob/main/heart_attack_prediction_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Projet : Heart attack prediction


##I. Définir l'objectif

Objectif : Prédiction du risque de développer une cardiopathie à partir des données cliniques

metrique : F1 score

##II. Importer les bibliothèques/framework

In [3]:
#import pre-processing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [4]:
#import evaluation packages
from sklearn.metrics import f1_score
from sklearn.model_selection import learning_curve

In [5]:
#Connect drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##III. Load data

In [6]:
#load data
df = pd.read_csv('drive/MyDrive/Projet_2022/heart_attack_prediction/dataset.csv')

In [7]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


#IV. Pre-processing

In [8]:
#remove Cholesterol column
df = df.drop('Cholesterol', axis=1) 

##1.Diviser la base de donnée

In [9]:
#split Train and Test set  
train_set, test_set = train_test_split(df, test_size=0.2, random_state=0)

##2.Répartition des différentes variables

###HeartDisease

In [10]:
train_set['HeartDisease'].value_counts(normalize = True)

1    0.546322
0    0.453678
Name: HeartDisease, dtype: float64

In [None]:
test_set['HeartDisease'].value_counts(normalize = True)

1    0.581522
0    0.418478
Name: HeartDisease, dtype: float64

###Sex

In [None]:
train_set['Sex'].value_counts(normalize = True)

M    0.788828
F    0.211172
Name: Sex, dtype: float64

In [None]:
test_set['Sex'].value_counts(normalize = True)

M    0.793478
F    0.206522
Name: Sex, dtype: float64

###ChestPainType

In [None]:
train_set['ChestPainType'].value_counts(normalize = True)

ASY    0.532698
NAP    0.220708
ATA    0.197548
TA     0.049046
Name: ChestPainType, dtype: float64

In [None]:
test_set['ChestPainType'].value_counts(normalize = True)

ASY    0.570652
NAP    0.222826
ATA    0.152174
TA     0.054348
Name: ChestPainType, dtype: float64

###FastingBS

In [None]:
train_set['FastingBS'].value_counts(normalize = True)

0    0.782016
1    0.217984
Name: FastingBS, dtype: float64

In [None]:
test_set['FastingBS'].value_counts(normalize = True)

0    0.706522
1    0.293478
Name: FastingBS, dtype: float64

###RestingECG

In [None]:
train_set['RestingECG'].value_counts(normalize = True)

Normal    0.606267
ST        0.197548
LVH       0.196185
Name: RestingECG, dtype: float64

In [None]:
test_set['RestingECG'].value_counts(normalize = True)

Normal    0.581522
LVH       0.239130
ST        0.179348
Name: RestingECG, dtype: float64

###ExerciseAngina

In [None]:
train_set['ExerciseAngina'].value_counts(normalize = True)

N    0.589918
Y    0.410082
Name: ExerciseAngina, dtype: float64

In [None]:
test_set['ExerciseAngina'].value_counts(normalize = True)

N    0.619565
Y    0.380435
Name: ExerciseAngina, dtype: float64

###ST_Slope

In [None]:
train_set['ST_Slope'].value_counts(normalize = True)

Flat    0.495913
Up      0.433243
Down    0.070845
Name: ST_Slope, dtype: float64

In [None]:
test_set['ST_Slope'].value_counts(normalize = True)

Flat    0.521739
Up      0.418478
Down    0.059783
Name: ST_Slope, dtype: float64

**Conclusion :** La répartition des différentes variables est homogène.

##3.Normalisation des variables

###a.Normalisation standard des variables quantitatives

In [11]:
#normalisazion
def make_standard_normal(df):
    df = df[['Age', 'RestingBP', 'MaxHR', 'Oldpeak']]
    #calculate the mean and standard deviation of the training set
    mean = df.mean(axis = 0)
    stdev = df.std(axis = 0)
    # standardize the training set
    df = (df-mean)/stdev
    return df

###b.Encodage des variables quantitatives

In [12]:
#encodage function
def encodage_function(df):
  df = df[['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']]
  qualitative_cols = ['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
  df = pd.get_dummies(df, columns=qualitative_cols)
  return df

###c.Fonction pré-processing

In [13]:
#defined pre-processing    
def preprocessing(df):
  df1 = make_standard_normal(df)
  df2 = encodage_function(df)
  Y = df[['HeartDisease']] 
  X = pd.concat([df1, df2], axis=1)
  return X,Y

In [14]:
#applied pre-processing
X_train, y_train = preprocessing(train_set)
X_test, y_test = preprocessing(test_set)

#V. Modelling

###1.Bagging

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
model = RandomForestClassifier(n_estimators=100)

model.fit(X_train, y_train.values.ravel())
y_preds = model.predict(X_test)

print('F1 Score - RandomForestClassifier:', f1_score(y_test, y_preds)) 

F1 Score - RandomForestClassifier: 0.8755760368663594


###2.Tuner les hyperparmètres

Adaboost peut être parfois difficile à régler car il se compose de nombreux **hyperparamètres**. 

L'utilisation de **GridSearchCV** est toujours une approche intelligente. Les processus de recherche populaires incluent une recherche aléatoire et une recherche par grille.

Dans mon cas, je recherche par grille deux hyperparamètres clés pour AdaBoost : le nombre d'arbres utilisés dans l'ensemble et le taux d'apprentissage. J'utilise une gamme de valeurs populaires performantes pour chaque hyperparamètre.


Chaque combinaison de configuration sera évaluée à l'aide d'une validation croisée répétée k-fold et les configurations seront comparées à l'aide du score moyen, dans ce cas, le score f1.

**Rechercher les meilleurs hyperparamètres**

In [17]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

In [42]:
# define the model with default hyperparameters
model = RandomForestClassifier()

In [43]:
# define the grid of values to search
grid = dict()
grid['n_estimators'] = [10, 50, 100, 500, 1000]
grid['max_features'] = ['auto', 'sqrt', 'log2']
grid['criterion'] = ['gini', 'entropy']

In [44]:
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [45]:
# define the grid search procedure
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='f1', verbose=2)

In [46]:
# execute the grid search
grid_result = grid_search.fit(X_train, y_train.values.ravel())

Fitting 30 folds for each of 30 candidates, totalling 900 fits


In [47]:
# summarize the best score and configuration
print("Best F1 score: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best F1 score: 0.876555 using {'criterion': 'entropy', 'max_features': 'sqrt', 'n_estimators': 100}


**Entrainement + Prédiction**

In [48]:
model = RandomForestClassifier(criterion='entropy', max_features='sqrt' ,n_estimators=100)

model.fit(X_train, y_train.values.ravel())
y_preds = model.predict(X_test)

print('F1 Score - RandomForestClassifier:', f1_score(y_test, y_preds)) 

F1 Score - RandomForestClassifier: 0.890909090909091
