# Helthcare prediction

But : à partir de données existantes, predire le risque d'accident vasculaire cerebrale.
La source des données vien de : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset.

Attribute Information

1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not

**Note:** "Unknown" in smoking_status means that the information is unavailable for this patient

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor

## Chargement des données

In [48]:
df = pd.read_csv('/data/healthcare-dataset-stroke-data.csv')
print(df.head(5))
print(df.describe())

      id  gender   age  ...   bmi   smoking_status stroke
0   9046    Male  67.0  ...  36.6  formerly smoked      1
1  51676  Female  61.0  ...   NaN     never smoked      1
2  31112    Male  80.0  ...  32.5     never smoked      1
3  60182  Female  49.0  ...  34.4           smokes      1
4   1665  Female  79.0  ...  24.0     never smoked      1

[5 rows x 12 columns]
                 id          age  ...          bmi       stroke
count   5110.000000  5110.000000  ...  4909.000000  5110.000000
mean   36517.829354    43.226614  ...    28.893237     0.048728
std    21161.721625    22.612647  ...     7.854067     0.215320
min       67.000000     0.080000  ...    10.300000     0.000000
25%    17741.250000    25.000000  ...    23.500000     0.000000
50%    36932.000000    45.000000  ...    28.100000     0.000000
75%    54682.000000    61.000000  ...    33.100000     0.000000
max    72940.000000    82.000000  ...    97.600000     1.000000

[8 rows x 7 columns]


## Split dataset

Separation du jeu de données en 2 parties :

- jeu de données d'entrainement
- jeu de données test

In [49]:
df_train, df_test = train_test_split(df, test_size=0.33, random_state=1)
print(df_train.shape)
print(df_test.shape)

(3423, 12)
(1687, 12)


## Modification du jeu test

Il faut supprimer la colonne "stroke" du jeu de données test

In [50]:
test_data = df_test.drop(['stroke'], axis=1)
print(test_data.shape)
print(test_data.head(5))

(1687, 11)
         id  gender   age  ...  avg_glucose_level   bmi   smoking_status
4673  49833  Female  42.0  ...             112.98  37.2  formerly smoked
3232  20375  Female  78.0  ...              78.29  30.1  formerly smoked
3694  39834    Male  28.0  ...              73.27  25.4           smokes
1070  42550  Female  81.0  ...             246.34  21.1     never smoked
4163  19907  Female  52.0  ...              97.05  28.0          Unknown

[5 rows x 11 columns]


## Prediction

Il faut construire une prediction pour entrainer à classifier si un individu avec ces carracteristiques est à risque de faire un accident vasculaire cerebrale. 

Pour cela, on utilise la methode **TabularPredictor** en spécifiant que la sortie doit etre une colonne 'stroke'.
La methode va choisir automatiquement l'algorithme le mieux adapté pour entrainer le jeu de donnée.

Les arguments (optionnels):

- 'verbosity=2' afficheront toutes les étapes que la prediction prend pour arriver au meilleur modèle
- 'presets= best quality' garantiront que le meilleur modèle est sélectionné parmi ceux formés.

Il existe d'autres arguments supplémentaires mentionnés dans la documentation officielle qui peuvent être utilisés pour affiner le modèle.

In [51]:
predictor = TabularPredictor(label='stroke').fit(train_data=df_train, verbosity=2, presets='best_quality')

No path specified. Models will be saved in: "AutogluonModels/ag-20230621_123542/"
Presets specified: ['best_quality']
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230621_123542/"
AutoGluon Version:  0.3.1
Train Data Rows:    3423
Train Data Columns: 11
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    16538.71 MB
	Train Data (Original)  Memory Usage: 1.25 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in 

L'analyse peut etre assez longue.

Au cours du défilement des logs, on peut constater que **AutoGluon** est capable de choisir une classification binaire automatiquement sans qu'il y ai eu le besoin de lui renseigner.
Le resultat est donc classé en label unique "0" ou "1" dans la colonne de sortie.

On note egalement qu'il choisi de selectionner le model le mieux adapté sur la précision ("accuracy").
Une fois l'entrainemeent terminé, il est posssible de voir un résumé de tous les modeles et leur précision avec la commande **predictor.fit_summary()**

In [52]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0           LightGBM_BAG_L2   0.960561       3.293508  495.618971                0.096124          16.790256            2       True         15
1       WeightedEnsemble_L3   0.960561       3.305000  497.805954                0.011492           2.186983            3       True         24
2           CatBoost_BAG_L2   0.958808       3.268074  488.427226                0.070690           9.598511            2       True         18
3            XGBoost_BAG_L2   0.958516       3.308910  500.660260                0.111526          21.831545            2       True         21
4         LightGBMXT_BAG_L1   0.958224       0.093506   11.665509                0.093506          11.665509            1       True          3
5       WeightedEnsemble_L2   0.958224       0.105568   14.285967         



{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L1': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesGini_BAG_L1': 'StackerEnsembleModel_XT',
  'ExtraTreesEntr_BAG_L1': 'StackerEnsembleModel_XT',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'NeuralNetMXNet_BAG_L1': 'StackerEnsembleModel_TabularNeuralNet',
  'LightGBMLarge_BAG_L1': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel',
  'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L2': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L2': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
  'ExtraTre

**Remarque :** attention, les modeles ne sont pas dans l'ordre de leur création! 
pour cela il faut utiliser la methode : **predictor.leaderboard(df_train, silent=True)**

In [53]:
predictor.leaderboard(df_train, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,1.0,0.951797,0.120929,0.10537,0.0132,0.120929,0.10537,0.0132,1,True,2
1,RandomForestEntr_BAG_L1,1.0,0.956471,0.440889,0.208901,1.496504,0.440889,0.208901,1.496504,1,True,6
2,RandomForestGini_BAG_L1,1.0,0.955887,0.446638,0.213377,1.622811,0.446638,0.213377,1.622811,1,True,5
3,ExtraTreesGini_BAG_L1,1.0,0.95501,0.516817,0.217516,1.706477,0.516817,0.217516,1.706477,1,True,8
4,ExtraTreesEntr_BAG_L1,1.0,0.954426,0.524461,0.214009,1.709065,0.524461,0.214009,1.709065,1,True,9
5,ExtraTreesGini_BAG_L2,0.996494,0.957639,17.197584,3.503412,480.656945,0.326973,0.306028,1.82823,2,True,19
6,ExtraTreesEntr_BAG_L2,0.989775,0.957055,17.294025,3.422231,480.473625,0.423414,0.224847,1.64491,2,True,20
7,NeuralNetMXNet_BAG_L2,0.981011,0.957932,30.051834,4.930431,804.688992,13.181224,1.733047,325.860277,2,True,22
8,RandomForestEntr_BAG_L2,0.972539,0.957347,17.150672,3.405339,480.339574,0.280061,0.207955,1.510859,2,True,17
9,RandomForestGini_BAG_L2,0.972247,0.957932,17.153476,3.428154,480.436205,0.282866,0.23077,1.60749,2,True,16


Il est egalement possible d'afficher les carracteristiques qui contribut le plus à notre score par ordre d'importance avec la methode **predictor.feature_importance(data=df_train)**

In [54]:
predictor.feature_importance(data=df_train)

Computing feature importance via permutation shuffling for 11 features using 1000 rows with 3 shuffle sets...
	260.97s	= Expected runtime (86.99s per shuffle set)
	155.86s	= Actual runtime (Completed 3 of 3 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
bmi,0.010667,0.002887,0.011777,3,0.027208,-0.005875
age,0.010333,0.004619,0.030303,3,0.0368,-0.016133
id,0.005,0.002646,0.041007,3,0.02016,-0.01016
avg_glucose_level,0.004333,0.000577,0.002933,3,0.007642,0.001025
heart_disease,0.002333,0.000577,0.009902,3,0.005642,-0.000975
work_type,0.001333,0.000577,0.028595,3,0.004642,-0.001975
ever_married,0.001333,0.002517,0.227834,3,0.015754,-0.013087
hypertension,0.001333,0.000577,0.028595,3,0.004642,-0.001975
gender,0.001333,0.000577,0.028595,3,0.004642,-0.001975
Residence_type,0.000333,0.000577,0.211325,3,0.003642,-0.002975


## test du model predictif

Une fois l'entrainement terminer et le modele pret à l'utilisation, on peut l'utiliser sur un jeux de données pour prédire le risque d'accident vasculaire cerebrale.

Pour ce faire il suffit de fournir un jeux de données à notre predictor et stocker le resultat dans un dataframe.

In [55]:
y_pred = predictor.predict(test_data)
df_pred = pd.DataFrame(y_pred, columns=['stroke'])
df_pred

Unnamed: 0,stroke
4673,0
3232,0
3694,0
1070,0
4163,0
...,...
386,0
3961,0
1608,0
1459,0


Pour comprendre comment est effectué l'evaluation de la precision de cette prediction, la methode **predictor.evaluate(df_test)** est utilisée.

In [56]:
predictor.evaluate(df_test)

Evaluation: accuracy on test data: 0.945465323058684
Evaluations on test data:
{
    "accuracy": 0.945465323058684,
    "balanced_accuracy": 0.5046162944409657,
    "mcc": 0.042660422905315104,
    "roc_auc": 0.8366590134279552,
    "f1": 0.02127659574468085,
    "precision": 0.25,
    "recall": 0.011111111111111112
}


{'accuracy': 0.945465323058684,
 'balanced_accuracy': 0.5046162944409657,
 'mcc': 0.042660422905315104,
 'roc_auc': 0.8366590134279552,
 'f1': 0.02127659574468085,
 'precision': 0.25,
 'recall': 0.011111111111111112}

## Resultat

Le prétraitement des données et l'ingénierie des fonctionnalités ont été réalisés par AutoGluon.
Le modèle formé inclut également la validation croisée.

Ainsi, nous avons obtenu le classificateur formé à une précision de 95 % avec seulement deux lignes de code (pour que le classificateur s'entraîne et prédise).

C'est impressionnant !

Avec un modèle ML traditionnel, nous passerions beaucoup de temps à terminer l'ensemble du processus, y compris l'analyse exploratoire des données, le nettoyage des données ainsi que le codage pour configurer plusieurs modèles.

AutoGluon nous a rendu cela assez simple en automatisant toutes les etapes.