# Machine-Learning TD
By Corentin Meyer, PhD Student @ CSTB - iCube, 04/10 ESBS


# PART 1: Import, format and split the data

## The Data that we will use
## [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)
### **Context**
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Attribute Information
### **11 clinical features for predicting stroke events**
1. **id:** unique identifier
2. **gender:** "Male", "Female" or "Other"
3. **age**: age of the patient
4. **hypertension**: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. **heart_disease**: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. **ever_married**: "No" or "Yes"
7. **work_type**: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. **Residence_type**: "Rural" or "Urban"
9. **avg_glucose_level**: average glucose level in blood
10. **bmi**: body mass index
11. **smoking_status**: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. **stroke**: 1 if the patient had a stroke or 0 if not  
* Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [None]:
# Step 1: import the data
# How many entries (patients) are in the dataset? How many columns (features) ?
import pandas as pd
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.set_index("id", inplace=True)
df.head()

In [None]:
# What is the percentage of the patients that were stroke patients?
df["stroke"].value_counts(normalize=True)

In [None]:
df.dtypes

In [None]:
columns_nothing = ["hypertension", "heart_disease", "stroke"]
columns_categorical = ["gender","ever_married", "work_type", "Residence_type", "smoking_status"]
columns_numeric = ["age", "avg_glucose_level","bmi"]

In [None]:
X_nothing_to_do = df[columns_nothing]
X_nothing_to_do = X_nothing_to_do.to_numpy()

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
enc = OneHotEncoder()
X_cat = df[columns_categorical]
X_cat_onehot = enc.fit_transform(X).toarray()
X_cat_columns_onehot = enc.get_feature_names()

In [None]:
X_num = df[columns_numeric]
X_num_scaled = StandardScaler().fit_transform(X_num)
X_num_columns_scaled = X_num.columns

In [None]:
import numpy as np
array_data = np.concatenate((X_cat_onehot, X_num_scaled, X_nothing_to_do), axis=1)
pd.DataFrame(data=array_data, columns=list(X_cat_columns_onehot) + list(X_num_columns_scaled) + columns_nothing)

# PART 2: Create your machine-learning model

# PART 3: Evaluate your model

# BONUS:  Automatic hyper-parameters tuning

# Additional Ressources

If you want to become an AI Expert: AI Rodmap [https://i.am.ai/roadmap](https://i.am.ai/roadmap)  
Visualisation in Python [https://www.python-graph-gallery.com/](https://www.python-graph-gallery.com/)  
Public Dataset to create projects [https://github.com/awesomedata/awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets)  
Machine Learning Competitions with Prize [https://www.kaggle.com/](https://www.kaggle.com/)