# Stroke prediction using Random Forest Classifier

Import all required libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

I use stroke dataset from https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

In [2]:
df = pd.read_csv(r"D:\Python Files\machine_learning\supervised\healthcare-dataset-stroke-data.csv")
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


The __stroke__ column is the target feature.

In [3]:
df['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

Drop the columns that does not have impact to prediction, such as __id__ attribute.

In [4]:
df.drop(labels=['id'], axis=1, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


There are 201 null values in __bmi__ column. 
<br>Check the correlation between __bmi__ and target feature whether __bmi__
has significant impact to target feature or not.

In [7]:
df['bmi'].corr(df['stroke'])

0.0423736611492336

Since __bmi__ has low-correlated to target feature, I will drop it.

In [8]:
df.drop(labels=['bmi'], axis=1, inplace=True)
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,formerly smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,never smoked,1
5,Male,81.0,0,0,Yes,Private,Urban,186.21,formerly smoked,1


One-hot-encode the categorical values.

In [16]:
encoded_df = pd.get_dummies(df)
encoded_df.shape

(4909, 21)

Split dataset into training and testin set.

In [22]:
X = encoded_df.values[:, :20]
y = encoded_df.values[:, 20]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Train and test the model.

In [25]:
model = RandomForestClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9989816700610998

Thank you.