## Brain stroke predictions
I will analyze the brain stroke data set from Kaggle (https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset). I will compare the performance of Logistic Regression and Random Forest.

In [41]:
import pandas as pd
df = pd.read_csv("brain_stroke.csv")

# Data analysis
Data consists of 10 features and 1 column of labels (output). Dimension of the data is: (4981,11). The features are:
1) gender: "Male", "Female" or "Other"
2) age: age of the patient
3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
4) heart disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 5) ever-married: "No" or "Yes"
6) worktype: "children", "Govtjov", "Neverworked", "Private" or "Self-employed" 7) Residencetype: "Rural" or "Urban"
8) avgglucoselevel: average glucose level in blood
9) bmi: body mass index
10) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [42]:
print(df.info())
print(df.isnull().sum())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   int64  
 3   heart_disease      4981 non-null   int64  
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 428.2+ KB
None
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi             

There is no missing data in the dataset.

# Data balance
The data is very unbalanced. This will have to be taken into account when training the model. To deal with this problem I will use oversampling technic.

In [43]:
df.groupby(["stroke"]).size()

stroke
0    4733
1     248
dtype: int64

# Data preparation
There are only two values for Residence_type, ever_married and gender columns. These columns can be easily encoded into 0 and 1. Data about the work_type can be encoded with one hot encoding, and the smoking_data looks useless, because there is a large part of the unknown fields, so I decided not to include this data in my model. Due to the use of linear regression, I decided to scale the age, avg_glucose_level and bmi data.


In [44]:
print(df.groupby(["Residence_type"]).size())
print(df.groupby(["ever_married"]).size())
print(df.groupby(["gender"]).size())
print(df.groupby(["work_type"]).size())
print(df.groupby(["smoking_status", "stroke"]).size())

Residence_type
Rural    2449
Urban    2532
dtype: int64
ever_married
No     1701
Yes    3280
dtype: int64
gender
Female    2907
Male      2074
dtype: int64
work_type
Govt_job          644
Private          2860
Self-employed     804
children          673
dtype: int64
smoking_status   stroke
Unknown          0         1453
                 1           47
formerly smoked  0          797
                 1           70
never smoked     0         1749
                 1           89
smokes           0          734
                 1           42
dtype: int64


In [45]:
from sklearn.preprocessing import minmax_scale, OrdinalEncoder
enc = OrdinalEncoder()
df[["Residence_type", "ever_married", "gender"]] = enc.fit_transform(df[["Residence_type", "ever_married", "gender"]])
df = pd.get_dummies(df, columns=["work_type"])
df = df.drop(columns=["smoking_status"])
df[['age','avg_glucose_level','bmi']] = minmax_scale(df[['age','avg_glucose_level','bmi']])
print(df.head())

   gender       age  hypertension  heart_disease  ever_married  \
0     1.0  0.816895             0              1           1.0   
1     1.0  0.975586             0              1           1.0   
2     0.0  0.597168             0              0           1.0   
3     0.0  0.963379             1              0           1.0   
4     1.0  0.987793             0              0           1.0   

   Residence_type  avg_glucose_level       bmi  stroke  work_type_Govt_job  \
0             1.0           0.801265  0.647564       1                   0   
1             0.0           0.234512  0.530086       1                   0   
2             1.0           0.536008  0.584527       1                   0   
3             0.0           0.549349  0.286533       1                   0   
4             1.0           0.605161  0.429799       1                   0   

   work_type_Private  work_type_Self-employed  work_type_children  
0                  1                        0                   0 

# Train test split
I will use train_test_split from sklearn.model_selection to split the data into 75% of the data for training and 25% for testing. I will use random_state=5 to make the results reproducible.

In [46]:
from sklearn.model_selection import train_test_split
X = df.drop("stroke", axis=1)
y = df["stroke"]
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=5)
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)

# Logistic regression
Unfortunately, linear regression is not very precise about people who have a stroke (Only 13% on test dataset).
But the recall (metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made) results are at the good level (75%), which is satisfactory for this type of classification. In this case it is better to have more false positives (poor precision) than many false negatives.

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
y_pred_train = lr.predict(X_train)
print(classification_report(y_test, y_pred))
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.98      0.75      0.85      1182
           1       0.14      0.75      0.23        64

    accuracy                           0.75      1246
   macro avg       0.56      0.75      0.54      1246
weighted avg       0.94      0.75      0.82      1246

              precision    recall  f1-score   support

           0       0.81      0.74      0.77      3551
           1       0.76      0.83      0.79      3551

    accuracy                           0.78      7102
   macro avg       0.79      0.78      0.78      7102
weighted avg       0.79      0.78      0.78      7102



# Random Forest Classifier
The max_depth of the tree had to be significantly reduced to avoid overfitting. The tree performs a bit better than linear regression. Precision is similar, but recall has increased recall by a few percent.

In [48]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=4, random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
y_pred_train = rfc.predict(X_train)
print(classification_report(y_test, y_pred))
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.99      0.70      0.82      1182
           1       0.13      0.81      0.22        64

    accuracy                           0.71      1246
   macro avg       0.56      0.76      0.52      1246
weighted avg       0.94      0.71      0.79      1246

              precision    recall  f1-score   support

           0       0.89      0.69      0.78      3551
           1       0.75      0.92      0.82      3551

    accuracy                           0.80      7102
   macro avg       0.82      0.80      0.80      7102
weighted avg       0.82      0.80      0.80      7102



# Summary
In this notebook, I analyzed the brain stroke dataset. I created a logistic regression model and a random forest model. I used oversampling method to deal with the unbalanced data. To prevent overfitting in random forest classifier I limited the death of the tree. I used default classification_report from sklearn.metrics with  precision, recall F1 score and support metrics to evaluate the models.