### Stroke Prediction Dataset
 * https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv
 * 11 clinical features for predicting stroke events

Attribute Information

1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [37]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgbm
from lightgbm import early_stopping
from sklearn.model_selection import cross_val_score
import numpy as np

In [7]:
df = pd.read_csv('./stroke_data/healthcare-dataset-stroke-data.csv')

In [8]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [90]:
df['age'].describe()

count    5110.000000
mean       43.226614
std        22.612647
min         0.080000
25%        25.000000
50%        45.000000
75%        61.000000
max        82.000000
Name: age, dtype: float64

In [91]:
bmi_list = []
for i in range(0, 9):
    start = i * 10
    end = start + 9
    bmi_mean = df.loc[(df['age'] > start) & (df['age'] <= end),'bmi'].dropna().mean()
    bmi_list.append(bmi_mean)
print(bmi_list)

[18.869934640522878, 25.30288248337029, 28.371014492753623, 31.32439446366782, 31.562519685039373, 31.90828729281768, 30.91513409961686, 29.191295116772825, 27.988695652173917]


In [11]:
X_stroke = df.drop(['stroke'],axis=1)
y_stroke = df['stroke']

In [12]:
select_list = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level']

In [13]:
X_selected = X_stroke[select_list]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_stroke, stratify=y_stroke, random_state=33)

In [49]:
logreg = LogisticRegression().fit(X_train, y_train)
pred_logreg = logreg.predict(X_test)

print("학습용 평가 점수: {:.6f}".format(logreg.score(X_train, y_train)))
print("테스트 평가 점수: {:.6f}".format(logreg.score(X_test, y_test)))
scores = cross_val_score(logreg, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))


학습용 평가 점수: 0.951200
테스트 평가 점수: 0.951487
ROC-AUC Score : 0.8425792050950536



In [46]:
tree = RandomForestClassifier(n_estimators=100,max_depth=5).fit(X_train, y_train)
pred_tree = tree.predict(X_test)

print("학습용 평가 정확도 : {:.6f}".format(tree.score(X_train, y_train)))
print("테스트 평가 정확도: {:.6f}".format(tree.score(X_test, y_test)))
scores = cross_val_score(tree, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))

학습용 평가 정확도 : 0.951722
테스트 평가 정확도: 0.951487
ROC-AUC Score : 0.8384138913251216



In [47]:
knn = KNeighborsClassifier().fit(X_train, y_train)
pred_knn = knn.predict(X_test)

print("학습용 평가 정확도 : {:.6f}".format(knn.score(X_train, y_train)))
print("테스트 평가 정확도: {:.6f}".format(knn.score(X_test, y_test)))
scores = cross_val_score(knn, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))

학습용 평가 정확도 : 0.952244
테스트 평가 정확도: 0.948357
ROC-AUC Score : 0.6589302994921847



In [18]:
lbl_enc = LabelEncoder()
labeled_df = df.copy()
labeled_df['gender'] = lbl_enc.fit_transform(df['gender'])
labeled_df['ever_married'] = lbl_enc.fit_transform(df['ever_married'])
labeled_df['work_type'] = lbl_enc.fit_transform(df['work_type'])
labeled_df['Residence_type'] = lbl_enc.fit_transform(df['Residence_type'])
labeled_df['smoking_status'] = lbl_enc.fit_transform(df['smoking_status'])

In [19]:
labeled_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1,1
1,51676,0,61.0,0,0,1,3,0,202.21,,2,1
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2,1


In [20]:
labeled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   int32  
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   int32  
 6   work_type          5110 non-null   int32  
 7   Residence_type     5110 non-null   int32  
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   int32  
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int32(5), int64(4)
memory usage: 379.4 KB


In [21]:
new_select_list = labeled_df.columns.drop(['bmi','stroke'])
print(new_select_list)

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'smoking_status'],
      dtype='object')


In [29]:
X_selected = labeled_df[select_list]
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_stroke, stratify=y_stroke, random_state=33)

In [50]:
logreg = LogisticRegression().fit(X_train, y_train)
pred_logreg = logreg.predict(X_test)

print("logreg train 점수: {:.6f}".format(logreg.score(X_train, y_train)))
print("logreg test 점수: {:.6f}".format(logreg.score(X_test, y_test)))
scores = cross_val_score(logreg, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))



logreg train 점수: 0.951200
logreg test 점수: 0.951487
ROC-AUC Score : 0.8425792050950536



In [51]:
tree = RandomForestClassifier(n_estimators=100,max_depth=5).fit(X_train, y_train)
pred_tree = tree.predict(X_test)

print("학습용 평가 정확도 : {:.6f}".format(tree.score(X_train, y_train)))
print("테스트 평가 정확도: {:.6f}".format(tree.score(X_test, y_test)))
scores = cross_val_score(tree, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))


학습용 평가 정확도 : 0.951461
테스트 평가 정확도: 0.951487
ROC-AUC Score : 0.8350797873740745



In [52]:
knn = KNeighborsClassifier().fit(X_train, y_train)
pred_knn = knn.predict(X_test)

print("학습용 평가 정확도 : {:.6f}".format(knn.score(X_train, y_train)))
print("테스트 평가 정확도: {:.6f}".format(knn.score(X_test, y_test)))
scores = cross_val_score(knn, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))


학습용 평가 정확도 : 0.952244
테스트 평가 정확도: 0.948357
ROC-AUC Score : 0.6589302994921847



In [33]:
params = {
        "num_iterations":10000,
        'learning_rate': 0.05,
    }

In [34]:
lgbm_model = lgbm.LGBMClassifier(**params).fit(
        X_train,y_train,
        eval_set=[(X_test,y_test),(X_train,y_train)],
        verbose=100,
        callbacks=[early_stopping(100)],
        #categorical_feature=cat_col
    )

Training until validation scores don't improve for 100 rounds
[100]	training's binary_logloss: 0.100256	valid_0's binary_logloss: 0.171749
Early stopping, best iteration is:
[41]	training's binary_logloss: 0.122369	valid_0's binary_logloss: 0.163856




In [40]:
scores = cross_val_score(lgbm_model, X_selected, y_stroke, cv= 5, scoring="roc_auc")
print("ROC-AUC Score : {}\n".format(np.mean(scores)))



ROC-AUC Score : 0.7294081014639752

