<a id='0'></a>
# Playground Series - Season 3, Episode 2 (+EDA)

# Easy Navigation

- [1- Problem Statement & Dataset Description](#1)
- [2- Data Exploration](#2)
- [3- Explanatory Data Analysis (EDA)](#3)
    - [3.1- Looking at features individually](#3-1)
    - [3.2- See the relationships among features [and the target variable]](#3-2)
- [4- Data preprocessing](#4)
- [5- Modeling](#5)
    - [5.1- Model Construction](#5-1)
    - [5.2- Model Utilization & Submission](#5-2)

In [None]:
# import required libraies/dependencies
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from warnings import simplefilter
simplefilter('ignore')

<a id='1'></a>
# 1- Problem Statement & Dataset Description

The following is directly copied from the descriptoin tab of the [original dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset), by [@fedesoriano](https://www.kaggle.com/fedesoriano), from which the dataset for this competition has been created.<br>

"""start of quote<br>
**Context:**<br>
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.<br>
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.<br>

**Attribute Information**:<br>
1) id: unique identifier<br>
2) gender: "Male", "Female" or "Other"<br>
3) age: age of the patient<br>
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension<br>
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease<br>
6) ever_married: "No" or "Yes"<br>
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"<br>
8) Residence_type: "Rural" or "Urban"<br>
9) avg_glucose_level: average glucose level in blood<br>
10) bmi: body mass index<br>
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*<br>
12) stroke: 1 if the patient had a stroke or 0 if not<br>

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient<br>
"""end of quote<br>

**Our objective** is to build a ml model to predict the probability of a person having a stroke

<a id='2'></a>
# 2- Data Exploration

In [None]:
# load datasets
df_train = pd.read_csv('/kaggle/input/playground-series-s3e2/train.csv', index_col=0)
df_test = pd.read_csv('/kaggle/input/playground-series-s3e2/test.csv', index_col=0)

In [None]:
df_train.head()

In [None]:
print(f'The training set contains {df_train.shape[0]} rows and {df_train.shape[1]} columns.')

In [None]:
# let's display an overview of the data types to see if they are as expected
df_train.dtypes

In [None]:
# extract numeric and categorical features
numeric_features = df_train.select_dtypes(exclude='object').columns
categorical_features = df_train.select_dtypes(include='object').columns

In [None]:
# visualize the ratio of numeric and categorical features
plt.figure(figsize=(8, 6))
ax = sns.barplot(x=['Numeric', 'Categorical'], y=[numeric_features.shape[0], categorical_features.shape[0]])
ax.set_title('Numeric vs Categorical', fontdict={'fontsize': 16})
plt.show()

There are **6** numeric and **5** categorical features

In [None]:
df_train.head()

In [None]:
# Missing values
df_train.isnull().sum()

There are not missing values

In [None]:
# let's have a statistical view over the dataset, and look for outliers
df_train.describe()

There does seem to be any extreme values in the dataset.

---

<a id='3'></a>
# 3- Explanatory Data Analysis (EDA)

In this section, we will first go through all features one by one, and then we will see the relationships among features themselves and the target variable

<a id='3-1'></a>
## 3.1- Looking at features individually

In [None]:
# funtion to draw a pie plot regarding a features counts
def draw_count_pie(df, feature):
    explode = [0]*df[feature].value_counts().shape[0]
    explode[0] = 0.1
    plt.pie(
        x = df[feature].value_counts(),
        labels=df[feature].value_counts().index,
        autopct='%1.1f%%',
        explode=explode,
        shadow=True,
        startangle=0
    )
    plt.title(f'{feature.title()} Counts', fontdict={'fontsize': 16})

**Let's look at *Gender* feature**

In [None]:
# gender
df_train['gender'].unique()

There are **3** values for gender which are *Male*, *Female* and *other*.

In [None]:
# gender count
pd.crosstab(
    index=df_train['gender'],
    columns='Counts'
).T

In [None]:
# gender count visualization
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['gender'])
ax.set_title('Gender Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)

draw_count_pie(df_train, 'gender')

plt.show()

Most of observations are Female following by Male, and there is only one observation having a gender of *Other*.

**Let's look at *Age* features**

In [None]:
# age
df_train['age'].describe()

The minimum age is **0.08**, and the maximum age is **82** years.

In [None]:
# Age distribution visualization
plt.figure(figsize=(8, 6))

ax = sns.distplot(df_train['age'])
ax.set_title('Age distribution', fontdict={'fontsize': 16})
plt.show()

A major number of observations are between their 40's and 60's. Also, the *Age* distribution does not show much bias.

**Let's look at *hypertension* feature**

In [None]:
df_train['hypertension'].unique()

There are **2** values corresponding to *hypertension*, 0 & 1, with 0 being negative and 1 being a positive case.

In [None]:
# hypertension count
pd.crosstab(
    index=df_train['hypertension'],
    columns='Counts'
).T

In [None]:
# hypertension count visualization
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['hypertension'])
ax.set_title('Hypertension Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)

draw_count_pie(df_train, 'hypertension')

plt.show()

The number of obversations not having hypertension (14543) is much higher than that of observations with a positive hypertension (761).

**Let's look at *heart_disease* feature**

In [None]:
df_train['heart_disease'].unique()

There are 2 values corresponding to *heart_disease*, 0 & 1, with 0 being negative and 1 being a positive case.

In [None]:
# heart_disease count
pd.crosstab(
    index=df_train['heart_disease'],
    columns='Counts'
).T

In [None]:
# heart_disease count visualization
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['heart_disease'])
ax.set_title('heart_disease Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'heart_disease')

plt.show()

The number of obversations not having heart disease (14947) is much higher than that of observations with a positive hypertension (357).

**Let's look at *ever_married* feature**

In [None]:
df_train['ever_married'].unique()

There are 2 values corresponding to *ever_married*, *Yes* & *No*, with *No* being negative and *Yes* being positive.

In [None]:
# ever_married count
pd.crosstab(
    index=df_train['ever_married'],
    columns='Counts'
).T


In [None]:
# ever_married count visualization
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['ever_married'])
ax.set_title('ever_married Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'ever_married')

plt.show()

The number of obversations who have evered married (10385) is much higher than that of observations who have not married (4919).

**Let's look at *work_type* feature**

In [None]:
df_train['work_type'].unique()

There are 4 values corresponding to *work_type*, and they are pretty self-explaining

In [None]:
# work_type count
pd.crosstab(
    index=df_train['work_type'],
    columns='Counts'
).T

In [None]:
# work_type count visualization
plt.figure(figsize=(12, 10))

sns.set_palette('Set2')
plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['work_type'])
ax.set_title('work_type Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'work_type')

plt.show()

The above charts and figures are pretty self-explaining and vivid.

**Let's look at *Residence_type* feature**

In [None]:
df_train['Residence_type'].unique()

There are 2 values corresponding to *Residence_type*, *Urban* & *Rural*

In [None]:
# Residence_type count
pd.crosstab(
    index=df_train['Residence_type'],
    columns='Counts'
).T

In [None]:
# Residence_type count visualization
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['Residence_type'])
ax.set_title('Residence_type Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'Residence_type')


plt.show()

*Urban* and *Rural* values of *Residence_type* are distributed almost equally, which is state-of-the-art in ML.

**Let's look at *avg_glucose_level* feature**

In [None]:
# avg_glucose_level
df_train['avg_glucose_level'].describe()

In [None]:
# avg_glucose_level distribution visualization
plt.figure(figsize=(8, 6))

ax = sns.distplot(df_train['avg_glucose_level'])
ax.set_title('Avg Glucose Level distribution', fontdict={'fontsize': 16})
plt.show()

The *avg_glucose_level* is not distributed equally; the majority of observations have an average *Glucose level* of 60-120, but very few of them have that of 120-250.

**Let's look at *bmi* feature**

In [None]:
# bmi
df_train['bmi'].describe()

In [None]:
# bmi distribution visualization
plt.figure(figsize=(8, 6))

ax = sns.distplot(df_train['bmi'])
ax.set_title('bmi distribution', fontdict={'fontsize': 16})
plt.show()

The *bmi* is not distributed equally; the majority of observations have a *bmi* of 25-35, but very few of them have that of 50-80, which may be troublesome.

**Let's look at *smoking_status* feature**

In [None]:
df_train['smoking_status'].unique()

There are 4 values corresponding to *smoking_status*, 'never smoked', 'formerly smoked', 'Unknown' and 'smokes'.

In [None]:
# smoking_status count
pd.crosstab(
    index=df_train['smoking_status'],
    columns='Counts'
).T

In [None]:
# smoking_status count visualization
plt.figure(figsize=(12, 10))
sns.set_palette('Set2')

plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['smoking_status'])
ax.set_title('Smoking Status Count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'smoking_status')

plt.grid(True)
plt.show()

**Let's have a look at the target variable**

In [None]:
df_train['stroke'].unique()

1 = Positive case, 0 = Negative case

In [None]:
# target variable count
pd.crosstab(
    index=df_train['stroke'],
    columns='Counts'
).T

In [None]:
# Target variable count visualization
plt.figure(figsize=(12, 10))

sns.set_palette('Set2')
plt.subplot(2, 2, 1)
ax = sns.countplot(df_train['stroke'])
ax.set_title('The target variable count', fontdict={'fontsize': 16})

plt.subplot(2, 2, 2)
draw_count_pie(df_train, 'stroke')

plt.grid(True)
plt.show()

The target variable shows extreme **bias** which may effect an ml model negatively. The number of **1's (632)** is much lower than the number of **0's (14672).**

---

<a id='3-2'></a>
## 3.2- See the relationships among features [and the target variable]

In this section we will see how each feature can effect the target variable. This will help us to understand the importance of features, and also to see how specific values in a column are more impressive.

**We will first examine the effects of categorical features on the target variable, and then, we will through numeric features.**

**1- Gender**

In [None]:
# How gender effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['gender'], hue=df_train['stroke'])
ax.set_title('Gender vs Stroke counts')
plt.show()

In [None]:
# gender vs stroke
pd.crosstab(
    index=df_train['gender'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

As we can see from the above chart and figure, gender alone does have much effects of the stroke value. **That is, 96% females and 95% of males are with a stroke value of 0.**

**2- Ever Married**

In [None]:
# How ever_married feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['ever_married'], hue=df_train['stroke'])
ax.set_title('ever_married vs Stroke counts')
plt.show()

In [None]:
# Ever Married vs stroke
pd.crosstab(
    index=df_train['ever_married'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

Those who have ever married have a higher chance of having a stroke than those who haven't ever married;

**3- Work Type**

In [None]:
# How work_type feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['work_type'], hue=df_train['stroke'])
ax.set_title('Work_type vs Stroke counts')
plt.show()

In [None]:
# Work Type vs stroke
pd.crosstab(
    index=df_train['work_type'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

Those who have never worked has the highest safety probability for having a stroke, wherease those who are self-employed has the highest risk probability for having a stroke. This feature has a higher effect on the target variable.

**4- Residence_type**

In [None]:
# How Residence_type feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['Residence_type'], hue=df_train['stroke'])
ax.set_title('Residence_type vs Stroke counts')
plt.show()

In [None]:
# Residence_type vs stroke
pd.crosstab(
    index=df_train['Residence_type'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

We may consider removing this feature, Residence_type, since its values' impacts to the target variable are the same. 95.9% of both values are having a stroke value of 0.

**5- Smoking Status**

In [None]:
# How smoking_status feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['smoking_status'], hue=df_train['stroke'])
ax.set_title('Smoking_status vs Stroke counts')
plt.show()

In [None]:
# smoking_status vs stroke
pd.crosstab(
    index=df_train['smoking_status'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

*Unknown* & *never smoked* values have lower chance of having a stroke wherease *formerly smoked* and *smokes* values have higher chance of having a stroke. Those who formerly smoked has the highest change of having a stroke.

**6- Hypertension**

In [None]:
# How hypertension feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['hypertension'], hue=df_train['stroke'])
ax.set_title('Hypertension vs Stroke counts')
plt.show()

In [None]:
# Hypertension vs stroke
pd.crosstab(
    index=df_train['hypertension'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

Those who are suffering from *hypertension* has a higher chance of having a stroke than those who don't have *hypertension*.

**7- Heart Disease**

In [None]:
# How heart_disease feature effects the target variable
plt.figure(figsize=(8, 6))

sns.set_palette('Set2')

ax = sns.countplot(df_train['heart_disease'], hue=df_train['stroke'])
ax.set_title('Heart_disease vs Stroke counts')
plt.show()

In [None]:
# heart_disease vs stroke
pd.crosstab(
    index=df_train['heart_disease'],
    columns=df_train['stroke'],
    margins=True,
    normalize='index',
)

Those who are suffering from *heart disease* has a higher chance of having a stroke than those who don't have *heart disease*.

---

**Now let's see the effects of different numeric features on the target variable**

**1- Age**

In [None]:
# How age feature effects the target variable
plt.figure(figsize=(10, 6))

sns.set_palette('Set2')

ax = sns.boxenplot(data=df_train, x='stroke', y='age')
ax.set_title('Age vs Stroke', fontdict={'fontsize': 16})
plt.show()

We can vividly see that most of observations having a stroke are above 60 wherease most of observations with no stroke is under 60. This feature can be a good predictor.

**2- Ave Clucose level**

In [None]:
# How avg_glucose_level feature effects the target variable
plt.figure(figsize=(10, 6))

sns.set_palette('Set2')

ax = sns.boxenplot(data=df_train, x='stroke', y='avg_glucose_level')
ax.set_title('Avg glucose level vs Stroke', fontdict={'fontsize': 16})
plt.show()

Again, the higher the glucose level, the higher the chance of having a stroke.

**3- BMI**

In [None]:
# How bmi feature effects the target variable
plt.figure(figsize=(10, 6))

sns.set_palette('Set2')

ax = sns.boxenplot(data=df_train, x='stroke', y='bmi')
ax.set_title('BMI vs Stroke', fontdict={'fontsize': 16})
plt.show()

Though many of observations with a negative stroke case have a lower bmi, it does seem to have a huge impact on the target variable because there are some observations who have too high bmi but still have a negative stroke case.

---

<a id='4'></a>
# 4- Data Preprocessing

In this section we will preprocess the dataset in order to get it ready for modeling. We will first combine the training set with the **original dataset** and the test set, and then do the following:
- 1. We will remove *Residence_type* and *bmi* features since they have almost no impact on the target variable, and getting rid of the noise will help us with our model's performance.
- 2. we will encode all categorical variables to numbers so as to fit to algorithms we will use.
- 3. we will  normalize/scale some features of which values are not at the same scale with others. Here we will use MinMax algorithm to shrink the values between 0 and 1

---

In [None]:
# load the original dataset 
df_orig = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv', index_col=0)


Because in the dataset, there are many observations with a stroke value of 0, but a few observations with a stroke value of 1, we will combine only those rows from the original data that have a stroke of 1.

In [None]:
# combine it with the training set
df_orig_stroke_1 = df_orig[df_orig[['stroke']].all(1)]
df_train = df_train.append(df_orig_stroke_1)
df_train.shape

In [None]:
# let's combine the training and the test sets for consistancy purposes
df = df_train.append(df_test)

# reset id since the original dataset also begins with 0
df.index = df.reset_index().index

df.shape

---

In [None]:
# 1- remove 'Residence_type' and 'bmi' features
df.drop(['Residence_type', 'bmi'], axis=1, inplace=True)

In [None]:
# 2- encode categorical variables to numbers

# gender
df['gender'] = df['gender'].map({
    'Male': 0,
    'Female': 1,
    'Other': 2
}).astype('int')

# ever_married
df['ever_married'] = df['ever_married'].map({
    'Yes': 1,
    'No': 0
}).astype('int')

# work_type
df['work_type'] = df['work_type'].map({
    'Private':0,
    'Self-employed': 1,
    'Govt_job': 2,
    'children': 3,
    'Never_worked': 4
}).astype('int')

# smoking_status
df['smoking_status'] = df['smoking_status'].map({
    'never smoked' : 0,
    'formerly smoked': 1,
    'smokes': 2,
    'Unknown': 3
}).astype('int')

df.dtypes

In [None]:
# 3- data normalization on numeric features (age & avg_glucose_level) using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['age', 'avg_glucose_level']] = scaler.fit_transform(df[['age', 'avg_glucose_level']])
df.head(5)

In [None]:
# seperate the training set and the test set from each other.
df_train = df.loc[:df_train.shape[0]-1]
df_test = df.loc[df_train.shape[0]:].iloc[:, :-1]

---

<a id='5'></a>
# 5- Modeling

- We will train the model and predict the test set using **KFOLD**. Here, We will use **XGBRFClassifier** and **CatBoostClassifier** Models. Then, we will blend them together.
- Submit the result

<a id='5-1'></a>
## 5.1- Model Construction & Prediction

In [None]:
# separate the features from the target variable
X = df_train.drop('stroke', axis=1)
y = df_train['stroke'].astype('int')

In [None]:
# Get out-of-fold function
from sklearn.model_selection import KFold

NFOLDS = 10
kf = KFold(n_splits=NFOLDS, shuffle=True, random_state=10)
def get_oof_test_preds(clf, X, y, test_df):
    '''
        returns test set out-of-fold predictions
    ''' 
    oof_test = np.zeros((test_df.shape[0],))
    oof_test_skf = np.empty((NFOLDS, test_df.shape[0]))

    for i, (train_index, test_index) in enumerate(kf.split(X)):
        X_tr = X[train_index]
        y_tr = y[train_index]
        X_te = X[test_index]
        
        clf.fit(X_tr, y_tr)
        oof_test_skf[i, :] = clf.predict_proba(test_df)[:, 1]

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_test.reshape(-1, 1)

In [None]:
# training models
from xgboost import XGBRFClassifier
from catboost import CatBoostClassifier

xgbrf_model = XGBRFClassifier(n_estimators=1000)
cb_model = CatBoostClassifier(n_estimators=1000, verbose=0)

# lets get out of kfold test set predictions using the above-defined function
xgbrf_oof_preds = get_oof_test_preds(xgbrf_model, X.values, y.ravel(), df_test.values)
cb_oof_preds = get_oof_test_preds(cb_model, X.values, y.ravel(), df_test.values)

<a id='5-2'></a>
## 5.2- Result Submission

In [None]:
# blend them
preds = (cb_oof_preds * 0.1) + (0.9 * xgbrf_oof_preds) # 10% cb & xgb. 90% xgbrf

In [None]:
# create a dataframe
submission = pd.DataFrame({
    'id': np.arange(15304, 25508),
    'stroke': preds.reshape(-1, )
})
submission.head()

In [None]:
# write to a file
submission.to_csv('./submission.csv', index=False)
print('Done...')

# Thank you :)
By: [Hikmatullah Mohammadi](https://www.kaggle.com/hikmatullahmohammadi) <br>

[Go to top](#0)