# Stroke Prediction

## Introduction

The competition is based on a dataset generated from a deep learning model trained on the [Stroke prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset), with feature distributions similar but not exactly the same as the original. Submissions are evaluated on area under the **ROC curve** between the predicted probability and the observed target.

## Data Description

According to the World Health Organization (WHO), stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

**Attribute Information:**
- `id`: unique identifier
- `gender`: "Male", "Female" or "Other"
- `age`: age of the patient
- `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- `ever_married`: "No" or "Yes"
- `work_type`: "children", "Govt_job", "Never_worked", "Private" or "Self-employed"
- `Residence_type`: "Rural" or "Urban"
- `avg_glucose_level`: average glucose level in blood
- `bmi`: body mass index
- `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- `stroke`: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## Loading the required libraries

Loading generally used libraries, other required libraries will be loaded later

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as mpl_colors
import seaborn as sns

## Loading the data

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/playground-series-s3e2/sample_submission.csv
/kaggle/input/playground-series-s3e2/train.csv
/kaggle/input/playground-series-s3e2/test.csv


In [3]:
train = pd.read_csv('/kaggle/input/playground-series-s3e2/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s3e2/test.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s3e2/sample_submission.csv')

# Exploratory Data Analysis (EDA)

In [4]:
# Display the first few rows of the data
print('The first few rows of the training dataset:\n')
display(train.head())
print()

# Display the shape of the data
print(f'The shape of the training dataset is: {train.shape}')
print()

The first few rows of the training dataset:



Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,Male,28.0,0,0,Yes,Private,Urban,79.53,31.1,never smoked,0
1,1,Male,33.0,0,0,Yes,Private,Rural,78.44,23.9,formerly smoked,0
2,2,Female,42.0,0,0,Yes,Private,Rural,103.0,40.3,Unknown,0
3,3,Male,56.0,0,0,Yes,Private,Urban,64.87,28.8,never smoked,0
4,4,Female,24.0,0,0,No,Private,Rural,73.36,28.8,never smoked,0



The shape of the training dataset is: (15304, 12)



In [5]:
# Checking datatypes
print(train.dtypes)

id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object


All of the variables in the dataset have the correct data type.

### Checking for Missing Values

In [6]:
# Check for missing data
print('Missing values per column in the training dataset:')
display(train.isnull().sum())

Missing values per column in the training dataset:


id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

This is good news because it means we can skip the data imputation step, which may introduce bias or errors into our analysis. Although there are no visible missing values we still need to handle the "Unknown" in smoking_status means that the information is unavailable for this patient.