# Stroke Prediction Dataset (Data Preparation and Data Cleansing)
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
____

**Contents:**
- Import Library
- Data Preparation 
    - Reading dataset
    - Check any null values
- Data Cleansing
    - Handling null values
    - Delete any duplicate values

## Import Library

In [1]:
#import lib
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Data Preparation

### Uploading CSV Dataset

In [2]:
#uploading dataset
from google.colab import files
uploaded = files.upload()

Saving healthcare-dataset-stroke-data.csv to healthcare-dataset-stroke-data.csv


### Reading *Dataset*

In [3]:
#reading csv dataset
data = pd.read_csv('healthcare-dataset-stroke-data.csv')

In [6]:
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [5]:
data.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


### Check any null values in dataset

In [17]:
#checking null values
data.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [18]:
data.shape

(5110, 12)

**There** are null values in 'bmi' columns.

From this null values, we can do a data cleansing either to drop all null values or we fill null values with the certain conditions

### Check data type in each columns

In [12]:
#checking data type
data.dtypes

id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

In [13]:
#checking dataset info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


## Data Cleansing

### Handling null values

In [45]:
#check null values in bmi column
data['bmi'].head(15)

0     36.6
1      NaN
2     32.5
3     34.4
4     24.0
5     29.0
6     27.4
7     22.8
8      NaN
9     24.2
10    29.7
11    36.8
12    27.3
13     NaN
14    28.2
Name: bmi, dtype: float64

In [49]:
#check total null values in bmi column
data['bmi'].isnull().sum()

201

In [56]:
#checking median values in bmi column
data['bmi'].median()

28.1

In [57]:
#replace null values in bmi column using median
data['bmi'].replace(np.NaN, data['bmi'].median()).head(15)

0     36.6
1     28.1
2     32.5
3     34.4
4     24.0
5     29.0
6     27.4
7     22.8
8     28.1
9     24.2
10    29.7
11    36.8
12    27.3
13    28.1
14    28.2
Name: bmi, dtype: float64

###Checking and Delete any duplicate values

In [51]:
#original dataset shape
data.shape

(5110, 12)

In [43]:
#checking is there any duplicate values?
data_v1 = data.drop_duplicates()
data_v1.shape

(5110, 12)

**There is no duplicate values in dataset**.