<a href="https://colab.research.google.com/github/ntr262003/Infosys_Stroke_Patient_Healthcare/blob/main/Milestone1_Tulasiram.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Data Exploration on Stroke Patient Healthcare data set**






**Dataset Description**

The dataset contains healthcare data related to strokes. It consists of 5110 records and 12 columns. The columns include patient demographic information, health metrics, and whether they have experienced a stroke.

**Dataset**
The dataset provided consists the list of patients and their healthcare related data:



*   id: Unique identifier for each patient
*   gender: Gender of the patient
*   age: Age of the patient
*   hypertension: 0 = No, 1 = Yes
*   heart_disease: 0 = No, 1 = Yes
*   ever_married: Marital status
*   work_type: Type of employment
*   Residence_type: Urban or Rural
*   avg_glucose_level: Average glucose level in blood
*   bmi: Body Mass Index
*   smoking_status: Smoking habits (never smoked, formerly smoked, etc.)
*   stroke: 1 if the patient had a stroke, 0 otherwise


###**(1) Defining Problem Statement and Analyzing Basic Metrics**

The goal is to develop a predictive model to determine whether a patient is at risk of having a stroke based on their demographic details, medical history, and lifestyle factors. By analyzing features such as age, hypertension, heart disease, BMI, glucose levels, and smoking status, we aim to identify patterns that indicate the likelihood of stroke occurrence. Accurate prediction can help in early intervention and medical decision-making to reduce stroke risks.

###  **(2) Import libraries and Loading the dataset**

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#To ignore the warnings and make code more representable
import warnings
warnings.filterwarnings('ignore')

In [47]:
#Load the healthcare.csv dataset into dataframe
url="/content/healthcare-dataset-stroke-data.csv"
df=pd.read_csv(url)

In [48]:
#show the top 5 records of dataset
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


# **Data Exploration and Pre-Processing**


### **Check basic metrics and data types**

Understanding the structure of the dataset, including the number of rows and columns, and the data types of each attribute. It is a crucial step in **data exploration.**

In [49]:
df.shape

(5110, 12)

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


**Observations:**



*   The dataset contains 5110 rows and 12 columns
*   We can columns like "gender", "ever_married", "work_type", "residence_type" and "smoking_status" contain string values, which are represented using the "object" datatype in this dataframe
*   The columns like "age", "avg_glucose_level", "bmi" to be of float datatype
*   The columns like "id", "heart_disease", "hypertension", "stroke" to be of int datatype







In [51]:
# Describing the statistical summary of the numerical type data
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


**Observations:**


*   Demographics: The average age of participants is 43 years, with a wide age range from 0.08 to 82 years, indicating a diverse population.
*   Health Conditions: Approximately 10% of participants have hypertension, while heart disease prevalence is around 5%, suggesting a relatively low incidence of these chronic conditions.
*   Stroke Incidence: Stroke occurrence is low at about 4.9%, indicating that the majority of participants have not experienced a stroke, reflecting overall good health in the population.






In [52]:
# Statistical summary of categorical type data
df.describe(include='object')

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,5110,5110,5110,5110,5110
unique,3,2,5,2,4
top,Female,Yes,Private,Urban,never smoked
freq,2994,3353,2925,2596,1892




*   Gender Distribution: The dataset consists of a majority of females, with 2,994 female participants compared to 2,116 males.
*   Marital Status: Most participants (approximately 65.6%) have ever been married, indicating a higher prevalence of marriage among the population.
*   Work Type: The most common work type is Private employment, with 2,925 individuals, reflecting a trend towards urban employment sectors.
*   Smoking Status: A significant portion of the population (approximately 37% or 1,892 participants) reported never smoking, indicating a potential focus on non-smoking individuals in this dataset.



### **Finding Unique Values**



In [57]:
unique_values_example = {col: df[col].nunique() for col in df.columns}
print(unique_values_example)

{'id': 5110, 'gender': 3, 'age': 104, 'hypertension': 2, 'heart_disease': 2, 'ever_married': 2, 'work_type': 5, 'Residence_type': 2, 'avg_glucose_level': 3979, 'bmi': 418, 'smoking_status': 4, 'stroke': 2}






*   We find number of unique values from iterating from each column








In [58]:
unique_values = {col: df[col].unique() for col in df.columns}
print(unique_values)

{'id': array([ 9046, 51676, 31112, ..., 19723, 37544, 44679]), 'gender': array(['Male', 'Female', 'Other'], dtype=object), 'age': array([6.70e+01, 6.10e+01, 8.00e+01, 4.90e+01, 7.90e+01, 8.10e+01,
       7.40e+01, 6.90e+01, 5.90e+01, 7.80e+01, 5.40e+01, 5.00e+01,
       6.40e+01, 7.50e+01, 6.00e+01, 5.70e+01, 7.10e+01, 5.20e+01,
       8.20e+01, 6.50e+01, 5.80e+01, 4.20e+01, 4.80e+01, 7.20e+01,
       6.30e+01, 7.60e+01, 3.90e+01, 7.70e+01, 7.30e+01, 5.60e+01,
       4.50e+01, 7.00e+01, 6.60e+01, 5.10e+01, 4.30e+01, 6.80e+01,
       4.70e+01, 5.30e+01, 3.80e+01, 5.50e+01, 1.32e+00, 4.60e+01,
       3.20e+01, 1.40e+01, 3.00e+00, 8.00e+00, 3.70e+01, 4.00e+01,
       3.50e+01, 2.00e+01, 4.40e+01, 2.50e+01, 2.70e+01, 2.30e+01,
       1.70e+01, 1.30e+01, 4.00e+00, 1.60e+01, 2.20e+01, 3.00e+01,
       2.90e+01, 1.10e+01, 2.10e+01, 1.80e+01, 3.30e+01, 2.40e+01,
       3.40e+01, 3.60e+01, 6.40e-01, 4.10e+01, 8.80e-01, 5.00e+00,
       2.60e+01, 3.10e+01, 7.00e+00, 1.20e+01, 6.20e+01, 2.00e+00,



*   We find all the unique values present in each column by iterating them column by column



#### **Finding Unique Values for columns having categorical type of data**

In [53]:
df.gender.unique()

array(['Male', 'Female', 'Other'], dtype=object)

In [54]:
df.ever_married.unique()

array(['Yes', 'No'], dtype=object)

In [55]:
df.work_type.unique()

array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
      dtype=object)

In [56]:
df.Residence_type.unique()

array(['Urban', 'Rural'], dtype=object)

In [59]:
df.smoking_status.unique()

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

### **Check for missing values**

This is both a **data cleaning** and **data preprocessing** step. Identifying and handling missing values is considered **data cleaning** since it involves addressing the issue of incomplete data. Depending on the extent of missing data, you may need to decide how to handle it, either by imputing values or removing the affected rows/columns. Additionally, it is also a **data preprocessing** step since having missing values can impact the effectiveness of subsequent analyses, and addressing them helps ensure the data is in a suitable form for analysis.

In [60]:
# Display the count of missing values for each column
df.isnull().sum()

Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,201


In [61]:
# Calculate the missing values percentage for each column and to two decimal places
missing_value_percentage=(df.isnull().mean()*100).round(2)
# Display the missing values percentage for each column
print("Missing Values Percentage:\n")
print(missing_value_percentage)

Missing Values Percentage:

id                   0.00
gender               0.00
age                  0.00
hypertension         0.00
heart_disease        0.00
ever_married         0.00
work_type            0.00
Residence_type       0.00
avg_glucose_level    0.00
bmi                  3.93
smoking_status       0.00
stroke               0.00
dtype: float64


**Observation:**
We can see only column "bmi" has around **3.93 %** of column's values are missing, while rest of the columns do **not** have any **null** values.

### **Handling null values**

In [68]:
#Handling null values for continuous variable
df['bmi'].fillna(0,inplace=True)
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,0.0,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,0.0,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0



**Observation:**
* Filling missing values and replacing them with zero(0).



In [62]:
df['bmi'].fillna(0)
np.mean(df.bmi)

28.893236911794666

In [63]:
df.fillna(np.mean(df.bmi))

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.600000,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.500000,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.400000,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.000000,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,28.893237,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.000000,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.600000,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.600000,formerly smoked,0


**Observation:**


*   Another method is to handle missing value is by computing the **mean** of 'bmi' column and since we have averaged the value, now we replace null values with mean value computed above



In [65]:
round(max(100*df.bmi.isnull()/len(df.bmi)),2)

0.02

**Observation**


*   In the given dataset 0.02 % of values are Null.
Since we have a very small percentage of null values we are dropping them for reducing time in processing them.



In [66]:
df.dropna()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5104,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
