<a href="https://colab.research.google.com/github/pravallika2580/Stroke_Prediction/blob/main/Prediction_of_Stroke_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Dataset Description: Healthcare Stroke Data**
The dataset contains medical and demographic information for 5,110 patients, aimed at identifying factors related to the occurrence of strokes. The dataset consists of 12 columns, each representing a feature or characteristic related to the patient’s health or lifestyle.

**Dataset:**
1.   **id:** Unique identifier for each patient (integer).
2.   **gender:** Gender of the patient (categorical: Male, Female).
1.   **age:** Age of the patient (numeric).
2.   **hypertension:** Whether the patient has hypertension (binary: 0 for no, 1 for yes).
1.   **heart_disease:** Whether the patient has heart disease (binary: 0 for no, 1 for yes).
2.   **ever_married:** Marital status of the patient (categorical: Yes, No).
1.   **work_type:** Type of employment (categorical: Private, Self-employed, etc.).
2.   **Residence_type:** Urban or rural residence (categorical: Urban, Rural).
1.   **avg_glucose_level:** Average glucose level in the blood (numeric).
2.   **bmi:** Body Mass Index (BMI) of the patient (numeric; some missing values).
1.   **smoking_status:** Smoking habits (categorical: formerly smoked, never smoked, smokes).
2.   **stroke:** Whether the patient has had a stroke (binary: 0 for no, 1 for yes).


There are 4 columns with numeric data (age, avg_glucose_level, bmi, id) and several categorical or binary columns (e.g., gender, smoking_status). Additionally, the BMI column has some missing values.

This dataset can be used for predictive modeling to assess the likelihood of stroke occurrence based on various health and lifestyle factors.

#**(1) Defining The Problem Statement**

---

**Building a Linear Regression Model to Classify Patient's data to predict possibility of Stroke**

The goal of this project is to build a predictive model using Linear Regression to classify patient data and predict the possibility of a stroke. Given a dataset with features such as age, hypertension status, heart disease, glucose levels, BMI, and lifestyle habits, the model should be able to estimate the likelihood of a patient having a stroke. This will help in identifying high-risk patients and potentially take preventive measures.

#**(2) Importing libraries**



---



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')



---



#**(3) Data loading and Preprocessing**

In [2]:
data=pd.read_csv('healthcare-dataset-stroke-data.csv')
data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [3]:
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1




> The head() function is used to print the first 5 rows of each dataset to give an overview of the data structure.



In [4]:
#To get the total number of rows in the DataFrame,
len(data)

5110



> There are 5110 rows in the dataset



In [5]:
# Check the basic structure and data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [6]:
#Get summary statistics for numerical columns
data.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [7]:
#Generate summary statistics for object (categorical) columns
data.describe(include=object)

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,5110,5110,5110,5110,5110
unique,3,2,5,2,4
top,Female,Yes,Private,Urban,never smoked
freq,2994,3353,2925,2596,1892


In [8]:
data.shape

(5110, 12)


>The shape attribute shows the number of rows and columns in the train dataset, which helps to understand the dataset size.

In [9]:
data.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [10]:
# Check for unique values in categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns

for col in categorical_columns:
    print(f"Unique values in '{col}': {data[col].unique()}")


Unique values in 'gender': ['Male' 'Female' 'Other']
Unique values in 'ever_married': ['Yes' 'No']
Unique values in 'work_type': ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
Unique values in 'Residence_type': ['Urban' 'Rural']
Unique values in 'smoking_status': ['formerly smoked' 'never smoked' 'smokes' 'Unknown']


#**(4) Missing Values**

---



In [12]:
print('Missing values in dataset are: ')
data.isnull().sum()

Missing values in dataset are: 


Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,201




> There are missing values only in the bmi column i.e 201



In [13]:
# Calculate the percentage of missing values for each column
missing_percentage = data.isnull().mean() * 100
# Display the missing percentage
print(missing_percentage)

id                   0.000000
gender               0.000000
age                  0.000000
hypertension         0.000000
heart_disease        0.000000
ever_married         0.000000
work_type            0.000000
Residence_type       0.000000
avg_glucose_level    0.000000
bmi                  3.933464
smoking_status       0.000000
stroke               0.000000
dtype: float64




> Percentage of missing values in bmi is 3.9333464



In [14]:
#Handling the missing values in the bmi by mean
data['bmi'].fillna(data['bmi'].mean(), inplace=True)
data.isnull().sum()

Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,0




> We can see that missing values in bmi are handled



In [15]:
#null values handled
data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.600000,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.500000,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.400000,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.000000,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,28.893237,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.000000,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.600000,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.600000,formerly smoked,0
