### ***PREDICTING STROKE RISK USING PATIENT HEALTH DATA - DATA WRANGLING***

In [9]:
# Let's start by loading the dataset into a pandas DataFrame and inspecting it.

import pandas as pd

# Load the dataset
file_path = 'C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/healthcare-dataset-stroke-data.csv'
stroke_data = pd.read_csv(file_path)

# Show the first few rows of the dataset to inspect its structure
stroke_data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


**The dataset has been successfully loaded. Here is a quick overview of the columns:**

    id: Unique patient identifier
    gender: Patient's gender (Male, Female, Other)
    age: Age of the patient
    hypertension: 0 if no hypertension, 1 if the patient has hypertension
    heart_disease: 0 if no heart disease, 1 if the patient has heart disease
    ever_married: Marital status (Yes or No)
    work_type: Type of employment (Private, Self-employed, Govt job, etc.)
    Residence_type: Rural or Urban residence
    avg_glucose_level: Average glucose level in blood
    bmi: Body mass index
    smoking_status: Smoking status (formerly smoked, never smoked, smokes, Unknown)
    stroke: Target variable (1 = stroke, 0 = no stroke)

**Next Steps for Data Wrangling:**

    Data Collection: We’ve loaded the data successfully, and there is no need to join any datasets since this appears to be self-contained.
    Data Definition: I will now summarize the data types and column statistics.
    Data Cleaning: We’ll need to address any missing values, duplicates, and check if the data types are appropriate.

In [13]:
# Summarize the dataset by checking data types, missing values, and basic statistics
data_info = stroke_data.info()
summary_statistics = stroke_data.describe(include='all')

# Check for missing values
missing_values = stroke_data.isnull().sum()

data_info, summary_statistics, missing_values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


(None,
                   id  gender          age  hypertension  heart_disease  \
 count    5110.000000    5110  5110.000000   5110.000000    5110.000000   
 unique           NaN       3          NaN           NaN            NaN   
 top              NaN  Female          NaN           NaN            NaN   
 freq             NaN    2994          NaN           NaN            NaN   
 mean    36517.829354     NaN    43.226614      0.097456       0.054012   
 std     21161.721625     NaN    22.612647      0.296607       0.226063   
 min        67.000000     NaN     0.080000      0.000000       0.000000   
 25%     17741.250000     NaN    25.000000      0.000000       0.000000   
 50%     36932.000000     NaN    45.000000      0.000000       0.000000   
 75%     54682.000000     NaN    61.000000      0.000000       0.000000   
 max     72940.000000     NaN    82.000000      1.000000       1.000000   
 
        ever_married work_type Residence_type  avg_glucose_level          bmi  \
 count    

**Data Types:**

    Most columns have appropriate data types (e.g., age, hypertension, and heart_disease are numeric).
    Categorical columns such as gender, ever_married, work_type, Residence_type, and smoking_status are stored as object.

**Missing Values:**

    The bmi column has 201 missing values, which we will need to address during data cleaning.

**Basic Statistics:**

    Age: Ranges from 0.08 to 82 years old.
    Hypertension: 9.7% (MEAN) of the patients have hypertension.
    Heart Disease: 5.4% (MEAN) of the patients have heart disease.
    Stroke: The target variable, with 4.9% of patients having had a stroke.

**Next Steps:**

    Handling missing values: We’ll need to decide how to handle the missing BMI data (e.g., imputation or removal).
    Data Cleaning: We’ll check for any duplicate records and ensure all categorical columns are correctly encoded for analysis.
    Data Exploration: Gain insights by visualizing distributions and relationships.

In [22]:
# Handling missing values in the 'bmi' column by filling them with the column's median value
bmi_median = stroke_data['bmi'].median()
stroke_data['bmi'].fillna(bmi_median, inplace=True)

# Checking for duplicates in the dataset
duplicate_rows = stroke_data.duplicated().sum()

# Drop any duplicates if found
stroke_data_cleaned = stroke_data.drop_duplicates()

# Verify missing values are handled and no duplicates remain
missing_values_after_cleaning = stroke_data_cleaned.isnull().sum()
total_rows_after_cleaning = stroke_data_cleaned.shape[0]

bmi_median, duplicate_rows, missing_values_after_cleaning, total_rows_after_cleaning

(28.1,
 0,
 id                   0
 gender               0
 age                  0
 hypertension         0
 heart_disease        0
 ever_married         0
 work_type            0
 Residence_type       0
 avg_glucose_level    0
 bmi                  0
 smoking_status       0
 stroke               0
 dtype: int64,
 5110)

**Here are the results after handling missing values and checking for duplicates:**

    Missing Values: All missing values in the bmi column have been filled with the median value of 28.1, and no other columns have missing values.

    Duplicates: There were no duplicate rows in the dataset, so no rows were removed.

    Total Rows: The dataset now has 5,110 rows after cleaning.

In [26]:
# Save the cleaned dataset as a CSV file
stroke_data_cleaned.to_csv('stroke_data_cleaned.csv', index=False)