# Data Preparation and Cleaning for Health Indicators Dataset

This notebook covers the data preparation and cleaning steps for the health indicators dataset.
The main objectives are to handle missing values, identify outliers, convert data types as needed,
and perform necessary transformations to make the data ready for analysis or model building.


## 1. Data Overview

We'll start by loading the data and performing an initial examination to understand its structure.


In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('diabetes_binary_health_indicators_BRFSS2021.csv')

# Display basic information and statistics
data.info()
data.describe()

## 2. Handling Missing Values

We'll check for any missing values in the dataset and decide on appropriate strategies to handle them.


In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]

## 3. Data Type Conversion

After examining the dataset, we'll ensure each column has an appropriate data type (e.g., converting binary columns to integers if needed).


In [None]:
# Convert data types if necessary
# Example: Ensure binary columns are integers for consistency
data = data.astype({col: 'int64' for col in data.select_dtypes(include='float64').columns})
data.dtypes

## 4. Outlier Detection and Handling

We will perform basic outlier detection for numerical variables and decide whether any action is necessary.


In [None]:
# Outlier detection using basic statistics
import numpy as np

# Example: Outliers in BMI
data.boxplot(column=['BMI'])
# Calculate and print any outliers in BMI
q1 = data['BMI'].quantile(0.25)
q3 = data['BMI'].quantile(0.75)
iqr = q3 - q1
outliers = data[(data['BMI'] < (q1 - 1.5 * iqr)) | (data['BMI'] > (q3 + 1.5 * iqr))]
outliers

## 5. Feature Engineering

Based on the dataset and context, we may create new features from existing ones to enhance the information available for analysis.


In [None]:
# Feature engineering example
# Example: Creating an 'Obesity' feature based on BMI threshold

# Define obesity threshold
obesity_threshold = 30
data['Obesity'] = (data['BMI'] >= obesity_threshold).astype(int)
data[['BMI', 'Obesity']].head()

## 6. Normalization and Scaling

For numerical features, we may apply normalization or scaling to standardize the range of values.


In [None]:
# Normalize BMI
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['BMI_normalized'] = scaler.fit_transform(data[['BMI']])
data[['BMI', 'BMI_normalized']].head()