# Project Data Description (Insurance)
## Foundations of Machine Learning

Purpose: We aim to build predictive models that estimate a person’s annual medical insurance charges (charges) and to interpret which factors are most strongly associated with higher costs.

Data source: Kaggle – Medical Cost Personal Dataset. This is a cross-sectional dataset of U.S. insurance policy holders with basic demographics and health-related behaviors. The raw file contains 1,338 rows and 7 columns. After standardization and de-duplication, the analysis table contains 1,337 rows and 10 columns (including three derived features listed below).

The data include:
  
-age: Age in years (numeric).

-sex: Biological sex of the policy holder (categorical: male, female).

-bmi: Body Mass Index (kg/m², numeric).

-children: Number of dependent children (non-negative integer).

-smoker: Smoking status at time of record (categorical: yes, no).

-region: U.S. region of residence (categorical: northeast, northwest, southeast, southwest).

-charges: Annual medical insurance charges in dollars (numeric) — target variable.


Cleaning and preparation:

Standardized column names to lower_snake_case; trimmed whitespace; harmonized categories (e.g., M/F→male/female, Y/N→yes/no, NE/NW/SE/SW→full region names).

Removed exact duplicate rows (n = 1).

Type-coerced numeric fields; enforced reasonable bounds (age 0–100, BMI 10–80, children ≥0, charges >0). No rows failed the bounds checks.

Assessed missingness; the dataset contains no missing values after cleaning.


Engineered helper features for analysis/interpretation:

bmi_obese = 1 if BMI ≥ 30, else 0.

is_senior = 1 if age ≥ 65, else 0.

smoker_yes = 1 if smoker == “yes”, else 0.

Files and usage. We will work from a single tidy file (insurance_cleaned.csv). For modeling we will create an 80/20 train–test split (stratified on smoker) and fit all preprocessing steps on the training data only to avoid leakage.

Limitations and notes. This is observational, cross-sectional data and does not support causal claims. Important cost drivers (e.g., plan details, comorbidities, income) are not observed and may confound associations with charges. Categorical labels are coarse (e.g., four regions), so geographic effects are only approximate. We will report model performance with appropriate caveats and perform sensitivity checks for influential observations.