<a href="https://colab.research.google.com/github/kmay9270/AIML-Projects-USD-MSAAI-Team7/blob/simon/Team7_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistical Coding Assignment: Regression Analysis of Medical Insurance Cost Across US regions

Students: **Emmanuel Sadek, Mayur Khare, and Jinyuan He**

Date: **09/30/2025**

This notebook serves as the technical foundation for our statistical coding assignment, focusing on the [Medical Insurance Cost dataset](https://www.kaggle.com/datasets/mosapabdelghany/medical-insurance-cost-dataset/) from Kaggle. This project is a collaborative effort by .

The analyses and code within this notebook directly support the sections of our technical report, which will include:

*   **Introduction:** Briefly introducing the problem and dataset.
*   **Data Cleaning/Preparation:** Detailing the steps taken to clean and prepare the data for analysis.
*   **Exploratory Data Analysis:** Presenting visualizations and summaries to understand the data's characteristics and relationships.
*   **Model Selection:** Explaining the rationale behind choosing specific models for predicting medical insurance costs.
*   **Model Analysis:** Evaluating the performance of the selected models.
*   **Conclusion and Recommendations:** Summarizing our findings and providing recommendations based on the analysis.

This notebook contains the executed code and outputs that will be included in the appendix of our technical report. We will use this notebook to perform the following steps:

*   **Load and Prepare the Data:** Import the dataset and perform any necessary cleaning or transformations.
*   **Conduct Exploratory Data Analysis:** Generate visualizations and descriptive statistics.
*   **Develop and Analyze Models:** Build and evaluate models for predicting medical insurance costs.

Through this notebook, we aim to systematically analyze the medical insurance cost dataset and generate the necessary outputs for our comprehensive technical report.

# Data Cleaning & Preparation

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# --------------------------------------------
# 1. Load the dataset
# --------------------------------------------

# Load into pandas
df = pd.read_csv("https://raw.githubusercontent.com/kmay9270/AIML-Projects-USD-MSAAI-Team7/refs/heads/main/insurance.csv")

# Quick look at the data
print("First 5 rows:")
display(df.head())

# --------------------------------------------
# 2. Check for missing values & duplicates
# --------------------------------------------
print("\nMissing values per column:")
print(df.isnull().sum())

print("\nDuplicate rows:", df.duplicated().sum())

# Drop duplicates if any
df = df.drop_duplicates()

# --------------------------------------------
# 3. Encoding for category data
# --------------------------------------------
# One-hot encode
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})
df['region'] = df['region'].map({'southwest': 0, 'southeast': 1, 'northwest': 2, 'northeast':3})

# --------------------------------------------
# 4. Normalization
# --------------------------------------------
# Normalize selected features
scaler = StandardScaler()
df['age'] = scaler.fit_transform(df[['age']])
df['bmi'] = scaler.fit_transform(df[['bmi']])
df['children'] = scaler.fit_transform(df[['children']])


# --------------------------------------------
# 5. Split tran-test dataset
# --------------------------------------------
X = df.drop(columns=['charges'])
y = df['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=15)

First 5 rows:


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552



Missing values per column:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Duplicate rows: 1


In [9]:
import statsmodels.api as sm
from scipy import stats

# Target variable
target = 'charges'
X = df.drop(columns=[target])
y = df[target]

# --------------------------------------------
# 2. Correlation with target
# --------------------------------------------
corr_with_target = {}
for col in X.columns:
    corr = np.corrcoef(X[col], y)[0,1]
    corr_with_target[col] = corr

# --------------------------------------------
# 3. Coefficients from a regression model
# --------------------------------------------
X_const = sm.add_constant(X)
model = sm.OLS(y, X_const).fit()
coefficients = model.params

# --------------------------------------------
# 4. One-way ANOVA for each feature
# --------------------------------------------
anova_p_values = {}
for col in df.columns:
    if col == target:
        continue
    if df[col].dtype == 'O':  # categorical
        groups = [df[df[col] == level][target] for level in df[col].unique()]
        f_stat, p_val = stats.f_oneway(*groups)
        anova_p_values[col] = p_val
    else:
        # For numeric variables, use correlation-based F-test equivalent
        slope, intercept, r_value, p_val, std_err = stats.linregress(df[col], df[target])
        anova_p_values[col] = p_val

# --------------------------------------------
# 5. Combine results: mean, correlation, coefficient, ANOVA p-value
# --------------------------------------------
summary = pd.DataFrame({
    "Mean": X.mean(),
    "Correlation_with_target": pd.Series(corr_with_target),
    "Regression_Coefficient": coefficients.drop('const', errors='ignore'),
    "ANOVA_p_value": pd.Series(anova_p_values)
})

# Sort by correlation strength
summary = summary.reindex(sorted(summary.index))

# --------------------------------------------
# 6. Display
# --------------------------------------------
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')
print("\nFeature Selection Summary:")
print(summary)


Feature Selection Summary:
            Mean  Correlation_with_target  Regression_Coefficient  \
age      -0.0000                   0.2983              3,610.8961   
bmi      -0.0000                   0.1984              2,028.2307   
children  0.0000                   0.0674                576.9775   
region    1.4839                   0.0065                354.0097   
sex       0.4951                  -0.0580                129.4009   
smoker    0.2049                   0.7872             23,819.1501   

          ANOVA_p_value  
age              0.0000  
bmi              0.0000  
children         0.0137  
region           0.8110  
sex              0.0338  
smoker           0.0000  
