## Hypothesis Testing

- Do smokers pay significantly more than non-smokers?

- Is there a significant difference in charges between males and females?

- Are insurance charges significantly different across regions?

- Does BMI category significantly affect charges?

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data\insurance.csv")
df.head()

  df = pd.read_csv("data\insurance.csv")


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
df['smoker'] = df['smoker'].str.lower().str.strip()
df['smoker'].value_counts()


smoker
no     1064
yes     274
Name: count, dtype: int64

In [5]:
smokers = df[df['smoker']=='yes']['charges']
non_smokers = df[df['smoker']=='no']['charges']

len(smokers), len(non_smokers)


(274, 1064)

1.Do smokers pay significantly more than non-smokers?

In [6]:
from scipy.stats import ttest_ind

smokers = df[df['smoker']=='yes']['charges']
non_smokers = df[df['smoker']=='no']['charges']

t_stat , p_val = ttest_ind(smokers ,non_smokers, equal_var=False )
t_stat , p_val

(np.float64(32.751887766341824), np.float64(5.88946444671698e-103))

p_val < 0.05,so we reject null Hypothesis.
Smokers pay significantly higher insurance charges than non-smokers.
This difference is not due to random chance — it is statistically significant.

---

2.Is there a significant difference in charges between males and females?

H₀ (Null): Mean charges of males = mean charges of females

H₁ (Alt): Mean charges of males ≠ mean charges of females

In [8]:
males = df[df['sex']=='male']['charges']
females = df[df['sex']=='female']['charges']

t_stat,p_val = ttest_ind(males,females, equal_var=False)
t_stat,p_val

(np.float64(2.1008878232359565), np.float64(0.03584101495601664))

p_val < 0.05
The two-sample t-test gave t = 2.10 and p = 0.0358, so we reject the null hypothesis and conclude that insurance charges differ significantly between males and females, though the effect is moderate.


---

3.Are insurance charges significantly different across regions?

H₀ (Null): All regions have the same mean charges

H₁ (Alt): At least one region has a different mean charge

In [None]:
# ANOVA TEST
regions = df['region'].unique()
groups = [df[df['region'] == r]['charges'] for r in regions]

from scipy.stats import f_oneway

f_stat, p_val = f_oneway(*groups)
f_stat, p_val

(np.float64(2.96962669358912), np.float64(0.0308933560705201))

α = 0.05

p = 0.03089 < 0.05

→ Reject the null hypothesis.
There is a statistically significant difference in insurance charges across regions.

---

4.Does BMI category significantly affect insurance charges?

In [18]:
def bmi_category(bmi):
    if bmi < 18.5:
        return "Underweight"
    elif bmi < 25:
        return "Normal"
    elif bmi < 30:
        return "Overweight"
    else:
        return "Obese"
df['bmi_cat'] = df['bmi'].apply(bmi_category)
df['bmi_cat'].value_counts()     

bmi_cat
Obese          707
Overweight     386
Normal         225
Underweight     20
Name: count, dtype: int64

H₀: Mean charges are the same for all BMI categories

H₁: At least one BMI category has a different mean charge

In [19]:
cats = df['bmi_cat'].unique()
groups = [df[df['bmi_cat'] == c]['charges'] for c in cats]

from scipy.stats import f_oneway
f_stat, p_val = f_oneway(*groups)
f_stat, p_val

(np.float64(18.73063972881849), np.float64(6.66279768069733e-12))

There is a highly significant difference in insurance charges across BMI categories.