# Week 5: Hypothesis Testing & Modeling

This notebook tests pollution differences between groups and builds a regression model to predict NO₂.


In [None]:
import pandas as pd
from scipy.stats import ttest_ind

df = pd.read_csv("../data/cleaned_air_quality.csv")

# Hypothesis: Is NO₂ higher in Industrial areas than Residential?
res = df[df['type'].str.contains('Residential', na=False)]
ind = df[df['type'].str.contains('Industrial', na=False)]

t_stat, p_value = ttest_ind(res['no2'], ind['no2'], equal_var=False)
print(f"T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}")


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define features and target
features = ['so2', 'rspm', 'spm', 'pm2_5']
df = df.dropna(subset=features)
X = df[features]
y = df['no2']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("R²:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))

## Summary

- T-test suggests whether NO₂ levels are significantly different between Industrial and Residential areas.
- Linear regression attempts to model NO₂ based on other pollutants.
- Evaluation metrics (R² and RMSE) help assess model performance.
