# Week 6: Final Report

This notebook summarizes the full air quality analysis project.


## Project Overview

This project analyzes air quality in India using historical data from multiple cities and states.

### Goals:
- Understand pollution trends over time
- Compare pollution across seasons, locations, and area types
- Predict pollutant levels using statistical models


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv("../data/cleaned_air_quality.csv")
df.head()

In [None]:
sns.boxplot(x="type", y="no2", data=df)
plt.title("NO2 Levels by Area Type")
plt.xticks(rotation=45)
plt.show()

In [None]:
monthly_avg = df.groupby("month")[['so2', 'no2', 'rspm']].mean()
monthly_avg.plot(title="Monthly Average Pollution Levels")
plt.ylabel("Concentration")
plt.show()

## Hypothesis Testing Results

- Industrial areas have higher NO₂ than residential → ✅ Supported (p < 0.05)
- Winter has higher pollution than summer → ✅ Supported


In [None]:
features = ['so2', 'rspm', 'spm', 'pm2_5']
X = df[features]
y = df['no2']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))

## Conclusion

- Pollution levels vary by location, season, and area type.
- NO₂ can be moderately predicted from other pollutants.
- This project provides insights for policy, urban planning, and public health.

### Further Work:
- Time-series forecasting
- Clustering for pollution typologies
- Geospatial visualizations
