# Week 2: Feature Exploration

This notebook focuses on exploring pollutant features across time, region, and area types.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

df = pd.read_csv("../data/data.csv", encoding="ISO-8859-1")
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

df.head()

In [None]:
pollutants = ['so2', 'no2', 'rspm', 'spm', 'pm2_5']
df[pollutants].isnull().mean().sort_values(ascending=False) * 100

In [None]:
for col in pollutants:
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.show()

In [None]:
pollution_by_year = df.groupby('year')[pollutants].mean()
pollution_by_year.plot(figsize=(12, 6), title="Yearly Average Pollutant Levels")
plt.ylabel("Concentration")
plt.xlabel("Year")
plt.show()

In [None]:
sns.boxplot(x="type", y="no2", data=df)
plt.title("NO2 Levels by Area Type")
plt.xticks(rotation=45)
plt.show()

In [None]:
state_pollution = df.groupby('state')[pollutants].mean().sort_values(by='no2', ascending=False)
state_pollution[['no2']].plot(kind='bar', figsize=(12, 6), title="Average NO2 by State")
plt.ylabel("NO2 Level")
plt.show()

## Summary

- Explored pollutant distributions, with clear variation in NO2 and SO2.
- Identified that `pm2_5` is largely missing, while other pollutants are more complete.
- NO2 levels vary across area types and states.
- Yearly trends may reveal policy or industrial changes over time.
