# üåç AQI Analysis ‚Äì Hyderabad, Pakistan

## Project Overview

This project focuses on exploratory data analysis (EDA) of Air Quality Index (AQI) data for Hyderabad, Pakistan. The goal is to understand patterns in key pollutants (PM2.5, PM10, NO2, O3), explore temporal trends, and prepare insights for machine learning forecasting models. Additionally, we apply SHAP (SHapley Additive exPlanations) to interpret feature importance in a pre-trained model.

### Objectives:
- Analyze real-time AQI data from Hyderabad
- Understand patterns in PM2.5, PM10, NO2, O3
- Prepare data for ML forecasting
- Explore temporal trends (hour/day/month)
- Apply SHAP to explain feature importance

### Data Source:
- Engineered features stored in MongoDB Atlas

## 1. Import Libraries

We start by importing necessary libraries for data manipulation, visualization, and machine learning interpretation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pymongo import MongoClient
import shap
import joblib
from config import MONGO_URI, DB_NAME

# Set plotting style
plt.style.use("seaborn")
sns.set_theme()

## 2. Data Loading

Connect to MongoDB and load the engineered features dataset.

In [None]:
# Connect to MongoDB Atlas Feature Store
client = MongoClient(MONGO_URI)
db = client[DB_NAME]

# Load engineered features
df = pd.DataFrame(list(db.engineered_features.find()))
df.drop(columns="_id", inplace=True)

print("Data loaded successfully!")
df.head()

## 3. Data Overview

Get a high-level understanding of the dataset: shape, data types, summary statistics, and missing values.

In [None]:
# Dataset shape
print("Dataset shape:", df.shape)

# Data types and non-null counts
df.info()

# Summary statistics
df.describe()

# Check for missing values
print("\nMissing values per column:")
df.isnull().sum()

## 4. Univariate Analysis

Analyze the distribution of individual pollutants.

### PM2.5 Distribution

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['pm2_5'], kde=True, bins=30, color='skyblue')
plt.title("Distribution of PM2.5 Levels")
plt.xlabel("PM2.5 (¬µg/m¬≥)")
plt.ylabel("Frequency")
plt.show()

### PM10 Distribution

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df['pm10'], kde=True, bins=30, color='orange')
plt.title("Distribution of PM10 Levels")
plt.xlabel("PM10 (¬µg/m¬≥)")
plt.ylabel("Frequency")
plt.show()

### Pollutant Concentration Comparison

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df[['pm2_5', 'pm10', 'no2', 'o3']])
plt.title("Pollutant Concentration Comparison")
plt.ylabel("Concentration (¬µg/m¬≥)")
plt.xticks(rotation=45)
plt.show()

## 5. Multivariate Analysis

Explore relationships between variables.

### Correlation Heatmap

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Between Features")
plt.show()

### Pair Plot

In [None]:
sns.pairplot(df[['pm2_5', 'pm10', 'no2', 'o3']], diag_kind='kde')
plt.suptitle("Pair Plot of Key Pollutants", y=1.02)
plt.show()

## 6. Temporal Analysis

Examine how AQI varies over time.

### Hourly PM2.5 Distribution

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(x='hour', y='pm2_5', data=df)
plt.title("Hourly PM2.5 Distribution")
plt.xlabel("Hour of Day")
plt.ylabel("PM2.5 (¬µg/m¬≥)")
plt.show()

### Daily PM2.5 Distribution

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(x='day', y='pm2_5', data=df)
plt.title("Daily PM2.5 Distribution")
plt.xlabel("Day of Month")
plt.ylabel("PM2.5 (¬µg/m¬≥)")
plt.show()

### Monthly PM2.5 Distribution

In [None]:
plt.figure(figsize=(12,5))
sns.boxplot(x='month', y='pm2_5', data=df)
plt.title("Monthly PM2.5 Distribution")
plt.xlabel("Month")
plt.ylabel("PM2.5 (¬µg/m¬≥)")
plt.show()

### Time Series Plot (if datetime available)
Assuming there's a datetime column, e.g., 'timestamp':

In [None]:
if 'timestamp' in df.columns:
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.set_index('timestamp', inplace=True)
    df['pm2_5'].plot(figsize=(15,5))
    plt.title("PM2.5 Time Series")
    plt.ylabel("PM2.5 (¬µg/m¬≥)")
plt.show()

## 7. AQI Categorization

Categorize AQI based on PM2.5 levels according to standard guidelines.

In [None]:
def aqi_category(pm25):
    if pm25 <= 50:
        return "Good"
    elif pm25 <= 100:
        return "Moderate"
    elif pm25 <= 150:
        return "Unhealthy"
    else:
        return "Hazardous"

df['AQI_Category'] = df['pm2_5'].apply(aqi_category)

print("AQI Category Distribution:")
print(df['AQI_Category'].value_counts())

plt.figure(figsize=(8,5))
sns.countplot(x='AQI_Category', data=df, order=["Good","Moderate","Unhealthy","Hazardous"])
plt.title("AQI Category Distribution")
plt.ylabel("Count")
plt.show()

## 8. Feature Engineering

Create lag features for time series forecasting.

In [None]:
df['lag1'] = df['pm2_5'].shift(1)
df['lag2'] = df['pm2_5'].shift(2)
df['lag3'] = df['pm2_5'].shift(3)

# Drop rows with NaN from lagging
df.dropna(inplace=True)

print("Feature engineering completed. New shape:", df.shape)
df.head()

## 9. Model Explanation with SHAP

Load the pre-trained model and use SHAP to explain feature importance.

In [None]:
# Load pre-trained model
model = joblib.load("../models/best_model.pkl")

# Prepare features for SHAP
X = df.drop(columns=['pm2_5', 'AQI_Category'])

# Create SHAP explainer
explainer = shap.Explainer(model, X)
shap_values = explainer(X)

# Summary plot
shap.summary_plot(shap_values, X)
plt.title("SHAP Feature Importance Summary")
plt.show()

# Waterfall plot for a single prediction
shap.plots.waterfall(shap_values[0])
plt.title("SHAP Waterfall Plot for First Prediction")
plt.show()

## 10. Conclusion

This EDA provided insights into AQI patterns in Hyderabad:
- Key pollutants and their distributions
- Temporal variations (hourly, daily, monthly)
- Correlations between features
- AQI categorization
- Feature importance via SHAP

The cleaned dataset is saved for further modeling.

## 11. Save Cleaned Data

In [None]:
df.to_csv("../data/hyderabad_aqi_eda_clean.csv", index=False)
print("Cleaned data saved to ../data/hyderabad_aqi_eda_clean.csv")