# üåç Karachi AQI Exploratory Data Analysis (EDA)
**Project:** Karachi Air Quality Intelligence System  
**Developer:** Karan Kumar  

This notebook explores the dataset fetched from the MongoDB Feature Store, analyzes preprocessing impacts, and visualizes feature importance from our XGBoost model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import pickle
import sys
import os

# Add src to path to import local modules
sys.path.append('../src')
from database import AQIDatabase
from preprocessing import preprocess_data

sns.set_theme(style="darkgrid")
print("Libraries loaded successfully!")

## 1. Data Retrieval
Connect to MongoDB Atlas and pull the hourly records.

In [None]:
db = AQIDatabase()
df_raw = db.fetch_data()
print(f"Raw Data Loaded: {len(df_raw)} records")
df_raw.head()

## 2. Statistical Summary

In [None]:
df_raw.describe().T

## 3. Preprocessing & Feature Engineering
Analyze how our preprocessing pipeline cleans the data and generates new features like rolling averages and lags.

In [None]:
df = preprocess_data(df_raw.copy())
print(f"Data after preprocessing: {len(df)} records")
df.head()

## 4. Visual Analysis
### 4.1 AQI Distribution in Karachi

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['us_aqi'], kde=True, color='teal')
plt.title('Karachi AQI Distribution')
plt.xlabel('US AQI Value')
plt.show()

### 4.2 Correlation Heatmap
Identifying which features strongly impact the AQI.

In [None]:
plt.figure(figsize=(12, 10))
corr = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=False, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.show()

### 4.3 Time-Series Trends
Karachi's pollution often follows daily patterns (e.g., peak traffic hours).

In [None]:
avg_hourly = df.groupby('hour')['us_aqi'].mean()
plt.figure(figsize=(10, 5))
avg_hourly.plot(kind='line', marker='o', color='red')
plt.title('Average Hourly AQI Trend in Karachi')
plt.xlabel('Hour of Day (24h)')
plt.ylabel('Mean AQI')
plt.xticks(range(0, 24))
plt.grid(True)
plt.show()

## 5. Model Feature Importance
Which features did the XGBoost model find most useful?

In [None]:
try:
    with open('../model.pkl', 'rb') as f:
        model = pickle.load(f)
    with open('../features.pkl', 'rb') as f:
        features = pickle.load(f)
    
    # Get feature importance
    importance = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)
    
    plt.figure(figsize=(10, 8))
    importance[:15].plot(kind='barh', color='darkblue')
    plt.title('Top 15 Most Important Features - XGBoost')
    plt.xlabel('Importance Score')
    plt.gca().invert_yaxis()
    plt.show()
except Exception as e:
    print(f"Could not load model for feature analysis: {e}")