# ðŸ§  Model Feature Analysis & Performance Deep-Dive
**Project:** Karachi Air Quality Intelligence System  
**Developer:** Karan Kumar  

This notebook focuses specifically on the **intelligence** behind our system. We analyze:
1. **Model Performance**: How close are our predictions to reality?
2. **Feature Importance**: What exactly drives the 99% accuracy?
3. **Error Analysis**: Where does the model struggle?
4. **Lag Correlation**: How past AQI values affect the future.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import sys
import os
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Add src to path
sys.path.append('../src')
from database import AQIDatabase
from preprocessing import preprocess_data

sns.set_theme(style="whitegrid", palette="viridis")
print("Analysis environment ready!")

## 1. Data & Model Loading

In [None]:
# Load Model & Features
with open('../model.pkl', 'rb') as f:
    model = pickle.load(f)
with open('../features.pkl', 'rb') as f:
    features = pickle.load(f)

# Load Data
db = AQIDatabase()
df_raw = db.fetch_data()
df = preprocess_data(df_raw.copy())

print(f"Model Type: {type(model).__name__}")
print(f"Number of Features: {len(features)}")

## 2. Global Feature Importance
Visualizing which weather and time parameters the XGBoost model relies on most.

In [None]:
importance = pd.Series(model.feature_importances_, index=features).sort_values(ascending=True)

plt.figure(figsize=(10, 12))
importance.plot(kind='barh', color='skyblue')
plt.title('Feature Importance (XGBoost Weight)', fontsize=15)
plt.xlabel('Importance Score')
plt.show()

## 3. Actual vs Predicted Analysis
How well did the model perform on the recent data?

In [None]:
# Prepare target
df['target'] = df['us_aqi'].shift(-1)
df_test = df.dropna().tail(200) # Testing on the last 200 hours

X_test = df_test[features]
y_actual = df_test['target']
y_pred = model.predict(X_test)

plt.figure(figsize=(12, 6))
plt.plot(df_test['date'], y_actual, label='Actual AQI', marker='.', alpha=0.6)
plt.plot(df_test['date'], y_pred, label='Predicted AQI', marker='.', linestyle='dashed', color='red')
plt.title('Karachi AQI: Actual vs Predicted (Validation Set)', fontsize=15)
plt.legend()
plt.show()

## 4. Error (Residual) Analysis
A professional analysis to see if our errors are random or biased.

In [None]:
residuals = y_actual - y_pred

plt.figure(figsize=(10, 6))
sns.residplot(x=y_pred, y=residuals, lowess=True, color="g")
plt.title('Residual Plot (Should be random noise)', fontsize=15)
plt.xlabel('Predicted AQI')
plt.ylabel('Residual (Actual - Predicted)')
plt.show()

## 5. Lag Strength Analysis
Does knowing the AQI 24 hours ago actually help our Karachi model?

In [None]:
plt.figure(figsize=(10, 6))
sns.regplot(x=df['lag_24h_aqi'], y=df['us_aqi'], scatter_kws={'alpha':0.3}, line_kws={'color':'red'})
plt.title('Correlation: AQI Today vs AQI 24 Hours Ago', fontsize=15)
plt.show()

## 6. Performance Metrics Release
Final scores for our model selection.

In [None]:
rmse = np.sqrt(mean_squared_error(y_actual, y_pred))
mae = mean_absolute_error(y_actual, y_pred)
r2 = r2_score(y_actual, y_pred)

print("--- KARACHI AQI MODEL PERFORMANCE ---")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Coefficient of Determination (R2): {r2:.4f}")