<a href="https://colab.research.google.com/github/leulged/pm25-air-quality-prediction-zindi/blob/main/PM2_5_Air_Quality_Prediction_Zindi_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌍 Zindi Air Quality Challenge Overview

## 🎯 Objective:
The goal is to **predict PM2.5 (particulate matter with diameter < 2.5 micrometers)** for various cities and dates based on weather data and satellite readings. PM2.5 is one of the most harmful air pollutants, so accurate prediction can help monitor and manage air quality globally.

The model will be trained on **`Train.csv`**, which includes:
- PM2.5 target values (`target`)
- Ground-based sensor stats (min, max, variance, count)
- Weather features from GFS (e.g., humidity, temperature, wind)
- Satellite-based pollution measurements (e.g., NO₂, SO₂, CH₄)

we'll then use the trained model to predict PM2.5 for unseen cities/dates in **`Test.csv`**, and submit your predictions in the required format.

## 📈 Evaluation Metric:
- The predictions are evaluated using **Root Mean Squared Error (RMSE)** between the predicted `target` and actual values on the private leaderboard.

## ⚙️ Tools & Constraints:
- we use only `Train.csv` and `Test.csv` (no external data unless Zindi allows it)
-  We Use only open-source libraries (no AutoML)
- No metadata or image-specific shortcuts
- No data leakage (test data must not be used in training!)

## 🔁 General Workflow:
1. **Data Loading & Initial Exploration** ✅
2. **Missing Value Analysis & Data Cleaning**
3. **Exploratory Data Analysis (EDA)**
4. **Feature Engineering & Selection**
5. **Model Building**
6. **Model Evaluation (on validation set)**
7. **Final Prediction & Submission**


# 📊 Step 1: Data Loading & Exploration

## 🔍 What we’ll do in this step:
- Import essential Python libraries (`pandas`, `numpy`, etc.)
- Load the training and test datasets from Google Drive
- Display dataset shapes to understand their size
- Preview the first few rows to get familiar with the structure


In [1]:
# 📦 Importing essential libraries
import pandas as pd
import numpy as np

# 📍 File paths on Google Drive
train_path = '/content/drive/MyDrive/zindi challenge/Train.csv'
test_path = '/content/drive/MyDrive/zindi challenge/Test.csv'

# 📥 Load datasets
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# 🔢 Show shape of datasets
print("✅ Train shape:", train_df.shape)
print("✅ Test shape:", test_df.shape)

# 👀 Preview first few rows of Train and Test
print("\n🧪 Train Sample:")
display(train_df.head())

print("\n🧪 Test Sample:")
display(test_df.head())


✅ Train shape: (30557, 82)
✅ Test shape: (16136, 77)

🧪 Train Sample:


Unnamed: 0,Place_ID X Date,Date,Place_ID,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,23.0,53.0,769.5,92,11.0,60.200001,...,38.593017,-61.752587,22.363665,1793.793579,3227.855469,0.010579,74.481049,37.501499,-62.142639,22.545118
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,25.0,63.0,1319.85,91,14.6,48.799999,...,59.624912,-67.693509,28.614804,1789.960449,3384.226562,0.015104,75.630043,55.657486,-53.868134,19.293652
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,8.0,56.0,1181.96,96,16.4,33.400002,...,49.839714,-78.342701,34.296977,,,,,,,
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,10.0,55.0,1113.67,96,6.911948,21.300001,...,29.181258,-73.896588,30.545446,,,,,,,
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,9.0,52.0,1164.82,95,13.900001,44.700001,...,0.797294,-68.61248,26.899694,,,,,,,



🧪 Test Sample:


Unnamed: 0,Place_ID X Date,Date,Place_ID,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,specific_humidity_2m_above_ground,temperature_2m_above_ground,u_component_of_wind_10m_above_ground,v_component_of_wind_10m_above_ground,L3_NO2_NO2_column_number_density,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,0OS9LVX X 2020-01-02,2020-01-02,0OS9LVX,11.6,30.200001,0.00409,14.656824,3.956377,0.712605,5.3e-05,...,1.445658,-95.984984,22.942019,,,,,,,
1,0OS9LVX X 2020-01-03,2020-01-03,0OS9LVX,18.300001,42.900002,0.00595,15.026544,4.23043,0.661892,5e-05,...,34.641758,-95.014908,18.539116,,,,,,,
2,0OS9LVX X 2020-01-04,2020-01-04,0OS9LVX,17.6,41.299999,0.0059,15.511041,5.245728,1.640559,5e-05,...,55.872276,-94.015418,14.14082,,,,,,,
3,0OS9LVX X 2020-01-05,2020-01-05,0OS9LVX,15.011948,53.100002,0.00709,14.441858,5.454001,-0.190532,5.5e-05,...,59.174188,-97.247602,32.730553,,,,,,,
4,0OS9LVX X 2020-01-06,2020-01-06,0OS9LVX,9.7,71.599998,0.00808,11.896295,3.511787,-0.279441,5.5e-05,...,40.925873,-96.057265,28.320527,1831.261597,3229.118652,0.031068,-100.278343,41.84708,-95.910744,28.498789


Step 2.1: Drop Unnecessary Columns
We’ll drop the following:

'Place_ID X Date', 'Date', 'Place_ID': These are identifiers, useful only for grouping or merging, not for training. We’ll save them separately for submission later.

'target_min', 'target_max', 'target_variance', 'target_count': These are derived from the target — leakage risk. Drop them as well.

In [2]:
# Save identifier for later submission
id_column = 'Place_ID X Date'
submission_ids = test_df[id_column].copy()

# Drop identifiers and leakage-related columns from train
leakage_columns = ['target_min', 'target_max', 'target_variance', 'target_count']
train_df.drop(columns=[id_column, 'Date', 'Place_ID'] + leakage_columns, inplace=True)
test_df.drop(columns=[id_column, 'Date', 'Place_ID'], inplace=True)

print("✅ Dropped ID and leakage columns.")
print("🔢 Updated Train shape:", train_df.shape)
print("🔢 Updated Test shape:", test_df.shape)


✅ Dropped ID and leakage columns.
🔢 Updated Train shape: (30557, 75)
🔢 Updated Test shape: (16136, 74)


Step 2.2: Handling Missing Values using KNN imputation.

We’ll start by applying KNN imputation to handle the missing values in the training and testing sets.

In [3]:
from sklearn.impute import KNNImputer
# Drop the target column only from the training data
train_df_no_target = train_df.drop(columns=['target'])

# For test data, no target column to drop, so we can just use it as it is
test_df_no_target = test_df.copy()

# Now perform KNN imputation on the datasets
knn_imputer = KNNImputer(n_neighbors=5)

# Apply imputation to both train and test sets
train_df_imputed = pd.DataFrame(knn_imputer.fit_transform(train_df_no_target), columns=train_df_no_target.columns)
test_df_imputed = pd.DataFrame(knn_imputer.transform(test_df_no_target), columns=test_df_no_target.columns)

print("✅ Missing values imputed using KNN.")


✅ Missing values imputed using KNN.


 Step 2.3: Feature Scaling.

Since you're planning to use boosting models like CatBoost, LightGBM, or XGBoost, here's a quick guide:

CatBoost: Does not require feature scaling, and often performs well on raw data.

LightGBM & XGBoost: Also handle unscaled data well, especially with tree-based boosters. But if you plan to try linear models (e.g., Lasso, Ridge) for feature selection or baselines, scaling is recommended.

Given that, here's what we can do:

Plan for Feature Scaling:
Scale only for feature selection (Lasso) using StandardScaler.

Keep unscaled data for boosting models (CatBoost, LGBM, XGBoost).



In [4]:
from sklearn.preprocessing import StandardScaler

# Separate target from training data
X_train_unscaled = train_df_imputed.copy()
y_train = train_df['target'].copy()

# Drop the target from the scaled version
X_train_scaled = X_train_unscaled.copy()

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data and transform
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train_scaled), columns=X_train_scaled.columns)

print("✅ Features scaled using StandardScaler for Lasso feature selection.")


✅ Features scaled using StandardScaler for Lasso feature selection.


Step 2.4: Feature Selection using Lasso.

Lasso (L1 regularization) helps automatically select the most important features by shrinking some coefficients to zero. We'll use LassoCV with cross-validation to find the best regularization strength, then select only the non-zero feature

In [5]:
from sklearn.linear_model import LassoCV
import numpy as np

# Initialize LassoCV model
lasso = LassoCV(cv=5, random_state=42, max_iter=10000)

# Fit the model to scaled features
lasso.fit(X_train_scaled, y_train)

# Get selected features (non-zero coefficients)
selected_features = X_train_scaled.columns[(lasso.coef_ != 0)]

# Reduced feature sets based on Lasso selection
X_train_lasso = X_train_unscaled[selected_features]
test_df_lasso = test_df_imputed[selected_features]

print(f"✅ Lasso selected {len(selected_features)} features out of {X_train_scaled.shape[1]}.")


✅ Lasso selected 68 features out of 74.


Since we've now:

Imputed missing values using KNNImputer, and

Scaled features and selected the Lasso-based subset of features,

it's a good idea to double-check that no missing values remain in the selected training and test sets after feature selection, because sometimes a selected feature may still contain residual NaNs due to earlier imputation edge cases.

Let’s verify this now:

In [6]:
# Check for any remaining missing values
print("🔍 Checking for NaNs...")

print("Train NaNs:", X_train_lasso.isnull().sum().sum())
print("Test NaNs:", test_df_lasso.isnull().sum().sum())


🔍 Checking for NaNs...
Train NaNs: 0
Test NaNs: 0


Step 3: Model Training and Validation.

Here’s how we’ll proceed:

We’ll start by training a CatBoostRegressor, since it handles feature interactions well and tends to perform strongly on tabular data.

We'll use cross-validation to estimate its performance.

Then, we can try other models (like LightGBM or XGBoost) and compare.



Here’s what we’ll do in this step:

Set up training data with the selected features and target.

Initialize and configure the CatBoostRegressor.

Perform 5-fold cross-validation using cross_val_score with negative RMSE.

Output the average RMSE to get a performance estimate.

In [7]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np
from catboost import CatBoostRegressor


In [9]:

# Initialize CatBoost model
catboost_model = CatBoostRegressor(verbose=0, random_state=42)

# Define X and y (features and target)
X = X_train_lasso  # Assuming X_train_lasso contains your features selected by Lasso
y = y_train       # Assuming y_train contains your target variable

# Perform 5-fold cross-validation using RMSE directly
cv_scores = cross_val_score(catboost_model, X, y, cv=5, scoring='neg_root_mean_squared_error')

# Output the results
print("📊 CatBoost CV RMSE scores:", -cv_scores)  # Negate because 'neg_root_mean_squared_error' returns negative values
print("📉 Mean CV RMSE:", -np.mean(cv_scores))

📊 CatBoost CV RMSE scores: [34.34875696 31.24146358 30.90563658 34.23953501 32.974604  ]
📉 Mean CV RMSE: 32.74199922595029


Let's proceed with hyperparameter tuning using RandomizedSearchCV to find the best hyperparameters for the CatBoost model.

Steps:
We define a grid of hyperparameters (learning_rate, depth, iterations, l2_leaf_reg, and border_count).

We use RandomizedSearchCV to randomly sample from this grid and test combinations.

We select the best hyperparameters and then retrain the CatBoost model using them.

We perform cross-validation again to evaluate the performance.

In [10]:
from catboost import CatBoostRegressor
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define parameter grid for RandomizedSearchCV
param_dist = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'depth': [4, 6, 8, 10],
    'iterations': [100, 200, 500],
    'l2_leaf_reg': [1, 3, 5, 10],
    'border_count': [32, 64, 128],
}

# Initialize CatBoostRegressor
catboost_model = CatBoostRegressor(random_state=42, verbose=0)

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    catboost_model,
    param_distributions=param_dist,
    n_iter=10,  # Number of iterations for the search
    cv=5,       # 5-fold cross-validation
    scoring='neg_root_mean_squared_error', # Use RMSE as scoring metric
    random_state=42
)

# Perform RandomizedSearchCV
random_search.fit(X_train_lasso, y_train)

# Output the best parameters
print("Best Hyperparameters:", random_search.best_params_)

# Get the best model from the search
best_model = random_search.best_estimator_

# Perform cross-validation to see the results of the best model
cv_scores = cross_val_score(best_model, X_train_lasso, y_train, cv=5, scoring='neg_root_mean_squared_error')

# Output results
print("📊 Best CatBoost CV RMSE scores after tuning:", -cv_scores)  # Negate because 'neg_root_mean_squared_error' is used
print("📉 Mean CV RMSE:", np.mean(-cv_scores))  # Negate the mean as well


Best Hyperparameters: {'learning_rate': 0.1, 'l2_leaf_reg': 1, 'iterations': 500, 'depth': 6, 'border_count': 32}
📊 Best CatBoost CV RMSE scores after tuning: [34.4138635  31.65988555 31.33080721 34.79576275 33.29279567]
📉 Mean CV RMSE: 33.098622935905375


In [15]:

# Train the final CatBoost model with the best hyperparameters
best_catboost_model = CatBoostRegressor(
    learning_rate=0.1,
    l2_leaf_reg=1,
    iterations=500,
    depth=6,
    border_count=32,
    random_state=42,
    verbose=0
)

# Train on the entire training data (X_train_lasso, y_train)
best_catboost_model.fit(X_train_lasso, y_train)

# Generate predictions on the test data
y_pred_test = best_catboost_model.predict(X_test_lasso)

submission_df = pd.DataFrame({
    'Place_ID X Date': submission_ids,  # Changed to use 'submission_ids'
    'target': y_pred_test
})

# Save the submission to a CSV file
submission_df.to_csv('/content/sample_data/final_submission.csv', index=False)

print("✅ Final model trained and predictions made.")
print("📂 Submission file saved as 'final_submission.csv'.")


✅ Final model trained and predictions made.
📂 Submission file saved as 'final_submission.csv'.
