## Introduction

This project uses machine learning to predict air quality metrics (PM2.5, PM10, and NO2) for three cities: **New York**, **Chicago**, and **Los Angeles**. Below is a step-by-step explanation of the code:

---

### 1. **Data Loading and Preprocessing**
   - The dataset is loaded from a CSV file (`updated_air_quality_dataset.csv`).
   - The `Date` and `Hour` columns are combined into a single `Datetime` column.
   - Temporal features (year, month, day, hour, day of the week) are extracted from the `Datetime` column.
   - Redundant columns (`Date`, `Year`, `Month`, `Day`, `Hour`) are dropped to clean the dataset.

---

### 2. **Feature and Target Selection**
   - **Features**: Temporal and environmental features such as temperature, humidity, precipitation, wind speed, and traffic density (for New York).
   - **Target Variables**: PM2.5, PM10, and NO2 concentrations.

---

### 3. **Model Training and Evaluation**
   - A **RandomForestRegressor** is used to predict air quality metrics.
   - The dataset is split into training (80%) and testing (20%) sets.
   - Separate models are trained for each city (New York, Chicago, Los Angeles) and each target variable (PM2.5, PM10, NO2).
   - Model performance is evaluated using **Mean Squared Error (MSE)** and **R² Score**.

## Code:

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
final_df = pd.read_csv("updated_air_quality_dataset.csv")

# Convert the combined Date to a datetime column
final_df['datetime'] = pd.to_datetime(final_df['Date'])
final_df['day_of_week'] = final_df['datetime'].dt.dayofweek

# Drop the original datetime columns
final_df = final_df.drop(columns=['Date', 'datetime'])

features = ['Year', 'Month', 'Day', 'Hour', 'day_of_week', 
            'temperature_2m (°C)', 'relative_humidity_2m (%)', 
            'precipitation (mm)', 'wind_speed_100m (km/h)']

target_pm25 = 'PM2.5'
target_pm10 = 'PM10'
target_no2 = 'NO2'

# Function to perform hyperparameter tuning, training, and evaluation of a RandomForestRegressor
def train_and_evaluate_rf(X_train, X_test, y_train, y_test, target_name):
    # Define the parameter grid for hyperparameter tuning
    param_grid = {
         'n_estimators': [50, 100, 200],
         'max_depth': [None, 10, 20, 30],
         'min_samples_split': [2, 5, 10],
         'min_samples_leaf': [1, 2, 4],
         'bootstrap': [True, False]
    }
    
    # Initialize the RandomForestRegressor
    rf = RandomForestRegressor(random_state=42)
    
    # Use GridSearchCV for hyperparameter tuning with 3-fold cross-validation
    grid_search = GridSearchCV(estimator=rf,
                               param_grid=param_grid,
                               cv=3,
                               n_jobs=-1,
                               scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)
    
    # Retrieve the best estimator
    best_rf = grid_search.best_estimator_
    
    # Predict on the test set and calculate performance metrics
    y_pred = best_rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"{target_name} Model - MSE: {mse:.3f}, R2: {r2:.3f}")
    return best_rf

###################################################################################################
# New York (Using PM2.5 and Average Traffic Density)
ny_df = final_df[final_df['County Name'] == 'New York']
ny_features = features + ['Average Traffic Density']  # Include traffic density for New York
X_ny = ny_df[ny_features]
y_ny_pm25 = ny_df[target_pm25]

# Border
print("--------------------------------------------------")

# Split the New York dataset into training and testing sets
X_ny_train, X_ny_test, y_ny_train, y_ny_test = train_test_split(X_ny, y_ny_pm25, test_size=0.2, random_state=42)

# Train and evaluate the PM2.5 model for New York with hyperparameter tuning
rf_pm25_ny = train_and_evaluate_rf(X_ny_train, X_ny_test, y_ny_train, y_ny_test, "New York PM2.5")

###################################################################################################
# Chicago (Using PM10, NO2, and PM2.5)
chicago_df = final_df[final_df['County Name'] == 'Cook']
X_chicago = chicago_df[features]
y_chicago_pm10 = chicago_df[target_pm10]
y_chicago_no2 = chicago_df[target_no2]
y_chicago_pm25 = chicago_df[target_pm25]

# Split the Chicago dataset into training and testing sets (for three target variables)
X_chicago_train, X_chicago_test, \
y_chicago_pm10_train, y_chicago_pm10_test, \
y_chicago_no2_train, y_chicago_no2_test, \
y_chicago_pm25_train, y_chicago_pm25_test = train_test_split(
    X_chicago, y_chicago_pm10, y_chicago_no2, y_chicago_pm25, test_size=0.2, random_state=42)

# Border
print("--------------------------------------------------")

# Train and evaluate the PM10 model for Chicago
rf_pm10_chicago = train_and_evaluate_rf(X_chicago_train, X_chicago_test, y_chicago_pm10_train, y_chicago_pm10_test, "Chicago PM10")

# Train and evaluate the NO2 model for Chicago
rf_no2_chicago = train_and_evaluate_rf(X_chicago_train, X_chicago_test, y_chicago_no2_train, y_chicago_no2_test, "Chicago NO2")

# Train and evaluate the PM2.5 model for Chicago
rf_pm25_chicago = train_and_evaluate_rf(X_chicago_train, X_chicago_test, y_chicago_pm25_train, y_chicago_pm25_test, "Chicago PM2.5")

###################################################################################################
# Los Angeles (Using PM10, NO2, and PM2.5)
los_angeles_df = final_df[final_df['County Name'] == 'Los Angeles']
X_la = los_angeles_df[features]
y_la_pm10 = los_angeles_df[target_pm10]
y_la_no2 = los_angeles_df[target_no2]
y_la_pm25 = los_angeles_df[target_pm25]

# Split the Los Angeles dataset into training and testing sets (for three target variables)
X_la_train, X_la_test, \
y_la_pm10_train, y_la_pm10_test, \
y_la_no2_train, y_la_no2_test, \
y_la_pm25_train, y_la_pm25_test = train_test_split(
    X_la, y_la_pm10, y_la_no2, y_la_pm25, test_size=0.2, random_state=42)

# Border
print("--------------------------------------------------")

# Train and evaluate the PM10 model for Los Angeles
rf_pm10_la = train_and_evaluate_rf(X_la_train, X_la_test, y_la_pm10_train, y_la_pm10_test, "Los Angeles PM10")

# Train and evaluate the NO2 model for Los Angeles
rf_no2_la = train_and_evaluate_rf(X_la_train, X_la_test, y_la_no2_train, y_la_no2_test, "Los Angeles NO2")

# Train and evaluate the PM2.5 model for Los Angeles
rf_pm25_la = train_and_evaluate_rf(X_la_train, X_la_test, y_la_pm25_train, y_la_pm25_test, "Los Angeles PM2.5")

--------------------------------------------------
New York PM2.5 Model - MSE: 15.547, R2: 0.521
--------------------------------------------------
Chicago PM10 Model - MSE: 54.348, R2: 0.754
Chicago NO2 Model - MSE: 13.416, R2: 0.783
Chicago PM2.5 Model - MSE: 12.740, R2: 0.615
--------------------------------------------------
Los Angeles PM10 Model - MSE: 83.752, R2: 0.538
Los Angeles NO2 Model - MSE: 24.819, R2: 0.811
Los Angeles PM2.5 Model - MSE: 43.068, R2: 0.869
