# ML2 | Pollution Prediction Project
**MAI 2023/24 USC | Online Learning Project**

This Jupyter Notebook is part of the Online Learning Project for the Machine Learning 2 course at UPC School. The project is uses data from the [Data Repository](https://link). 


**Authors:**
- Brian
- Fernando Nunez Sanchez
- Marcin 
- Santiago Su√°rez Carrera


In [9]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np


## 1. Problem description (Max 1 point)

### Informal Problem Description
Predicting pollution levels (PM2.5) is crucial for public health and environmental policy-making. This project aims to forecast pollution levels 24 hours ahead using weather forecasts and current pollution data.

### ML Problem Characteristics
- **What is the problem?** Time series forecasting of pollution levels 24 hours ahead.
- **Type of problem:** Regression.
- **Dataset imbalance:** Explore if certain periods exhibit high variance in pollution levels, potentially causing imbalance.
- **Potential for concept drift:** Consider how seasonal changes, urban development, or policy changes may affect pollution levels over time.
- **Evaluation metrics:** Use RMSE and MAE for model performance evaluation due to the regression nature of the problem.

## 2. Dataset Selection (Max. 1 point)
- **Justification for Dataset Suitability:** The selected dataset is ideal for stream learning due to its temporal component and relevance to pollution level predictions.
- **Dataset Source:** Confirm the source of the dataset and its suitability for stream processing.
- **Dataset Preparation:** Describe any preprocessing to make the dataset stream-friendly or simulate real-time streaming.

### Dataset Overview

For this project, we have chosen a comprehensive dataset on air pollution, specifically focusing on PM2.5 particle measurements. The dataset comprises several weather-related features alongside the pollution readings, recorded hourly. Our primary goal is to predict future pollution levels (24 hours ahead) based on current weather conditions and pollution data. This predictive model could serve as a tool for early warning systems to mitigate the adverse effects of air pollution on health and the environment.

### Dataset Characteristics

The dataset includes the following features:

- `date`: Timestamp of the observation (hourly data)
- `pollution`: PM2.5 concentration
- `dew`: Dew point
- `temp`: Temperature
- `press`: Pressure
- `wnd_dir`: Wind direction
- `wnd_spd`: Wind speed
- `snow`: Snowfall
- `rain`: Rainfall

#### Initial Data Analysis


In [19]:
# Loading the dataset
dataset = pd.read_csv('data/air_pollution_dataset.csv')

# Display the first few rows of the dataset
dataset.head()


Unnamed: 0,date,pollution,dew,temp,press,wnd_dir,wnd_spd,snow,rain
0,2010-01-02 00:00:00,129.0,-16,-4.0,1020.0,SE,1.79,0,0
1,2010-01-02 01:00:00,148.0,-15,-4.0,1020.0,SE,2.68,0,0
2,2010-01-02 02:00:00,159.0,-11,-5.0,1021.0,SE,3.57,0,0
3,2010-01-02 03:00:00,181.0,-7,-5.0,1022.0,SE,5.36,1,0
4,2010-01-02 04:00:00,138.0,-7,-5.0,1022.0,SE,6.25,2,0


In [20]:
# Checking for null values
dataset.isnull().sum()

date         0
pollution    0
dew          0
temp         0
press        0
wnd_dir      0
wnd_spd      0
snow         0
rain         0
dtype: int64

In [21]:
# Statistical summary of the dataset
dataset.describe()

Unnamed: 0,pollution,dew,temp,press,wnd_spd,snow,rain
count,43800.0,43800.0,43800.0,43800.0,43800.0,43800.0,43800.0
mean,94.013516,1.828516,12.459041,1016.447306,23.894307,0.052763,0.195023
std,92.252276,14.429326,12.193384,10.271411,50.022729,0.760582,1.416247
min,0.0,-40.0,-19.0,991.0,0.45,0.0,0.0
25%,24.0,-10.0,2.0,1008.0,1.79,0.0,0.0
50%,68.0,2.0,14.0,1016.0,5.37,0.0,0.0
75%,132.25,15.0,23.0,1025.0,21.91,0.0,0.0
max,994.0,28.0,42.0,1046.0,585.6,27.0,36.0


In [22]:
# Unique values in 'wnd_dir' column
print(dataset['wnd_dir'].unique())


['SE' 'cv' 'NW' 'NE']


## 3. Data Preparation (Max. 1 point)
- **Data Type Conversions:** Convert categorical variables and cast numerical variables as needed for River processing.
- **Normalization/Standardization:** Discuss normalizing or standardizing features to improve model performance.
- **Feature Engineering:** Create new features and select relevant ones to enhance predictions.
- **Categorization:** Explain categorization of continuous variables, if applicable.


### Feature Engineering

To align with our objective, we will perform the following data preparation steps:

1. **Shift the `pollution` column by 24 hours** to create the `pollution_24` target variable.
2. **Extract `day of the week` and `month` from the `date` column** to capture temporal patterns in pollution levels.
3. **Forecast Weather Features**: For the purpose of this project, we will treat the current weather features as if they were forecasts for the next 24 hours. This simplification assumes the availability of accurate weather forecasts.
4. **Encode the `wnd_dir` categorical variable** using one-hot encoding to convert it into a format suitable for our machine learning models.
5. **Normalize or standardize the numerical features** as required to ensure consistent scale across all input features.

These preparation steps will transform the raw dataset into a structured format that our machine learning models can efficiently process to predict future pollution levels.

In [33]:
# Loading the dataset
dataset = pd.read_csv('data/air_pollution_dataset.csv')

In [34]:
# Step 1: Create target variable by shifting pollution
dataset['current_pollution'] = dataset['pollution'].shift(24)

# Drop the last 24 rows where pollution_24 is NaN due to shifting
dataset = dataset.iloc[24:-24]

# Rename the pollution column to pred_pollution
dataset = dataset.rename(columns={'pollution': 'pred_pollution'})

In [35]:
# Step 2: Extract day of the week and month from the date column
dataset['date'] = pd.to_datetime(dataset['date'])
dataset['day_of_week'] = dataset['date'].dt.dayofweek
dataset['month'] = dataset['date'].dt.month

# Rename pollution as pred_pollution for clarity
dataset.rename(columns={'pollution': 'pred_pollution'}, inplace=True)

In [36]:
dataset.head()

Unnamed: 0,date,pred_pollution,dew,temp,press,wnd_dir,wnd_spd,snow,rain,current_pollution,day_of_week,month
24,2010-01-03 00:00:00,90.0,-7,-6.0,1027.0,SE,58.56,4,0,129.0,6,1
25,2010-01-03 01:00:00,63.0,-8,-6.0,1026.0,SE,61.69,5,0,148.0,6,1
26,2010-01-03 02:00:00,65.0,-8,-7.0,1026.0,SE,65.71,6,0,159.0,6,1
27,2010-01-03 03:00:00,55.0,-8,-7.0,1025.0,SE,68.84,7,0,181.0,6,1
28,2010-01-03 04:00:00,65.0,-8,-7.0,1024.0,SE,72.86,8,0,138.0,6,1


In [37]:
# Step 3: Prepare for one-hot encoding and feature scaling
# One-hot encode 'wnd_dir' and scale numerical features except 'pollution_24' which is our target

# Define columns to scale and encode
columns_to_scale = ['current_pollution', 'dew', 'temp', 'press', 'wnd_spd', 'snow', 'rain', 'day_of_week', 'month']
columns_to_encode = ['wnd_dir']

# Separate the target variable and features
features = dataset.drop(columns=['pred_pollution'])
target = dataset['pred_pollution'].values

# Define transformer for scaling and encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), columns_to_scale),
        ('cat', OneHotEncoder(), columns_to_encode)
    ])

# Apply transformations to the features
features_prepared = preprocessor.fit_transform(features)

## 4. Concept Drifts (Max. 1 point)
- **Implemented Detectors:** Use at least two concept drift detectors from River to monitor drifts in pollution levels.


## 5. Batch Learning with Base Model (Max. 1 point)
- **Data Splitting:** Ensure temporal integrity when splitting the dataset.
- **Model Training and Evaluation:** Train a base model, evaluate its performance, and establish a benchmark for stream learning models.


## 6. Stream Learning (Max. 2 points)
- **Stream Pipeline Implementation:** Develop a River stream pipeline for data preprocessing, model training, and evaluation.
- **Model Selection/Comparison:** Compare at least three machine learning models within River, including a Hoeffding Tree model.


## 7. Results and Conclusions 
Together with notebook presentation max. 2 points. 
* Oral presentation 2 points.