# Task 3: Air Pollution Forecasting — Project Overview

This project focuses on building and evaluating a recurrent neural network (RNN) model for analyzing time series data describing air pollution (PM2.5) in Beijing during 2013–2017.  
The goal is to forecast PM2.5 concentration based on meteorological data from the Wanshouxigong monitoring station.

Project steps include:

- Loading, cleaning, and exploring the CSV dataset
- Preparing data for time series analysis (sequential split, standardization, TimeseriesGenerator)
- Building and training an LSTM (Long Short-Term Memory) model
- Evaluating prediction quality using regression metrics (MSE, R²)
- Visualizing results and interpreting the model.


In [1]:
## Import required libraries

In [2]:
import numpy as np
import pandas as pd

In [3]:
# Load data from CSV, dropping columns 'No', 'wd', 'station'
file_path = 'PRSA_Data_Wanshouxigong_20130301-20170228.csv'
columns_to_drop = ['No', 'wd', 'station']
try:
    df = pd.read_csv(file_path)
    print("File loaded successfully.")
    df.drop(columns=columns_to_drop, inplace=True)
    print("Columns 'No', 'wd', 'station' have been dropped.\n")
    print("First 5 rows of the original data:")
    print(df.head())
except FileNotFoundError:
    print(f"File {file_path} not found.")
    df = None

# Combine 'year', 'month', 'day', 'hour' into a single datetime column
if df is not None:
    df['Date'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
    print("\nColumn 'Date' has been created.")

    # Drop the original 'year', 'month', 'day', 'hour' columns
    df.drop(columns=['year', 'month', 'day', 'hour'], inplace=True)
    
    # Set 'Date' as the DataFrame index
    df.set_index('Date', inplace=True)
    print("Set 'Date' as DataFrame index.\n")
    print("Transformed data structure:")
    print(df.info())

    # Show first 5 rows of transformed data
    print("\nFirst 5 rows of transformed data:")
    print(df.head())


File loaded successfully.
Columns 'No', 'wd', 'station' have been dropped.

First 5 rows of the original data:
   year  month  day  hour  PM2.5  PM10  SO2   NO2     CO    O3  TEMP    PRES  \
0  2013      3    1     0    9.0   9.0  6.0  17.0  200.0  62.0   0.3  1021.9   
1  2013      3    1     1   11.0  11.0  7.0  14.0  200.0  66.0  -0.1  1022.4   
2  2013      3    1     2    8.0   8.0  NaN  16.0  200.0  59.0  -0.6  1022.6   
3  2013      3    1     3    8.0   8.0  3.0  16.0    NaN   NaN  -0.7  1023.5   
4  2013      3    1     4    8.0   8.0  3.0   NaN  300.0  36.0  -0.9  1024.1   

   DEWP  RAIN  WSPM  
0 -19.0   0.0   2.0  
1 -19.3   0.0   4.4  
2 -19.7   0.0   4.7  
3 -20.9   0.0   2.6  
4 -21.7   0.0   2.5  

Column 'Date' has been created.
Set 'Date' as DataFrame index.

Transformed data structure:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35064 entries, 2013-03-01 00:00:00 to 2017-02-28 23:00:00
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
--

### Interpretation of Remaining Columns

After preprocessing, our DataFrame `df` contains a time-based index and columns representing pollutant concentrations and meteorological conditions. These variables form the foundation for building a predictive model.

Below is a brief description of each variable.

#### Target Variable (Prediction Target)

* **`PM2.5`**: Concentration of fine particulate matter less than 2.5 microns in diameter (µg/m³). This is the main air pollution indicator, harmful to health. **This is the value we aim to forecast.**

#### Explanatory Variables (Features)

The features can be divided into two logical groups: other pollutants (often correlated with PM2.5) and meteorological variables (which affect formation, dispersal, and removal of pollutants).

**1. Other Pollutants:**

* **`PM10`**: Particulate matter <10 microns (µg/m³), includes PM2.5 fraction; strongly correlated.
* **`SO2`**: Sulfur dioxide concentration (µg/m³).
* **`NO2`**: Nitrogen dioxide concentration (µg/m³).
* **`CO`**: Carbon monoxide concentration (µg/m³).
* **`O3`**: Ozone concentration (µg/m³).

Gases like `SO2` and `NO2` can lead to the formation of secondary particulate matter (including PM2.5) through atmospheric reactions.

**2. Meteorological Variables:**

* **`TEMP`**: Temperature (°C). Affects reaction rates and air mixing.
* **`PRES`**: Atmospheric pressure (hPa). High pressure is linked to stable conditions, often resulting in pollutant accumulation.
* **`DEWP`**: Dew point temperature (°C). Indicates air humidity; high humidity can promote condensation of pollutants and aerosol formation.
* **`RAIN`**: Precipitation (mm). Rain cleanses the air by washing out particulates and pollutants.

#### Observations & Next Steps

Inspection of the data structure (`df.info()`) shows that **missing values (`NaN`)** exist in several columns, including our target **`PM2.5`**. This is critical, as most machine learning algorithms cannot operate on incomplete data. In the next steps, we’ll need to handle this issue, likely through row removal or imputation techniques such as interpolation.
