# Predictions of Particulate Matter (PM10)
### Author: Ricky Yang, Serena Wang
### Date: 2024-05-28



# 1. Introduction and Data Pre-processing
**Requirements**

Make sure dataset all has the same temporal resolution (i.e. hourly measurement). Perform data exploration and identify missing data and outliers (data that are out of the expected range). For example, unusual measurements of air temperature of 40(°C) for Auckland, Relative Humidity measurements
above 100, and negative or unexplained high concentrations are outliers.
* Introduce the problem being addresses in this assignment.
* Provide attribute-specific information about outliers and missing data. How can these affect dataset quality?
* Based on this analysis, decide, and justify your approach for data cleaning. Once your dataset is cleaned move to the next step for feature selection

## 1.1 Introduction

The objective of this project is to build predictive models for PM10 concentrations using various machine learning techniques, including regression models, Multilayer Perceptron (MLP), and Long Short-Term Memory (LSTM) networks. Accurate predictions of PM10 concentrations can help in implementing timely interventions to mitigate the adverse health effects associated with air pollution.


**Problem Statement**

The problem addressed in this assignment is to accurately predict PM10 concentrations using machine learning models. PM10, or particulate matter with a diameter of 10 micrometers or less, poses significant health risks, particularly affecting respiratory and cardiovascular systems. By predicting PM10 levels, we can provide timely warnings and implement strategies to mitigate air pollution's adverse health effects. Using data from the Penrose Station in Auckland, which includes various environmental and meteorological factors, this assignment aims to develop regression, Multilayer Perceptron (MLP), and Long Short-Term Memory (LSTM) models to forecast PM10 concentrations and contribute to better air quality management.


**Data Source**

The dataset for this analysis is sourced from the **Penrose Station (ID:7)** in Auckland, covering the period from **May 1, 2019, to April 30, 2024**.  It includes hourly measurements of various environmental and meteorological factors, which are potential predictors of PM10 concentrations. The variables included in the dataset are:

- Air Temperature (24-Hour Aggregate, °C)
- Relative Humidity (24-Hour Aggregate, %)
- Wind Speed (24-Hour Aggregate, m/s)
- Wind Direction (Hourly Aggregate, °)
- NO (24-Hour Aggregate, µg/m³)
- NO2 (24-Hour Aggregate, µg/m³)
- NOx (24-Hour Aggregate, µg/m³)
- SO2 (24-Hour Aggregate, µg/m³)
- PM10 (24-Hour Aggregate, µg/m³)

**Objective**

By analyzing and modeling these variables, we aim to develop robust predictive models that can provide accurate forecasts of PM10 concentrations, thereby contributing to better air quality management and public health protection.

**Workflow**

we will follow a systematic approach starting from data import and cleaning, followed by exploratory data analysis, feature selection, and model development. Each section will detail the steps and methodologies used, along with the corresponding results and interpretations.





## 1.2  Outliers and Missing Data


##1.3 Data Cleaning Approach and Justification


# 2. Data Exploration and Feature Selection
**Requirements**

Choose five attributes of your dataset that has the highest correlation with  PM10 concentration
using Pearson Correlation or any other feature selection method of your choice with justification.
* Provide the correlation plot (or results of any other feature selection method of your choice) and elaborate on the rationale for your selection.
* Describe your chosen attributes and their influence on PM concentration.
* Provide graphical visualisation of variation of PM variation.
* Provide summary statistics of the PM concentration.
* Provide summary statistic of predictors of your choice that has the highest correlation in tabular
format


## 2.1 Correlation Analysis and Feature Selection


### Result
The dataset information ....
### Analysis and Explanation
Ensure ...


# 3. Experimental Methods
**Requirements**

Use 70% of the data for training and the rest for testing the MLP and LSTM models. Use a Workflow diagram to illustrate the process of predicting PM concentrations using the MLP and LSTM models.


# 4 Multilayer Perceptron (MLP)
**Requirements**

1) In your own words, describe multilayer perceptron (MLP). You may use one diagram in your explanation (one page).

2) Use the sklearn.neural_network.MLPRegressor with default values for parameters and a single hidden layer with k= 25 neurons. Use default values for all parameters and experimentally determine the best learning rate that givesthe highest performance on the testing dataset. Use this as a baseline for comparison in later parts of this question.

3) Experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest accuracy. In part 2, we had all k neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer
will have k-1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons will be in the first layer with 2 in the second, and so on.

4) From the results in part 3 of this question, you will observe a variation in the obtained performance metrics with the split of neurons across the two layers. Give explanations for some possible reasons for this variation and which architecture gives the best performance


# 5 Long Short-Term Memory (LSTM)
**Requirements**

1) Describe LSTM architecture including the gates and state functions. How does LSTM differ from MLP? Discuss how does the number of neurons and batch size affect the performance of the network?

2) To create the LSTM Model and determine the optimal architecture, apply Adaptive Moment Estimation (ADAM) to train the networks. Identify an appropriate cost function to measure model performance based on training samples and the related prediction outputs. To find the best epoch, based on your cost function results, complete up to 30 runs keeping the learning rate and the number of batch sizes constant (e.g. at 0.01 and 4 respectively).
Provide a line plot of the test and train cost function scores for each epoch. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time for each epoch. Choose the best epoch with justification.

3) Investigate the impact of differing the number of the batch size, complete 30 runs keeping the learning rate constant at 0.01 and use the best number of epochs obtained in previous step 2. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost.function as well as the run time for each batch size. Choose the best batch size with justification.

4) Investigate the impact of differing the number of neurons in the hidden layer while keeping the epoch (step 2) and Batch size (step 3) constant for 30 runs. Report the summary statistics (Mean,Standard Deviation, Minimum and Maximum) of the cost function as well as the run time.Discuss how does the number of neurons affect performance and what is the optimal number of
neurons in your experiment?


# 6 Model Comparison
**Requirements**

1) Plot model-specific actual and predicted PM to visually compare the model performance. What is your observation?

2) Compare the performance of both MLP and LSTM using RMSE. Which model performed better? Justify your finding.


# 7. Report Presentation
Summarize the findings of the analysis and model comparison. Discuss the implications and potential next steps for further improvement.


# 8. References
List any references or sources used