### Project Statement: Forecasting Air Pollution in Quito Using Time Series Analysis and Machine Learning

**Project Overview:**

This project aims to forecast PM2.5 air pollution levels in Quito for the next 5-10 years using time series analysis and machine learning models. The dataset includes multiple meteorological and atmospheric chemical variables collected daily from 2004 to 2017. The data comes from various meteorological stations across Quito, specifically:

- BELISARIO
- CARAPUNGO
- CENTRO
- COTOCOLLAO
- EL CAMAL
- GUAMANI
- JIPIJAPA
- LOS CHILLOS
- SAN ANTONIO
- TUMBACO

**Data Variables:**

The dataset includes the following variables:
- CO (Carbon Monoxide)
- Humidity
- Precipitation
- NO2 (Nitrogen Dioxide)
- O3 (Ozone)
- Solar Radiation
- PM2.5 (Particulate Matter < 2.5 µm)
- PM10 (Particulate Matter < 10 µm)
- SO2 (Sulfur Dioxide)
- Temperature
- Wind Velocity
- Wind Direction

**Project Goals:**

1. **Data Preprocessing and Tidying:**
   - Handle missing data through appropriate imputation techniques for time series data.
   - Perform extensive data cleaning to ensure consistency and accuracy, considering that the data comes from multiple files and sources.
   - Consolidate and standardize the data from different meteorological stations into a unified dataset.

2. **Exploratory Data Analysis (EDA):**
   - Visualize the data to understand temporal trends and seasonal patterns.
   - Perform dimensionality reduction to visualize relationships between different variables.

3. **Feature Engineering:**
   - Create new features from temporal data, such as lag features, rolling statistics, and time-based features (e.g., day of the week, month, year).
   - Research and integrate static data to develop a hybrid model that includes both temporal and static features for improved forecasting accuracy.

4. **Model Building and Testing:**
   - Build and test various machine learning models, including but not limited to ARIMA, LSTM, RF, GBM, and other suitable time series forecasting models.
   - Evaluate models using appropriate time series metrics, such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE).

5. **Model Evaluation and Comparison:**
   - Assess the models based on their forecasting accuracy using the aforementioned time series metrics.
   - Perform cross-validation to ensure the robustness of the models.

6. **Forecasting:**
   - Use the best-performing model to forecast PM2.5 levels for the next 5-10 years for each meteorological station.

7. **Reporting:**
   - Document the entire process in a comprehensive report that includes:
     - Introduction
     - Exploratory Data Analysis (EDA)
     - Feature Engineering
     - Model Building and Testing
     - Model Evaluation and Comparison
     - Results and Forecasting
     - Conclusions and Recommendations

**Additional Points:**

- Extra points will be awarded for incorporating static data into the model to create a hybrid model that leverages both temporal and static features.
- The report should also discuss the implications of the findings in the context of modeling air pollution and its impact on public health and policy-making.

**Expected Outcomes:**

By the end of this project, we aim to develop robust models that can accurately forecast PM2.5 levels in Quito, providing valuable insights for policymakers and stakeholders to make informed decisions regarding air quality management and public health interventions. The comprehensive report will detail the methodology, findings, and implications, contributing to the broader understanding of air pollution dynamics in urban environments.

- Recommended libraries:
  - https://github.com/WenjieDu/SAITS
  - https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html
  - https://nixtlaverse.nixtla.io/mlforecast/index.html
  - This is an example of the use of MLForecast: https://colab.research.google.com/drive/1xzM8s8r-O0tkd_MZCH_M_An8juqMLP9C?usp=sharing