# 01_data_collection.ipynb

**Project:** ds-air-pollution-prediction  
**Author:** Iris Winkler, Carlos Duque, Johannes Gooth  
**Date:** April 29, 2024

---

## 📘 Introduction

In this notebook, we will focus on the **data collection** phase of our analysis for the **ds-air-pollution-prediction** project. The primary objective of this phase is to acquire and organize all necessary datasets from the Zindi platform, ensuring that we have comprehensive and relevant data to build an effective predictive model for **PM2.5 particulate matter** concentration.

### Project Objective

The objective of this project is to predict the daily concentration of **PM2.5 particulate matter** in the air for several locations across Africa, using a combination of ground-based sensor data, weather information, and satellite observations. This challenge provides a real-world scenario to apply data science techniques for environmental protection and public health improvement.  
PM2.5 particles, with diameters less than 2.5 micrometers, are among the most harmful air pollutants, posing significant health risks. Accurate predictions of PM2.5 levels are essential for public health initiatives and environmental monitoring.

### Data Overview

The dataset for this project spans three months in 2020 and includes data from multiple cities across Africa. The data is sourced from three main providers:

1. **Ground-Based Air Quality Sensors:**
   - **Description:** These sensors provide measurements of PM2.5 concentrations, including daily mean, minimum, and maximum readings, variance, and the total number of sensor readings used to compute the target value.
   - **Availability:** Data is available only for the training set (`Train.csv`). The test set (`Test.csv`) requires predictions of the target variable.

2. **Global Forecast System (GFS) Weather Data:**
   - **Description:** This dataset includes meteorological variables such as humidity, temperature, and wind speed, which serve as important predictors for PM2.5 levels.
   - **Source:** [NOAA GFS Dataset](https://developers.google.com/earth-engine/datasets/catalog/NOAA_GFS0P25)

3. **Sentinel 5P Satellite Data:**
   - **Description:** Satellite data provides information on various atmospheric pollutants, including NO₂ and CH₄ concentrations. Key measurements include `NO2_column_number_density` and `tropospheric_NO2_column_number_density`, along with metadata like satellite altitude.
   - **Source:** [Sentinel 5P Dataset](https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p)
   - **Note:** The dataset contains gaps, particularly in CH₄ data, which will need to be addressed during preprocessing.

### Data Files

The data for the Zindi challenge is organized into the following files:

- **Train.csv:**  
  Contains the target variable (daily mean PM2.5 concentration) and supporting data for 349 locations. This dataset will be used to train the predictive model.

- **Test.csv:**  
  Similar to Train.csv but without the target-related columns, covering 179 different locations. This dataset is used to generate predictions for the competition.

- **SampleSubmission.csv:**  
  Demonstrates the required submission format, containing `Place_ID X Date` columns and a `target` column for predictions. The order of the rows does not matter, but the `Place_ID X Date` combinations must match those in Test.csv.

**Note:** Since we are not actively participating in the Zindi Challenge for which these data files were originally created, we will focus solely on using `Train.csv` for further analysis and modeling.

**Why Only the Training Dataset?**

- **Comprehensive Information:** The training dataset contains both the target variable (daily mean PM2.5 concentration) and the supporting features required for model development. This allows us to perform exploratory data analysis, feature engineering, and model training effectively.
  
- **Exclusion of Test Data:** The test dataset (`Test.csv`) lacks the target variable, making it unsuitable for initial model training and validation. Since our project does not involve submitting predictions to the competition, we can allocate our resources to optimizing the model using the available training data.

### Data Collection Steps

1. **Accessing Zindi Platform:**
   - Navigate to the [Zindi Air Pollution Challenge](https://zindi.africa/competitions/urban-air-pollution-challenge/data) page.
   - Review the competition details, objectives, and evaluation metrics to understand the requirements.

2. **Downloading Dataset:**
   - **Train.csv:**  
     Download the training dataset, which includes historical PM2.5 measurements and related features.

3. **Renaming CSV File:**
   - **Train.csv:**  
     Rename the CSV file containing our project's dataset from `Train.csv` to `air_pollution_data.csv`. This renaming ensures clarity and prevents confusion during the subsequent process of splitting the data into training and testing sets for modeling purposes.

4. **Exploring Data Sources:**
   - **Ground-Based Sensors:**  
     Understand the structure and content of the sensor data, focusing on how PM2.5 measurements are recorded and aggregated.
     
   - **[GFS Weather Data](https://developers.google.com/earth-engine/datasets/catalog/NOAA_GFS0P25):**  
     Familiarize with the weather variables provided by the GFS dataset and how they can influence PM2.5 levels.
     
   - **[Sentinel 5P Satellite Data](https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p):**  
     Explore the satellite measurements of atmospheric pollutants and assess the completeness and relevance of the data.

5. **Data Storage:**
   - Organize the downloaded file into a structured directory within the project repository for easy access and management.

6. **Initial Data Inspection:**
   - Once the data is collected, we will perform a quick inspection to verify its structure and ensure it meets our requirements.

### Key Steps for Data Collection

1. **Download and Organize Data Files:**
   - Access the Zindi competition page and download the necessary datasets.
   - Store the datasets in designated folders within the project repository.

2. **Understand Data Structure and Content:**
   - Review the columns and data types in `Train.csv` and `Test.csv`.
   - Explore the metadata and key measurements provided by the satellite and weather data sources.

3. **Verify Data Integrity:**
   - Check for file completeness and integrity to ensure that all necessary data is available for the subsequent analysis phases.

4. **Document Data Sources and Attributes:**
   - Maintain clear documentation of the data sources, variable definitions, and any relevant metadata to facilitate data understanding and preprocessing.

### Expected Outcome

By the end of this notebook, we will have successfully **collected and organized all necessary datasets** required for the project. This foundational step ensures that we have access to comprehensive and high-quality data, setting the stage for effective data cleaning, preprocessing, and subsequent modeling efforts.

---

## ⚙️ Setting-Up the Working Enviroment

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Avoid restarting Kernel 
%load_ext autoreload
%autoreload 2

import pandas as pd

---

## 📥 Loading the Data

In this project, our focus is solely on the training dataset (`Train.csv`). This decision is based on our objective to develop and validate a predictive model for **PM2.5 particulate matter** concentrations without participating in the Zindi competition's submission process.

In [2]:
# Load the data into a Pandas DataFrame
df = pd.read_csv('../data/air_pollution_data.csv')

---

## 🔍 Initial Data Inspection

In [3]:
df.head()

Unnamed: 0,id,site_id,site_latitude,site_longitude,city,country,date,hour,sulphurdioxide_so2_column_number_density,sulphurdioxide_so2_column_number_density_amf,...,cloud_cloud_top_height,cloud_cloud_base_pressure,cloud_cloud_base_height,cloud_cloud_optical_depth,cloud_surface_albedo,cloud_sensor_azimuth_angle,cloud_sensor_zenith_angle,cloud_solar_azimuth_angle,cloud_solar_zenith_angle,pm2_5
0,id_vjcx08sz91,6531a46a89b3300013914a36,6.53257,3.39936,Lagos,Nigeria,2023-10-25,13,,,...,,,,,,,,,,12.015
1,id_bkg215syli,6531a46a89b3300013914a36,6.53257,3.39936,Lagos,Nigeria,2023-11-02,12,,,...,,,,,,,,,,42.2672
2,id_oui2pot3qd,6531a46a89b3300013914a36,6.53257,3.39936,Lagos,Nigeria,2023-11-03,13,,,...,6791.682888,51171.802486,5791.682829,11.816715,0.192757,-96.41189,61.045123,-121.307414,41.898269,39.450741
3,id_9aandqzy4n,6531a46a89b3300013914a36,6.53257,3.39936,Lagos,Nigeria,2023-11-08,14,,,...,,,,,,,,,,10.5376
4,id_ali5x2m4iw,6531a46a89b3300013914a36,6.53257,3.39936,Lagos,Nigeria,2023-11-09,13,0.000267,0.774656,...,1451.050659,96215.90625,451.050598,10.521009,0.153114,-97.811241,49.513439,-126.064453,40.167355,19.431731


The data is retrieved and verified. Now, it is ready for further analysis.