# Urban Air Pollution Challenge


__Can you predict air quality in cities around the world using satellite data?__

For this challenge we’ll be digging deeper into Air quality data in several African cities, finding ways to track air quality and how it is changing, even in places without ground-based sensors. The collected weather data and daily observations are from the _Sentinel 5P satellite_ tracking various pollutants in the atmosphere. 

Our goal is to use the information from such data to predict _PM2.5_ particulate matter concentration (a common measure of air quality that normally requires ground-based sensors to measure) every day for each city. The data covers the last three months, spanning hundreds of cities across the globe.

_This is a Zindi data challenge, for more details check: [Urban Air Pollution challenge](https://zindi.africa/competitions/zindiweekendz-learning-urban-air-pollution-challenge)_


The objective of this challenge is to predict __PM2.5__ particulate matter concentration in the air __every day__ for __each city__. 
- PM2.5 refers to atmospheric particulate matter that have a diameter of __less than 2.5 micrometers__ 
- Is one of the most harmful air pollutants. 
- PM2.5 is a common measure of air quality that normally requires ground-based sensors to measure.

The data comes from three main sources:

1. __Ground-based air quality sensors__. These measure the __target__ variable (PM2.5 particle concentration). In addition to the `target` column (which is the daily mean concentration) there are also columns for `minimum` and `maximum` readings on that day, the `variance` of the readings and the total number (`count`) of sensor readings used to compute the target value. _This data is only provided for the train set_ - you must predict the target variable for the test set.

2. __The Global Forecast System (GFS)__ for _weather data_. `Humidity`, `temperature` and `wind speed`, which can be used as inputs for your model.

3. __The Sentinel 5P satellite__. This satellite monitors various _pollutants_ in the atmosphere. For each pollutant, we queried the `offline Level 3` (L3) datasets available in Google Earth Engine (you can read more about the individual products here: https://developers.google.com/earth-engine/datasets/catalog/sentinel-5p). For a given pollutant, for example NO2, we provide all data from the Sentinel 5P dataset for that pollutant. This includes the key measurements like `NO2_column_number_density` (a measure of NO2 concentration) as well as metadata like the `satellite altitude`. We recommend that you __focus on the key measurements__, either the `column_number_density` or the `tropospheric_X_column_number_density` (which measures density closer to Earth’s surface).
Unfortunately, this data is not 100% complete. Some locations have no sensor readings for a particular day, and so those rows have been excluded. There are also gaps in the input data, particularly the satellite data for CH4.



This data is not 100% complete. Some locations have no sensor readings for a particular day, and so those rows have been excluded. There are also gaps in the input data, particularly the satellite data for CH4.

## Upload the data

In [1]:
import sys
print(sys.executable)

/home/ilaria/Data Science/Bootcamp/.venv/bin/python


In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Upload Train data
df_train=pd.read_csv('data/Train.csv')

# Upload Test data
df_test=pd.read_csv('data/Test.csv')

In [5]:
df_train.columns

Index(['Place_ID X Date', 'Date', 'Place_ID', 'target', 'target_min',
       'target_max', 'target_variance', 'target_count',
       'precipitable_water_entire_atmosphere',
       'relative_humidity_2m_above_ground',
       'specific_humidity_2m_above_ground', 'temperature_2m_above_ground',
       'u_component_of_wind_10m_above_ground',
       'v_component_of_wind_10m_above_ground',
       'L3_NO2_NO2_column_number_density',
       'L3_NO2_NO2_slant_column_number_density',
       'L3_NO2_absorbing_aerosol_index', 'L3_NO2_cloud_fraction',
       'L3_NO2_sensor_altitude', 'L3_NO2_sensor_azimuth_angle',
       'L3_NO2_sensor_zenith_angle', 'L3_NO2_solar_azimuth_angle',
       'L3_NO2_solar_zenith_angle',
       'L3_NO2_stratospheric_NO2_column_number_density',
       'L3_NO2_tropopause_pressure',
       'L3_NO2_tropospheric_NO2_column_number_density',
       'L3_O3_O3_column_number_density', 'L3_O3_O3_effective_temperature',
       'L3_O3_cloud_fraction', 'L3_O3_sensor_azimuth_angle',
   

In [4]:
df_train.head(10)

Unnamed: 0,Place_ID X Date,Date,Place_ID,target,target_min,target_max,target_variance,target_count,precipitable_water_entire_atmosphere,relative_humidity_2m_above_ground,...,L3_SO2_sensor_zenith_angle,L3_SO2_solar_azimuth_angle,L3_SO2_solar_zenith_angle,L3_CH4_CH4_column_volume_mixing_ratio_dry_air,L3_CH4_aerosol_height,L3_CH4_aerosol_optical_depth,L3_CH4_sensor_azimuth_angle,L3_CH4_sensor_zenith_angle,L3_CH4_solar_azimuth_angle,L3_CH4_solar_zenith_angle
0,010Q650 X 2020-01-02,2020-01-02,010Q650,38.0,23.0,53.0,769.5,92,11.0,60.200001,...,38.593017,-61.752587,22.363665,1793.793579,3227.855469,0.010579,74.481049,37.501499,-62.142639,22.545118
1,010Q650 X 2020-01-03,2020-01-03,010Q650,39.0,25.0,63.0,1319.85,91,14.6,48.799999,...,59.624912,-67.693509,28.614804,1789.960449,3384.226562,0.015104,75.630043,55.657486,-53.868134,19.293652
2,010Q650 X 2020-01-04,2020-01-04,010Q650,24.0,8.0,56.0,1181.96,96,16.4,33.400002,...,49.839714,-78.342701,34.296977,,,,,,,
3,010Q650 X 2020-01-05,2020-01-05,010Q650,49.0,10.0,55.0,1113.67,96,6.911948,21.300001,...,29.181258,-73.896588,30.545446,,,,,,,
4,010Q650 X 2020-01-06,2020-01-06,010Q650,21.0,9.0,52.0,1164.82,95,13.900001,44.700001,...,0.797294,-68.61248,26.899694,,,,,,,
5,010Q650 X 2020-01-07,2020-01-07,010Q650,28.0,10.0,52.0,1053.22,94,14.6,42.200001,...,30.605176,-62.134264,23.419991,,,,,,,
6,010Q650 X 2020-01-08,2020-01-08,010Q650,21.0,6.0,51.0,1239.66,96,15.6,47.100002,...,60.866484,-71.908414,32.348835,,,,,,,
7,010Q650 X 2020-01-09,2020-01-09,010Q650,18.0,6.0,28.0,307.93,93,18.6,62.400002,...,59.674296,-60.765053,26.396956,,,,,,,
8,010Q650 X 2020-01-10,2020-01-10,010Q650,21.0,15.0,33.0,305.92,95,11.8,39.0,...,37.176703,-73.81275,31.707143,,,,,,,
9,010Q650 X 2020-01-11,2020-01-11,010Q650,24.0,16.0,32.0,279.19,85,10.396144,33.100002,...,10.016394,-68.586306,28.090359,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Y = df[]

## Data cleaning and feature engineering

In [None]:
df_train.info()

In [None]:
# X = 

## Splitting data for testing 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [None]:
X_train.info()

In [None]:
# fillna with mean.. 
# X_train[""] = X_train[""].fillna()

## Trainining the model

In [None]:
## in order to exemplify how the predict will work.. we will save the y_train
X_test.to_csv("data/X_test.csv")
y_test.to_csv("data/y_test.csv")

In [None]:
#training the model
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error
y_train_pred = reg.predict(X_train)
mse = mean_squared_error(y_train, y_train_pred)
print(mse)