# Warsaw Air Pollution

<span style="color: gray; font-size:1em;">Mateusz Zajac</span>
<br><span style="color: gray; font-size:1em;">Jul-2020</span>


## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gather)
- [Part II - Assessing Data](#assess)
- [Part III - Cleaning Data](#clean)

In [1]:
import pandas as pd
import numpy as np

from scipy.special import boxcox1p
from scipy.special import inv_boxcox1p

from datetime import datetime, timedelta
import calendar


#visualization
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
base_color = sns.color_palette()[0]


import warnings
warnings.filterwarnings('ignore')

#show all columns and rows
pd.options.display.max_rows = None
pd.options.display.max_columns = None


import requests
import json

<a id='gather'></a>
## Part I - Gathering Data

### Meteorological data

**darksky API Key**
```
key = generation of the key is no longer available, please read https://blog.darksky.net/
```

**API Setup**
```
date = datetime.strptime('2015-01-01', '%Y-%m-%d')
timest = int(datetime.timestamp(date))
```

**Download data**
```
data_darksky = []
for i in range(365):
    darksky = json.loads(requests.get('https://api.darksky.net/forecast/{}/52.2193,21.0047,{}'
    .format(key, timest)).text)
    
    data_darksky.extend(darksky['hourly']['data'])
    
    timest += 24*60*60
    
df_darksky = pd.DataFrame(data_darksky)
```

### Pollution data from the monitoring stations
Data has been downloaded manually from the [GIOŚ archives](http://powietrze.gios.gov.pl/pjp/archives)

In [2]:
# load both dataset

darksky = pd.read_hdf('data/darksky_data.h5')
gios = pd.read_hdf('data/gios_data.h5')

In [3]:
darksky.head()

Unnamed: 0,date,apparentTemperature,cloudCover,dewPoint,humidity,icon,ozone,precipAccumulation,precipIntensity,precipProbability,precipType,pressure,summary,temperature,time,uvIndex,visibility,windBearing,windGust,windSpeed
0,2015-01-01 00:00:00,32.93,1.0,29.56,0.87,,,0.0,0.0,0.0,,1027.7,,32.93,1420067000.0,0.0,2.733,260.0,6.93,6.93
1,2015-01-01 01:00:00,33.9,1.0,31.11,0.89,,,0.0,0.0,0.0,,1027.7,,33.9,1420070000.0,0.0,2.733,260.0,6.93,6.93
2,2015-01-01 02:00:00,27.73,1.0,31.59,0.91,cloudy,,0.0,0.0,0.0,,1027.7,Overcast,33.81,1420074000.0,0.0,2.733,260.0,6.93,6.93
3,2015-01-01 03:00:00,27.73,1.0,32.76,0.96,fog,,0.0,0.0,0.0,,1027.7,Foggy,33.81,1420078000.0,0.0,1.244,251.0,6.93,6.93
4,2015-01-01 04:00:00,26.41,1.0,32.89,0.97,fog,,0.0,0.0,0.0,,1027.7,Foggy,33.73,1420081000.0,0.0,1.152,251.0,8.96,8.96


In [4]:
darksky.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43853 entries, 0 to 43852
Data columns (total 20 columns):
date                   43853 non-null object
apparentTemperature    43853 non-null float64
cloudCover             43853 non-null float64
dewPoint               43853 non-null float64
humidity               43853 non-null float64
icon                   42219 non-null object
ozone                  13513 non-null float64
precipAccumulation     43853 non-null float64
precipIntensity        43853 non-null float64
precipProbability      43853 non-null float64
precipType             5442 non-null object
pressure               43853 non-null float64
summary                42219 non-null object
temperature            43853 non-null float64
time                   43853 non-null float64
uvIndex                43853 non-null float64
visibility             43853 non-null float64
windBearing            43853 non-null float64
windGust               43853 non-null float64
windSpeed             

In [5]:
darksky.describe()

Unnamed: 0,apparentTemperature,cloudCover,dewPoint,humidity,ozone,precipAccumulation,precipIntensity,precipProbability,pressure,temperature,time,uvIndex,visibility,windBearing,windGust,windSpeed
count,43853.0,43853.0,43853.0,43853.0,13513.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0,43853.0
mean,48.1553,0.575709,41.072846,0.725362,318.439369,0.000416,0.000896,0.021855,1016.543833,50.739092,1498956000.0,0.863476,6.891712,193.698265,10.333003,7.524209
std,18.897848,0.327405,12.890181,0.180586,40.287832,0.006123,0.005747,0.093905,8.845515,16.258938,45532790.0,1.591354,2.244549,93.865781,7.107354,3.951228
min,-11.33,0.0,-9.65,0.12,224.2,0.0,0.0,0.0,975.2,-4.84,1420067000.0,0.0,0.062,0.0,0.0,0.0
25%,32.67,0.25,32.02,0.6,289.2,0.0,0.0,0.0,1011.4,37.87,1459526000.0,0.0,6.215,115.0,5.14,4.68
50%,47.63,0.72,41.08,0.76,315.4,0.0,0.0,0.0,1016.6,49.89,1498990000.0,0.0,6.216,206.0,8.48,6.93
75%,63.41,0.87,51.42,0.87,341.4,0.0,0.0,0.0,1022.2,63.38,1538366000.0,1.0,10.0,270.0,13.49,9.83
max,96.81,1.0,70.48,1.0,484.6,0.3075,0.2721,0.97,1046.7,96.79,1577830000.0,8.0,10.0,359.0,51.64,38.78


In [6]:
gios.head()

Unnamed: 0,date,pm25_nie,pm25_kon,pm25_wok
0,2015-01-01 00:00:00,51.5034,78.085,51.32
1,2015-01-01 01:00:00,71.8204,78.085,68.982316
2,2015-01-01 02:00:00,42.6996,64.46,48.707108
3,2015-01-01 03:00:00,38.2824,36.21,37.986883
4,2015-01-01 04:00:00,35.4194,29.585,33.675489


In [7]:
gios.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43853 entries, 0 to 43852
Data columns (total 4 columns):
date        43853 non-null object
pm25_nie    43853 non-null float64
pm25_kon    43853 non-null float64
pm25_wok    43853 non-null float64
dtypes: float64(3), object(1)
memory usage: 1.7+ MB


In [8]:
gios.describe()

Unnamed: 0,pm25_nie,pm25_kon,pm25_wok
count,43853.0,43853.0,43853.0
mean,25.296143,31.188994,19.677586
std,16.907685,26.346212,14.572871
min,0.877018,0.01,1.281766
25%,13.535946,11.3,9.438459
50%,20.661466,20.96,15.64
75%,31.902,46.402488,25.22
max,187.930147,256.02897,155.365434


In [9]:
gios.isnull().sum()

date        0
pm25_nie    0
pm25_kon    0
pm25_wok    0
dtype: int64

In [10]:
darksky.date.min(), darksky.date.max()

('2015-01-01 00:00:00', '2019-12-31 23:00:00')

In [11]:
gios.date.min(), gios.date.max()

('2015-01-01 00:00:00', '2019-12-31 23:00:00')

In [12]:
# set dates to timestamps
darksky['date'] = pd.to_datetime(darksky.date)
gios['date'] = pd.to_datetime(gios.date)