# Beijing Air Quality

**Programmazione di Applicazioni Data Intensive**  
Laurea in Ingegneria e Scienze Informatiche  
DISI - Università di Bologna, Cesena

Andrea Dotti `andrea.dotti4@studio.unibo.it`

Giacomo Pierbattista `giacomo.pierbattista@studio.unibo.it`

Citazioni
 - Beijing Multi-Site Air-Quality Data
 (https://archive.ics.uci.edu/dataset/501/beijing+multi+site+air+quality+data)

## Parte 1 - Descrizione del problema e analisi esplorativa

Si deve realizzare un modello che, utilizzando i dati registrati da alcuni sensori posti in alcuni distretti di Pechino/Beijing, sia in grado di fare qualcosa di utile.

Vengono importate le librerie necessarie per scaricare i file, organizzare le strutture dati e disegnare i grafici.

In [19]:
%matplotlib inline
import os.path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [31]:
file_zip_url = "https://raw.githubusercontent.com/jackprb/beijing-air-quality/master/datasets.zip"
file_zip_name = "datasets.zip"

if not os.path.exists(file_zip_name):
    from urllib.request import urlretrieve
    urlretrieve(file_zip_url, file_zip_name)
    from zipfile import ZipFile
    with ZipFile(file_zip_name) as f:
        f.extractall()

In [40]:
# short names for Beijing districts
dict_names = {"Ao" : "Aotizhongxin", 
      "Chang" : "Changping",
      "Ding" : "Dingling",
      "Dong" : "Dongsi",
      "Guan" : "Guanyuan",
      "Guc" : "Gucheng",
      "Hua" : "Huairou",
      "Nong" : "Nongzhanguan",
      "Shu" : "Shunyi",
      "Tia" : "Tiantan",
      "Wanl" : "Wanliu",
      "Wans" : "Wanshouxigong"
     }

frames = []
for name in dict_names.values():
    with open("PRSA_Data_" + name + "_20130301-20170228.csv") as dataFile:
        data_raw = pd.read_csv(dataFile, sep=",")
        frames.append(data_raw)

data_all = pd.concat(frames)
data_all = data_all.drop('No', axis=1)

In [41]:
data_all

Unnamed: 0,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013,3,1,0,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2013,3,1,1,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,2013,3,1,2,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,2013,3,1,3,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,2013,3,1,4,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35059,2017,2,28,19,11.0,32.0,3.0,24.0,400.0,72.0,12.5,1013.5,-16.2,0.0,NW,2.4,Wanshouxigong
35060,2017,2,28,20,13.0,32.0,3.0,41.0,500.0,50.0,11.6,1013.6,-15.1,0.0,WNW,0.9,Wanshouxigong
35061,2017,2,28,21,14.0,28.0,4.0,38.0,500.0,54.0,10.8,1014.2,-13.3,0.0,NW,1.1,Wanshouxigong
35062,2017,2,28,22,12.0,23.0,4.0,30.0,400.0,59.0,10.5,1014.4,-12.9,0.0,NNW,1.2,Wanshouxigong


In [42]:
#
data_all.insert(0, 'date','')
data_all['date'] = pd.to_datetime(data_all[['year', 'month', 'day', 'hour']])
data_all = data_all.drop(['year', 'month', 'day', 'hour'], axis=1)

In [43]:
data_all

Unnamed: 0,date,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013-03-01 00:00:00,4.0,4.0,4.0,7.0,300.0,77.0,-0.7,1023.0,-18.8,0.0,NNW,4.4,Aotizhongxin
1,2013-03-01 01:00:00,8.0,8.0,4.0,7.0,300.0,77.0,-1.1,1023.2,-18.2,0.0,N,4.7,Aotizhongxin
2,2013-03-01 02:00:00,7.0,7.0,5.0,10.0,300.0,73.0,-1.1,1023.5,-18.2,0.0,NNW,5.6,Aotizhongxin
3,2013-03-01 03:00:00,6.0,6.0,11.0,11.0,300.0,72.0,-1.4,1024.5,-19.4,0.0,NW,3.1,Aotizhongxin
4,2013-03-01 04:00:00,3.0,3.0,12.0,12.0,300.0,72.0,-2.0,1025.2,-19.5,0.0,N,2.0,Aotizhongxin
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35059,2017-02-28 19:00:00,11.0,32.0,3.0,24.0,400.0,72.0,12.5,1013.5,-16.2,0.0,NW,2.4,Wanshouxigong
35060,2017-02-28 20:00:00,13.0,32.0,3.0,41.0,500.0,50.0,11.6,1013.6,-15.1,0.0,WNW,0.9,Wanshouxigong
35061,2017-02-28 21:00:00,14.0,28.0,4.0,38.0,500.0,54.0,10.8,1014.2,-13.3,0.0,NW,1.1,Wanshouxigong
35062,2017-02-28 22:00:00,12.0,23.0,4.0,30.0,400.0,59.0,10.5,1014.4,-12.9,0.0,NNW,1.2,Wanshouxigong
