# ER190C Lecture 6 Notebook

**Data Cleaning and Exploratory Data Analysis**

Duncan Callaway

September 11 2018

Today we'll work with PurpleAir data to explore the concepts of Structure, Granularity, Scope, Temporality and Faithfulness.  Along the way we'll talk about data cleaning as well.  

[Here's PurpleAir's website](https://www.purpleair.com/map#1/25/-30) -- They have really cool maps!

The way I developed this lecture was by pulling the data down and exploring it.  You'll see my (edited) process of examining the data.

This began by me visiting [this website](https://www.purpleair.com/sensorlist) to look for data.  I used the Chrome browser to pull data (other browsers didn't work).

The folks are PurpleAir also sent me [this pdf](https://github.com/ds-modules/ER-190C/blob/master/lecture/Lecture%206%20Sept%2011/Using%20PurpleAir%20Data.pdf) describing their data.  

In [8]:
import numpy as np
import pandas as pd
import os

from utils import line_count
from utils import head

In [9]:
os.listdir('data')

['Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv',
 'Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Secondary 08_05_2018 09_04_2018.csv']

In [10]:
os.path.getsize('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')/1e6

2.41108

In [11]:
line_count('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

29894

In [12]:
head('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

['created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,\n',
 '2018-08-05 00:00:31 UTC,111170,1.96,4.34,4.96,135.00,-67.00,84.00,33.00,4.34\n',
 '2018-08-05 00:01:51 UTC,111171,2.13,3.89,6.83,136.00,-67.00,84.00,33.00,3.89\n',
 '2018-08-05 00:03:11 UTC,111172,3.04,4.93,6.18,137.00,-68.00,84.00,34.00,4.93\n',
 '2018-08-05 00:04:31 UTC,111173,2.17,4.26,6.83,139.00,-65.00,84.00,33.00,4.26\n']

In [13]:
EB_data = pd.read_csv('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

In [15]:
EB_data.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
0,2018-08-05 00:00:31 UTC,111170,1.96,4.34,4.96,135.0,-67.0,84.0,33.0,4.34,
1,2018-08-05 00:01:51 UTC,111171,2.13,3.89,6.83,136.0,-67.0,84.0,33.0,3.89,
2,2018-08-05 00:03:11 UTC,111172,3.04,4.93,6.18,137.0,-68.0,84.0,34.0,4.93,
3,2018-08-05 00:04:31 UTC,111173,2.17,4.26,6.83,139.0,-65.0,84.0,33.0,4.26,
4,2018-08-05 00:05:51 UTC,111174,2.06,4.06,8.51,140.0,-67.0,84.0,33.0,4.06,


In [16]:
pd.unique(EB_data['Unnamed: 10'].isna())

array([ True])

In [18]:
EB_data.loc[0, 'created_at']

'2018-08-05 00:00:31 UTC'

In [25]:
EB_time = pd.to_datetime(EB_data['created_at'], utc = True)

In [26]:
EB_time.dtype

datetime64[ns, UTC]

In [28]:
EB_data['created_at'] = EB_time

EB_data.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
0,2018-08-05 00:00:31+00:00,111170,1.96,4.34,4.96,135.0,-67.0,84.0,33.0,4.34,
1,2018-08-05 00:01:51+00:00,111171,2.13,3.89,6.83,136.0,-67.0,84.0,33.0,3.89,
2,2018-08-05 00:03:11+00:00,111172,3.04,4.93,6.18,137.0,-68.0,84.0,34.0,4.93,
3,2018-08-05 00:04:31+00:00,111173,2.17,4.26,6.83,139.0,-65.0,84.0,33.0,4.26,
4,2018-08-05 00:05:51+00:00,111174,2.06,4.06,8.51,140.0,-67.0,84.0,33.0,4.06,


In [30]:
EB_data.describe()

Unnamed: 0,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
count,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,0.0
mean,126116.0,15.656506,23.983548,28.240713,810.441508,-64.379654,71.434048,49.988726,21.014233,
std,8629.510135,120.382762,121.160645,121.630947,1126.457388,10.010075,4.550603,6.149002,80.844029,
min,111170.0,0.33,1.22,1.31,1.0,-79.0,63.0,27.0,1.22,
25%,118643.0,5.04,7.98,10.43,183.0,-67.0,68.0,45.0,7.98,
50%,126116.0,9.51,14.79,18.54,459.0,-65.0,70.0,52.0,14.79,
75%,133589.0,17.28,28.59,32.89,960.0,-63.0,75.0,55.0,28.4,
max,141062.0,5003.89,5003.89,5003.89,6761.0,31.0,88.0,61.0,3335.44,


In [32]:
EB_data.loc[EB_data['PM2.5_CF_ATM_ug/m3']==5003.89, 'PM2.5_CF_ATM_ug/m3']

9633    5003.89
Name: PM2.5_CF_ATM_ug/m3, dtype: float64

In [36]:
EB_data.loc[EB_data['PM2.5_CF_ATM_ug/m3']>150, 'PM2.5_CF_ATM_ug/m3']

2569     797.96
2570     228.62
2571     256.98
9632    1497.30
9633    5003.89
9634    4999.74
9635    4998.48
9636    5000.70
9637    5000.00
9638    5000.00
9639    5000.00
9640    5000.00
9641    5000.00
9642    4998.30
9643    4996.82
9644    4998.23
9645    4992.67
9646    4998.00
9647    5000.00
9648    4996.89
9649    4996.04
9650    2187.27
9692     201.70
Name: PM2.5_CF_ATM_ug/m3, dtype: float64

In [37]:
EB_data.loc[EB_data['PM2.5_CF_ATM_ug/m3']>1400, 'PM2.5_CF_ATM_ug/m3'] = np.nan

In [38]:
EB_data.describe()

Unnamed: 0,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
count,29893.0,29893.0,29874.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,0.0
mean,126116.0,15.656506,21.030858,28.240713,810.441508,-64.379654,71.434048,49.988726,21.014233,
std,8629.510135,120.382762,19.162532,121.630947,1126.457388,10.010075,4.550603,6.149002,80.844029,
min,111170.0,0.33,1.22,1.31,1.0,-79.0,63.0,27.0,1.22,
25%,118643.0,5.04,7.98,10.43,183.0,-67.0,68.0,45.0,7.98,
50%,126116.0,9.51,14.775,18.54,459.0,-65.0,70.0,52.0,14.79,
75%,133589.0,17.28,28.56,32.89,960.0,-63.0,75.0,55.0,28.4,
max,141062.0,5003.89,797.96,5003.89,6761.0,31.0,88.0,61.0,3335.44,
