# ER190C Lecture 6 Notebook

**Data Cleaning and Exploratory Data Analysis**

Duncan Callaway

September 11 2018

Today we'll work with PurpleAir data to explore the concepts of Structure, Granularity, Scope, Temporality and Faithfulness.  Along the way we'll talk about data cleaning as well.  

[Here's PurpleAir's website](https://www.purpleair.com/map#1/25/-30) -- They have really cool maps!

The way I developed this lecture was by pulling the data down and exploring it.  You'll see my (edited) process of examining the data.

This began by me visiting [this website](https://www.purpleair.com/sensorlist) to look for data.  I used the Chrome browser to pull data (other browsers didn't work).

The folks are PurpleAir also sent me [this pdf](https://github.com/duncancallaway/ER131_2019/blob/master/lecture/Lecture%2006%20Sept%2017/Using%20PurpleAir%20Data.pdf) describing their data.  

In [1]:
import numpy as np
import pandas as pd
import os

## Structure: how are the data stored?  
First let's look at what's in the data directory using `os.listdir` (remember this is a set of command line-style commands that work across platforms, i.e. mac, linux, windows)

In [2]:
os.listdir('data')

['.DS_Store',
 'Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv',
 'Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Secondary 08_05_2018 09_04_2018.csv']

What can we learn from these file names?
* the sensor location is probably the French School in Berkeley.
* Looks like lat / lon coordinates in parens
* the date range is listed
* there is a secondary / primary distinction. 

Before proceeding let's find the size of some of these files:

In [3]:
os.path.getsize('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

2381187

What are the units?  Let's shift tab in to `getsize` to find out.

In [4]:
os.path.getsize

<function genericpath.getsize>

Not much information.  Google search reveals [this](https://docs.python.org/2/library/os.path.html) information page, which says the units are bytes.

In [5]:
os.path.getsize('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')/1e6

2.381187

SO 2.4 Mb. 

In [6]:
os.path.getsize('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Secondary 08_05_2018 09_04_2018.csv')/1e6

2.497975

Before we go further, what's the primary vs secondary data file?

Checking out the "Using Purple Air data" pdf, provided to me by them, it looks like the two files contain different data.  We'll focus on PM2.5, which is in the primary file.

In this directory there is a python file (`utils.py`) that has some useful utilities -- we'll pull some in over the course of the lecture.  First to use is `line_count`

In [7]:
from utils import line_count

In [8]:
help(line_count)

Help on function line_count in module utils:

line_count(file)
    Computes the number of lines in a file.
    
    file: the file in which to count the lines.
    return: The number of lines in the file



In [9]:
line_count('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

29894

In [10]:
from utils import head

In [11]:
head('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')

['created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,\n',
 '2018-08-05 00:00:31 UTC,111170,1.96,4.34,4.96,135.00,-67.00,84.00,33.00,4.34\n',
 '2018-08-05 00:01:51 UTC,111171,2.13,3.89,6.83,136.00,-67.00,84.00,33.00,3.89\n',
 '2018-08-05 00:03:11 UTC,111172,3.04,4.93,6.18,137.00,-68.00,84.00,34.00,4.93\n',
 '2018-08-05 00:04:31 UTC,111173,2.17,4.26,6.83,139.00,-65.00,84.00,33.00,4.26\n']

This confirms the file type is .csv, so let's pull it in:

In [12]:
EB_primary = pd.read_csv('data/Ecole Bilingue de Berkeley (37.854830799999995 -122.28937169999999) Primary 08_05_2018 09_04_2018.csv')
EB_primary.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
0,2018-08-05 00:00:31 UTC,111170,1.96,4.34,4.96,135.0,-67.0,84.0,33.0,4.34,
1,2018-08-05 00:01:51 UTC,111171,2.13,3.89,6.83,136.0,-67.0,84.0,33.0,3.89,
2,2018-08-05 00:03:11 UTC,111172,3.04,4.93,6.18,137.0,-68.0,84.0,34.0,4.93,
3,2018-08-05 00:04:31 UTC,111173,2.17,4.26,6.83,139.0,-65.0,84.0,33.0,4.26,
4,2018-08-05 00:05:51 UTC,111174,2.06,4.06,8.51,140.0,-67.0,84.0,33.0,4.06,


Several things to ask from this: 
1. Dates are UTC.
2. Each entry has a unique ID -- could be used to check for time stamp errors or gaps in data
3. Headers have 'CF_ATM' at the top -- what does that mean?
    4. There is one PM2.5 column without 'CF_ATM', what is its significance?
        1. From the PurpleAir documentation, in this directory, *"ATM is "atmospheric", meant to be used for outdoor applications. CF=1 is meant to be used for indoor or controlled environment applications. However, PurpleAir uses CF=1 values on the map. This value is lower than the ATM value in higher measured concentrations."*  
        2. The explanation is a little vague and suggests further exploration required.  It has to do with how changing atmospheric pressure might change the measurements.  
4. The columns "UptimeMinutes" and "RSSI_dbm" are not immediately obvious
    1. again from documentation: "uptimeminutes" is time since last restart, and "RSSI_dbm" is wifi signal strength for the device.  
5. The "unnamed: 10" column seems useless, why is it there?
    1. Looking at the data we see "\n" at the end of each line (newline character), it appears this is generating the extra row.

## Granularity: how are the data aggregated?

We'll talk a little more about Temporality in a moment, but time also matters for thinking about granularity.

First we need to pay attention to the fact that this is UTC.  Let's put it in datetime format to prevent mistakes.

In [13]:
EB_time = pd.to_datetime(EB_primary['created_at'], utc=True)

In [14]:
EB_primary['created_at']=EB_time

In [15]:
EB_primary['created_at'].dtype

datetime64[ns, UTC]

Yes, that response really means the time are recorded down to the nanosecond.  

In [24]:
EB_primary.head()

Unnamed: 0,created_at,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
0,2018-08-05 00:00:31+00:00,111170,1.96,4.34,4.96,135.0,-67.0,84.0,33.0,4.34,
1,2018-08-05 00:01:51+00:00,111171,2.13,3.89,6.83,136.0,-67.0,84.0,33.0,3.89,
2,2018-08-05 00:03:11+00:00,111172,3.04,4.93,6.18,137.0,-68.0,84.0,34.0,4.93,
3,2018-08-05 00:04:31+00:00,111173,2.17,4.26,6.83,139.0,-65.0,84.0,33.0,4.26,
4,2018-08-05 00:05:51+00:00,111174,2.06,4.06,8.51,140.0,-67.0,84.0,33.0,4.06,


Nice thing about the datetime formate is that you can easily get time information out of it.  For example let's look at the 1,000th entry:

In [27]:
EB_primary.iloc[1000,0].tzinfo

<UTC>

Note, we could rename the cols to make things easier if we wished.  I'm not going to because we're not going to be workign with this data set for long, but in other cases you might decide to.

Can we figure out how frequent measurements are?

Unfortunately I found it difficult to take differences with datetime objects, so I had to write a for loop:

In [28]:
diffs = np.zeros(len(EB_primary['created_at']))

for i in range(0, len(diffs)-1):
    diffs[i] = float((EB_primary['created_at'][i+1]- EB_primary['created_at'][i]).total_seconds())

diffs = np.sort((diffs))

print('mins:', diffs[0:30])
print('maxes:', diffs[-1:-30:-1])
print('mean:', np.mean(diffs))

mins: [ 0. 14. 69. 69. 70. 70. 70. 70. 71. 71. 72. 72. 72. 72. 73. 73. 73. 73.
 73. 74. 74. 74. 74. 74. 74. 74. 75. 75. 75. 75.]
maxes: [134161.    931.    486.    481.    403.    384.    338.    334.    325.
    322.    320.    320.    320.    320.    317.    315.    306.    299.
    298.    256.    252.    245.    241.    241.    240.    240.    240.
    240.    240.]
mean: 86.70581741544844


Looks like for the most part we're sampling every 1.5 minutes or so, with a few gaps in the data.  

## Scope: how much time, how many people, what spatial area?
This is data from one location -- French School in Berkeley.  

From the file name it looks like the time is from early August to early September, let's confirm:

In [29]:
EB_primary['created_at'].min()

Timestamp('2018-08-05 00:00:31+0000', tz='UTC')

In [30]:
EB_primary['created_at'].max()

Timestamp('2018-09-03 23:58:48+0000', tz='UTC')

So it's about one month of data.  

Does the data cover the topic of interest?

In this case, we need to answer the question:  For the PurpleAir data, what topic of interest might the data cover?

--> class discussion on this.

## Temporality: How is time represented in the data?
We've already figured out that we're working with UTC dates.  UTC is "universal time coordinated" and is essentially greenwich mean time, the time on the prime meridian.

## Faithfulness: are the data trustworthy?
This one's much harder to assess.  Let's have a look at some basic things we might care about

In [31]:
sum(EB_primary['PM2.5_CF_ATM_ug/m3'].isna())

0

That tells us there are no NaN values in the PM2.5 data.  Impressive!

In [32]:
EB_primary.describe()

Unnamed: 0,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
count,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,0.0
mean,126116.0,15.656506,23.983548,28.240713,810.441508,-64.379654,71.434048,49.988726,21.014233,
std,8629.510135,120.382762,121.160645,121.630947,1126.457388,10.010075,4.550603,6.149002,80.844029,
min,111170.0,0.33,1.22,1.31,1.0,-79.0,63.0,27.0,1.22,
25%,118643.0,5.04,7.98,10.43,183.0,-67.0,68.0,45.0,7.98,
50%,126116.0,9.51,14.79,18.54,459.0,-65.0,70.0,52.0,14.79,
75%,133589.0,17.28,28.59,32.89,960.0,-63.0,75.0,55.0,28.4,
max,141062.0,5003.89,5003.89,5003.89,6761.0,31.0,88.0,61.0,3335.44,


That's a pretty high PM2.5 average.  And the max is very suspiciously high.  What's going on?

Options: 
1. Wildfire smoke really pumped up the 2.5 values
2. We have a lot of missing data and only values during the wild fires
3. There are some erroneously high values.

Let's start by looking at how many values are big.  

In [33]:
log_ind = EB_primary.loc[:,'PM2.5_CF_ATM_ug/m3'] > 150
EB_primary.loc[log_ind,'PM2.5_CF_ATM_ug/m3']

2569     797.96
2570     228.62
2571     256.98
9632    1497.30
9633    5003.89
9634    4999.74
9635    4998.48
9636    5000.70
9637    5000.00
9638    5000.00
9639    5000.00
9640    5000.00
9641    5000.00
9642    4998.30
9643    4996.82
9644    4998.23
9645    4992.67
9646    4998.00
9647    5000.00
9648    4996.89
9649    4996.04
9650    2187.27
9692     201.70
Name: PM2.5_CF_ATM_ug/m3, dtype: float64

Looks like there was a stretch of time with really high values, somewhat suspciously clustered around 5000.  If I were doing more work here I would look into the sensor more carefully to see if there is any significance to that number.

But for now -- let's just go ahead and drop them and see what happens:

In [34]:
EB_primary.loc[log_ind,'PM2.5_CF_ATM_ug/m3'] = np.nan
EB_primary.describe()

Unnamed: 0,entry_id,PM1.0_CF_ATM_ug/m3,PM2.5_CF_ATM_ug/m3,PM10.0_CF_ATM_ug/m3,UptimeMinutes,RSSI_dbm,Temperature_F,Humidity_%,PM2.5_CF_1_ug/m3,Unnamed: 10
count,29893.0,29893.0,29870.0,29893.0,29893.0,29893.0,29893.0,29893.0,29893.0,0.0
mean,126116.0,15.656506,20.98395,28.240713,810.441508,-64.379654,71.434048,49.988726,21.014233,
std,8629.510135,120.382762,18.510573,121.630947,1126.457388,10.010075,4.550603,6.149002,80.844029,
min,111170.0,0.33,1.22,1.31,1.0,-79.0,63.0,27.0,1.22,
25%,118643.0,5.04,7.98,10.43,183.0,-67.0,68.0,45.0,7.98,
50%,126116.0,9.51,14.765,18.54,459.0,-65.0,70.0,52.0,14.79,
75%,133589.0,17.28,28.55,32.89,960.0,-63.0,75.0,55.0,28.4,
max,141062.0,5003.89,115.95,5003.89,6761.0,31.0,88.0,61.0,3335.44,


You can see the average came down a little, and the standard deviation came *really* far down.  And as we'd hope the max is now below 150.  