## Download and Pre-process Purple Air Data

Last updated on 07/06/2023


This notebook demonstrates the download and pre-processing of the Purple Air real-time monitoring data (60-min average), including:

- Download the PurpleAir data at user-defined multiple locations through API
- Organize the data into a pandas data frame
- Convert the PM 2.5 concentration data to AQI
- Save the data frame to a csv file for further anlaysis  


User defined variables

In [1]:
# A text file that contains a list of sensor URLs
sensorlist  = "drive/MyDrive/purpleair_chicago.txt"
# Output file
outfilename = 'drive/MyDrive/purpleair_chicago_20230626.csv'

# Time window
from datetime import datetime
t_start = datetime(year=2023, month=6, day=26, hour=13)
t_end   = datetime(year=2023, month=6, day=26, hour=14)

Note that the sensor urls can be found by clicking each sensor on map.purpleair.com

## Google Colab Environment

In [2]:
pip install purpleair

Collecting purpleair
  Downloading purpleair-0.0.4-py3-none-any.whl (6.9 kB)
Installing collected packages: purpleair
Successfully installed purpleair-0.0.4


In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Download the Purple Air Data

Access to the PurpleAir API
- API Basics: https://api.purpleair.com/
- API Key Usage: Payment has to be loaded into your project account
https://community.purpleair.com/t/paymentrequirederror-making-api-calls/3971

In [4]:
from purpleair import PurpleAir
from datetime import datetime
import pandas as pd

In [5]:
p = PurpleAir('5C0CE324-F8DD-11ED-BD21-42010A800008')

Let's test. Click the sensor location. Find out the sensor ID#

In [6]:
url = 'https://map.purpleair.com/1/mAQI/a10/p604800/cC0?select=177269#11.39/64.8175/-147.7986'

In [7]:
# Download the current values
d1=p.get_sensor_data('177269')

In [8]:
# Download the past data
d2=p.get_sensor_history(sensor_index=177269, fields=('pm2.5_cf_1'), start_timestamp=t_start,end_timestamp=t_end)

In [9]:
import numpy as np
d=np.array(d2['data'])
np.mean(d[:,1])
d[:,1]

array([0., 0., 0., 0., 0., 0.])

Read a text file that has the list of sensors

In [10]:
# Read the text file and sensor IDs
f = open(sensorlist, "r")
id_list  = []
lat_list = []
lon_list = []
for fline in f:
    if len(fline)>10:
        fline=fline.replace('\n','/')
        fline=fline.replace('=','/')
        fline=fline.replace('#','/')
        slist = fline.split('/')
        idx = slist.index('cC0?select')

        #Save sensor IDs, latitudes, longitudes
        id_list.append(slist[idx+1])

In [11]:
# Check the list of IDs is okay
df = pd.DataFrame({'id':id_list})
df = df.drop_duplicates()

In [12]:
i=0
z_list = []
lat_list = []
lon_list = []

In [13]:
# Get data for all the sensors
import time
for i in range(len(id_list)):
  ix = id_list[i]
  d=p.get_sensor_history(sensor_index=ix, fields=('pm2.5_cf_1'), start_timestamp=t_start, end_timestamp=t_end)
  data = np.array(d['data'])
  if len(d['data'])>0:
    z = np.mean(data[:,1])
  else:
    z = np.nan
  z_list.append(z)
  d=p.get_sensor_data(ix)
  lat_list.append(d['sensor']['latitude'])
  lon_list.append(d['sensor']['longitude'])
  time.sleep(1)

In [14]:
df = {'id':id_list, 'Latitude':lat_list, 'Longitude':lon_list,'pm2.5_60minute':z_list}
df = pd.DataFrame(df)

## Preprocess the Purple Air data

We create the pandas dataframe and save it to a file with the specified name.

In [15]:
import numpy as np

In [16]:
# pm25 to AQI
# https://community.purpleair.com/t/how-to-calculate-the-us-epa-pm2-5-aqi/877

def aqiFromPM(pm):
    #/*                                  AQI         RAW PM2.5
    #Good                               0 - 50   |   0.0 – 12.0
    #Moderate                          51 - 100  |  12.1 – 35.4
    #Unhealthy for Sensitive Groups   101 – 150  |  35.5 – 55.4
    #Unhealthy                        151 – 200  |  55.5 – 150.4
    #Very Unhealthy                   201 – 300  |  150.5 – 250.4
    #Hazardous                        301 – 400  |  250.5 – 350.4
    #Hazardous                        401 – 500  |  350.5 – 500.4

    if pm > 350.5:
        return calcAQI(pm, 500, 401, 500.4, 350.5) #Hazardous
    elif pm > 250.5:
        return calcAQI(pm, 400, 301, 350.4, 250.5) #Hazardous
    elif pm > 150.5:
        return calcAQI(pm, 300, 201, 250.4, 150.5) #Very Unhealthy
    elif pm > 55.5:
        return calcAQI(pm, 200, 151, 150.4, 55.5) #Unhealthy
    elif pm > 35.5:
        return calcAQI(pm, 150, 101, 55.4, 35.5) #Unhealthy for Sensitive Groups
    elif pm > 12.1:
        return calcAQI(pm, 100, 51, 35.4, 12.1) #Moderate
    elif pm >= 0:
        return calcAQI(pm, 50, 0, 12, 0); #Good
    else:
        return -99999

def calcAQI(Cp, Ih, Il, BPh, BPl):
    a = (Ih - Il)
    b = (BPh - BPl)
    c = (Cp - BPl)
    return np.round((a/b) * c + Il)

In [17]:
z_list =[]
for i in range(len(df)):
    z_list.append(aqiFromPM(df['pm2.5_60minute'][i]))
df['AQI'] = z_list
df

Unnamed: 0,id,Latitude,Longitude,pm2.5_60minute,AQI
0,97679,43.310467,-86.185905,2.839167,12.0
1,180239,43.223440,-86.277800,11.041833,46.0
2,41993,43.078312,-86.197550,14.300833,56.0
3,97713,42.964836,-86.169370,,-99999.0
4,53131,42.846397,-86.132560,19.532667,67.0
...,...,...,...,...,...
64,49069,43.083280,-89.518390,34.825833,99.0
65,177179,43.050896,-89.338020,44.281167,123.0
66,183047,41.370865,-85.043106,17.863833,63.0
67,185213,41.666490,-86.187790,,-99999.0


In [18]:
df = df.drop(df[df.AQI < 1].index)
df = df.drop(df[df.AQI >400].index)
df.to_csv(outfilename)

In [19]:
df

Unnamed: 0,id,Latitude,Longitude,pm2.5_60minute,AQI
0,97679,43.310467,-86.185905,2.839167,12.0
1,180239,43.22344,-86.2778,11.041833,46.0
2,41993,43.078312,-86.19755,14.300833,56.0
4,53131,42.846397,-86.13256,19.532667,67.0
5,92021,42.848255,-85.78783,23.410333,75.0
6,147200,42.644955,-86.15989,14.321667,56.0
7,95527,42.394768,-85.888084,13.654667,54.0
8,181673,41.893898,-86.61754,19.048167,66.0
9,95527,42.394768,-85.888084,13.654667,54.0
13,181673,41.893898,-86.61754,19.048167,66.0
