# Analyzing Air Quality Data from Planet OS Datahub using Python Pandas and Plotly

In this example we will use Python3 [Pandas](https://pandas.pydata.org/) and [Plotly](https://plotly.com/python/) to analyze air quality data from datasets shared in Planet OS Datahub. We will: 
- find out which observational air quality datasets are available in [Datahub using Search & Discovery Endpoints](http://docs.planetos.com/#search-amp-discovery-endpoints)
- see all the stations on the map from [EPA AirNow hourly](https://data.planetos.com/datasets/epa_airnow_hourly) and [EEA Air Quality](https://data.planetos.com/datasets/eea_airquality_europe) datasets
- find out available variables from dataset
- query data about a single station, find daily values with Pandas and visualize it with Plotly
- query data from bigger area, find daily and monthly values, visualize them
- find maximum values and its stations
- analyze stations with maximum values from the last year

In [77]:
import pandas as pd
import requests
import numpy as np
import plotly.express as px 
import matplotlib.pyplot as plt
import time
import matplotlib as mpl
import plotly.express as px 
import plotly.graph_objects as go
from pprint import pprint
from itertools import cycle
import datetime
mpl.__version__

'3.3.3'

In [78]:
API_key = open('APIKEY').readlines()[0].strip() #'<YOUR API KEY HERE>'

In [79]:
import plotly
print (plotly.__version__)

4.12.0


In [80]:
search_text = "air"
product_type = "In situ observation"
search_link = 'https://api.planetos.com/v1/search/text?q={0}+product_type:"{1}"&apikey={2}'.format(search_text,product_type,API_key)
search_respone = requests.request("GET", search_link)
search_data = search_respone.json()

In [81]:
namespaces,dataset_names = zip(*[(i['key'], i['title']) for i in search_data['results']])

These are the in situ observations of air quality datasets that are available in Planet OS Datahub. As you can see, we have two datasets from EPA, one of them is air quality specific, other one has some air quality variables, but also more general weather variables. However, EPA Hourly dataset has longer history.

This time we will use The U.S. Environmental Protection Agency’s (EPA) Hourly Data and European Environment Agency Air Quality Dataset datasets.

In [82]:
pprint(dataset_names)

('European Environment Agency Air Quality Dataset',
 'The U.S. Environmental Protection Agency’s (EPA) Air Quality',
 'The U.S. Environmental Protection Agency’s (EPA) Hourly Data',
 'Weekly mean carbon dioxide measured at Mauna Loa Observatory, Hawaii')


Before getting to real work, we will define some functions to make our further work easier. We will have a function what helps us to get station information, to get data and variables and also to convert data to pandas dataframe. 

In [83]:
def get_stations(region_def,region,var,dataset,API_key):
    if region == 'all':
        url = 'https://api.planetos.com/v1/datasets/{0}/stations?apikey={1}'.format(dataset,API_key)
    else:
        coordinates = region_def[region]
        lon_w = coordinates['longitude_west']; lon_e = coordinates['longitude_east']; lat_n = coordinates['latitude_north']; lat_s = coordinates['latitude_south']
        url = 'https://api.planetos.com/v1/datasets/' + dataset + '/subdatasets?apikey=' + API_key + '&geometry={"type":"Polygon","coordinates":[' + str([[lon_w,lat_n],[lon_e,lat_n],[lon_e,lat_s],[lon_w,lat_s],[lon_w,lat_n]]) + ']}'
    stations_response = requests.request("GET", url)
    stations_json = stations_response.json()
    stations = {}
    if region == 'all':
        for station in stations_json['station']:
            stations[station] = {'longitude':stations_json['station'][station]['SpatialExtent']['coordinates'][0],'latitude':stations_json['station'][station]['SpatialExtent']['coordinates'][1]}
    else:   
        for subdataset in stations_json['subdatasets']:
            station = [n['attributeValue'] for n in subdataset['attributes'] if n['attributeKey'] == 'station'][0]
            stations[station] = {'longitude':subdataset['spatialCoverage']['coordinates'][0],'latitude':subdataset['spatialCoverage']['coordinates'][1]}
    
    stations_data = pd.DataFrame.from_dict(stations).transpose()
    stations_data = stations_data.reset_index().rename(columns={'index':'station'})

    return stations_data

def convert_json_to_some_pandas(injson):
    param_list = ['axes','data']
    new_dict = {}
    [new_dict.update({i:[]}) for i in param_list]
    [(new_dict['axes'].append(i['axes']),new_dict['data'].append(i['data'])) for i in injson];
    pd_temp = pd.DataFrame(injson)
    if 'indexAxes' in pd_temp:
        dev_frame = pd_temp[['context', 'axes','indexAxes']].join(pd.concat([pd.DataFrame(new_dict[i]) for i in param_list], axis=1))
    else:
        dev_frame = pd_temp[['context','axes']].join(pd.concat([pd.DataFrame(new_dict[i]) for i in param_list],axis=1))
    if 'time' in dev_frame:
        dev_frame['time'] = pd.to_datetime(dev_frame['time'])
    return dev_frame

def get_data(years,stations, dataset,var,API_key):
    data = []
    for station in stations['station']:
        #print (station)
        st_data_by_year = []
        for year in years:
            start = str(year) + '-01-01T00:00:00'
            end = str(year+1) + '-01-01T00:00:00'
            if var == '':
                url = "https://api.planetos.com/v1/datasets/{0}/stations/{1}?apikey={2}&count=10000&time_start={3}&time_end={4}".format(dataset,station,API_key,start,end)
            else:
                url = "https://api.planetos.com/v1/datasets/{0}/stations/{1}?apikey={2}&var={3},latitude,longitude,lon,lat&count=10000&time_start={4}&time_end={5}".format(dataset,station,API_key,var,start,end)
            station_data_response = requests.request("GET", url)
            st_data = station_data_response.json()['entries']
            if st_data:
                pd_d = convert_json_to_some_pandas(st_data)
                pd_d=pd_d.drop(columns=['context','axes'])
                st_data_by_year.append(pd_d)
        if st_data_by_year:
            st_data_by_year_pd = pd.concat(st_data_by_year)
            st_data_by_year_pd['station'] = station
            data.append(st_data_by_year_pd)
    if data:
        data_pd = pd.concat(data,ignore_index=True)
    else:
        data_pd = pd.DataFrame()
    return data_pd

def get_variables(dataset,API_key):
    url = 'https://api.planetos.com/v1/datasets/{0}/variables?apikey={1}'.format(dataset,API_key)
    response = requests.request("GET", url)
    response_json = response.json()
    variables = [n['variableKey'] for n in response_json['variables'] if n['variableType'] == 'data']
    return variables

## Showing stations on map

Let's get all the stations from those two datasets and put them to the map. For mapping, we will use plotly.
We can see that US and Europe are pretty well covered with stations. However, we need to know that often, all the stations don't cover whole time period. Air quality stations do not have a common international standardized network, as weather stations do. Unlike weather stations, they do not try to generalize the value of a larger area by measurement, but only for the point of interest.

In [84]:
all_epa_stations = get_stations(None,'all','PM2.5','epa_airnow_hourly',API_key)
all_eea_stations = get_stations(None,'all','PM2.5','eea_airquality_europe',API_key)

In [85]:
fig = go.Figure()
fig.add_trace(go.Scattermapbox(
    lat=all_epa_stations.latitude, lon=all_epa_stations.longitude,
    mode='markers',
    marker=go.scattermapbox.Marker(color='#EC5840',size=4),
    text=all_epa_stations.station,
    hoverinfo = 'text'
    ))

fig.add_trace(go.Scattermapbox(
    lat=all_eea_stations.latitude, lon=all_eea_stations.longitude,
    mode='markers',
    marker=go.scattermapbox.Marker(color='#4E2F90',size=4),
    text=all_eea_stations.station,
    hoverinfo = 'text'
    ))

fig.update_layout(mapbox_style="open-street-map",autosize=True,showlegend=False,height=500, margin={"r":0,"t":0,"l":0,"b":0}) 
fig.show()

## Getting data from desired station
First, let's see how to work with daya from a single station. For example, I am interested in station located in Tallinn, Estonia. When zoomed in to Tallinn, I found station called STA-EE0020A, which is in city center. One is probably interested in variables the dataset has, so it is easier to decide what to query. 

In [86]:
pprint (get_variables('eea_airquality_europe',API_key))

['lat',
 'lon',
 'SO2',
 'NO2',
 'PM10',
 'NO',
 'CO',
 'O3',
 'NOXasNO2',
 'C6H6',
 'PM2.5',
 'Validity_Pb',
 'Pb',
 'Verification_Pb',
 'Validity_Ni',
 'Ni',
 'Verification_Ni']


Below we will get data from the station STA-EE0018A for 2014-2020 and we are interested in some air quality variables - SO2, CO, PM10 and PM2.5. 

In [87]:
years = np.arange(2014,2021)
variables = ['SO2','CO','PM10','PM2.5']
station_data = get_data(years,{'station':['STA-EE0018A']}, 'eea_airquality_europe',','.join(variables),API_key)

We can see the pandas dataFrame below.

In [88]:
station_data

Unnamed: 0,time,latitude,longitude,SO2,PM10,CO,PM2.5,station
0,2014-01-01 00:00:00,59.414169,24.649458,3.800000,21.90000,0.280000,17.5000,STA-EE0018A
1,2014-01-01 01:00:00,59.414169,24.649458,2.300000,,0.290000,,STA-EE0018A
2,2014-01-01 02:00:00,59.414169,24.649458,1.600000,,0.300000,,STA-EE0018A
3,2014-01-01 03:00:00,59.414169,24.649458,1.800000,,0.280000,,STA-EE0018A
4,2014-01-01 04:00:00,59.414169,24.649458,1.700000,,0.260000,,STA-EE0018A
...,...,...,...,...,...,...,...,...
61159,2020-12-31 20:00:00,59.414169,24.649458,0.143733,10.33200,0.270945,8.6100,STA-EE0018A
61160,2020-12-31 21:00:00,59.414169,24.649458,0.151794,10.33200,0.252503,8.6100,STA-EE0018A
61161,2020-12-31 22:00:00,59.414169,24.649458,0.174630,9.90150,0.280111,8.1795,STA-EE0018A
61162,2020-12-31 23:00:00,59.414169,24.649458,0.306273,9.47099,0.225060,7.7490,STA-EE0018A


We will compute daily means to visualize the data. For that we will use pandas resample. In order to use it, we need to make time variable as index.

In [89]:
station_data = station_data.set_index('time')
daily_station = station_data.resample('1D').mean()
daily_station.reset_index(inplace=True)

As we want to use specific colors we define them before making plots. 

In [90]:
palette = cycle(['#0030A0',
                  '#F4B63F',
                  '#4779EC',
                  '#a3a7b0',
                  '#1B9AA0',
                  '#EC5840',
                  '#abd8f4',
                  '#4E2F90',
                  '#6ac3ec',
                  '#FFC0CB',
                  '#98FB98'])

On the figure below, we can see SO2, CO, PM10 and PM2.5 daily mean values. If we are interested only in some specific value, we can exclude rest from the image by clicking on variable names on the right side. For getting the closer look, we can zoom the image. 

In [91]:
fig = go.Figure()
for variable in variables:
    fig.add_traces(go.Scatter(x=daily_station.time, y=daily_station[variable], mode='lines', name = variable,marker_color=next(palette)))
fig.show()

Next we will look how to get data from area of interest. For that, we will define the area, this time will do it with square, however, Datahub also supports more complex polygons. 
We will define New York in the US and Berlin in Europe, so we will use two different datasets - EPA Airnow Hourly and EEA Air Quality Europe. 

In [92]:
populated_cities = {'New York':{'longitude_east':-73.8,'longitude_west':-74.3,'latitude_north':40.8,'latitude_south':40.5}, 'Berlin':{'longitude_east':13.7,'longitude_west':13.1,'latitude_north':52.7,'latitude_south':52.4}}

Below, we will get all the stations that are in the area of interest.

In [93]:
ny_stations = get_stations(populated_cities,'New York','PM2.5','epa_airnow_hourly',API_key)
berlin_stations = get_stations(populated_cities,'Berlin','PM2.5','eea_airquality_europe',API_key)

Now we will query the data. Keep in mind that quering data from several stations for six years with high temporal resolution might take some time. If you don't feel like waiting, just change the years. 

In [94]:
%%time
years = np.arange(2014,2021)
berlin_data = get_data(years,berlin_stations, 'eea_airquality_europe','PM2.5',API_key)

CPU times: user 15 s, sys: 738 ms, total: 15.8 s
Wall time: 11min 35s


In [95]:
%%time
ny_data = get_data(years,ny_stations, 'epa_airnow_hourly','PM2.5',API_key)

CPU times: user 9.49 s, sys: 485 ms, total: 9.97 s
Wall time: 4min 12s


Here we will drop nan values and set time as index, so we can resample the data later.

In [96]:
ny_data = ny_data.set_index('time')
ny_data.dropna(inplace=True)

berlin_data = berlin_data.set_index('time')
berlin_data.dropna(inplace=True)

Here, we will group data by stations. For this, [pandas groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) is the good way to go. 
After grouping the data, we can start working on data based on stations. We will plot stations to the map, where marker size indicates overall mean values. The highest value (19.05) in Berlin is in the station STA.DE_DEBE065. It is pretty close to Berlin city center, however, there are multiple stations in the region and we can't say why the mean value in this exact location is higher.  

In [97]:
mean_data_berlin = pd.DataFrame({'mean_PM2.5' : berlin_data.groupby( ["station", "longitude","latitude"] )['PM2.5'].mean()}).reset_index()
fig = px.scatter_mapbox(mean_data_berlin, lat="latitude", lon="longitude",hover_data=["mean_PM2.5","station"],color_discrete_sequence=["#F4B63F"],size="mean_PM2.5")
fig.update_layout(mapbox_style="open-street-map") 
fig.show()

Below is the map on New York stations. Station US:UNION:340390004 has the highest mean PM2.5 value (10.44). It locates pretty close to the airport, it might be the reason of high mean values. But of course there can be many other factors as well. 

In [98]:
mean_data_ny = pd.DataFrame({'mean_PM2.5' : ny_data.groupby( ["station", "longitude","latitude"] )['PM2.5'].mean()}).reset_index()
fig = px.scatter_mapbox(mean_data_ny, lat="latitude", lon="longitude",hover_data=["mean_PM2.5","station"],color_discrete_sequence=["#F4B63F"],size="mean_PM2.5")
fig.update_layout(mapbox_style="open-street-map") 
fig.show()

For visualizing the data, we will resample it to daily and monthly values. Most of the data has 1 hour resolution. 

In [99]:
ny_daily_station_means = ny_data.groupby('station')['PM2.5'].resample('1D').mean()
ny_monthly_station_means = ny_data.groupby('station')['PM2.5'].resample('1M').mean()

berlin_daily_station_means = berlin_data.groupby('station')['PM2.5'].resample('1D').mean()
berlin_monthly_station_means = berlin_data.groupby('station')['PM2.5'].resample('1M').mean()

Below we plot out monthly mean values. 

In [100]:
fig = go.Figure()
for stat in np.unique(ny_data['station']):
    station_monthly_mean_data = pd.DataFrame({'PM2.5' : ny_monthly_station_means[stat].values, 'time':ny_monthly_station_means[stat].index})
    fig.add_traces(go.Scatter(x=station_monthly_mean_data.time, y=station_monthly_mean_data["PM2.5"], mode='lines', name = stat,marker_color=next(palette)))
fig.show()

Unfortunately, many stations have stopped working in Berin after 2019. There is only two active stations right now. Looking into monthly means, it seems like 2020 has had lower values as usual. 

In [101]:
fig = go.Figure()
for stat in np.unique(berlin_data['station']):
    station_monthly_mean_data = pd.DataFrame({'PM2.5' : berlin_monthly_station_means[stat].values, 'time':berlin_monthly_station_means[stat].index})
    fig.add_traces(go.Scatter(x=station_monthly_mean_data.time, y=station_monthly_mean_data["PM2.5"], mode='lines', name = stat,marker_color=next(palette)))
fig.show()

Other way to visualize grouped historic data is using violin plot. For example, we can see that in STA.DE.DEBE065 PM2.5 values vary the most. In general, all the stations have very similar data to each other. 

In [102]:
fig = go.Figure()
for stat in np.unique(berlin_data['station']):
    station_daily_mean_data = pd.DataFrame({'PM2.5' : berlin_daily_station_means[stat].values, 'time':berlin_daily_station_means[stat].index})
    station_daily_mean_data['year'] = pd.DatetimeIndex(station_daily_mean_data['time']).year
    fig.add_trace(go.Violin(x=station_daily_mean_data['year'],
                            y=station_daily_mean_data['PM2.5'],
                            legendgroup=stat, scalegroup=stat, name=stat,marker_color=next(palette)))
                  
fig.update_traces(box_visible=True, meanline_visible=True)
fig.update_layout(violinmode='group',    width=900,height=450,)
fig.show()

Using pandas, it is also easy to find maximum values from data arrays. On the plot below we show a station from the New York and Berlin where was the highest value of all. 

Unfortunately, data for the Berlin station stops in the end on 2019. 
We can cleary see how values peak on the New Year Eve, fireworks realy pollute the air. 

In [103]:
fig = go.Figure()
ny_max_station_mean_data = pd.DataFrame({'PM2.5' : ny_daily_station_means[ny_daily_station_means.idxmax()[0]].values, 'time':ny_daily_station_means[ny_daily_station_means.idxmax()[0]].index})
berlin_max_station_mean_data = pd.DataFrame({'PM2.5' : berlin_daily_station_means[berlin_daily_station_means.idxmax()[0]].values, 'time':berlin_daily_station_means[berlin_daily_station_means.idxmax()[0]].index})
fig.add_traces(go.Scatter(x=ny_max_station_mean_data.time, y=ny_max_station_mean_data["PM2.5"], mode='lines', name = ny_daily_station_means.idxmax()[0],marker_color=next(palette)))
fig.add_traces(go.Scatter(x=berlin_max_station_mean_data.time, y=berlin_max_station_mean_data["PM2.5"], mode='lines', name = berlin_daily_station_means.idxmax()[0],marker_color=next(palette)))
fig.show()

It's always nice to know when exactly the total maximum happened. We can see that in Berlin it was exactly New Year eve 2018-12-31. 
However, in New York, it was recently, 2020-12-06. If you might know the reason, definitely let us know.

In [104]:
print (berlin_max_station_mean_data.max())
print ('\n')
print (ny_max_station_mean_data.max())


PM2.5                 186.96
time     2018-12-31 00:00:00
dtype: object


PM2.5                82.3714
time     2021-01-01 00:00:00
dtype: object


Now, let's find out how to use data about last and current year. 

In [105]:
last_years_ny_data = ny_data[ny_data.index>datetime.datetime.today().replace(year = datetime.datetime.today().year-1,month=1,day=1,hour=0)]
last_years_berlin_data = berlin_data[berlin_data.index>datetime.datetime.today().replace(year = datetime.datetime.today().year-1,month=1,day=1,hour=0)]

Here we will also group data by station and then find daily means by using Pandas Resample. 

In [106]:
ny_daily_station_means_current_year = last_years_ny_data.groupby('station')['PM2.5'].resample('1D').mean()
berlin_daily_station_means_current_year = last_years_berlin_data.groupby('station')['PM2.5'].resample('1D').mean()

Below, we will compare Berlin and New York daily mean PM2.5 values from the stations where was the overall maximum values. We can see that pollution on those stations are relatively similar. However, we can find some interesting peaks. For example, in March 28, Berlin has really high daily mean. And on November 9th, both stations have relatively high means.  

In [108]:
fig = go.Figure()
ny_max_station_mean_data = pd.DataFrame({'PM2.5' : ny_daily_station_means_current_year[ny_daily_station_means_current_year.idxmax()[0]].values, 'time':ny_daily_station_means_current_year[ny_daily_station_means_current_year.idxmax()[0]].index})
berlin_max_station_mean_data = pd.DataFrame({'PM2.5' : berlin_daily_station_means_current_year[berlin_daily_station_means_current_year.idxmax()[0]].values, 'time':berlin_daily_station_means_current_year[berlin_daily_station_means_current_year.idxmax()[0]].index})
fig.add_traces(go.Scatter(x=ny_max_station_mean_data.time, y=ny_max_station_mean_data["PM2.5"], mode='lines', name = ny_daily_station_means_current_year.idxmax()[0],marker_color=next(palette)))
fig.add_traces(go.Scatter(x=berlin_max_station_mean_data.time, y=berlin_max_station_mean_data["PM2.5"], mode='lines', name = berlin_daily_station_means_current_year.idxmax()[0],marker_color=next(palette)))
fig.show()

In conclusion, Pandas and Plotly are powerful tools to use for analyzing data. In this example, we mainly focused on visualizing timeseries datasets with multiple stations and found some maximum and mean values, we also resampled data to daily and monthly means for better visualizing. However, it is also possible to do some higher level analyzes. Let us know if there's something specific you would like to see.