# Dataset 

In this notebook we will explore the dataset [Historical Air Quality](https://www.kaggle.com/epa/epa-historical-air-quality) from [Kaggle](https://www.kaggle.com). It is a collection of the air quality from outdoor monitors across the US. It contains more than **8 million** records.

| Field                            | Type    | Description                                    |
|----------------------------------|---------|------------------------------------------------|
| state_code                       | Integer | The FIPS code of the state in which the monitor resides.|
| county_code                      | String  | The FIPS code of the county in which the monitor resides. |
| site_num                         | String  | A unique number within the county identifying the site. |
| parameter_code                   | Integer | The AQS code corresponding to the parameter measured by the monitor. |
| poc                              | Integer | This is the “Parameter Occurrence Code” used to distinguish different instruments that measure the same parameter at the same site. |
| latitude                         | Float   | The monitoring site’s angular distance north of the equator measured in decimal degrees. |
| longitude                        | Float   | The monitoring site’s angular distance east of the prime meridian measured in decimal degrees. |
| datum                            | String  | The Datum associated with the Latitude and Longitude measures. |
| parameter_name                   | String  | The name or description assigned in AQS to the parameter measured by the monitor. Parameters may be pollutants or non-pollutants. |
| sample_duration                  | String  | The length of time that air passes through the monitoring device before it is analyzed (measured). So, it represents an averaging period in the atmosphere (for example, a 24-hour sample duration draws ambient air over a collection filter for 24 straight hours). For continuous monitors, it can represent an averaging time of many samples (for example, a 1-hour value may be the average of four one-minute samples collected during each quarter of the hour). |
| pollutant_standard               | String  | A description of the ambient air quality standard rules used to aggregate statistics.  |
| date_local                       | String  | The calendar date for the summary. All daily summaries are for the local standard day (midnight to midnight) at the monitor. |
| units_of_measure                 | String  | The unit of measure for the parameter. QAD always returns data in the standard units for the parameter. Submitters are allowed to report data in any unit and EPA converts to a standard unit so that we may use the data in calculations. |
| event_type                       | String  | Indicates whether data measured during exceptional events are included in the summary. A wildfire is an example of an exceptional event; it is something that affects air quality, but the local agency has no control over. No Events means no events occurred. Events Included means events occurred and the data from them is included in the summary. Events Excluded means that events occurred but data form them is excluded from the summary. Concurred Events Excluded means that events occurred but only EPA concurred exclusions are removed from the summary. If an event occurred for the parameter in question, the data will have multiple records for each monitor. |
| observation_count                | Integer | The number of observations (samples) taken during the year. |
| arithmetic_mean                  | Float   | The average (arithmetic mean) value for the year. |
| first_max_value                  | Float   | The highest value for the day. |
| first_max_hour                   | Integer | The date and time (on a 24-hour clock) when the highest value for the year (the previous field) was taken. |
| aqi                              | Integer   | The Air Quality Index for the day for the pollutant, if applicable. |
| method_code                      | Integer  | An internal system code indicating the method (processes, equipment, and protocols) used in gathering and measuring the sample. The method name is in the next column. |
| method_name                      | String  | A short description of the processes, equipment, and protocols used in gathering and measuring the sample. | 
| local_site_name                  | String  | The name of the site (if any) given by the State, local, or tribal air pollution control agency that operates it. |
| address                          | String  | The approximate street address of the monitoring site. |
| state_name                       | String  | The name of the state where the monitoring site is located. | 
| county_name                      | String  | The name of the county where the monitoring site is located. |
| city_name                        | String  | The name of the city where the monitoring site is located. This represents the legal incorporated boundaries of cities and not urban areas. |
| cbsa_name                        | String  | The name of the core bases statistical area (metropolitan area) where the monitoring site is located. |
| date_of_last_change              | Date    | The date the last time any numeric values in this record were updated in the AQS data system. |


Please note that, as we highlighted in `README.md` we have already performed an initial preprocessing, which means we have introduced new helper/utility fields that are not available in the original dataset; and we have dropped at least one field (i.e. `Country`).



# Imports 

In [1]:
import sys
sys.path.append("../config")

import config

In [2]:
import pyspark.sql.functions as fn
import pyspark.sql.types as t

# Load data 

In [3]:
df_air_q = spark.read.parquet(f"{config.ARTIFACTS}/sample_air_quality/")

In [4]:
df = df_air_q

# Summary

In [5]:
df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
state_code,144148,26.19063046313511,18.11397205043684,1,80
county_code,144148,77.8448885867303,105.63567064153942,001,810
site_num,144148,803.8889266587119,1679.3616871204779,0001,9997
parameter_code,144148,42101.0,0.0,42101,42101
poc,144148,1.053667064406027,0.3741866113244187,1,9
latitude,144148,37.6448351149521,5.640754799192715,0.0,64.84569
longitude,144148,-97.77013400302471,19.26404162545803,-159.36624,0.0
datum,144148,,,NAD83,WGS84
parameter_name,144148,,,Carbon monoxide,Carbon monoxide


In [6]:
num_fields = filter(lambda col: True if isinstance(col.dataType, t.FloatType) or isinstance(col.dataType, t.IntegerType) or isinstance(col.dataType, t.DoubleType) else False, df.schema)
num_fields_names = list(map(lambda col: col.name, num_fields))

In [7]:
amount_missing_df = df.select([
    (fn.count(fn.when(fn.isnan(c) | fn.col(c).isNull(), c)) / fn.count(fn.lit(1))).alias(f"{c}_perc_missing") for c in num_fields_names
])

In [8]:
complete_cols = [k for k, v in amount_missing_df.collect()[0].asDict().items() if v == 0.0]

In [9]:
amount_missing_df = amount_missing_df.drop(*complete_cols)

In [10]:
amount_missing_df.toPandas().transpose()

Unnamed: 0,0
aqi_perc_missing,0.500104
method_code_perc_missing,0.499896


# SQL 

In [11]:
df = df.dropna(how='all', subset=['aqi', 'county_name', 'date_local'])

In [12]:
df.count()

144148

In [13]:
df.createOrReplaceTempView("airq")

In [14]:
q = spark.sql("""
    SELECT 
        YEAR(date_local) AS year,
        state_name,
        COUNT(*) AS COUNT
    FROM 
        airq
    GROUP BY
        state_name, year
    ORDER BY
        state_name ASC, year ASC
""")

In [15]:
q.show()

+----+----------+-----+
|year|state_name|COUNT|
+----+----------+-----+
|1990|   Alabama|   79|
|1991|   Alabama|   62|
|1992|   Alabama|   79|
|1993|   Alabama|   69|
|1994|   Alabama|   54|
|1995|   Alabama|   63|
|1996|   Alabama|   60|
|1997|   Alabama|   56|
|1998|   Alabama|   40|
|1999|   Alabama|   55|
|2000|   Alabama|   59|
|2001|   Alabama|   37|
|2002|   Alabama|   54|
|2003|   Alabama|   41|
|2004|   Alabama|   30|
|2005|   Alabama|   32|
|2006|   Alabama|   39|
|2007|   Alabama|   36|
|2008|   Alabama|   39|
|2009|   Alabama|   33|
+----+----------+-----+
only showing top 20 rows



In [16]:
q = spark.sql("""
    SELECT 
        YEAR(date_local) AS year,
        state_name,
        arithmetic_mean
    FROM
        airq
    ORDER BY
        state_name ASC, year ASC
""")

In [17]:
q.show()

+----+----------+---------------+
|year|state_name|arithmetic_mean|
+----+----------+---------------+
|1990|   Alabama|       2.395833|
|1990|   Alabama|       3.679167|
|1990|   Alabama|          0.575|
|1990|   Alabama|       1.495833|
|1990|   Alabama|         0.4375|
|1990|   Alabama|         0.5875|
|1990|   Alabama|       0.691667|
|1990|   Alabama|       1.286957|
|1990|   Alabama|       0.733333|
|1990|   Alabama|       1.704167|
|1990|   Alabama|       1.133333|
|1990|   Alabama|       1.183333|
|1990|   Alabama|       1.495833|
|1990|   Alabama|       0.429167|
|1990|   Alabama|       0.604167|
|1990|   Alabama|       0.679167|
|1990|   Alabama|       1.945833|
|1990|   Alabama|       0.483333|
|1990|   Alabama|       1.183333|
|1990|   Alabama|       1.929167|
+----+----------+---------------+
only showing top 20 rows

