## Python SparkDF
The final output gives the number of monitors for each U.S.A. state using the FIPS code with also includes islands and self governing territories.

This notebook exemplifies the execution of a pySpark program in Python, using the SQL interface. In this example, spark runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.

Spark documentation available at: https://spark.apache.org/docs/2.3.1/

### Download the dataset 
Dataset is being downloaded from a dropbox link Contains a small subset of the original dataset Dataset contains the information from air quality monitoring facilities across the U.S.A.

In [1]:
!wget -O epa_hap_daily_summary_small https://www.dropbox.com/s/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv?dl=0

--2021-12-21 18:12:16--  https://www.dropbox.com/s/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.68.18, 2620:100:6024:18::a27d:4412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.68.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv [following]
--2021-12-21 18:12:22--  https://www.dropbox.com/s/raw/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb9f10291b576dc53170c4903a7.dl.dropboxusercontent.com/cd/0/inline/BcT71uf70pqC1K8qG3mlYuZjyfFyGMG1bkiF9qaKguPZ6HZhFnC7ba3rJuqjveGvpIyht6IuaqtzEmQFb_UTYdkhcgUGnyZ2tggJjJFSpY6zoJuHAFjgiq7Q3obMKBmWSJ0eK5BMS0EBiQc4CdXrlE9A/file# [following]
--2021-12-21 18:12:23--  https://ucb9f10291b576dc53170c4903a7.dl.dropboxusercontent.com/cd/0/inline/BcT71uf70pqC1K8qG3mlYuZjy

In [60]:
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]').appName('Q1').getOrCreate()
sc = spark.sparkContext


In [61]:
lines = sc.textFile('epa_hap_daily_summary_small')

In [62]:
df=spark.read.option("header","true").csv(lines)

In [63]:
df.describe()

DataFrame[summary: string, state_code: string, county_code: string, site_num: string, parameter_code: string, poc: string, latitude: string, longitude: string, datum: string, parameter_name: string, sample_duration: string, pollutant_standard: string, date_local: string, units_of_measure: string, event_type: string, observation_count: string, observation_percent: string, arithmetic_mean: string, first_max_value: string, first_max_hour: string, aqi: string, method_code: string, method_name: string, local_site_name: string, address: string, state_name: string, county_name: string, city_name: string, cbsa_name: string, date_of_last_change: string]

In [64]:
monitor_values = df = spark.read.csv(lines,inferSchema =True,header=True)

In [65]:
monitor_values_small = monitor_values.drop('parameter_code','county_name','arithmetic_mean','county_code','site_num','poc','latitude','longitude','datum','parameter_name','sample_duration','pollutant_standard','date_local','units_of_measure','event_type','observation_count','observation_percent','first_max_value','first_max_hour','aqi','method_code','method_name','local_site_name','state_name','city_name','cbsa_name','date_of_last_change')


In [66]:
monitor_values_small.show(10)

+----------+--------------------+
|state_code|             address|
+----------+--------------------+
|        34|Interchange 13 Ne...|
|        35|     San Pedro Parks|
|        51|2401 HARTMAN STRE...|
|        48|  1902 WEST SCHUNIOR|
|        44|FRANCIS SCHOOL 64...|
|        36|POST OFFICE350 CA...|
|        36|274 S Pearl St Al...|
|        39|         1330 DUEBER|
|        48|800 S San Marcial...|
|        16|            Scoville|
+----------+--------------------+
only showing top 10 rows



In [67]:
drop_columns_duplicRows = monitor_values_small.dropDuplicates(['address'])

In [68]:
drop_columns_duplicRows.show(10)

+----------+--------------------+
|state_code|             address|
+----------+--------------------+
|        35|          San Andres|
|         6|5551 BETHEL ISLAN...|
|        41|        711 WELCH ST|
|        20|FIRE STA.#8; SHUN...|
|        54|1400 Main St / 10...|
|        22|U.S. Coast Guard ...|
|        72|FIRE DEPARTMENTCR...|
|        22|HEALTH UNIT BUILD...|
|        12|7200-22 AVENUE NORTH|
|        27|    649 FIFTH STREET|
+----------+--------------------+
only showing top 10 rows



I have now cleaned all my data and now only need to count the state code instances and show a table in descending order
I need to group by state code column and use a counting function, after that I apply a sort and specify that I want column count to be ordered descending, a final show is applied to present the final result 

In [81]:
drop_columns_duplicRows.groupBy('state_code').count().sort(col("count").desc()).show()

+----------+-----+
|state_code|count|
+----------+-----+
|         6|  161|
|        48|  132|
|        27|   94|
|        39|   89|
|        26|   83|
|        36|   66|
|        45|   64|
|        30|   60|
|        42|   60|
|        12|   57|
|        18|   52|
|         8|   50|
|        17|   49|
|        37|   49|
|        53|   42|
|        22|   40|
|         4|   38|
|        20|   37|
|        13|   34|
|        41|   31|
+----------+-----+
only showing top 20 rows



In [None]:
sc.stop()