# Getting Stared with Spark

Welcome to this hands-on training where we will investigate cleaning a dataset using Python and Apache Spark! 
During this training, we will cover:

* Loading a dataset into a Spark DataFrame
* Defining a schema for the data
* Saving a file as a Parquet file
* Projection & Filtering 
* Modifying, Renaming, and Dropping Columns

In [1]:
# Import Libraries

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import functions as F

### **The Dataset**

The dataset used in this webinar is a set of CSV files named `sf-fire-calls.csv`. The data contains information about calls that came into the San Francisco Fire Department over a number of years. The data contains the following fields:

* `CallNumber`: The number of the call
* `UnitID`: The ID of the unit that responded to the call
* `IncidentNumber`: The number of the incident
* `CallType`: The type of call that was made
* `CallDate`: The date on which the call was made
* `WatchDate`: The date on which the watch was made
* `CallFinalDisposition`: The final disposition of the call
* `AvailableDtTS`: The timestamp at which the unit was made available
* `Address`: The address where the incident occurred
* `City`: The city where the incident occurred
* `Zipcode`: The zipcode of the city where the incident occurred
* `Battalion`: The battalion where the incident occurred
* `StationArea`: The station area where the incident occurred
* `Box`: The box number where the incident occurred
* `OriginalPriority`: The original priority of the call
* `Priority`: The priority of the call
* `FinalPriority`: The final priority of the call
* `ALSUnit`: Whether an ALS unit was called
* `CallTypeGroup`: The type of call
* `NumAlarms`: The number of alarms
* `UnitType`: The unit type that responded
* `UnitSequenceInCallDispatch`: The sequence of the unit
* `FirePreventionDistrict`: The fire prevention district
* `SupervisorDistrict`: The supervisor district
* `Neighborhood`: The neighborhood
* `Location`: The location of the incident
* `RowID`: The row ID
* `Delay`: The delay


###  `SparkSession`

- `SparkSession` is the entry point to Spark SQL. It is one of the very first objects you create while developing a Spark SQL application.

- As a Spark developer, you create a `SparkSession` using the `SparkSession.builder` method (that gives you access to Builder API that you use to configure the session).

- In order to work with Spark, we have to first set up a `SparkSession`.

- From this point forward, we can interact with Apache Spark using this `spark` object.

In [2]:
# Initiate Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime

# Create a SparkSession and set the extraClassPath configuration
spark = SparkSession.builder.master("local[1]") \
    .appName("LetSparkWorkForYou") \
    .config("spark.driver.extraClassPath", "/home/jovyan/work/jars/*") \
    .getOrCreate()

# Details of the Spark Session
spark

### Inspection 

- Inspect the data looks like before defining a schema.

In [3]:
! head /home/jovyan/work/data/sf-fire-calls.csv

CallNumber,UnitID,IncidentNumber,CallType,CallDate,WatchDate,CallFinalDisposition,AvailableDtTm,Address,City,Zipcode,Battalion,StationArea,Box,OriginalPriority,Priority,FinalPriority,ALSUnit,CallTypeGroup,NumAlarms,UnitType,UnitSequenceInCallDispatch,FirePreventionDistrict,SupervisorDistrict,Neighborhood,Location,RowID,Delay
20110016,T13,2003235,Structure Fire,01/11/2002,01/10/2002,Other,01/11/2002 01:51:44 AM,2000 Block of CALIFORNIA ST,SF,94109,B04,38,3362,3,3,3,false,"",1,TRUCK,2,4,5,Pacific Heights,"(37.7895840679362, -122.428071912459)",020110016-T13,2.95
20110022,M17,2003241,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 03:01:18 AM,0 Block of SILVERVIEW DR,SF,94124,B10,42,6495,3,3,3,true,"",1,MEDIC,1,10,10,Bayview Hunters Point,"(37.7337623673897, -122.396113802632)",020110022-M17,4.7
20110023,M41,2003242,Medical Incident,01/11/2002,01/10/2002,Other,01/11/2002 02:39:50 AM,MARKET ST/MCALLISTER ST,SF,94102,B03,01,1455,3,3,3,true,"",1,MEDIC,2,3,6,Tenderloin,"(37.7811772186

## Read Data 

- Read Data From File

### Schema Definition 

- Define our schema as the file has large volume. Inferring the schema is expensive for large files.

In [4]:
# Define the Schema
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
                    StructField('UnitID', StringType(), True),
                    StructField('IncidentNumber', IntegerType(), True),
                    StructField('CallType', StringType(), True),
                    StructField('CallDate', StringType(), True),
                    StructField('WatchDate', StringType(), True),
                    StructField('CallFinalDisposition', StringType(), True),
                    StructField('AvailableDtTm', StringType(), True),
                    StructField('Address', StringType(), True),
                    StructField('City', StringType(), True),
                    StructField('Zipcode', IntegerType(), True),
                    StructField('Battalion', StringType(), True),
                    StructField('StationArea', StringType(), True),
                    StructField('Box', StringType(), True),
                    StructField('OriginalPriority', StringType(), True),
                    StructField('Priority', StringType(), True),
                    StructField('FinalPriority', IntegerType(), True),
                    StructField('ALSUnit', BooleanType(), True),
                    StructField('CallTypeGroup', StringType(), True),
                    StructField('NumAlarms', IntegerType(), True),
                    StructField('UnitType', StringType(), True),
                    StructField('UnitSequenceInCallDispatch', IntegerType(), True),
                    StructField('FirePreventionDistrict', StringType(), True),
                    StructField('SupervisorDistrict', StringType(), True),
                    StructField('Neighborhood', StringType(), True),
                    StructField('Location', StringType(), True),
                    StructField('RowID', StringType(), True),
                    StructField('Delay', FloatType(), True)])

# file path 
sr_file = "/home/jovyan/work/data/sf-fire-calls.csv"

# read into spark 
fire_df = (spark.read.csv(sr_file, header=True, schema=fire_schema))

In [5]:
fire_df.cache()

DataFrame[CallNumber: int, UnitID: string, IncidentNumber: int, CallType: string, CallDate: string, WatchDate: string, CallFinalDisposition: string, AvailableDtTm: string, Address: string, City: string, Zipcode: int, Battalion: string, StationArea: string, Box: string, OriginalPriority: string, Priority: string, FinalPriority: int, ALSUnit: boolean, CallTypeGroup: string, NumAlarms: int, UnitType: string, UnitSequenceInCallDispatch: int, FirePreventionDistrict: string, SupervisorDistrict: string, Neighborhood: string, Location: string, RowID: string, Delay: float]

## View Data

In [6]:
fire_df.show(5, truncate=False)

+----------+------+--------------+----------------+----------+----------+--------------------+----------------------+---------------------------+----+-------+---------+-----------+----+----------------+--------+-------------+-------+-------------+---------+--------+--------------------------+----------------------+------------------+---------------------+-------------------------------------+-------------+---------+
|CallNumber|UnitID|IncidentNumber|CallType        |CallDate  |WatchDate |CallFinalDisposition|AvailableDtTm         |Address                    |City|Zipcode|Battalion|StationArea|Box |OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumAlarms|UnitType|UnitSequenceInCallDispatch|FirePreventionDistrict|SupervisorDistrict|Neighborhood         |Location                             |RowID        |Delay    |
+----------+------+--------------+----------------+----------+----------+--------------------+----------------------+---------------------------+----+-------+--

## Selection / Projection

- The process of selection is arguably the most fundamental means of reducing the footprint of the data you are working with.
- This concept will be familiar to anyone with working knowledge of SQL.
- In a nutshell, selection enables us to reduce the set of rows returned by a query by way of a condition.

In [7]:
fire_df.select("CallNumber", "City", "Delay").show(5, truncate=False)

+----------+----+---------+
|CallNumber|City|Delay    |
+----------+----+---------+
|20110016  |SF  |2.95     |
|20110022  |SF  |4.7      |
|20110023  |SF  |2.4333334|
|20110032  |SF  |1.5      |
|20110043  |SF  |3.4833333|
+----------+----+---------+
only showing top 5 rows



In [8]:
fire_df.select("IncidentNumber", "AvailableDtTm", "IncidentNumber", "Address", "NumAlarms", "Battalion", "CallFinalDisposition").show(5, False)

+--------------+----------------------+--------------+---------------------------+---------+---------+--------------------+
|IncidentNumber|AvailableDtTm         |IncidentNumber|Address                    |NumAlarms|Battalion|CallFinalDisposition|
+--------------+----------------------+--------------+---------------------------+---------+---------+--------------------+
|2003235       |01/11/2002 01:51:44 AM|2003235       |2000 Block of CALIFORNIA ST|1        |B04      |Other               |
|2003241       |01/11/2002 03:01:18 AM|2003241       |0 Block of SILVERVIEW DR   |1        |B10      |Other               |
|2003242       |01/11/2002 02:39:50 AM|2003242       |MARKET ST/MCALLISTER ST    |1        |B03      |Other               |
|2003250       |01/11/2002 04:16:46 AM|2003250       |APPLETON AV/MISSION ST     |1        |B06      |Other               |
|2003259       |01/11/2002 06:01:58 AM|2003259       |1400 Block of SUTTER ST    |1        |B04      |Other               |
+-------

In [9]:
fire_df.select("IncidentNumber", "AvailableDtTm", "IncidentNumber", "Address", "NumAlarms", "Battalion", "CallFinalDisposition").show(5, False)

+--------------+----------------------+--------------+---------------------------+---------+---------+--------------------+
|IncidentNumber|AvailableDtTm         |IncidentNumber|Address                    |NumAlarms|Battalion|CallFinalDisposition|
+--------------+----------------------+--------------+---------------------------+---------+---------+--------------------+
|2003235       |01/11/2002 01:51:44 AM|2003235       |2000 Block of CALIFORNIA ST|1        |B04      |Other               |
|2003241       |01/11/2002 03:01:18 AM|2003241       |0 Block of SILVERVIEW DR   |1        |B10      |Other               |
|2003242       |01/11/2002 02:39:50 AM|2003242       |MARKET ST/MCALLISTER ST    |1        |B03      |Other               |
|2003250       |01/11/2002 04:16:46 AM|2003250       |APPLETON AV/MISSION ST     |1        |B06      |Other               |
|2003259       |01/11/2002 06:01:58 AM|2003259       |1400 Block of SUTTER ST    |1        |B04      |Other               |
+-------

# Projection 

A projection in relational parlance is a way to return only the rows matching a certain relational condition by using filters.

- In Spark, projections are done with the select() method, while filters can be expressed using the filter() or where() method.
- We can use this technique to examine specific aspects of our SF Fire Department data set:

In [10]:
# In Python
few_fire_df = (fire_df
      .select("IncidentNumber", "AvailableDtTm", "CallType") \
      .where(col("CallType") != "Medical Incident"))
few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



## Create a SQL View 
- A SQL View is a virtual table, which is based on SQL SELECT query.
- A view contains rows and columns, just like a real table.
- The fields in a view are fields from one or more real tables in the database.

In [11]:
# Create a SQL View
fire_df.createOrReplaceTempView("fire_service")

In [12]:
# Query the View to find the distinct number of alarms 
query = """
SELECT DISTINCT (NumAlarms) AS Alarms
FROM fire_service
"""
spark.sql(query).show(truncate=False)


query1 = """
SELECT  DISTINCT (CallType) AS CallTypes
FROM fire_service
"""
spark.sql(query1).show(truncate=False)



+------+
|Alarms|
+------+
|1     |
|3     |
|5     |
|4     |
|2     |
+------+

+--------------------------------------------+
|CallTypes                                   |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Outside Fire            

In [13]:
# Query the View
query = """
SELECT CallNumber, City, Delay
FROM fire_service
WHERE City != 'San Francisco'
"""
spark.sql(query).show(5, truncate=False)

+----------+----+---------+
|CallNumber|City|Delay    |
+----------+----+---------+
|20110016  |SF  |2.95     |
|20110022  |SF  |4.7      |
|20110023  |SF  |2.4333334|
|20110032  |SF  |1.5      |
|20110043  |SF  |3.4833333|
+----------+----+---------+
only showing top 5 rows



## Filters 

In [14]:
few_fire_df = (fire_df.select("IncidentNumber", "AvailableDtTm", "CallType") \
                             .where(col("CallType") != "Medical Incident"))
few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



# Let's do some ETL:
- Transform the string dates to Spark Timestamp data type so we can make some time-based queries later
- Returns a transformed query
- Cache the new DataFrame

In [18]:
## Change Data Type
# check schema and locate the time columns 
fire_df.printSchema()

root
 |-- CallNumber: integer (nullable = true)
 |-- UnitID: string (nullable = true)
 |-- IncidentNumber: integer (nullable = true)
 |-- CallType: string (nullable = true)
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- CallFinalDisposition: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- Battalion: string (nullable = true)
 |-- StationArea: string (nullable = true)
 |-- Box: string (nullable = true)
 |-- OriginalPriority: string (nullable = true)
 |-- Priority: string (nullable = true)
 |-- FinalPriority: integer (nullable = true)
 |-- ALSUnit: boolean (nullable = true)
 |-- CallTypeGroup: string (nullable = true)
 |-- NumAlarms: integer (nullable = true)
 |-- UnitType: string (nullable = true)
 |-- UnitSequenceInCallDispatch: integer (nullable = true)
 |-- FirePreventionDistrict: string (nullable = true)
 

In [17]:
# List of time columns
time_columns = [
    'CallDate',
    'WatchDate',
    'AvailableDtTm'
]

# Select the time columns from the DataFrame
time_columns_df = fire_df.select(*time_columns)

# Show the result
time_columns_df.show(5, truncate=False)

# show the data types for the time columns
time_columns_df.printSchema()

+----------+----------+----------------------+
|CallDate  |WatchDate |AvailableDtTm         |
+----------+----------+----------------------+
|01/11/2002|01/10/2002|01/11/2002 01:51:44 AM|
|01/11/2002|01/10/2002|01/11/2002 03:01:18 AM|
|01/11/2002|01/10/2002|01/11/2002 02:39:50 AM|
|01/11/2002|01/10/2002|01/11/2002 04:16:46 AM|
|01/11/2002|01/10/2002|01/11/2002 06:01:58 AM|
+----------+----------+----------------------+
only showing top 5 rows

root
 |-- CallDate: string (nullable = true)
 |-- WatchDate: string (nullable = true)
 |-- AvailableDtTm: string (nullable = true)



### Convert the Time Columns

In [19]:
# convert the string to timestamp format
fire_ts_df = (fire_df.withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy")) \
                    .drop("CallDate") \
                    .withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy")) \
                    .drop("WatchDate") \
                    .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss a")) \
                    .drop("AvailableDtTm")
                    )

# Select the converted columns
fire_ts_df.select("IncidentDate", "OnWatchDate", "AvailableDtTS").show(5, truncate=False)

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



In [20]:
fire_ts_df.cache()
fire_ts_df.columns

['CallNumber',
 'UnitID',
 'IncidentNumber',
 'CallType',
 'CallFinalDisposition',
 'Address',
 'City',
 'Zipcode',
 'Battalion',
 'StationArea',
 'Box',
 'OriginalPriority',
 'Priority',
 'FinalPriority',
 'ALSUnit',
 'CallTypeGroup',
 'NumAlarms',
 'UnitType',
 'UnitSequenceInCallDispatch',
 'FirePreventionDistrict',
 'SupervisorDistrict',
 'Neighborhood',
 'Location',
 'RowID',
 'Delay',
 'IncidentDate',
 'OnWatchDate',
 'AvailableDtTS']

SQL View

In [21]:
# Create a SQL View
fire_ts_df.createOrReplaceTempView("fire_services")

# Queries 

## Q-1) How many distinct types of calls were made to the Fire Department?

In [22]:
# Execute your SQL query and create a DataFrame
query = """
SELECT COUNT(DISTINCT CallType) AS DistinctCallTypes
FROM fire_services
WHERE CallType IS NOT NULL;
"""
result_df = spark.sql(query)
result_df.show(truncate=False)

# # Save the DataFrame as a table
# result_df.createOrReplaceTempView("distinct_fire_calls")

# # Export the table as a Parquet file
# result_df.write.mode("overwrite").parquet("/path/to/output/common_fire_calls.parquet")

[Stage 15:>                                                         (0 + 1) / 1]

+-----------------+
|DistinctCallTypes|
+-----------------+
|30               |
+-----------------+



                                                                                

## Q-2) What are distinct types of calls were made to the Fire Department?

In [23]:
# Execute your SQL query and create a DataFrame
query = """
SELECT DISTINCT CallType AS DistinctCallTypes
FROM fire_services
WHERE CallType IS NOT NULL;
"""
result_df = spark.sql(query)
result_df.show(truncate=False)

+--------------------------------------------+
|DistinctCallTypes                           |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Polic

## Q-3) Find out all response or delayed times greater than 5 mins?
- Rename the column Delay - > ReponseDelayedinMins.
- Returns a new DataFrame.
- Find out all calls where the response time to the fire site was delayed for more than 5 mins.

In [24]:
renamed_fire_df = fire_ts_df.withColumnRenamed("Delay", "ResponseDelayedinMins")
renamed_fire_df \
        .select("ResponseDelayedinMins") \
        .where(col("ResponseDelayedinMins") > 5) \
        .show(5, False)

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



## Q-4) What were all the different types of fire calls in 2018?

In [28]:
query = """
SELECT DISTINCT CallType
FROM fire_services
WHERE EXTRACT(YEAR FROM IncidentDate) = 2018;
"""
spark.sql(query).show( truncate=False)

+-------------------------------+
|CallType                       |
+-------------------------------+
|Elevator / Escalator Rescue    |
|Alarms                         |
|Odor (Strange / Unknown)       |
|Citizen Assist / Service Call  |
|HazMat                         |
|Explosion                      |
|Vehicle Fire                   |
|Suspicious Package             |
|Other                          |
|Outside Fire                   |
|Traffic Collision              |
|Assist Police                  |
|Gas Leak (Natural and LP Gases)|
|Water Rescue                   |
|Electrical Hazard              |
|Structure Fire                 |
|Medical Incident               |
|Fuel Spill                     |
|Smoke Investigation (Outside)  |
|Train / Rail Incident          |
+-------------------------------+



## Q-5) What are the most common types of fire calls?

In [29]:
# Query the View
query = """

SELECT CallType, COUNT(*) AS CallCount
FROM fire_service
WHERE CallType IS NOT NULL
GROUP BY CallType
ORDER BY CallCount DESC;
"""
spark.sql(query).show(truncate=False)

+-------------------------------+---------+
|CallType                       |CallCount|
+-------------------------------+---------+
|Medical Incident               |113794   |
|Structure Fire                 |23319    |
|Alarms                         |19406    |
|Traffic Collision              |7013     |
|Citizen Assist / Service Call  |2524     |
|Other                          |2166     |
|Outside Fire                   |2094     |
|Vehicle Fire                   |854      |
|Gas Leak (Natural and LP Gases)|764      |
|Water Rescue                   |755      |
|Odor (Strange / Unknown)       |490      |
|Electrical Hazard              |482      |
|Elevator / Escalator Rescue    |453      |
|Smoke Investigation (Outside)  |391      |
|Fuel Spill                     |193      |
|HazMat                         |124      |
|Industrial Accidents           |94       |
|Explosion                      |89       |
|Train / Rail Incident          |57       |
|Aircraft Emergency             

## Q-6) What were the most common call types?

In [30]:
(fire_ts_df
 .select("CallType").where(col("CallType").isNotNull())
 .groupBy("CallType")
 .count()
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False))

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



## Q-6A) What zip codes accounted for most common calls?

In [35]:
# Query the View
query = """
SELECT CallType, ZipCode, COUNT(*) AS CallCount
FROM fire_services
WHERE CallType IS NOT NULL
GROUP BY CallType, ZipCode
ORDER BY CallCount DESC;
"""
spark.sql(query).show(truncate=False)

+----------------+-------+---------+
|CallType        |ZipCode|CallCount|
+----------------+-------+---------+
|Medical Incident|94102  |16130    |
|Medical Incident|94103  |14775    |
|Medical Incident|94110  |9995     |
|Medical Incident|94109  |9479     |
|Medical Incident|94124  |5885     |
|Medical Incident|94112  |5630     |
|Medical Incident|94115  |4785     |
|Medical Incident|94122  |4323     |
|Medical Incident|94107  |4284     |
|Medical Incident|94133  |3977     |
|Medical Incident|94117  |3522     |
|Medical Incident|94134  |3437     |
|Medical Incident|94114  |3225     |
|Medical Incident|94118  |3104     |
|Medical Incident|94121  |2953     |
|Medical Incident|94116  |2738     |
|Medical Incident|94132  |2594     |
|Structure Fire  |94110  |2267     |
|Medical Incident|94105  |2258     |
|Structure Fire  |94102  |2229     |
+----------------+-------+---------+
only showing top 20 rows



## Q-6B) What San Francisco neighborhoods are in the zip codes 94102 and 94103

In [37]:
# Query the View
query = """
SELECT DISTINCT Neighborhood
FROM fire_services
WHERE ZipCode IN ('94102', '94103');
"""
spark.sql(query).show(truncate=False)

+------------------------------+
|Neighborhood                  |
+------------------------------+
|Western Addition              |
|Mission Bay                   |
|Hayes Valley                  |
|Financial District/South Beach|
|Nob Hill                      |
|Mission                       |
|Tenderloin                    |
|Potrero Hill                  |
|Castro/Upper Market           |
|South of Market               |
+------------------------------+




## Q-6a) How many distinct years of data is in the CSV file?

In [43]:
# Query the View
query = """
SELECT COUNT(DISTINCT EXTRACT(YEAR FROM IncidentDate)) AS DistinctYears
FROM fire_services;
"""
spark.sql(query).show(truncate=False)

+-------------+
|DistinctYears|
+-------------+
|19           |
+-------------+



In [44]:
# Query the View
query = """
SELECT DISTINCT EXTRACT(YEAR FROM IncidentDate) AS DistinctYears
FROM fire_services
ORDER BY DistinctYears;
"""
spark.sql(query).show(truncate=False)

+-------------+
|DistinctYears|
+-------------+
|2000         |
|2001         |
|2002         |
|2003         |
|2004         |
|2005         |
|2006         |
|2007         |
|2008         |
|2009         |
|2010         |
|2011         |
|2012         |
|2013         |
|2014         |
|2015         |
|2016         |
|2017         |
|2018         |
+-------------+



## Q-6b) What week of the year in 2018 had the most fire calls?

In [50]:
fire_ts_df.filter(year('IncidentDate') == 2018) \
        .groupBy(weekofyear('IncidentDate')) \
                    .count() \
                        .orderBy('count', ascending=False) \
                                            .show()

+------------------------+-----+
|weekofyear(IncidentDate)|count|
+------------------------+-----+
|                      22|  259|
|                      40|  255|
|                      43|  250|
|                      25|  249|
|                       1|  246|
|                      44|  244|
|                      13|  243|
|                      32|  243|
|                      11|  240|
|                       5|  236|
|                      18|  236|
|                      23|  235|
|                      42|  234|
|                       2|  234|
|                      31|  234|
|                      19|  233|
|                      10|  232|
|                      34|  232|
|                       8|  232|
|                      28|  231|
+------------------------+-----+
only showing top 20 rows



### Checking Missing Values

In [9]:
df = fire_df

In [10]:
from pyspark.sql.functions import col

null_entries_df = fire_df.filter(F.col('CallNumber').isNull() | \
                                 F.col('UnitID').isNull() | \
                                 F.col('IncidentNumber').isNull() | \
                                 F.col('CallType').isNull() | \
                                 F.col('CallDate').isNull() | \
                                 F.col('WatchDate').isNull() | \
                                 F.col('CallFinalDisposition').isNull() | \
                                 F.col('AvailableDtTm').isNull() | \
                                 F.col('Address').isNull() | \
                                 F.col('City').isNull() | \
                                 F.col('Zipcode').isNull() | \
                                 F.col('Battalion').isNull() | \
                                 F.col('StationArea').isNull() | \
                                 F.col('Box').isNull() | \
                                 F.col('OriginalPriority').isNull() | \
                                 F.col('Priority').isNull() | \
                                 F.col('FinalPriority').isNull() | \
                                 F.col('ALSUnit').isNull() | \
                                 F.col('CallTypeGroup').isNull() | \
                                 F.col('NumAlarms').isNull() | \
                                 F.col('UnitType').isNull() | \
                                 F.col('UnitSequenceInCallDispatch').isNull() | \
                                 F.col('FirePreventionDistrict').isNull() | \
                                 F.col('SupervisorDistrict').isNull() | \
                                 F.col('Neighborhood').isNull() | \
                                 F.col('Location').isNull() | \
                                 F.col('RowID').isNull() | \
                                 F.col('Delay').isNull())

In [11]:
null_entries_df.count()

                                                                                

100992

In [12]:
from pyspark.sql.functions import col

not_null_entries_df = fire_df.filter(
    ~F.col('CallNumber').isNull() &
    ~F.col('UnitID').isNull() &
    ~F.col('IncidentNumber').isNull() &
    ~F.col('CallType').isNull() &
    ~F.col('CallDate').isNull() &
    ~F.col('WatchDate').isNull() &
    ~F.col('CallFinalDisposition').isNull() &
    ~F.col('AvailableDtTm').isNull() &
    ~F.col('Address').isNull() &
    ~F.col('City').isNull() &
    ~F.col('Zipcode').isNull() &
    ~F.col('Battalion').isNull() &
    ~F.col('StationArea').isNull() &
    ~F.col('Box').isNull() &
    ~F.col('OriginalPriority').isNull() &
    ~F.col('Priority').isNull() &
    ~F.col('FinalPriority').isNull() &
    ~F.col('ALSUnit').isNull() &
    ~F.col('CallTypeGroup').isNull() &
    ~F.col('NumAlarms').isNull() &
    ~F.col('UnitType').isNull() &
    ~F.col('UnitSequenceInCallDispatch').isNull() &
    ~F.col('FirePreventionDistrict').isNull() &
    ~F.col('SupervisorDistrict').isNull() &
    ~F.col('Neighborhood').isNull() &
    ~F.col('Location').isNull() &
    ~F.col('RowID').isNull() &
    ~F.col('Delay').isNull()
)
not_null_entries_df.count()

                                                                                

74304

In [13]:
total_columns = fire_df.count()
count_nulls = null_entries_df.count()
count_not_nulls = not_null_entries_df.count()

if count_nulls + count_not_nulls == total_columns:
    print(f"Total Columns: {total_columns}")
    print(f"Count of Nulls: {count_nulls}")
    print(f"Count of Not Nulls: {count_not_nulls}")
    print("The columns count is correct")
else:
    print("Counts do not add up to the total number of columns.")

                                                                                

Total Columns: 175296
Count of Nulls: 100992
Count of Not Nulls: 74304
The columns count is correct


                                                                                

In [10]:
output_path = "/Users/oasis/Desktop/learn-spark/parq_data"

In [11]:
# write the data to parquet file
fire_df.write.format("parquet").mode("overwrite").save(output_path)

                                                                                

In [14]:
# write the data from parquet file
today_date = datetime.now().strftime("%Y-%m-%d")
output_path = f"{output_path}/{today_date}_curated"

fire_df.write.option("compression", "snappy").parquet(output_path)
print("Data written successfully.")

[Stage 2:=====>                                                   (1 + 10) / 11]

Data written successfully.


                                                                                