# common dataframe operations
performed on san francisco fire call dataset


# questions
1. diff b/w spark.read spark.DataFrameReader, similarly df.write and dataframewriter


# notes
Spark provides an interface, DataFrameReader, that enables you to read data into a DataFrame from myriad
data sources in formats such as JSON, CSV, Parquet, Text, Avro, ORC, etc. Likewise,
to write a DataFrame back to a data source in a particular format, Spark uses DataFrameWriter

To write the DataFrame into an external data source in your format of choice, you
can use the DataFrameWriter interface. Like DataFrameReader, it supports multiple
data sources. Parquet, a popular columnar format, is the default format; it uses
snappy compression to compress the data. If the DataFrame is written as Parquet, the
schema is preserved as part of the Parquet metadata. In this case, subsequent reads
back into a DataFrame do not require you to manually supply a schema


# errors faced and solutions
https://stackoverflow.com/questions/49102292/file-already-exists-error-writing-new-files-from-dataframe

In [1]:
# imports and initialisation

from pyspark.sql import SparkSession
from pyspark.sql import DataFrameReader, DataFrameWriter
from pyspark.sql.types import *
from pyspark.sql.functions import col
import os

dataset_path = os.path.join(os.getcwd(), "datasets", "Fire_Department_Calls_for_Service.csv")

# spark = SparkSession.builder\
#                     .appName("SF_DF")\
#                     .getOrCreate()


# https://stackoverflow.com/questions/52133731/how-to-solve-cant-assign-requested-address-service-sparkdriver-failed-after
spark = SparkSession.builder\
                    .appName("SF_DF")\
                    .config("spark.driver.bindAddress", "127.0.0.1")\
                    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/26 19:38:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# define schema

sf_firecall_schema = StructType([
    StructField('CallNumber', IntegerType(), True),
    StructField('UnitID', StringType(), True),
    StructField('IncidentNumber', IntegerType(), True),
    StructField('CallType', StringType(), True),
    StructField('CallDate', StringType(), True),
    StructField('WatchDate', StringType(), True),
    StructField('ReceivedDtTm', StringType(), True),
    StructField('EntryDtTm', StringType(), True),
    StructField('DispatchDtTm', StringType(), True),
    StructField('ResponseDtTm', StringType(), True),
    StructField('OnSceneDtTm', StringType(), True),
    StructField('TransportDtTm', StringType(), True),
    StructField('HospitalDtTm', StringType(), True),
    StructField('CallFinalDisposition', StringType(), True),
    StructField('AvailableDtTm', StringType(), True),
    StructField('Address', StringType(), True),
    StructField('City', StringType(), True),
    StructField('ZipcodeofIncident', IntegerType(), True),
    StructField('Battalion', StringType(), True),
    StructField('StationArea', StringType(), True),
    StructField('Box', StringType(), True),
    StructField('OriginalPriority', StringType(), True),
    StructField('Priority', StringType(), True),
    StructField('FinalPriority', StringType(), True),
    StructField('ALSUnit', BooleanType(), True),
    StructField('CallTypeGroup', StringType(), True),
    StructField('NumberofAlarms', IntegerType(), True),
    StructField('UnitType', StringType(), True),
    StructField('UnitSequenceInCallDispatch', IntegerType(), True),
    StructField('FirePreventionDistrict', StringType(), True),
    StructField('SupervisorDistrict', StringType(), True),
    StructField('NeighborhooodsAnalysisBoundaries', StringType(), True),
    StructField('Location', StringType(), True),
    StructField('RowID', IntegerType(), True)])

In [3]:
sf_firecall_df = spark.read.csv(dataset_path, sf_firecall_schema, header=True)

In [4]:
sf_firecall_df.select("CallNumber","CallDate","CallType","City").where(col("CallType") != "Medical Incident").show(n=10,truncate=False)

25/03/26 19:38:36 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Number, Call Type, Call Date, City
 Schema: CallNumber, CallType, CallDate, City
Expected: CallNumber but found: Call Number
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv


+----------+----------+-----------------------------+----+
|CallNumber|CallDate  |CallType                     |City|
+----------+----------+-----------------------------+----+
|1030107   |04/12/2000|Alarms                       |SF  |
|1030112   |04/12/2000|Citizen Assist / Service Call|SF  |
|1030116   |04/12/2000|Electrical Hazard            |SF  |
|1030117   |04/12/2000|Odor (Strange / Unknown)     |SF  |
|1030120   |04/12/2000|Alarms                       |SF  |
|1030128   |04/12/2000|Alarms                       |SF  |
|1030128   |04/12/2000|Alarms                       |SF  |
|1030132   |04/12/2000|Other                        |SF  |
|1030136   |04/12/2000|Structure Fire               |SF  |
|1030143   |04/12/2000|Other                        |SF  |
+----------+----------+-----------------------------+----+
only showing top 10 rows



In [5]:
# sf_firecall_df.write.format("parquet").save(os.path.join(os.getcwd(), "dst","sf_firecall"))

In [6]:
# Alternatively, you can save it as a table, which registers metadata with the Hive metastore

# sf_firecall_df.write.format("parquet").saveAsTable("sf_firecall")

# wonder where the table data has been stored ? "spark-warehouse" is created in cwd and inside folder named as table name

In [7]:
# distinct call types count
from pyspark.sql.functions import countDistinct

distinct_call_types_count = sf_firecall_df.\
                        where(col("CallType").isNotNull()).\
                        agg(countDistinct(col("CallType")).alias("DistinctCallTypes_count"))

distinct_call_types_count.show()

25/03/26 19:38:37 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Type
 Schema: CallType
Expected: CallType but found: Call Type
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+-----------------------+
|DistinctCallTypes_count|
+-----------------------+
|                     32|
+-----------------------+



                                                                                

In [8]:
# distinct call types list
# from pyspark.sql.functions import 

distinct_call_types = sf_firecall_df.\
                        select("CallType").\
                        where(col("CallType").isNotNull()).\
                        distinct()

distinct_call_types.sort(col("CallType")).show(truncate=False)

25/03/26 19:38:39 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Type
 Schema: CallType
Expected: CallType but found: Call Type
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Administrative                              |
|Aircraft Emergency                          |
|Alarms                                      |
|Assist Police                               |
|Citizen Assist / Service Call               |
|Confined Space / Structure Collapse         |
|Electrical Hazard                           |
|Elevator / Escalator Rescue                 |
|Explosion                                   |
|Extrication / Entrapped (Machinery, Vehicle)|
|Fuel Spill                                  |
|Gas Leak (Natural and LP Gases)             |
|HazMat                                      |
|High Angle Rescue                           |
|Industrial Accidents                        |
|Lightning Strike (Investigation)            |
|Marine Fire                                 |
|Medical Incident                            |
|Mutual Aid /

                                                                                

# Renaming, adding, and dropping columns.

In [9]:
from pyspark.sql.functions import length

In [10]:


sf_firecall_df_renamed_col = sf_firecall_df\
                                .withColumnRenamed("NeighborhooodsAnalysisBoundaries","NeighborhooodsBoundaries")
(sf_firecall_df_renamed_col.select("NeighborhooodsBoundaries").show(n=5))
(sf_firecall_df_renamed_col.select("NeighborhooodsBoundaries").where(length(col("NeighborhooodsBoundaries")) < 10).show(n=10))

# there is also another method withColumnsRenamed to rename multiple cols

+------------------------+
|NeighborhooodsBoundaries|
+------------------------+
|         Sunset/Parkside|
|         Sunset/Parkside|
|              Tenderloin|
|              Tenderloin|
|    Financial Distric...|
+------------------------+
only showing top 5 rows

+------------------------+
|NeighborhooodsBoundaries|
+------------------------+
|                Nob Hill|
|                Nob Hill|
|                    None|
|               Chinatown|
|                 Mission|
|                 Mission|
|                 Mission|
|                Nob Hill|
|                 Mission|
|                Nob Hill|
+------------------------+
only showing top 10 rows



25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Neighborhooods - Analysis Boundaries
 Schema: NeighborhooodsAnalysisBoundaries
Expected: NeighborhooodsAnalysisBoundaries but found: Neighborhooods - Analysis Boundaries
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv
25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Neighborhooods - Analysis Boundaries
 Schema: NeighborhooodsAnalysisBoundaries
Expected: NeighborhooodsAnalysisBoundaries but found: Neighborhooods - Analysis Boundaries
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv


In [11]:
from pyspark.sql.functions import lit, to_timestamp

In [12]:
# adding a new column
# https://stackoverflow.com/questions/32788322/how-to-add-a-constant-column-in-a-spark-dataframe

added_col = sf_firecall_df_renamed_col.withColumn("response_in_days", lit("dummy"))
(added_col.describe)
(added_col.schema)

# currently this cell only outputs schema result which is last executed, to output both in jupyter cell configuration need to be changed
# %config InteractiveShell.ast_node_interactivity = 'all'



StructType([StructField('CallNumber', IntegerType(), True), StructField('UnitID', StringType(), True), StructField('IncidentNumber', IntegerType(), True), StructField('CallType', StringType(), True), StructField('CallDate', StringType(), True), StructField('WatchDate', StringType(), True), StructField('ReceivedDtTm', StringType(), True), StructField('EntryDtTm', StringType(), True), StructField('DispatchDtTm', StringType(), True), StructField('ResponseDtTm', StringType(), True), StructField('OnSceneDtTm', StringType(), True), StructField('TransportDtTm', StringType(), True), StructField('HospitalDtTm', StringType(), True), StructField('CallFinalDisposition', StringType(), True), StructField('AvailableDtTm', StringType(), True), StructField('Address', StringType(), True), StructField('City', StringType(), True), StructField('ZipcodeofIncident', IntegerType(), True), StructField('Battalion', StringType(), True), StructField('StationArea', StringType(), True), StructField('Box', StringTyp

In [13]:
# show records with more than 1 day response date
added_col.show(n=2)

25/03/26 19:38:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----+-----------------+---------+-----------+----+----------------+--------+-------------+-------+-------------+--------------+--------+--------------------------+----------------------+------------------+------------------------+--------------------+-----+----------------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|       TransportDtTm|        HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumberofAlarms|UnitType|UnitSequenceInCallDispa

25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Number, Unit ID, Incident Number, Call Type, Call Date, Watch Date, Received DtTm, Entry DtTm, Dispatch DtTm, Response DtTm, On Scene DtTm, Transport DtTm, Hospital DtTm, Call Final Disposition, Available DtTm, Address, City, Zipcode of Incident, Battalion, Station Area, Box, Original Priority, Priority, Final Priority, ALS Unit, Call Type Group, Number of Alarms, Unit Type, Unit sequence in call dispatch, Fire Prevention District, Supervisor District, Neighborhooods - Analysis Boundaries, Location, RowID
 Schema: CallNumber, UnitID, IncidentNumber, CallType, CallDate, WatchDate, ReceivedDtTm, EntryDtTm, DispatchDtTm, ResponseDtTm, OnSceneDtTm, TransportDtTm, HospitalDtTm, CallFinalDisposition, AvailableDtTm, Address, City, ZipcodeofIncident, Battalion, StationArea, Box, OriginalPriority, Priority, FinalPriority, ALSUnit, CallTypeGroup, NumberofAlarms, UnitType, UnitSequenceInCallDispatch,

In [14]:
# Adding more than one new cols
# Signature:
# sf_firecall_df_renamed_col.withColumns(
#     *colsMap: Dict[str, pyspark.sql.column.Column],
# ) -> 'DataFrame'
# Docstring:
# Returns a new :class:`DataFrame` by adding multiple columns or replacing the
# existing columns that have the same names.

# The colsMap is a map of column name and column, the column must only refer to attributes
# supplied by this Dataset. It is an error to add columns that refer to some other Dataset.


added_cols = sf_firecall_df_renamed_col.withColumns({"newcol1":lit("new col data 1"),"newcol2":lit("new col data 2")})

In [15]:
added_cols.show(n=2)

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----+-----------------+---------+-----------+----+----------------+--------+-------------+-------+-------------+--------------+--------+--------------------------+----------------------+------------------+------------------------+--------------------+-----+--------------+--------------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|       TransportDtTm|        HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumberofAlarms|UnitType|UnitSequen

25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Number, Unit ID, Incident Number, Call Type, Call Date, Watch Date, Received DtTm, Entry DtTm, Dispatch DtTm, Response DtTm, On Scene DtTm, Transport DtTm, Hospital DtTm, Call Final Disposition, Available DtTm, Address, City, Zipcode of Incident, Battalion, Station Area, Box, Original Priority, Priority, Final Priority, ALS Unit, Call Type Group, Number of Alarms, Unit Type, Unit sequence in call dispatch, Fire Prevention District, Supervisor District, Neighborhooods - Analysis Boundaries, Location, RowID
 Schema: CallNumber, UnitID, IncidentNumber, CallType, CallDate, WatchDate, ReceivedDtTm, EntryDtTm, DispatchDtTm, ResponseDtTm, OnSceneDtTm, TransportDtTm, HospitalDtTm, CallFinalDisposition, AvailableDtTm, Address, City, ZipcodeofIncident, Battalion, StationArea, Box, OriginalPriority, Priority, FinalPriority, ALSUnit, CallTypeGroup, NumberofAlarms, UnitType, UnitSequenceInCallDispatch,

In [16]:
# dropping cols
# Signature: added_cols.drop(*cols: 'ColumnOrName') -> 'DataFrame'
# Docstring:
# Returns a new :class:`DataFrame` without specified columns.
# This is a no-op if the schema doesn't contain the given column name(s).

dropped_col_df = added_cols.drop("newcol1")
dropped_col_df.show(n=2)

+----------+------+--------------+----------------+----------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----+-----------------+---------+-----------+----+----------------+--------+-------------+-------+-------------+--------------+--------+--------------------------+----------------------+------------------+------------------------+--------------------+-----+--------------+
|CallNumber|UnitID|IncidentNumber|        CallType|  CallDate| WatchDate|        ReceivedDtTm|           EntryDtTm|        DispatchDtTm|        ResponseDtTm|         OnSceneDtTm|       TransportDtTm|        HospitalDtTm|CallFinalDisposition|       AvailableDtTm|             Address|City|ZipcodeofIncident|Battalion|StationArea| Box|OriginalPriority|Priority|FinalPriority|ALSUnit|CallTypeGroup|NumberofAlarms|UnitType|UnitSequenceInCallDispatc

25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Number, Unit ID, Incident Number, Call Type, Call Date, Watch Date, Received DtTm, Entry DtTm, Dispatch DtTm, Response DtTm, On Scene DtTm, Transport DtTm, Hospital DtTm, Call Final Disposition, Available DtTm, Address, City, Zipcode of Incident, Battalion, Station Area, Box, Original Priority, Priority, Final Priority, ALS Unit, Call Type Group, Number of Alarms, Unit Type, Unit sequence in call dispatch, Fire Prevention District, Supervisor District, Neighborhooods - Analysis Boundaries, Location, RowID
 Schema: CallNumber, UnitID, IncidentNumber, CallType, CallDate, WatchDate, ReceivedDtTm, EntryDtTm, DispatchDtTm, ResponseDtTm, OnSceneDtTm, TransportDtTm, HospitalDtTm, CallFinalDisposition, AvailableDtTm, Address, City, ZipcodeofIncident, Battalion, StationArea, Box, OriginalPriority, Priority, FinalPriority, ALSUnit, CallTypeGroup, NumberofAlarms, UnitType, UnitSequenceInCallDispatch,

In [17]:
from pyspark.sql.functions import unix_timestamp, when

In [18]:
# try to find time taken to respond
sf_firecall_df_new = sf_firecall_df.withColumns({"res_date":unix_timestamp(sf_firecall_df["ResponseDtTm"],'MM/dd/yyyy hh:mm:ss a'), "rec_date":unix_timestamp(sf_firecall_df["ReceivedDtTm"],'MM/dd/yyyy hh:mm:ss a')})
sf_firecall_df_new = sf_firecall_df_new.withColumn("res_time_taken_in_sec",
                                when((col("res_date")-col("rec_date")).isNull(), 0)\
                              .otherwise((col("res_date")-col("rec_date"))))\
        .select("IncidentNumber","ResponseDtTm","ReceivedDtTm","res_date","rec_date","res_time_taken_in_sec")\

                              

In [19]:

sf_firecall_df_new.orderBy(col("res_time_taken_in_sec"), ascending=False).show(truncate=False)

# extract year from the date
from pyspark.sql.functions import year
date_ops_df = sf_firecall_df_new.withColumns({"resdate":to_timestamp(col("ResponseDtTm"),'MM/dd/yyyy hh:mm:ss a'), "recdate":to_timestamp(col("ReceivedDtTm"),'MM/dd/yyyy hh:mm:ss a')})
date_ops_df.select(year("recdate").alias("extracted_year")).distinct().orderBy("extracted_year").show()


25/03/26 19:38:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Incident Number, Received DtTm, Response DtTm
 Schema: IncidentNumber, ReceivedDtTm, ResponseDtTm
Expected: IncidentNumber but found: Incident Number
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv
25/03/26 19:38:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Received DtTm
 Schema: ReceivedDtTm
Expected: ReceivedDtTm but found: Received DtTm
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv


+--------------+----------------------+----------------------+----------+----------+---------------------+
|IncidentNumber|ResponseDtTm          |ReceivedDtTm          |res_date  |rec_date  |res_time_taken_in_sec|
+--------------+----------------------+----------------------+----------+----------+---------------------+
|87458         |10/19/2000 12:45:59 PM|10/18/2000 05:26:22 AM|971939759 |971826982 |112777               |
|17066608      |06/08/2017 02:33:46 PM|06/07/2017 07:49:13 AM|1496912626|1496801953|110673               |
|9071789       |08/31/2009 03:45:00 PM|08/30/2009 10:45:34 AM|1251713700|1251609334|104366               |
|87458         |10/19/2000 09:21:44 AM|10/18/2000 05:26:22 AM|971927504 |971826982 |100522               |
|17119230      |10/11/2017 06:55:12 PM|10/10/2017 03:17:46 PM|1507728312|1507628866|99446                |
|7101115       |12/19/2007 08:46:56 AM|12/18/2007 06:51:08 AM|1198034216|1197940868|93348                |
|1052942       |06/21/2001 10:03:58 A



+--------------+
|extracted_year|
+--------------+
|          2000|
|          2001|
|          2002|
|          2003|
|          2004|
|          2005|
|          2006|
|          2007|
|          2008|
|          2009|
|          2010|
|          2011|
|          2012|
|          2013|
|          2014|
|          2015|
|          2016|
|          2017|
|          2018|
+--------------+



                                                                                

# aggregations

In [20]:
# Let’s take our first question: what were the most common types of fire calls?

# to check cols on the df sf_firecall_df.columns

sf_firecall_df.where(col('CallType').isNotNull())\
                .groupBy('CallType').count()\
                .orderBy('count', ascending = False)\
                .show()

25/03/26 19:38:46 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Type
 Schema: CallType
Expected: CallType but found: Call Type
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+--------------------+-------+
|            CallType|  count|
+--------------------+-------+
|    Medical Incident|3135026|
|      Structure Fire| 628420|
|              Alarms| 517243|
|   Traffic Collision| 197956|
|               Other|  77082|
|Citizen Assist / ...|  72529|
|        Outside Fire|  57213|
|        Vehicle Fire|  23178|
|        Water Rescue|  22991|
|Gas Leak (Natural...|  18393|
|   Electrical Hazard|  13580|
|Elevator / Escala...|  12728|
|Odor (Strange / U...|  12474|
|Smoke Investigati...|  10734|
|          Fuel Spill|   5593|
|              HazMat|   3931|
|Industrial Accidents|   2836|
|           Explosion|   2587|
|  Aircraft Emergency|   1511|
|       Assist Police|   1334|
+--------------------+-------+
only showing top 20 rows



                                                                                

In [21]:
# #Along with all the others we’ve seen, the Data‐
# Frame API provides descriptive statistical methods like min(), max(), sum(), and
# avg(). Let’s take a look at some examples showing how to compute them with our SF
# Fire Department data set.

In [22]:
# What were all the different types of fire calls in 2018?

from pyspark.sql.functions import to_date

# firecalls_2018.select('CallDate','calldate_year').show()

firecalls_2018 = sf_firecall_df\
    .where(year(to_date('CallDate', 'mm/dd/yyyy')) == 2018)\
    .withColumn('calldate_year', year(to_date('CallDate', 'mm/dd/yyyy')))

firecalls_2018\
    .groupBy('CallType')\
    .count()\
    .orderBy('count', ascending = False)\
    .show()

25/03/26 19:38:48 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Type, Call Date
 Schema: CallType, CallDate
Expected: CallType but found: Call Type
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+--------------------+------+
|            CallType| count|
+--------------------+------+
|    Medical Incident|198404|
|              Alarms| 32646|
|      Structure Fire| 24843|
|   Traffic Collision| 12367|
|        Outside Fire|  4362|
|               Other|  3858|
|Citizen Assist / ...|  3842|
|Gas Leak (Natural...|  1655|
|        Water Rescue|  1341|
|        Vehicle Fire|   962|
|   Electrical Hazard|   915|
|Elevator / Escala...|   867|
|Smoke Investigati...|   784|
|          Fuel Spill|   259|
|Odor (Strange / U...|   208|
|Train / Rail Inci...|   144|
|              HazMat|   132|
|           Explosion|    63|
|Industrial Accidents|    51|
|Extrication / Ent...|    49|
+--------------------+------+
only showing top 20 rows



                                                                                

In [23]:
# What months within the year 2018 saw the highest number of fire calls?

from pyspark.sql.functions import month
firecalls_2018_with_month = firecalls_2018.withColumns({'calldate_year': year(to_date('CallDate', 'MM/dd/yyyy')), 'calldate_month': month(to_date('CallDate', 'MM/dd/yyyy'))})
firecalls_2018_with_month\
    .groupBy('calldate_year','calldate_month')\
    .count()\
    .orderBy('count', ascending = False)\
    .show()


25/03/26 19:38:50 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Date
 Schema: CallDate
Expected: CallDate but found: Call Date
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+-------------+--------------+-----+
|calldate_year|calldate_month|count|
+-------------+--------------+-----+
|         2018|             1|27027|
|         2018|             3|26606|
|         2018|            10|26536|
|         2018|            11|26307|
|         2018|             5|26297|
|         2018|             6|26189|
|         2018|             7|25964|
|         2018|             4|25565|
|         2018|             8|25341|
|         2018|             9|24602|
|         2018|             2|24252|
|         2018|            12| 3307|
+-------------+--------------+-----+



                                                                                

In [24]:
# Which neighborhood in San Francisco generated the most fire calls in 2018?

# firecalls_2018_with_month.columns

firecalls_2018_with_month\
    .groupBy('NeighborhooodsAnalysisBoundaries')\
    .count()\
    .orderBy('count', ascending = False)\
    .show(truncate = False)

25/03/26 19:38:52 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Date, Neighborhooods - Analysis Boundaries
 Schema: CallDate, NeighborhooodsAnalysisBoundaries
Expected: CallDate but found: Call Date
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+--------------------------------+-----+
|NeighborhooodsAnalysisBoundaries|count|
+--------------------------------+-----+
|Tenderloin                      |40537|
|South of Market                 |30356|
|Mission                         |25210|
|Financial District/South Beach  |22228|
|Bayview Hunters Point           |14582|
|Sunset/Parkside                 |10056|
|Western Addition                |9874 |
|Nob Hill                        |9091 |
|Castro/Upper Market             |7652 |
|Hayes Valley                    |7246 |
|Outer Richmond                  |6724 |
|North Beach                     |6207 |
|West of Twin Peaks              |5940 |
|Excelsior                       |5708 |
|Marina                          |5662 |
|Pacific Heights                 |5662 |
|Chinatown                       |5586 |
|Potrero Hill                    |5086 |
|Bernal Heights                  |4605 |
|Mission Bay                     |4575 |
+--------------------------------+-----+
only showing top

                                                                                

In [25]:
# Which week in the year in 2018 had the most fire calls?

from pyspark.sql.functions import weekofyear

# firecalls_2018.columns
firecalls_2018_with_weeks = firecalls_2018.withColumn('week_of_year', weekofyear(to_date('CallDate', 'MM/dd/yyyy')))
firecalls_2018_with_weeks.groupBy('calldate_year','week_of_year')\
                            .count().withColumnRenamed('count','weekly_count')\
                            .orderBy('weekly_count', ascending = False)\
                            .show()

25/03/26 19:38:54 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Call Date
 Schema: CallDate
Expected: CallDate but found: Call Date
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv

+-------------+------------+------------+
|calldate_year|week_of_year|weekly_count|
+-------------+------------+------------+
|         2018|           1|        6626|
|         2018|          25|        6425|
|         2018|          22|        6328|
|         2018|          13|        6321|
|         2018|          27|        6289|
|         2018|          40|        6252|
|         2018|          44|        6250|
|         2018|          16|        6217|
|         2018|          46|        6209|
|         2018|          43|        6200|
|         2018|           5|        6160|
|         2018|          18|        6152|
|         2018|          48|        6142|
|         2018|           2|        6109|
|         2018|           9|        6079|
|         2018|          21|        6073|
|         2018|          45|        6050|
|         2018|           6|        6025|
|         2018|           8|        6014|
|         2018|          23|        5997|
+-------------+------------+------

                                                                                

In [26]:
# Is there a correlation between neighborhood, zip code, and number of fire calls?

# sf_firecall_df.columns
neighborhood_total_calls = sf_firecall_df\
                            .groupby('NeighborhooodsAnalysisBoundaries')\
                            .count().withColumnRenamed('count', 'total_calls')\
                            .orderBy('total_calls', ascending = False)
neighborhood_total_calls.show(truncate = False)

zipcode_total_calls = sf_firecall_df\
                            .groupby('ZipcodeofIncident')\
                            .count().withColumnRenamed('count', 'total_calls')\
                            .orderBy('total_calls', ascending = False)
zipcode_total_calls.show(truncate = False)

25/03/26 19:38:56 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Neighborhooods - Analysis Boundaries
 Schema: NeighborhooodsAnalysisBoundaries
Expected: NeighborhooodsAnalysisBoundaries but found: Neighborhooods - Analysis Boundaries
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv
25/03/26 19:38:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Zipcode of Incident
 Schema: ZipcodeofIncident
Expected: ZipcodeofIncident but found: Zipcode of Incident
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv


+--------------------------------+-----------+
|NeighborhooodsAnalysisBoundaries|total_calls|
+--------------------------------+-----------+
|Tenderloin                      |634244     |
|South of Market                 |460433     |
|Mission                         |438958     |
|Financial District/South Beach  |329037     |
|Bayview Hunters Point           |260228     |
|Sunset/Parkside                 |189134     |
|Western Addition                |178051     |
|Nob Hill                        |159096     |
|Outer Richmond                  |129874     |
|Hayes Valley                    |118402     |
|Castro/Upper Market             |115631     |
|West of Twin Peaks              |107252     |
|North Beach                     |104359     |
|Chinatown                       |102116     |
|Pacific Heights                 |99150      |
|Excelsior                       |97356      |
|Bernal Heights                  |92728      |
|Marina                          |90921      |
|Potrero Hill



+-----------------+-----------+
|ZipcodeofIncident|total_calls|
+-----------------+-----------+
|94102            |605254     |
|94103            |578402     |
|94110            |410468     |
|94109            |401869     |
|94124            |250373     |
|94112            |227335     |
|94115            |214642     |
|94107            |191061     |
|94122            |172112     |
|94133            |171867     |
|94117            |162794     |
|94118            |143381     |
|94114            |143277     |
|94134            |133630     |
|94121            |128435     |
|94132            |116152     |
|94105            |116027     |
|94108            |112594     |
|94116            |103737     |
|94123            |100121     |
+-----------------+-----------+
only showing top 20 rows



                                                                                

In [27]:
# How can we use Parquet files or SQL tables to store this data and read it back?

zipcode_total_calls.write.parquet('zipcode_total_calls')

25/03/26 19:39:00 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Zipcode of Incident
 Schema: ZipcodeofIncident
Expected: ZipcodeofIncident but found: Zipcode of Incident
CSV file: file:///Users/pvasud669@apac.comcast.com/repos/learnings/spark/datasets/Fire_Department_Calls_for_Service.csv
                                                                                

AnalysisException: [PATH_ALREADY_EXISTS] Path file:/Users/pvasud669@apac.comcast.com/repos/learnings/spark/zipcode_total_calls already exists. Set mode as "overwrite" to overwrite the existing path.

In [None]:
zipcode_parquet_df = spark.read.parquet('zipcode_total_calls')
zipcode_parquet_df.show()