¿Cuáles fueron los diferentes tipos de llamadas de incendio en 2018?

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import * 
from pyspark.sql import SparkSession

spark = (SparkSession 
        .builder
        .appName("projections_and_filters")
        .getOrCreate())

In [2]:
sf_fire_file = "./sf-fire-calls.csv"

In [3]:
fire_df = spark.read.csv(sf_fire_file, header=True)

In [4]:
(fire_df
.select("CallType")
.where(year(to_timestamp(col("CallDate"), "MM/dd/yyyy")) == 2018)
.distinct()
.show(10, False))

+-----------------------------+
|CallType                     |
+-----------------------------+
|Elevator / Escalator Rescue  |
|Alarms                       |
|Odor (Strange / Unknown)     |
|Citizen Assist / Service Call|
|HazMat                       |
|Vehicle Fire                 |
|Other                        |
|Outside Fire                 |
|Traffic Collision            |
|Assist Police                |
+-----------------------------+
only showing top 10 rows



¿En qué meses del año 2018 se registró el mayor número de llamadas de incendios?

In [10]:
(fire_df
.select(month(to_timestamp(col("CallDate"), "MM/dd/yyyy")).alias('mes'))
.where(year(to_timestamp(col("CallDate"), "MM/dd/yyyy")) == 2018)
.groupBy('mes')
.count()
.orderBy("count", ascending=False)
.show(n=10, truncate=False))

+---+-----+
|mes|count|
+---+-----+
|10 |1068 |
|5  |1047 |
|3  |1029 |
|8  |1021 |
|1  |1007 |
|7  |974  |
|6  |974  |
|9  |951  |
|4  |947  |
|2  |919  |
+---+-----+
only showing top 10 rows



¿Qué vecindario de San Francisco generó la mayor cantidad de llamadas de incendio en 2018?

In [7]:
(fire_df
.select('Neighborhood')
.where(year(to_timestamp(col("CallDate"), "MM/dd/yyyy")) == 2018)
.groupBy('Neighborhood')
.count()
.orderBy("count", ascending=False)
.take(1))

[Row(Neighborhood='Tenderloin', count=1393)]

¿Qué vecindarios tuvieron los peores tiempos de respuesta a las llamadas de incendios en 2018?

In [8]:
(fire_df
.where(year(to_timestamp(col("CallDate"), "MM/dd/yyyy")) == 2018)
.groupBy('Neighborhood')
.agg(avg("Delay").alias("promedio"))
.orderBy('promedio', ascending=False)
.show(10))

+--------------------+------------------+
|        Neighborhood|          promedio|
+--------------------+------------------+
|           Chinatown| 6.190314097905762|
|            Presidio|5.8292270414492755|
|     Treasure Island|      5.4537037125|
|        McLaren Park| 4.744047642857142|
|Bayview Hunters P...|4.6205619568773955|
|    Presidio Heights| 4.594131472394366|
|        Inner Sunset| 4.438095199935065|
|      Inner Richmond| 4.364728682713178|
|Financial Distric...| 4.344084618290156|
|      Haight Ashbury| 4.266428599285714|
+--------------------+------------------+
only showing top 10 rows



¿Qué semana del año 2018 tuvo más llamadas de incendio?

In [11]:
(fire_df
.where(year(to_timestamp(col("CallDate"), "MM/dd/yyyy")) == 2018)
.where( col('CallType') =='Structure Fire')
.groupBy(weekofyear((to_timestamp(col("CallDate"),"MM/dd/yyyy" ))).alias("semana del año"))
.count().alias("numero llamadas")
.orderBy('count', ascending=False)
.show(10))

+--------------+-----+
|semana del año|count|
+--------------+-----+
|            25|   31|
|             8|   30|
|             1|   29|
|            43|   29|
|            11|   28|
|            26|   27|
|            18|   27|
|            38|   27|
|            40|   27|
|            27|   25|
+--------------+-----+
only showing top 10 rows



¿Existe una correlación entre el vecindario, el código postal y el número de llamadas de incendio?

In [24]:
(fire_df
.groupBy("Neighborhood", "Zipcode")
.count()
.orderBy('count', ascending=False)
.show(10))

+--------------------+-------+-----+
|        Neighborhood|Zipcode|count|
+--------------------+-------+-----+
|          Tenderloin|  94102|17084|
|     South of Market|  94103|13762|
|             Mission|  94110|10444|
|Bayview Hunters P...|  94124| 9150|
|             Mission|  94103| 5445|
|          Tenderloin|  94109| 5377|
|Financial Distric...|  94105| 4235|
|      Outer Richmond|  94121| 4121|
|            Nob Hill|  94109| 3983|
| Castro/Upper Market|  94114| 3946|
+--------------------+-------+-----+
only showing top 10 rows



¿Cómo podemos usar archivos Parquet o tablas SQL para almacenar estos datos y volver a leerlos?

val parquetPath = ...
fireDF.write.format("parquet").save(parquetPath)

parDF1=spark.read.parquet("/temp/out/people.parquet")