**RDDs**

Las RDDs son las abstracciones más básicas de Spark. Tienen tres caracteristicas asociadas:  
	- Dependencias: dicen a Spark como esta construido un RDD y que inputs requiere.  
	- Particiones (con algo de información local): permiten a Spark dividir el trabajo.  
	- Funciones de computación: Partition =>Iterator[T]

**Ejemplo 1: low-level RDD: Python**

In [1]:
#Creamos un RDD con tuplas (name,age)
dataRDD = sc.parallelize([("Brooke", 20), ("Denny", 31), ("Jules", 30), ("TD", 35), ("Brooke", 25)])

Con el siguiente RDD vamos a utilizar la función reduceByKey para agrupar las tuplas con la misma clave y vamos a calcular la media

In [2]:
agesRDD = dataRDD.map(lambda x: (x[0], (x[1], 1))).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])).map(lambda x: (x[0], x[1][0]/x[1][1]))

In [3]:
agesRDD.collect()

[('Brooke', 22.5), ('Denny', 31.0), ('TD', 35.0), ('Jules', 30.0)]

Este código se ejecuta, pero a la hora de realizar una accion sobre él obtendremos como resultado un error.

**Ejemplo 2: high-level DSL operators and the DataFrame: Python**

Importamos las librerias de Spark

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
import sys
from pyspark.sql.functions import count

Creamos la SparkSession

In [5]:
spark = (SparkSession
 .builder
 .appName("AuthorsAges")
 .getOrCreate())

Cremaos un DataFrame con los datos

In [6]:
dataDF = spark.createDataFrame([("Broke",20), ("Denny",31), ("Jules",30), ("TD",35),
                         ("Broke",25)], ["name", "age"])

In [7]:
dataDF.show()

+-----+---+
| name|age|
+-----+---+
|Broke| 20|
|Denny| 31|
|Jules| 30|
|   TD| 35|
|Broke| 25|
+-----+---+



Agrupamos los datos por nombre y calculamos la media de la edad:

In [8]:
avg_df = (dataDF
          .groupBy("name")
          .agg(avg("age").alias("mean")))

In [9]:
avg_df.show(n=5,truncate=False)

+-----+----+
|name |mean|
+-----+----+
|Jules|30.0|
|Broke|22.5|
|TD   |35.0|
|Denny|31.0|
+-----+----+



**Ejemplo 2: high-level DSL operators and the DataFrame: Scala**

In [1]:
//Importamos las librerias necesarias
import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.SparkSession

Intitializing Scala interpreter ...

Spark Web UI available at http://EM2021002836.bosonit.local:4041
SparkContext available as 'sc' (version = 3.1.1, master = local[*], app id = local-1622799803219)
SparkSession available as 'spark'


import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.SparkSession


In [2]:
//Creamos un DataFrame a través de SparkSession
val spark = SparkSession
            .builder
            .appName("AuthorAges")
            .getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@628b81be


In [3]:
//Creamos un DataFrame con nombres y edades
val dataDF = spark.createDataFrame(Seq(("Brooke", 20), ("Brooke", 25),
 ("Denny", 31), ("Jules", 30), ("TD", 35))).toDF("name", "age")

dataDF: org.apache.spark.sql.DataFrame = [name: string, age: int]


In [4]:
//Agrupamos por nombre y calculamos la media de las edades agruupadas
val avgDF = dataDF.groupBy("name").agg(avg("age"))

avgDF: org.apache.spark.sql.DataFrame = [name: string, avg(age): double]


In [5]:
//Mostramos el resultado final
avgDF.show()

+------+--------+
|  name|avg(age)|
+------+--------+
|Brooke|    22.5|
| Jules|    30.0|
|    TD|    35.0|
| Denny|    31.0|
+------+--------+



-------------------------------------------------------------------------------------------------------------------------------

## The DataFrame API

Los Spark DataFrames son como tablas distribuidas en memoria con nombre de columnas y esquemas, donde se especifica el tipo de cada columna.

**Ejemplo Basic Data Types: Scala**

In [6]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


In [7]:
val nameTypes = StringType

nameTypes: org.apache.spark.sql.types.StringType.type = StringType


In [8]:
val firstName = nameTypes

firstName: org.apache.spark.sql.types.StringType.type = StringType


In [9]:
val lastName = nameTypes

lastName: org.apache.spark.sql.types.StringType.type = StringType


**Ejemplo two way to define a schema: Scala**

*Spark DataFrame API*

In [10]:
import org.apache.spark.sql.types._

import org.apache.spark.sql.types._


In [11]:
val schema = StructType(Array(StructField("author", StringType, false),
 StructField("title", StringType, false),
 StructField("pages", IntegerType, false)))


schema: org.apache.spark.sql.types.StructType = StructType(StructField(author,StringType,false), StructField(title,StringType,false), StructField(pages,IntegerType,false))


*DDL: Data Definition Lenguage*

In [12]:
val schema = "author STRING, title STRING, pages INT"

schema: String = author STRING, title STRING, pages INT


*Ejemplo*

In [13]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._


In [14]:
val spark = SparkSession
 .builder
 .appName("Example-3_7")
 .getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@628b81be


In [15]:
 val schema = StructType(Array(StructField("Id", IntegerType, false),
     StructField("First", StringType, false),
     StructField("Last", StringType, false),
     StructField("Url", StringType, false),
     StructField("Published", StringType, false),
     StructField("Hits", IntegerType, false),
     StructField("Campaigns", ArrayType(StringType), false)))

schema: org.apache.spark.sql.types.StructType = StructType(StructField(Id,IntegerType,false), StructField(First,StringType,false), StructField(Last,StringType,false), StructField(Url,StringType,false), StructField(Published,StringType,false), StructField(Hits,IntegerType,false), StructField(Campaigns,ArrayType(StringType,true),false))


In [16]:
val jsonFile = "C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/blogs.json"

jsonFile: String = C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/blogs.json


In [17]:
val blogsDF = spark.read.schema(schema).json(jsonFile)

blogsDF: org.apache.spark.sql.DataFrame = [Id: int, First: string ... 5 more fields]


In [18]:
blogsDF.show(false)

+---+---------+-------+-----------------+---------+-----+----------------------------+
|Id |First    |Last   |Url              |Published|Hits |Campaigns                   |
+---+---------+-------+-----------------+---------+-----+----------------------------+
|1  |Jules    |Damji  |https://tinyurl.1|1/4/2016 |4535 |[twitter, LinkedIn]         |
|2  |Brooke   |Wenig  |https://tinyurl.2|5/5/2018 |8908 |[twitter, LinkedIn]         |
|3  |Denny    |Lee    |https://tinyurl.3|6/7/2019 |7659 |[web, twitter, FB, LinkedIn]|
|4  |Tathagata|Das    |https://tinyurl.4|5/12/2018|10568|[twitter, FB]               |
|5  |Matei    |Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB, LinkedIn]|
|6  |Reynold  |Xin    |https://tinyurl.6|3/2/2015 |25568|[twitter, LinkedIn]         |
+---+---------+-------+-----------------+---------+-----+----------------------------+



In [19]:
println(blogsDF.printSchema)
println(blogsDF.schema)

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (nullable = true)
 |-- Campaigns: array (nullable = true)
 |    |-- element: string (containsNull = true)

()
StructType(StructField(Id,IntegerType,true), StructField(First,StringType,true), StructField(Last,StringType,true), StructField(Url,StringType,true), StructField(Published,StringType,true), StructField(Hits,IntegerType,true), StructField(Campaigns,ArrayType(StringType,true),true))


**Ejemplo two way to define a schema: Python**

*Spark DataFrame API*

In [10]:
from pyspark.sql.types import *

In [11]:
schema = StructType([StructField("author", StringType(), False),
 StructField("title", StringType(), False),
 StructField("pages", IntegerType(), False)])

*DDL*

In [12]:
schema = "author STRING, title STRING, pages INT"

*Ejemplo*

In [20]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [14]:
schema = "`Id` INT, `First` STRING, `Last` STRING, `Url` STRING, `Published` STRING, `Hits` INT, `Campaigns` ARRAY<STRING>"

In [15]:
data = [
    [1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter", "LinkedIn"]],
    [2, "Brooke","Wenig", "https://tinyurl.2", "5/5/2018", 8908, ["twitter", "LinkedIn"]],
    [3, "Denny", "Lee", "https://tinyurl.3", "6/7/2019", 7659, ["web", "twitter", "FB", "LinkedIn"]],
    [4, "Tathagata", "Das", "https://tinyurl.4", "5/12/2018", 10568, ["twitter", "FB"]],
    [5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web", "twitter", "FB", "LinkedIn"]],
    [6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568, ["twitter", "LinkedIn"]]
 ]

In [16]:
spark = (SparkSession
 .builder
 .appName("Example-3_6")
 .getOrCreate())

In [17]:
blogs_df = spark.createDataFrame(data, schema)

In [18]:
blogs_df.show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [21]:
#Para multiplicar una columna
(blogs_df
 .select(F.col("Hits")*2)
 .show())

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
|     15318|
|     21136|
|     81156|
|     51136|
+----------+



In [22]:
print(blogs_df.printSchema())

root
 |-- Id: integer (nullable = true)
 |-- First: string (nullable = true)
 |-- Last: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- Published: string (nullable = true)
 |-- Hits: integer (nullable = true)
 |-- Campaigns: array (nullable = true)
 |    |-- element: string (containsNull = true)

None


In [23]:
##Para mirar el esquema en cualquier parte del código
blogs_df.schema

StructType(List(StructField(Id,IntegerType,true),StructField(First,StringType,true),StructField(Last,StringType,true),StructField(Url,StringType,true),StructField(Published,StringType,true),StructField(Hits,IntegerType,true),StructField(Campaigns,ArrayType(StringType,true),true)))

**Ejemplo columnas: Scala**

In [20]:
import org.apache.spark.sql.functions._

import org.apache.spark.sql.functions._


In [21]:
blogsDF.columns //muestra el nombre de las columnas de nuestro DataFrame

res3: Array[String] = Array(Id, First, Last, Url, Published, Hits, Campaigns)


In [22]:
blogsDF.col("Id") //para mostrar que existe una columna con ese nombre

res4: org.apache.spark.sql.Column = Id


In [23]:
blogsDF.select(expr("Hits * 2")).show(2) //Multiplica por 2 la columna Hits y muestra las dos primeras filas

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
+----------+
only showing top 2 rows



In [24]:
blogsDF.select(col("Hits") * 2).show(2) //igual que el anterior a traves de la funcion col

+----------+
|(Hits * 2)|
+----------+
|      9070|
|     17816|
+----------+
only showing top 2 rows



In [25]:
//Creamos una nueva columna a partir de una ya existente teniendo que cumplir la condicion dada
blogsDF.withColumn("Big Hitters", (expr("Hits > 10000"))).show() 

+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|Big Hitters|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|      false|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|      false|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|      false|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|       true|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|       true|
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|       true|
+---+---------+-------+-----------------+---------+-----+--------------------+-----------+



In [26]:
//Concatenar tres columnas existentes para crear una nueva
blogsDF
 .withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id"))))
 .select(col("AuthorsId"))
 .show(4)

+-------------+
|    AuthorsId|
+-------------+
|  JulesDamji1|
| BrookeWenig2|
|    DennyLee3|
|TathagataDas4|
+-------------+
only showing top 4 rows



Tres métodos distintos para seleccionar una columna:

In [27]:
blogsDF.select(expr("Hits")).show(2)

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows



In [28]:
blogsDF.select(col("Hits")).show(2)

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows



In [29]:
blogsDF.select("Hits").show(2)

+----+
|Hits|
+----+
|4535|
|8908|
+----+
only showing top 2 rows



Dos metodos para ordenar el DataFrame a través de una columna

In [30]:
blogsDF.sort(col("Id").desc).show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



In [31]:
blogsDF.sort($"Id".desc).show()

+---+---------+-------+-----------------+---------+-----+--------------------+
| Id|    First|   Last|              Url|Published| Hits|           Campaigns|
+---+---------+-------+-----------------+---------+-----+--------------------+
|  6|  Reynold|    Xin|https://tinyurl.6| 3/2/2015|25568| [twitter, LinkedIn]|
|  5|    Matei|Zaharia|https://tinyurl.5|5/14/2014|40578|[web, twitter, FB...|
|  4|Tathagata|    Das|https://tinyurl.4|5/12/2018|10568|       [twitter, FB]|
|  3|    Denny|    Lee|https://tinyurl.3| 6/7/2019| 7659|[web, twitter, FB...|
|  2|   Brooke|  Wenig|https://tinyurl.2| 5/5/2018| 8908| [twitter, LinkedIn]|
|  1|    Jules|  Damji|https://tinyurl.1| 1/4/2016| 4535| [twitter, LinkedIn]|
+---+---------+-------+-----------------+---------+-----+--------------------+



**Ejemplo Filas: Scala**

In [32]:
import org.apache.spark.sql.Row

import org.apache.spark.sql.Row


In [33]:
val blogRow = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015", Array("twitter", "LinkedIn"))

blogRow: org.apache.spark.sql.Row = [6,Reynold,Xin,https://tinyurl.6,255568,3/2/2015,[Ljava.lang.String;@712d3150]


In [34]:
//Para acceder al segundo elemento de la linea
blogRow(1)

res14: Any = Reynold


In [35]:
//Crear DataFrames a traves de lineas
val rows = Seq(("Matei Zaharia", "CA"), ("Reynold Xin", "CA"))
val authorsDF = rows.toDF("Author", "State")
authorsDF.show()

+-------------+-----+
|       Author|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



rows: Seq[(String, String)] = List((Matei Zaharia,CA), (Reynold Xin,CA))
authorsDF: org.apache.spark.sql.DataFrame = [Author: string, State: string]


**Ejemplo Filas: Python**

In [24]:
from pyspark.sql import Row

In [25]:
blog_row = Row(6, "Reynold", "Xin", "https://tinyurl.6", 255568, "3/2/2015",
 ["twitter", "LinkedIn"])

In [26]:
##Para acceder a una linea
blog_row[1]

'Reynold'

In [27]:
##Crear un DataFrame a través de lineas
rows = [Row("Matei Zaharia", "CA"), Row("Reynold Xin", "CA")]
authors_df = spark.createDataFrame(rows, ["Authors", "State"])
authors_df.show()

+-------------+-----+
|      Authors|State|
+-------------+-----+
|Matei Zaharia|   CA|
|  Reynold Xin|   CA|
+-------------+-----+



**spark.read.csv() --> lee un CSV y devuelve un DataFrame con las filas y nombres de columnas definidos por un esquema anterior.**

-------------------------------------------------------------------------------------------------------------------------------

### Ejemplo fire-calls: Python

In [28]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import sys 
from pyspark.sql import functions as F
import pandas as pd

In [29]:
##Creamos el esquema del DF
fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
  StructField('UnitID', StringType(), True),
  StructField('IncidentNumber', IntegerType(), True),
  StructField('CallType', StringType(), True), 
  StructField('CallDate', StringType(), True), 
  StructField('WatchDate', StringType(), True),
  StructField('CallFinalDisposition', StringType(), True),
  StructField('AvailableDtTm', StringType(), True),
  StructField('Address', StringType(), True), 
  StructField('City', StringType(), True), 
  StructField('Zipcode', IntegerType(), True), 
  StructField('Battalion', StringType(), True), 
  StructField('StationArea', StringType(), True), 
  StructField('Box', StringType(), True), 
  StructField('OriginalPriority', StringType(), True), 
  StructField('Priority', StringType(), True), 
  StructField('FinalPriority', IntegerType(), True), 
  StructField('ALSUnit', BooleanType(), True), 
  StructField('CallTypeGroup', StringType(), True),
  StructField('NumAlarms', IntegerType(), True),
  StructField('UnitType', StringType(), True),
  StructField('UnitSequenceInCallDispatch', IntegerType(), True),
  StructField('FirePreventionDistrict', StringType(), True),
  StructField('SupervisorDistrict', StringType(), True),
  StructField('Neighborhood', StringType(), True),
  StructField('Location', StringType(), True),
  StructField('RowID', StringType(), True),
  StructField('Delay', FloatType(), True)])

In [30]:
##Leemos los datos y creamos el DF
sf_fire_file = "C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/sf-fire-calls.csv"
fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)

Guardar los datos en formato Parquet: conserva el esquema de los datos

Guardar los datos como tabla de SQL:

**Ejemplo Projections and filters**

In [32]:
##Seleccionamos tres columnas donde del tipo de llamada no sea un accidente medico
few_fire_df = (fire_df
 .select("IncidentNumber", "AvailableDtTm", "CallType")
 .where(fire_df.CallType != "Medical Incident"))
few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



In [33]:
##Contamos cuantos valores hay en la columna CallType que no sean nulos
(fire_df
 .select("CallType")
 .where(fire_df.CallType.isNotNull())
 .agg(F.countDistinct("CallType").alias("DistinctCallTypes"))
 .show())

+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



In [34]:
#Mostramos los diferentes valores que toma la columna CallType
(fire_df
 .select("CallType")
 .where(fire_df.CallType.isNotNull())
 .distinct()
 .show(10, False))

+-----------------------------------+
|CallType                           |
+-----------------------------------+
|Elevator / Escalator Rescue        |
|Marine Fire                        |
|Aircraft Emergency                 |
|Confined Space / Structure Collapse|
|Administrative                     |
|Alarms                             |
|Odor (Strange / Unknown)           |
|Citizen Assist / Service Call      |
|HazMat                             |
|Watercraft in Distress             |
+-----------------------------------+
only showing top 10 rows



**Example renaming, adding and dropping columns**

In [35]:
#Creamos un nuevo dataframe a partir de las columnas seleccionadas
new_fire_df = fire_df.withColumnRenamed("Delay", "ResponseDelayedinMins")
(new_fire_df
 .select("ResponseDelayedinMins")
 .where(new_fire_df.ResponseDelayedinMins > 5)
 .show(5, False))

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



In [36]:
##Creamos un nuevo DataFrame con tres columnas nuevas en formato fecha a partir de la funcion to_tiemstamp
fire_ts_df = (new_fire_df
 .withColumn("IncidentDate", F.to_timestamp(new_fire_df.CallDate, "MM/dd/yyyy")).drop("CallDate")
 .withColumn("OnWatchDate", F.to_timestamp(new_fire_df.WatchDate, "MM/dd/yyyy")).drop("WatchDate")
 .withColumn("AvailableDtTS", F.to_timestamp(new_fire_df.AvailableDtTm, "MM/dd/yyyy hh:mm:ss a")).drop("AvailableDtTm"))

In [37]:
##Mostramos las cinco primeras lineas de las tres nuevas variables
(fire_ts_df
 .select("IncidentDate", "OnWatchDate", "AvailableDtTS")
 .show(5, False))

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



In [38]:
##De la columna IncidentDate seleccionamos los años que puede tomar esa columna como valor
(fire_ts_df
 .select(F.year('IncidentDate'))
 .distinct()
 .orderBy(F.year('IncidentDate'))
 .show())

+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
|              2005|
|              2006|
|              2007|
|              2008|
|              2009|
|              2010|
|              2011|
|              2012|
|              2013|
|              2014|
|              2015|
|              2016|
|              2017|
|              2018|
+------------------+



**Example aggregations**

In [39]:
##Seleccionamos la columna calltype donde el valor no sea nulo, lo agrupamos por valor y contamos cuantas veces aparece
##dicho valor, ordenamos el resultado por el numero de veces que aparece
(fire_ts_df
 .select("CallType")
 .where(fire_ts_df.CallType.isNotNull())
 .groupBy("CallType")
 .count()
 .orderBy("count", ascending=False)
 .show(n=10, truncate=False))

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



**Example other common DF operations**

In [40]:
##Calculamos la suma, media, minimo y maximo de algunas columnas y mostramos el resultado
(fire_ts_df
 .select(F.sum("NumAlarms"), F.avg("ResponseDelayedinMins"),F.min("ResponseDelayedinMins"), F.max("ResponseDelayedinMins"))
 .show())

+--------------+--------------------------+--------------------------+--------------------------+
|sum(NumAlarms)|avg(ResponseDelayedinMins)|min(ResponseDelayedinMins)|max(ResponseDelayedinMins)|
+--------------+--------------------------+--------------------------+--------------------------+
|        176170|         3.892364154521585|               0.016666668|                   1844.55|
+--------------+--------------------------+--------------------------+--------------------------+



Otras funciones son stat(), describe(), correlation(), covariance(), sampleBy(), approxQuantile(), frequenItems(), etc.

-------------------------------------------------------------------------------------------------------------------------------

### Ejemplo fire-calls: Scala

In [36]:
//Importamos las librerias necesarias
import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._


In [37]:
//Para que Spark establezca un esquema de forma automatica
val sampleDF = spark
 .read
 .option("samplingRatio", 0.001)
 .option("header", true)
 .csv("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/sf-fire-calls.csv")

sampleDF: org.apache.spark.sql.DataFrame = [CallNumber: string, UnitID: string ... 26 more fields]


In [38]:
//Creamos la variable con el esquema que van a tener nuestros datos
val fireSchema = StructType(Array(StructField("CallNumber", IntegerType, true),
                                  StructField("UnitID", StringType, true),
                                  StructField("IncidentNumber", IntegerType, true),
                                  StructField("CallType", StringType, true),
                                  StructField("CallDate", StringType, true),
                                  StructField("WatchDate", StringType, true),
                                  StructField("CallFinalDisposition", StringType, true),
                                  StructField("AvailableDtTm", StringType, true),
                                  StructField("Address", StringType, true),
                                  StructField("City", StringType, true),
                                  StructField("Zipcode", IntegerType, true),
                                  StructField("Battalion", StringType, true),
                                  StructField("StationArea", StringType, true),
                                  StructField("Box", StringType, true),
                                  StructField("OriginalPriority", StringType, true),
                                  StructField("Priority", StringType, true),
                                  StructField("FinalPriority", StringType, true),
                                  StructField("ALSUnit", BooleanType, true),
                                  StructField("CallTypeGroup", StringType, true),
                                  StructField("NumAlarms", IntegerType, true),
                                  StructField("UnitType", StringType, true),
                                  StructField("UnitSequenceInCallDispatch", IntegerType, true),
                                  StructField("FirePreventionDistrict", StringType, true),
                                  StructField("SupervisorDistrict", StringType, true),
                                  StructField("Neighborhood", StringType, true),       
                                  StructField("Location", StringType, true),
                                  StructField("RowID", StringType, true),
                                  StructField("Delay", FloatType, true)))

fireSchema: org.apache.spark.sql.types.StructType = StructType(StructField(CallNumber,IntegerType,true), StructField(UnitID,StringType,true), StructField(IncidentNumber,IntegerType,true), StructField(CallType,StringType,true), StructField(CallDate,StringType,true), StructField(WatchDate,StringType,true), StructField(CallFinalDisposition,StringType,true), StructField(AvailableDtTm,StringType,true), StructField(Address,StringType,true), StructField(City,StringType,true), StructField(Zipcode,IntegerType,true), StructField(Battalion,StringType,true), StructField(StationArea,StringType,true), StructField(Box,StringType,true), StructField(OriginalPriority,StringType,true), StructField(Priority,StringType,true), StructField(FinalPriority,StringType,true), StructField(ALSUnit,BooleanType,true),...


In [39]:
//Leemos los datos y creamos el dataframe
val sfFireFile="C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/sf-fire-calls.csv"
val fireDF = spark.read.schema(fireSchema)
 .option("header", "true")
 .csv(sfFireFile)

sfFireFile: String = C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/sf-fire-calls.csv
fireDF: org.apache.spark.sql.DataFrame = [CallNumber: int, UnitID: string ... 26 more fields]


Guardar los datos en formato Parquet: conserva el esquema de los datos

Guardar los datos como tabla de SQL:

**Ejemplo Projections and filters**

In [40]:
//Seleccionamos las filas que no sean igual a medical incident en la columna de calltype
val fewFireDF = fireDF
 .select("IncidentNumber", "AvailableDtTm", "CallType")
 .where($"CallType" =!= "Medical Incident") 

fewFireDF.show(5, false)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



fewFireDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [IncidentNumber: int, AvailableDtTm: string ... 1 more field]


In [41]:
//De la columna calltype seleccionamos los valores que no son nulos y contamos cuantos diferentes hay.
fireDF
 .select("CallType")
 .where(col("CallType").isNotNull)
 .agg(countDistinct('CallType) as 'DistinctCallTypes)
 .show()

+-----------------+
|DistinctCallTypes|
+-----------------+
|               30|
+-----------------+



//De la columna calltype seleccionamos los valores que no son nulos y mostramos los distintos valores que puede tomar
fireDF
 .select("CallType")
 .where(col("CallType").isNotNull())
 .distinct()
 .show(10, false)

**Ejercicio renaming, adding and dropping columns**

In [42]:
//Tenemos que crear un nuevo dataFrame cuando utilizamos la funcion withcolumrenamed para que las originales no se pierdan
val newFireDF = fireDF.withColumnRenamed("Delay", "ResponseDelayedinMins")
newFireDF
 .select("ResponseDelayedinMins")
 .where($"ResponseDelayedinMins" > 5)
 .show(5, false)

+---------------------+
|ResponseDelayedinMins|
+---------------------+
|5.35                 |
|6.25                 |
|5.2                  |
|5.6                  |
|7.25                 |
+---------------------+
only showing top 5 rows



newFireDF: org.apache.spark.sql.DataFrame = [CallNumber: int, UnitID: string ... 26 more fields]


In [43]:
//creamos tres nuevas columnas a partir de las existentes en formato fecha
val fireTsDF = newFireDF
 .withColumn("IncidentDate", to_timestamp(col("CallDate"), "MM/dd/yyyy"))
 .drop("CallDate")
 .withColumn("OnWatchDate", to_timestamp(col("WatchDate"), "MM/dd/yyyy"))
 .drop("WatchDate")
 .withColumn("AvailableDtTS", to_timestamp(col("AvailableDtTm"),
 "MM/dd/yyyy hh:mm:ss a"))
 .drop("AvailableDtTm")

fireTsDF: org.apache.spark.sql.DataFrame = [CallNumber: int, UnitID: string ... 26 more fields]


In [44]:
//Mostramos las cinco primeras lineas de nuevas columnas
fireTsDF
 .select("IncidentDate", "OnWatchDate", "AvailableDtTS")
 .show(5, false)

+-------------------+-------------------+-------------------+
|IncidentDate       |OnWatchDate        |AvailableDtTS      |
+-------------------+-------------------+-------------------+
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 01:51:44|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 03:01:18|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 02:39:50|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 04:16:46|
|2002-01-11 00:00:00|2002-01-10 00:00:00|2002-01-11 06:01:58|
+-------------------+-------------------+-------------------+
only showing top 5 rows



In [45]:
//Mostramos los diferentes años que toma como valor la variable incidentDate
fireTsDF
 .select(year($"IncidentDate"))
 .distinct()
 .orderBy(year($"IncidentDate"))
 .show()

+------------------+
|year(IncidentDate)|
+------------------+
|              2000|
|              2001|
|              2002|
|              2003|
|              2004|
|              2005|
|              2006|
|              2007|
|              2008|
|              2009|
|              2010|
|              2011|
|              2012|
|              2013|
|              2014|
|              2015|
|              2016|
|              2017|
|              2018|
+------------------+



**Example aggregations**

In [46]:
//Mostramos los diferentes valores que toma la variable CallType
fireTsDF
 .select("CallType")
 .where(col("CallType").isNotNull)
 .groupBy("CallType")
 .count()
 .orderBy(desc("count"))
 .show(10, false)

+-------------------------------+------+
|CallType                       |count |
+-------------------------------+------+
|Medical Incident               |113794|
|Structure Fire                 |23319 |
|Alarms                         |19406 |
|Traffic Collision              |7013  |
|Citizen Assist / Service Call  |2524  |
|Other                          |2166  |
|Outside Fire                   |2094  |
|Vehicle Fire                   |854   |
|Gas Leak (Natural and LP Gases)|764   |
|Water Rescue                   |755   |
+-------------------------------+------+
only showing top 10 rows



In [47]:
//Calculamos el sumatorio, minimo, media y maximo de ciertas variables
fireTsDF
 .select(sum("NumAlarms"), avg("ResponseDelayedinMins"),min("ResponseDelayedinMins"), max("ResponseDelayedinMins"))
 .show()

+--------------+--------------------------+--------------------------+--------------------------+
|sum(NumAlarms)|avg(ResponseDelayedinMins)|min(ResponseDelayedinMins)|max(ResponseDelayedinMins)|
+--------------+--------------------------+--------------------------+--------------------------+
|        176170|         3.892364154521585|               0.016666668|                   1844.55|
+--------------+--------------------------+--------------------------+--------------------------+



-------------------------------------------------------------------------------------------------------------------------------

## The Dataset API

La API de los Dataset en Spark tiene una interfaz similar a la de los DataFrames. Tienen dos caracteristicas:   
    - Typed: Dataset [T], In Scala&Java   
    - Untyped: DataFrame=Dataset[Row], Alias in Scala    
    
Datasets: Java y Scala   
DataFrames: R y Python

In [48]:
//Scala
import org.apache.spark.sql.Row
val row = Row(350, true, "Learning Spark 2E", null)

import org.apache.spark.sql.Row
row: org.apache.spark.sql.Row = [350,true,Learning Spark 2E,null]


In [49]:
row.getInt(0)

res23: Int = 350


In [50]:
row.getBoolean(1)

res24: Boolean = true


In [1]:
##Python
from pyspark.sql import Row
row = Row(350, True, "Learning Spark 2E", None)

In [2]:
row[0]

350

In [3]:
row[1]

True

        si usas toPandas() obtienes resultados mas claros que con show()

**Scala**

**Creating Datasets:** para poder crear un Dataset tienes que conocer previamente el esquema de los datos, es decir, el tipo de datos que va a contener.

In [52]:
case class DeviceIoTData (battery_level: Long, c02_level: Long,
cca2: String, cca3: String, cn: String, device_id: Long,
device_name: String, humidity: Long, ip: String, latitude: Double,
lcd: String, longitude: Double, scale:String, temp: Long,
timestamp: Long)

defined class DeviceIoTData


In [53]:
val ds = spark.read
.json("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/iot_devices.json")
.as[DeviceIoTData]

ds: org.apache.spark.sql.Dataset[DeviceIoTData] = [battery_level: bigint, c02_level: bigint ... 13 more fields]


In [54]:
ds.show(5, false)

+-------------+---------+----+----+-------------+---------+---------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|battery_level|c02_level|cca2|cca3|cn           |device_id|device_name          |humidity|ip           |latitude|lcd   |longitude|scale  |temp|timestamp    |
+-------------+---------+----+----+-------------+---------+---------------------+--------+-------------+--------+------+---------+-------+----+-------------+
|8            |868      |US  |USA |United States|1        |meter-gauge-1xbYRYcj |51      |68.161.225.1 |38.0    |green |-97.0    |Celsius|34  |1458444054093|
|7            |1473     |NO  |NOR |Norway       |2        |sensor-pad-2n2Pea    |70      |213.161.254.1|62.47   |red   |6.15     |Celsius|11  |1458444054119|
|2            |1556     |IT  |ITA |Italy        |3        |device-mac-36TWSKiT  |44      |88.36.5.1    |42.83   |red   |12.83    |Celsius|19  |1458444054120|
|6            |1080     |US  |USA |United States|4  

**Operaciones con Datasets:** igual que con DataFrames

In [55]:
val filterTempDS = ds.filter({d => {d.temp > 30 && d.humidity > 70}})

filterTempDS: org.apache.spark.sql.Dataset[DeviceIoTData] = [battery_level: bigint, c02_level: bigint ... 13 more fields]


*Otro ejemplo*

In [57]:
//Creamos el esquema
case class DeviceTempByCountry(temp: Long, device_name: String, device_id: Long,
 cca3: String)

defined class DeviceTempByCountry


In [58]:
//Otra forma de hacer la query anterior
val dsTemp2 = ds
 .select($"temp", $"device_name", $"device_id", $"device_id", $"cca3")
 .where("temp > 25")
 .as[DeviceTempByCountry]
dsTemp2.show(5, false)

+----+---------------------+---------+---------+----+
|temp|device_name          |device_id|device_id|cca3|
+----+---------------------+---------+---------+----+
|34  |meter-gauge-1xbYRYcj |1        |1        |USA |
|28  |sensor-pad-4mzWkz    |4        |4        |USA |
|27  |sensor-pad-6al7RTAobR|6        |6        |USA |
|27  |sensor-pad-8xUD6pzsQI|8        |8        |JPN |
|26  |sensor-pad-10BsywSYUF|10       |10       |USA |
+----+---------------------+---------+---------+----+
only showing top 5 rows



dsTemp2: org.apache.spark.sql.Dataset[DeviceTempByCountry] = [temp: bigint, device_name: string ... 3 more fields]


    select() es igual que map()

In [60]:
//Miramos solo la primera linea del Dataset
val device = dsTemp2.first()
println(device)

DeviceTempByCountry(34,meter-gauge-1xbYRYcj,1,USA)


device: DeviceTempByCountry = DeviceTempByCountry(34,meter-gauge-1xbYRYcj,1,USA)


**DataFrames vs Datasets**   
    - Si quieres decirle a spark que hacer y no como hacerlo -> DF o DS   
    - Rica semantica, alto nivel de abstraccion... -> DF o DS   
    - Si no importa crear multiples clases -> DS   
    - Expresiones de alto nivel, filtros, agregaciones, queries en SQL... ->DF o DS   
    - Transformaciones similares a queries de SQL -> DF   
    - Unificacion, optimizacion, simplificacion ->DF   
    - R o python -> DF    
    - Espacio y velocidad eficiente -> DF   
