1. Hay que crear una base de datos y con el comando USE ponerla en marcha.
2. Creamos tablas dentro de la base de datos
3. Creamos vistas a partir de las tablas en la base de datos.
       Diferencias entre tabla y vista: la vista desaparece cuando cerramos nuestra SparkSession, por el contrario, las tablas continuan creadas en la BBDD.

**Diferencias entre una vista temporal y una vista global temporal**    
Una vista temporal solo se puede utilizar en la spark.session en la que se ha creado, por el contrario, una vista temporal global se puede usar en multiples SparkSessions.

Igual que en DF, en SQL tambien podemos establecer la lazy evaluation para las tablas, y que solo sean ejecutadas cuando una accion caiga sobre ellas a traves del siguiente código:           

    spark.sql("CACHE [LAZY] TABLE <table-name>")
    spark.sql("UNCACHE TABLE <table-name>")

Funcion para leer un fichero y convertirlo en df (opciones de lectura en la pagina 119):

Funcion para guardar un df en un fichero (opciones escritura pag 120):

Para guardar un DataFrame como tabla de SQL utilizamos nombredf.write.saveAsTable

Tipos de archivos que se pueden leer y escribir:          
  - Parquet: guarda datos y metadatos. Tipo de archivo por defecto
  - JSON: tipo de archivo facil de leer. Tiene dos formatos: single-line(cada linea tine un objeto JSON) o multiline mode(un unico JSON formado por muchas lineas). Para guardar un DF como JSON hay que establecer la opcion .format("json").option("compression", "snappy"). Mas opciones de JSON pag 125.
  - CSV: archivos de texto planos, delimitados normalmente por coma, que separa los campos. Mas opciones pag 128.
  - Avro: en este formato, al pasar los datos a tabla sql en show hay que establecer 'false'. Mas opciones de abro pag 130.
  - Images: pag 132
  - Binary files: pag 134

### Scala

In [1]:
import org.apache.spark.sql.SparkSession

Intitializing Scala interpreter ...

Spark Web UI available at http://EM2021002836.bosonit.local:4042
SparkContext available as 'sc' (version = 3.1.1, master = local[*], app id = local-1623741654944)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession


In [2]:
//Creamos una SparkSession
val spark = SparkSession
    .builder
    .appName("SparkSQLExampleApp")
    .getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@628b81be


In [3]:
//Cargamos la ruta de los datos
val csvFile= "C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/departuredelays.csv"

csvFile: String = C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/departuredelays.csv


In [4]:
//Leemos el fichero e inferimos el esquema de datos
val df = spark.read.format("csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(csvFile)

df: org.apache.spark.sql.DataFrame = [date: int, delay: int ... 3 more fields]


In [5]:
//Creamos una tabla temporal
df.createOrReplaceTempView("us_delay_flights_tbl")

Para cargar el esquema del DF     
val schema = "date STRING, delay INT, distance INT, 
 origin STRING, destination STRING"


In [6]:
df.select("*").show(4)

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245|    6|     602|   ABE|        ATL|
|1020600|   -8|     369|   ABE|        DTW|
|1021245|   -2|     602|   ABE|        ATL|
|1020605|   -4|     602|   ABE|        ATL|
+-------+-----+--------+------+-----------+
only showing top 4 rows



Ejemplos con DF vs SQL:

In [7]:
//DF 
//Opcion 1:
df.select("distance", "origin", "destination")
 .where(col("distance") > 1000)
 .orderBy(desc("distance")).show(10)

//Opcion 2:
df.select("distance", "origin", "destination")
 .where("distance > 1000")
 .orderBy(desc("distance")).show(10)


//SQL
spark.sql("""SELECT distance, origin, destination 
FROM us_delay_flights_tbl WHERE distance > 1000 
ORDER BY distance DESC""").show(10)

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   JFK| 

In [8]:
//DF
df.select("date", "delay", "origin", "destination")
    .where((col("delay")> 120) and (col("origin")==="SFO") and (col("destination")==="ORD"))
    .orderBy(desc("delay"))
    .show(10)

//SQL
spark.sql("""SELECT date, delay, origin, destination 
FROM us_delay_flights_tbl 
WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD' 
ORDER by delay DESC""").show(10)

+-------+-----+------+-----------+
|   date|delay|origin|destination|
+-------+-----+------+-----------+
|2190925| 1638|   SFO|        ORD|
|1031755|  396|   SFO|        ORD|
|1022330|  326|   SFO|        ORD|
|1051205|  320|   SFO|        ORD|
|1190925|  297|   SFO|        ORD|
|2171115|  296|   SFO|        ORD|
|1071040|  279|   SFO|        ORD|
|1051550|  274|   SFO|        ORD|
|3120730|  266|   SFO|        ORD|
|1261104|  258|   SFO|        ORD|
+-------+-----+------+-----------+
only showing top 10 rows

+-------+-----+------+-----------+
|   date|delay|origin|destination|
+-------+-----+------+-----------+
|2190925| 1638|   SFO|        ORD|
|1031755|  396|   SFO|        ORD|
|1022330|  326|   SFO|        ORD|
|1051205|  320|   SFO|        ORD|
|1190925|  297|   SFO|        ORD|
|2171115|  296|   SFO|        ORD|
|1071040|  279|   SFO|        ORD|
|1051550|  274|   SFO|        ORD|
|3120730|  266|   SFO|        ORD|
|1261104|  258|   SFO|        ORD|
+-------+-----+------+-------

In [9]:
//DF
df.select("delay", "origin", "destination")
    .withColumn("Flights_Delays", when(col("delay") >360 , "Very Long Delays")
                                    .when((col("delay") >120) and (col("delay") <360) , "Long Delays")
                                    .when((col("delay") >60) and  (col("delay") <120), "Short Delays")
                                    .when((col("delay") >0) and  (col("delay") <60) , "Tolerable Delays")
                                    .when(col("delay")===0 , "No Delays")
                                    .otherwise("Early"))
     .orderBy(col("origin"), desc("delay"))
     .show(10)


//SQL
spark.sql("""SELECT delay, origin, destination,
 CASE
 WHEN delay > 360 THEN 'Very Long Delays'
 WHEN delay > 120 AND delay < 360 THEN 'Long Delays'
 WHEN delay > 60 AND delay < 120 THEN 'Short Delays'
 WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
 WHEN delay = 0 THEN 'No Delays'
 ELSE 'Early'
 END AS Flight_Delays
 FROM us_delay_flights_tbl
 ORDER BY origin, delay DESC""").show(10)

+-----+------+-----------+--------------+
|delay|origin|destination|Flights_Delays|
+-----+------+-----------+--------------+
|  333|   ABE|        ATL|   Long Delays|
|  305|   ABE|        ATL|   Long Delays|
|  275|   ABE|        ATL|   Long Delays|
|  257|   ABE|        ATL|   Long Delays|
|  247|   ABE|        DTW|   Long Delays|
|  247|   ABE|        ATL|   Long Delays|
|  219|   ABE|        ORD|   Long Delays|
|  211|   ABE|        ATL|   Long Delays|
|  197|   ABE|        DTW|   Long Delays|
|  192|   ABE|        ORD|   Long Delays|
+-----+------+-----------+--------------+
only showing top 10 rows

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|  Long Delays|
|  305|   ABE|        ATL|  Long Delays|
|  275|   ABE|        ATL|  Long Delays|
|  257|   ABE|        ATL|  Long Delays|
|  247|   ABE|        ATL|  Long Delays|
|  247|   ABE|        DTW|  Long Delays|
|  219|   ABE|   

Creamos una Base de datos y la ponemos en uso para introducir tablas en ella:

In [11]:
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

res6: org.apache.spark.sql.DataFrame = []


Creacion de tabla gestionada (cuando eliminamos este tipo de tablas borramos tanto los metadatos como los datos):

In [None]:
//Usando la DF API:
df.write.saveAsTable("managed_us_delay_flights_tbl3")

Creación de tabla no gestionada (cuando eliminamos este tipo de tablas borramos solo los metadatos):

In [17]:
spark.sql("DROP TABLE us_delay_flights_tbl")

res12: org.apache.spark.sql.DataFrame = []


In [18]:
spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, 
 distance INT, origin STRING, destination STRING) 
 USING csv OPTIONS (PATH 'C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/departuredelays.csv')""")

res13: org.apache.spark.sql.DataFrame = []


In [19]:
//Usando la DF API:
df
 .write
 .option("path", "/tmp/data/us_flights_delay")
 .saveAsTable("us_delay_flights_tbl4")

Crear vistas temporales:

In [20]:
spark.sql(""" SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'SFO' """).show()

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
|01011005|   -8|   SFO|        DFW|
|01011800|    0|   SFO|        ORD|
|01011740|   -7|   SFO|        LAX|
|01012015|   -7|   SFO|        LAX|
|01012110|   -1|   SFO|        MIA|
|01011610|  134|   SFO|        DFW|
|01011240|   -6|   SFO|        MIA|
|01010755|   -3|   SFO|        DFW|
|01010020|    0|   SFO|        DFW|
|01010705|   -6|   SFO|        LAX|
|01010925|   -3|   SFO|        ORD|
|01010555|   -6|   SFO|        ORD|
|01011105|   -8|   SFO|        DFW|
|01012330|   32|   SFO|        ORD|
|01011330|    3|   SFO|        DFW|
+--------+-----+------+-----------+
only showing top 20 rows



In [21]:
//API DF
val df_sfo = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'SFO'")
df_sfo.createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view2")
df_sfo.show(5)            
    
//SQL
spark.sql("""CREATE TEMP VIEW us_origin_airport_SFO_global_tmp_view AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'SFO'""")
spark.sql(""" SELECT date, delay, origin, destination from us_origin_airport_SFO_global_tmp_view""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
+--------+-----+------+-----------+
only showing top 5 rows

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
+--------+-----+------+-----------+
only showing top 5 rows



df_sfo: org.apache.spark.sql.DataFrame = [date: string, delay: int ... 2 more fields]


In [22]:
//API DF
val df_jfk = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'JFK'")
df_jfk.createOrReplaceTempView("us_origin_airport_JFK_tmp_view2")
df_jfk.show(5)

//SQL
spark.sql("""CREATE TEMP VIEW us_origin_airport_JFK_tmp_view AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'JFK'""")
spark.sql(""" SELECT date, delay, origin, destination from us_origin_airport_JFK_tmp_view""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows



df_jfk: org.apache.spark.sql.DataFrame = [date: string, delay: int ... 2 more fields]


Cuando hacemos una vista global, para llamar a dicha vista, colocaremos delante del nommbre 'global_temp.'

In [23]:
spark.sql("""CREATE GLOBAL TEMP VIEW prueba AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'JFK'""")
spark.sql(""" SELECT date, delay, origin, destination from global_temp.prueba""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows



In [24]:
spark.sql("""SHOW TABLES""").show()

+--------------+--------------------+-----------+
|      database|           tableName|isTemporary|
+--------------+--------------------+-----------+
|learn_spark_db|us_delay_flights_tbl|      false|
|learn_spark_db|us_delay_flights_...|      false|
|              |us_origin_airport...|       true|
|              |us_origin_airport...|       true|
|              |us_origin_airport...|       true|
+--------------+--------------------+-----------+



Para borrar vistas:

In [25]:
//SQL
spark.sql("""DROP VIEW IF EXISTS us_origin_airport_SFO_global_tmp_view;""")
spark.sql("""DROP VIEW IF EXISTS us_origin_airport_JFK_tmp_view""")


spark.catalog.dropGlobalTempView("prueba") //vista global 
spark.catalog.dropTempView("us_origin_airport_JFK_tmp_view")

res20: Boolean = false


Acceder a los metadatos de una base de datos, tabla o columnas:

In [26]:
spark.catalog.listDatabases().show()

+--------------+----------------+--------------------+
|          name|     description|         locationUri|
+--------------+----------------+--------------------+
|       default|default database|file:/C:/Users/ne...|
|learn_spark_db|                |file:/C:/Users/ne...|
+--------------+----------------+--------------------+



In [27]:
spark.catalog.listTables().show()

+--------------------+--------------+-----------+---------+-----------+
|                name|      database|description|tableType|isTemporary|
+--------------------+--------------+-----------+---------+-----------+
|us_delay_flights_tbl|learn_spark_db|       null| EXTERNAL|      false|
|us_delay_flights_...|learn_spark_db|       null| EXTERNAL|      false|
|us_origin_airport...|          null|       null|TEMPORARY|       true|
+--------------------+--------------+-----------+---------+-----------+



In [28]:
spark.catalog.listColumns("us_delay_flights_tbl").show()

+-----------+-----------+--------+--------+-----------+--------+
|       name|description|dataType|nullable|isPartition|isBucket|
+-----------+-----------+--------+--------+-----------+--------+
|       date|       null|  string|    true|      false|   false|
|      delay|       null|     int|    true|      false|   false|
|   distance|       null|     int|    true|      false|   false|
|     origin|       null|  string|    true|      false|   false|
|destination|       null|  string|    true|      false|   false|
+-----------+-----------+--------+--------+-----------+--------+



In [29]:
//Guardamos una tabla de SQL en DF
val usFlightsDF = spark.sql("SELECT * FROM us_delay_flights_tbl")
val usFlightsDF2 = spark.table("us_delay_flights_tbl")

usFlightsDF: org.apache.spark.sql.DataFrame = [date: string, delay: int ... 3 more fields]
usFlightsDF2: org.apache.spark.sql.DataFrame = [date: string, delay: int ... 3 more fields]


Importar ficheros a DF:

In [30]:
// Use Parquet 
val file ="""C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/2010-summary.parquet"""
val df = spark.read.format("parquet").load(file)
// Use Parquet; you can omit format("parquet") if you wish as it's the default
val df2 = spark.read.load(file)

file: String = C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/2010-summary.parquet
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
df2: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [31]:
// Use CSV
val df3 = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .option("mode", "PERMISSIVE")
 .load("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/csv/*")

df3: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [32]:
// Use JSON
val df4 = spark.read.format("json")
 .load("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/json/*")

df4: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


Exportar DF a fichero:

### Python

In [1]:
from pyspark.sql import SparkSession

In [2]:
##Creamos la SparkSession
spark =(SparkSession
        .builder
        .appName("SparkSQLExampleApp")
        .getOrCreate())

In [3]:
##Cargamos la ruta de los datos
csv_file ="C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/departuredelays.csv"

In [4]:
##leemos los datos
df = (spark.read.format("csv")
     .option("header", "true")
     .option("inferSchema", "true")
     .load(csv_file))

In [5]:
##Creamos una tabla temporal
df.createOrReplaceTempView("us_delay_flights_tbl")

Para cargar el esquema del DF      
        schema = "`date` STRING, `delay` INT, `distance` INT, 
`origin` STRING, `destination` STRING"


In [6]:
df.select("*").show(4)

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245|    6|     602|   ABE|        ATL|
|1020600|   -8|     369|   ABE|        DTW|
|1021245|   -2|     602|   ABE|        ATL|
|1020605|   -4|     602|   ABE|        ATL|
+-------+-----+--------+------+-----------+
only showing top 4 rows



Ejemplos con DF vs SQL:

In [7]:
from pyspark.sql import functions as F
##DF
##Opcion 1
from pyspark.sql.functions import col, desc
(df.select("distance", "origin", "destination")\
 .where(col("distance") > 1000)\
 .orderBy(desc("distance"))).show(10)

##Opcion 2
(df.select("distance", "origin", "destination")
 .where("distance > 1000")
 .orderBy("distance", ascending=False).show(10))


##SQL
spark.sql("""SELECT distance, origin, destination 
FROM us_delay_flights_tbl WHERE distance > 1000 
ORDER BY distance DESC""").show(10)


+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
|    4330|   JFK|        HNL|
+--------+------+-----------+
only showing top 10 rows

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL| 

In [8]:
##DF
(df.select("date", "delay", "origin", "destination")\
    .where((col("delay")> 120) & (col("origin")=='SFO') & (col("destination")=='ORD'))\
    .orderBy(desc("delay"))\
    .show(10))

##SQL
spark.sql("""SELECT date, delay, origin, destination 
FROM us_delay_flights_tbl 
WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD' 
ORDER by delay DESC""").show(10)

+-------+-----+------+-----------+
|   date|delay|origin|destination|
+-------+-----+------+-----------+
|2190925| 1638|   SFO|        ORD|
|1031755|  396|   SFO|        ORD|
|1022330|  326|   SFO|        ORD|
|1051205|  320|   SFO|        ORD|
|1190925|  297|   SFO|        ORD|
|2171115|  296|   SFO|        ORD|
|1071040|  279|   SFO|        ORD|
|1051550|  274|   SFO|        ORD|
|3120730|  266|   SFO|        ORD|
|1261104|  258|   SFO|        ORD|
+-------+-----+------+-----------+
only showing top 10 rows

+-------+-----+------+-----------+
|   date|delay|origin|destination|
+-------+-----+------+-----------+
|2190925| 1638|   SFO|        ORD|
|1031755|  396|   SFO|        ORD|
|1022330|  326|   SFO|        ORD|
|1051205|  320|   SFO|        ORD|
|1190925|  297|   SFO|        ORD|
|2171115|  296|   SFO|        ORD|
|1071040|  279|   SFO|        ORD|
|1051550|  274|   SFO|        ORD|
|3120730|  266|   SFO|        ORD|
|1261104|  258|   SFO|        ORD|
+-------+-----+------+-------

In [9]:
##DF
(df.select("delay", "origin", "destination")
    .withColumn("Flights_Delays", F.when(F.col("delay") >360 , "Very Long Delays")\
                                    .when((F.col("delay") >120) &  (F.col("delay") <360) , "Long Delays")\
                                    .when((F.col("delay") >60) &  (F.col("delay") <120), "Short Delays")\
                                    .when((F.col("delay") >0) &  (F.col("delay") <60) , "Tolerable Delays")\
                                    .when(F.col("delay")==0 , "No Delays")\
                                    .otherwise("Early"))\
     .orderBy("origin", desc("delay"))\
     .show(10))


##SQL
spark.sql("""SELECT delay, origin, destination,
 CASE
 WHEN delay > 360 THEN 'Very Long Delays'
 WHEN delay > 120 AND delay < 360 THEN 'Long Delays'
 WHEN delay > 60 AND delay < 120 THEN 'Short Delays'
 WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
 WHEN delay = 0 THEN 'No Delays'
 ELSE 'Early'
 END AS Flight_Delays
 FROM us_delay_flights_tbl
 ORDER BY origin, delay DESC""").show(10)

+-----+------+-----------+--------------+
|delay|origin|destination|Flights_Delays|
+-----+------+-----------+--------------+
|  333|   ABE|        ATL|   Long Delays|
|  305|   ABE|        ATL|   Long Delays|
|  275|   ABE|        ATL|   Long Delays|
|  257|   ABE|        ATL|   Long Delays|
|  247|   ABE|        ATL|   Long Delays|
|  247|   ABE|        DTW|   Long Delays|
|  219|   ABE|        ORD|   Long Delays|
|  211|   ABE|        ATL|   Long Delays|
|  197|   ABE|        DTW|   Long Delays|
|  192|   ABE|        ORD|   Long Delays|
+-----+------+-----------+--------------+
only showing top 10 rows

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|  Long Delays|
|  305|   ABE|        ATL|  Long Delays|
|  275|   ABE|        ATL|  Long Delays|
|  257|   ABE|        ATL|  Long Delays|
|  247|   ABE|        ATL|  Long Delays|
|  247|   ABE|        DTW|  Long Delays|
|  219|   ABE|   

Creamos una Base de datos y la ponemos en uso para introducir tablas en ella:

In [10]:
spark.sql("DROP DATABASE learn_spark_db CASCADE")

DataFrame[]

In [11]:
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

DataFrame[]

Creacion de tabla gestionada (cuando eliminamos este tipo de tablas borramos tanto los metadatos como los datos):

In [12]:
spark.sql("CREATE TABLE managed_us_delay_flights_tbl2 (date STRING, delay INT, distance INT, origin STRING, destination STRING)")

DataFrame[]

In [13]:
##Usando la DF API:
df.write.saveAsTable("managed_us_delay_flights_tbl")

Creación de tabla no gestionada (cuando eliminamos este tipo de tablas borramos solo los metadatos):

In [14]:
spark.sql("DROP TABLE us_delay_flights_tbl")

DataFrame[]

In [15]:
spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, 
 distance INT, origin STRING, destination STRING) 
 USING csv OPTIONS (PATH 'C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/departuredelays.csv')""")

DataFrame[]

In [16]:
##Usando la DF API:
(df
 .write
 .option("path", "/tmp/data/us_flights_delay")
 .saveAsTable("us_delay_flights_tbl2"))

Crear vistas temporales:

In [17]:
spark.sql(""" SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'SFO' """).show()

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
|01011005|   -8|   SFO|        DFW|
|01011800|    0|   SFO|        ORD|
|01011740|   -7|   SFO|        LAX|
|01012015|   -7|   SFO|        LAX|
|01012110|   -1|   SFO|        MIA|
|01011610|  134|   SFO|        DFW|
|01011240|   -6|   SFO|        MIA|
|01010755|   -3|   SFO|        DFW|
|01010020|    0|   SFO|        DFW|
|01010705|   -6|   SFO|        LAX|
|01010925|   -3|   SFO|        ORD|
|01010555|   -6|   SFO|        ORD|
|01011105|   -8|   SFO|        DFW|
|01012330|   32|   SFO|        ORD|
|01011330|    3|   SFO|        DFW|
+--------+-----+------+-----------+
only showing top 20 rows



In [19]:
##API DF
df_sfo = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'SFO'")
df_sfo.createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view2")
df_sfo.show(5)            
    
#SQL
spark.sql("""CREATE TEMP VIEW us_origin_airport_SFO_global_tmp_view AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'SFO'""")
spark.sql(""" SELECT date, delay, origin, destination from us_origin_airport_SFO_global_tmp_view""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
+--------+-----+------+-----------+
only showing top 5 rows

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01011250|   55|   SFO|        JFK|
|01012230|    0|   SFO|        JFK|
|01010705|   -7|   SFO|        JFK|
|01010620|   -3|   SFO|        MIA|
|01010915|   -3|   SFO|        LAX|
+--------+-----+------+-----------+
only showing top 5 rows



In [20]:
##API DF
df_jfk = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'JFK'")
df_jfk.createOrReplaceTempView("us_origin_airport_JFK_tmp_view2")
df_jfk.show(5)

##SQL
spark.sql("""CREATE TEMP VIEW us_origin_airport_JFK_tmp_view AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'JFK'""")
spark.sql(""" SELECT date, delay, origin, destination from us_origin_airport_JFK_tmp_view""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows



Cuando hacemos una vista global, para llamar a dicha vista, colocaremos delante del nommbre 'global_temp.'

In [21]:
spark.sql("""CREATE GLOBAL TEMP VIEW prueba AS
 SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
 origin = 'JFK'""")
spark.sql(""" SELECT date, delay, origin, destination from global_temp.prueba""").show(5)

+--------+-----+------+-----------+
|    date|delay|origin|destination|
+--------+-----+------+-----------+
|01010900|   14|   JFK|        LAX|
|01011200|   -3|   JFK|        LAX|
|01011900|    2|   JFK|        LAX|
|01011700|   11|   JFK|        LAS|
|01010800|   -1|   JFK|        SFO|
+--------+-----+------+-----------+
only showing top 5 rows



In [22]:
spark.sql("""SHOW TABLES""").show()

+--------------+--------------------+-----------+
|      database|           tableName|isTemporary|
+--------------+--------------------+-----------+
|learn_spark_db|managed_us_delay_...|      false|
|learn_spark_db|managed_us_delay_...|      false|
|learn_spark_db|us_delay_flights_tbl|      false|
|learn_spark_db|us_delay_flights_...|      false|
|              |us_origin_airport...|       true|
|              |us_origin_airport...|       true|
|              |us_origin_airport...|       true|
+--------------+--------------------+-----------+



Para borrar vistas:

In [23]:
##SQL
spark.sql("""DROP VIEW IF EXISTS us_origin_airport_SFO_global_tmp_view;""")
spark.sql("""DROP VIEW IF EXISTS us_origin_airport_JFK_tmp_view""")

#en pyspark
spark.catalog.dropGlobalTempView("prueba") #vista global 
spark.catalog.dropTempView("us_origin_airport_JFK_tmp_view")

Acceder a los metadatos de una base de datos, tabla o columnas:

In [28]:
spark.catalog.listDatabases()

[Database(name='default', description='Default Hive database', locationUri='file:/C:/Users/nerea.gomez/Documents/Documentacion/Learning%20Spark/spark-warehouse'),
 Database(name='learn_spark_db', description='', locationUri='file:/C:/Users/nerea.gomez/Documents/Documentacion/Learning%20Spark/spark-warehouse/learn_spark_db.db')]

In [25]:
spark.catalog.listTables()

[Table(name='managed_us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='managed_us_delay_flights_tbl2', database='learn_spark_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='us_delay_flights_tbl2', database='learn_spark_db', description=None, tableType='EXTERNAL', isTemporary=False),
 Table(name='us_origin_airport_jfk_tmp_view2', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [26]:
spark.catalog.listColumns("us_delay_flights_tbl")

[Column(name='date', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='delay', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='distance', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='origin', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='destination', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False)]

In [29]:
##Guardamos una tabla de SQL en DF
us_flights_df = spark.sql("SELECT * FROM us_delay_flights_tbl")
us_flights_df2 = spark.table("us_delay_flights_tbl")

Lectura de ficheros con diferentes extensiones:

In [30]:
## Use Parquet 
file = """C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/2010-summary.parquet"""
df = spark.read.format("parquet").load(file)
## Use Parquet; you can omit format("parquet") if you wish as it's the default
df2 = spark.read.load(file)

In [31]:
## Use CSV
df3 = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("mode", "PERMISSIVE")\
            .load("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/csv/*")

In [32]:
## Use JSON
df4 = spark.read.format("json").load("C:/Users/nerea.gomez/Documents/Documentacion/Learning Spark/Datasets/json/*")