# SparkSQL and DataFrames 

<a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>

## RDDs, DataSets, and DataFrames

RDDs are the original interface for Spark programming.

DataFrames were introduced in 1.3

Datasets were introduced in 1.6, and unified with DataFrames in 2.0

### Advantages of DataFrames:

from https://www.datacamp.com/community/tutorials/apache-spark-python:

> More specifically, the performance improvements are due to two things, which you’ll often come across when you’re reading up DataFrames: custom memory management (project Tungsten), which will make sure that your Spark jobs much faster given CPU constraints, and optimized execution plans (Catalyst optimizer), of which the logical plan of the DataFrame is a part.

## SparkSQL and DataFrames 


pyspark does not have the Dataset API, which is available only if you use Spark from a statically typed language: Scala or Java.

From https://spark.apache.org/docs/2.4.4/sql-programming-guide.html

> A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset&lt;Row> to represent a DataFrame.


### The pyspark.sql module

Important classes of Spark SQL and DataFrames:

* `pyspark.sql.SparkSession` Main entry point for DataFrame and SQL functionality.

* `pyspark.sql.DataFrame` A distributed collection of data grouped into named columns.

* `pyspark.sql.Column` A column expression in a DataFrame.

* `pyspark.sql.Row` A row of data in a DataFrame.

* `pyspark.sql.GroupedData` Aggregation methods, returned by DataFrame.groupBy().

* `pyspark.sql.DataFrameNaFunctions` Methods for handling missing data (null values).

* `pyspark.sql.DataFrameStatFunctions` Methods for statistics functionality.

* `pyspark.sql.functions` List of built-in functions available for DataFrame.

* `pyspark.sql.types` List of data types available.

* `pyspark.sql.Window` For working with window functions.

http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html

https://spark.apache.org/docs/2.4.4/sql-programming-guide.html

## SparkSession

The traditional way to interact with Spark is the SparkContext. In the notebooks we get that from the pyspark driver.

From 2.0 we can use SparkSession to replace SparkConf, SparkContext and SQLContext

### If you are running this notebook in Google Colab

Copy the following to a code cell and run it. It will install and set up Spark for you.

```python
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.uvigo.es/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar -xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

import os
import findspark
from pyspark.sql import SparkSession

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()
```

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.uvigo.es/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar -xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark pyspark==2.4.6
 
import os
import findspark
from pyspark.sql import SparkSession
 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()

[K     |████████████████████████████████| 218.4MB 66kB/s 
[K     |████████████████████████████████| 204kB 46.4MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


#### Passing other options to spark session:
    
    

In [2]:
spark

We can check option values in the resulting session like this:

In [3]:
spark.sparkContext.getConf().getAll()

[('spark.driver.port', '42039'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1593846551727'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.host', '0697c868a2f7'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'pyspark-shell')]

In [4]:
spark = SparkSession.builder.master("local[*]").config('Mi Nombrecito', 'Felipito').getOrCreate()

In [5]:
spark.sparkContext.getConf().getAll()

[('spark.driver.port', '42039'),
 ('spark.rdd.compress', 'True'),
 ('Mi Nombrecito', 'Felipito'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.app.id', 'local-1593846551727'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.host', '0697c868a2f7'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'pyspark-shell')]

### Creating DataFrames

SparkSession.createDataFrame: from an RDD, a list or a pandas.DataFrame.

In [6]:
import random

random.seed(42)

n = 20
races = random.choices(['elf', 'hobbit', 'orc'], k = n)
creatures = [ (id_, race) for id_, race in zip(range(n), races) ]
creatures

[(0, 'hobbit'),
 (1, 'elf'),
 (2, 'elf'),
 (3, 'elf'),
 (4, 'orc'),
 (5, 'orc'),
 (6, 'orc'),
 (7, 'elf'),
 (8, 'hobbit'),
 (9, 'elf'),
 (10, 'elf'),
 (11, 'hobbit'),
 (12, 'elf'),
 (13, 'elf'),
 (14, 'hobbit'),
 (15, 'hobbit'),
 (16, 'elf'),
 (17, 'hobbit'),
 (18, 'orc'),
 (19, 'elf')]

In [7]:
df = spark.createDataFrame(creatures)
df

DataFrame[_1: bigint, _2: string]

In [8]:
#También es lazy, es como un RDD
df.take(5)

[Row(_1=0, _2='hobbit'),
 Row(_1=1, _2='elf'),
 Row(_1=2, _2='elf'),
 Row(_1=3, _2='elf'),
 Row(_1=4, _2='orc')]

In [9]:
from pyspark.sql import Row
#Nos va a servir para encapsular la estructura del DataFrame
Row(id_ = 4, race = 'elf')

Row(id_=4, race='elf')

In [10]:
# O pasar el argumento schema
df = spark.createDataFrame(creatures, schema=['id', 'race'])
df

DataFrame[id: bigint, race: string]

In [11]:
x = df.show(5)
#Esta action solo devuelve un print. Nada más.

+---+------+
| id|  race|
+---+------+
|  0|hobbit|
|  1|   elf|
|  2|   elf|
|  3|   elf|
|  4|   orc|
+---+------+
only showing top 5 rows



In [12]:
#Otra cosa parecida a show() Es el schema()
df.printSchema()

root
 |-- id: long (nullable = true)
 |-- race: string (nullable = true)



### Creating DataFrames

* From RDDs
* from Hive tables
* From Spark sources: parquet (default), json, jdbc, orc, libsvm, csv, text


#### From RDDs

In [14]:
rdd = spark.sparkContext.textFile('coupon150720.csv.gz')
rdd.take(3)

['79062005698500,1,MAA,AUH,9W,9W,56.79,USD,1,H,H,0526,150904,OK,IAF0',
 '79062005698500,2,AUH,CDG,9W,9W,84.34,USD,1,H,H,6120,150905,OK,IAF0',
 '79062005924069,1,CJB,MAA,9W,9W,60.0,USD,1,H,H,2768,150721,OK,IAA0']

In [15]:
split_lines = rdd.map(lambda line: line.split(','))
split_lines.take(3)

[['79062005698500',
  '1',
  'MAA',
  'AUH',
  '9W',
  '9W',
  '56.79',
  'USD',
  '1',
  'H',
  'H',
  '0526',
  '150904',
  'OK',
  'IAF0'],
 ['79062005698500',
  '2',
  'AUH',
  'CDG',
  '9W',
  '9W',
  '84.34',
  'USD',
  '1',
  'H',
  'H',
  '6120',
  '150905',
  'OK',
  'IAF0'],
 ['79062005924069',
  '1',
  'CJB',
  'MAA',
  '9W',
  '9W',
  '60.0',
  'USD',
  '1',
  'H',
  'H',
  '2768',
  '150721',
  'OK',
  'IAA0']]

In [16]:
spark.createDataFrame(split_lines)

DataFrame[_1: string, _2: string, _3: string, _4: string, _5: string, _6: string, _7: string, _8: string, _9: string, _10: string, _11: string, _12: string, _13: string, _14: string, _15: string]

### Inferring and specifying schemas

In [17]:
df = spark.createDataFrame(creatures, schema=['id', 'race'])
df

DataFrame[id: bigint, race: string]

#### Fully specifying a schema

We need to create a `StructType` composed of `StructField`s. each of those specifies afiled with name, type and `nullable` properties. 

In [18]:
from pyspark.sql import types

types.BooleanType()

schema = types.StructType([types.StructField('id', types.IntegerType(), False), types.StructField('race', types.StringType())])

df = spark.createDataFrame(creatures, schema = schema)
df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- race: string (nullable = true)



#### From csv files

We can either read them directly into dataframes or read them as RDDs and transform that into a DataFrame. This second way will be very useful if we have unstructured data like web server logs.

In [19]:
spark.read.csv('coupon150720.csv.gz', inferSchema = True)

DataFrame[_c0: bigint, _c1: int, _c2: string, _c3: string, _c4: string, _c5: string, _c6: double, _c7: string, _c8: int, _c9: string, _c10: string, _c11: string, _c12: int, _c13: string, _c14: string]

In [20]:
coupons = spark.sql('''SELECT
                      CAST (_C0 AS BIGINT) as tkt_number,
                      CAST (_C1 AS INT) as cpn_number,
                      _c0 as tkt_number,
                      _c1 as cpn_number,
                      _c2 as origin,
                      _c3 as dest,
                      _c4 as carrier,
                      cast(_c6 AS FLOAT) as amount
                   FROM csv.`coupon150720.csv.gz`''')
coupons.show(10)

+--------------+----------+--------------+----------+------+----+-------+------+
|    tkt_number|cpn_number|    tkt_number|cpn_number|origin|dest|carrier|amount|
+--------------+----------+--------------+----------+------+----+-------+------+
|79062005698500|         1|79062005698500|         1|   MAA| AUH|     9W| 56.79|
|79062005698500|         2|79062005698500|         2|   AUH| CDG|     9W| 84.34|
|79062005924069|         1|79062005924069|         1|   CJB| MAA|     9W|  60.0|
|79065668570385|         1|79065668570385|         1|   DEL| DXB|     9W|160.63|
|79065668737021|         1|79065668737021|         1|   AUH| IXE|     9W|152.46|
|79062006192650|         1|79062006192650|         1|   RPR| BOM|     9W|  68.5|
|79062006192650|         2|79062006192650|         2|   BOM| RPR|     9W|  68.5|
|79062005733853|         1|79062005733853|         1|   DEL| DED|     9W| 56.16|
|79062005836987|         1|79062005836987|         1|   ATL| LGA|     AA|  28.3|
|79062005836987|         2|7

#### From other types of data

Apache Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem. It is similar to the other columnar storage file formats available in Hadoop namely RCFile and Optimized RCFile. It is compatible with most of the data processing frameworks in the Hadoop environment.

In [21]:
spark.read.jdbc #Se utilizaría si hay un cluster de spark y un java database conectivity
spark.read.parquet
spark.read.json

<bound method DataFrameReader.json of <pyspark.sql.readwriter.DataFrameReader object at 0x7f3ac9da9358>>

### Basic operations with DataFrames

In [22]:
df.show()

+---+------+
| id|  race|
+---+------+
|  0|hobbit|
|  1|   elf|
|  2|   elf|
|  3|   elf|
|  4|   orc|
|  5|   orc|
|  6|   orc|
|  7|   elf|
|  8|hobbit|
|  9|   elf|
| 10|   elf|
| 11|hobbit|
| 12|   elf|
| 13|   elf|
| 14|hobbit|
| 15|hobbit|
| 16|   elf|
| 17|hobbit|
| 18|   orc|
| 19|   elf|
+---+------+



In [23]:
df.show(5)

+---+------+
| id|  race|
+---+------+
|  0|hobbit|
|  1|   elf|
|  2|   elf|
|  3|   elf|
|  4|   orc|
+---+------+
only showing top 5 rows



In [24]:
df.take(3)

[Row(id=0, race='hobbit'), Row(id=1, race='elf'), Row(id=2, race='elf')]

### Filtering and selecting

Syntax inspired in SQL.

In [25]:
# df.filter()

df.select('id')

DataFrame[id: int]

If we want to filter, we will need to build an instance of `Column`, using square bracket notation.

In [26]:
df['id'] #Para expresar notaciones en spark dataframe. Un dataframe de una columna
#No es lo mismo que una columna. Una columna es un objeto que nos sirve de referencia.

Column<b'id'>

In [27]:
df['id'].show()


TypeError: ignored

In [28]:
df.select(df['id']) #Del dataframe df seleccioname la columna id del dataframe df

DataFrame[id: int]

In [29]:
#si quisiera filtar, necesito construir una columna obligatoriamente.

df.filter(df['id'] < 5)

DataFrame[id: int, race: string]

That's because a comparison between str and int will error out, so spark will not even get the chance to infer to which column we are referring.

`where` is exactly synonimous with `filter`

In [30]:
#También vale WHERE omo filter. Se debe a la capacidad híbrida de Spark.
df_orcs = df['race'] = 'orcs').select('id').show()
df_orcs.show()

SyntaxError: ignored

A column is quite different to a Pandas Series. It is just a reference to a column, and can only be used to construct sparkSQL expressions (select, where...). It can't be collected or taken as a one-dimensional sequence:

In [31]:
df['race'].show()

TypeError: ignored

#### Exercise

Extract all employee ids which correspond to orcs

In [32]:
df_orcs = df['race'] = 'orcs').select('id').show()
df_orcs.show()

SyntaxError: ignored

### Adding columns

Dataframes are immutable, since they are built on top of RDDs, so we can not assign to them. We need to create new DataFrames with the appropriate columns.

In [33]:
df['square'] = df['id'] ** 2

TypeError: ignored

In [34]:
df2 = df.withColumn('square', df['id'] ** 2)
df2.show(5)

+---+------+------+
| id|  race|square|
+---+------+------+
|  0|hobbit|   0.0|
|  1|   elf|   1.0|
|  2|   elf|   4.0|
|  3|   elf|   9.0|
|  4|   orc|  16.0|
+---+------+------+
only showing top 5 rows



### User defined functions

There are many useful functions in pyspark.sql.functions. These work on columns, that is, they are vectorial.

We can write User Defined Functions (`udf`s), which allow us to "vectorize" operations: write a standard function to process single elements, then build a udf with that that works on columns in a DataFrame, like a SQL function.

In [35]:
df2['id'] + df2['square']

Column<b'(id + square)'>

In [36]:
import math

#Evidentemente no podemos aplicar la función
#Pero podemos empaquetar la función
math.log1p(df['id'])

TypeError: ignored

This errors out because 

```python
math.log1p
```

is not a udf: it doesn't know how to work with strings or Column objects:

In [37]:
from pyspark.sql import functions

df.select(functions.cos(df['id'])).show()

+--------------------+
|             COS(id)|
+--------------------+
|                 1.0|
|  0.5403023058681398|
| -0.4161468365471424|
| -0.9899924966004454|
| -0.6536436208636119|
| 0.28366218546322625|
|  0.9601702866503661|
|  0.7539022543433046|
|-0.14550003380861354|
| -0.9111302618846769|
| -0.8390715290764524|
|0.004425697988050785|
|  0.8438539587324921|
|  0.9074467814501962|
|  0.1367372182078336|
| -0.7596879128588213|
| -0.9576594803233847|
|-0.27516333805159693|
|  0.6603167082440802|
|  0.9887046181866692|
+--------------------+



But we can transform it into a udf:

In [38]:
my_udf = functions.udf(math.log1p, returnType = types.FloatType()) #Empaquetamos la función para poder aplicarla
#Si quiero que devuelva algo específico returnType = 
df.select(my_udf(df['id'])).show(5) #Aplicamos la función

+---------+
|log1p(id)|
+---------+
|      0.0|
|0.6931472|
|1.0986123|
|1.3862944|
| 1.609438|
+---------+
only showing top 5 rows



We can do the same with any function we dream up:

In [39]:
#También podemos aplicar una lambda. Lo que sea.
last_2 = functions.udf(lambda word: word[-2:])
df.select(last_2('race')).show(5)

+--------------+
|<lambda>(race)|
+--------------+
|            it|
|            lf|
|            lf|
|            lf|
|            rc|
+--------------+
only showing top 5 rows



If we want the resulting columns to be of a particular type, we need to specify the return type. This is because in Python return types can not be inferred.

In [40]:
df.select(my_udf(df['id'])).printSchema()

root
 |-- log1p(id): float (nullable = true)



Think about this function: what is its return type?

In [41]:
def incognito (a, b):
  return a + b

In [42]:
df.select('*',
          my_udf('id').cast(types.DoubleType()).alias('new_column'))

DataFrame[id: int, race: string, new_column: double]

#### Exercise: 

Create a 'hitpoints' field in our df. make it 30000 for halflings, 40000 for elves and 70000 for orcs.



In [43]:
from pyspark.sql import functions

def hitpoints(race):
  reference = {'elf' : 40000, 'orc' : 70000, 'hobbit' : 30000}
  return reference[race]

hitpoints('hobbit')

30000

In [45]:
hp_udf = functions.udf(hitpoints, types.IntegerType())
hp_udf(df['race'])

df3 = df.withColumn('hp', hp_udf('race'))
df3.show()

+---+------+-----+
| id|  race|   hp|
+---+------+-----+
|  0|hobbit|30000|
|  1|   elf|40000|
|  2|   elf|40000|
|  3|   elf|40000|
|  4|   orc|70000|
|  5|   orc|70000|
|  6|   orc|70000|
|  7|   elf|40000|
|  8|hobbit|30000|
|  9|   elf|40000|
| 10|   elf|40000|
| 11|hobbit|30000|
| 12|   elf|40000|
| 13|   elf|40000|
| 14|hobbit|30000|
| 15|hobbit|30000|
| 16|   elf|40000|
| 17|hobbit|30000|
| 18|   orc|70000|
| 19|   elf|40000|
+---+------+-----+



If we have a column that is not the desired type, we can convert it with `cast`.

### Summary statistics

https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

In [46]:
df3.stat.corr('id', 'hp')

-0.14067080955835778

In [47]:
df3.stat.cov('id', 'hp')

-12105.263157894737

### .crosstab()

Crosstab returns the contingency table for two columns, as a DataFrame.

In [48]:
random.seed(17)
land = functions.udf(lambda: random.choice(['gondor', 'rohan']))

df4 = df3.withColumn('land', land())
df4.show()

#A partir de nada

+---+------+-----+------+
| id|  race|   hp|  land|
+---+------+-----+------+
|  0|hobbit|30000|gondor|
|  1|   elf|40000| rohan|
|  2|   elf|40000| rohan|
|  3|   elf|40000|gondor|
|  4|   orc|70000| rohan|
|  5|   orc|70000|gondor|
|  6|   orc|70000| rohan|
|  7|   elf|40000| rohan|
|  8|hobbit|30000| rohan|
|  9|   elf|40000|gondor|
| 10|   elf|40000|gondor|
| 11|hobbit|30000| rohan|
| 12|   elf|40000| rohan|
| 13|   elf|40000|gondor|
| 14|hobbit|30000| rohan|
| 15|hobbit|30000|gondor|
| 16|   elf|40000| rohan|
| 17|hobbit|30000| rohan|
| 18|   orc|70000| rohan|
| 19|   elf|40000|gondor|
+---+------+-----+------+



In [49]:
df4.cache().show(5)

+---+------+-----+------+
| id|  race|   hp|  land|
+---+------+-----+------+
|  0|hobbit|30000|gondor|
|  1|   elf|40000| rohan|
|  2|   elf|40000|gondor|
|  3|   elf|40000|gondor|
|  4|   orc|70000| rohan|
+---+------+-----+------+
only showing top 5 rows



In [50]:
df4.crosstab('race', 'land').show()

+---------+------+-----+
|race_land|gondor|rohan|
+---------+------+-----+
|   hobbit|     3|    3|
|      orc|     1|    3|
|      elf|     6|    4|
+---------+------+-----+



### Grouping

Grouping works very similarly to Pandas: executing groupby (or groupBy) on a DataFrame will return an object (a GroupedData) that can then be aggregated to obtain the results.

In [51]:
gd = df4.groupBy('land')
gd

<pyspark.sql.group.GroupedData at 0x7f3ac9da4f60>

GroupedData has several aggregation functions defined:

In [52]:
gd.sum('hp').show()

+------+-------+
|  land|sum(hp)|
+------+-------+
| rohan| 460000|
|gondor| 400000|
+------+-------+



We can do several aggregations in a single step, with a number of different syntaxes:

In [53]:
gd.agg({'hp' : 'mean', 'id' : 'count'}).show()

+------+-------+---------+
|  land|avg(hp)|count(id)|
+------+-------+---------+
| rohan|46000.0|       10|
|gondor|40000.0|       10|
+------+-------+---------+



In [54]:
gd.agg(functions.mean('hp'), 
       functions.count('id'),
       functions.mean('id')).show()

+------+-------+---------+-------+
|  land|avg(hp)|count(id)|avg(id)|
+------+-------+---------+-------+
| rohan|46000.0|       10|   10.0|
|gondor|40000.0|       10|    9.0|
+------+-------+---------+-------+



In [55]:
df4.groupby(df4['id'] < 5).mean('hp').show()

+--------+------------------+
|(id < 5)|           avg(hp)|
+--------+------------------+
|    true|           44000.0|
|   false|42666.666666666664|
+--------+------------------+



### Intersections

Ver much like SQL joins. We can specify the columns and the join method (left, right, inner, outer) or we can let Spark infer them.

In [56]:
#Son joins como los de SQL y los de Pandas
result = gd.agg(functions.mean('hp'), 
       functions.count('id'),
       functions.mean('id'))

result.show()

+------+-------+---------+-------+
|  land|avg(hp)|count(id)|avg(id)|
+------+-------+---------+-------+
| rohan|46000.0|       10|   10.0|
|gondor|40000.0|       10|    9.0|
+------+-------+---------+-------+



In [57]:
df4.join(result).show()

AnalysisException: ignored

Spark refuses to do cross joins by default. To perform them, we can 

a) Allow then explicitly:

```python
spark.conf.set("spark.sql.crossJoin.enabled", "true")
```

b) Specify the join criterion

```python
df4.join(new_df, on='id').show()
```

In [58]:
df4.join(result, on = 'land').show()

+------+---+------+-----+-------+---------+-------+
|  land| id|  race|   hp|avg(hp)|count(id)|avg(id)|
+------+---+------+-----+-------+---------+-------+
|gondor|  0|hobbit|30000|40000.0|       10|    9.0|
| rohan|  1|   elf|40000|46000.0|       10|   10.0|
|gondor|  2|   elf|40000|40000.0|       10|    9.0|
|gondor|  3|   elf|40000|40000.0|       10|    9.0|
| rohan|  4|   orc|70000|46000.0|       10|   10.0|
| rohan|  5|   orc|70000|46000.0|       10|   10.0|
| rohan|  6|   orc|70000|46000.0|       10|   10.0|
|gondor|  7|   elf|40000|40000.0|       10|    9.0|
|gondor|  8|hobbit|30000|40000.0|       10|    9.0|
| rohan|  9|   elf|40000|46000.0|       10|   10.0|
|gondor| 10|   elf|40000|40000.0|       10|    9.0|
| rohan| 11|hobbit|30000|46000.0|       10|   10.0|
|gondor| 12|   elf|40000|40000.0|       10|    9.0|
|gondor| 13|   elf|40000|40000.0|       10|    9.0|
| rohan| 14|hobbit|30000|46000.0|       10|   10.0|
| rohan| 15|hobbit|30000|46000.0|       10|   10.0|
| rohan| 16|

In [59]:
df4.join(result, on = df4['id'] > result['count(id)'], how = 'left').show()

+---+------+-----+------+------+-------+---------+-------+
| id|  race|   hp|  land|  land|avg(hp)|count(id)|avg(id)|
+---+------+-----+------+------+-------+---------+-------+
|  0|hobbit|30000|gondor|  null|   null|     null|   null|
|  1|   elf|40000| rohan|  null|   null|     null|   null|
|  2|   elf|40000|gondor|  null|   null|     null|   null|
|  3|   elf|40000|gondor|  null|   null|     null|   null|
|  4|   orc|70000| rohan|  null|   null|     null|   null|
|  5|   orc|70000| rohan|  null|   null|     null|   null|
|  6|   orc|70000| rohan|  null|   null|     null|   null|
|  7|   elf|40000|gondor|  null|   null|     null|   null|
|  8|hobbit|30000|gondor|  null|   null|     null|   null|
|  9|   elf|40000| rohan|  null|   null|     null|   null|
| 10|   elf|40000|gondor|  null|   null|     null|   null|
| 11|hobbit|30000| rohan| rohan|46000.0|       10|   10.0|
| 11|hobbit|30000| rohan|gondor|40000.0|       10|    9.0|
| 12|   elf|40000|gondor| rohan|46000.0|       10|   10.

#### Digression

We can monitor our running jobs and storage used at the Spark Web UI. We can get its url with sc.uiWebUrl.

StorageLevels represent how our DataFrame is cached: we can save the results of the computation up to that point, so that if we process several times the same data only the subsequent steps will be recomputed.

We can erase it with `unpersist`

#### Exercise

Calculate the [z-score](http://www.statisticshowto.com/probability-and-statistics/z-score/) of each employee's hitpoints for their location

1) Calculate the mean and std of hitpoints for each location

In [71]:
stats = df4.groupby('land').agg(functions.mean('hp').alias('avg_hp'),
                        functions.stddev('hp').alias('std_hp'))

2) Annotate each employee with the stats corresponding to their location

In [72]:
annotated = df4.join(stats, on = 'land')
annotated.show()

+------+---+------+-----+-------+------------------+
|  land| id|  race|   hp| avg_hp|            std_hp|
+------+---+------+-----+-------+------------------+
|gondor|  0|hobbit|30000|40000.0|11547.005383792515|
| rohan|  1|   elf|40000|46000.0|17126.976771553505|
|gondor|  2|   elf|40000|40000.0|11547.005383792515|
|gondor|  3|   elf|40000|40000.0|11547.005383792515|
| rohan|  4|   orc|70000|46000.0|17126.976771553505|
| rohan|  5|   orc|70000|46000.0|17126.976771553505|
| rohan|  6|   orc|70000|46000.0|17126.976771553505|
|gondor|  7|   elf|40000|40000.0|11547.005383792515|
|gondor|  8|hobbit|30000|40000.0|11547.005383792515|
| rohan|  9|   elf|40000|46000.0|17126.976771553505|
|gondor| 10|   elf|40000|40000.0|11547.005383792515|
| rohan| 11|hobbit|30000|46000.0|17126.976771553505|
|gondor| 12|   elf|40000|40000.0|11547.005383792515|
|gondor| 13|   elf|40000|40000.0|11547.005383792515|
| rohan| 14|hobbit|30000|46000.0|17126.976771553505|
| rohan| 15|hobbit|30000|46000.0|17126.9767715

3) Calculate the z-score

In [73]:
annotated.select('*',
                 ((annotated['hp'] - annotated['avg_hp']) / annotated['std_hp']).alias('z')).show()

+------+---+------+-----+-------+------------------+--------------------+
|  land| id|  race|   hp| avg_hp|            std_hp|                   z|
+------+---+------+-----+-------+------------------+--------------------+
|gondor|  0|hobbit|30000|40000.0|11547.005383792515| -0.8660254037844387|
| rohan|  1|   elf|40000|46000.0|17126.976771553505|-0.35032452487268534|
|gondor|  2|   elf|40000|40000.0|11547.005383792515|                 0.0|
|gondor|  3|   elf|40000|40000.0|11547.005383792515|                 0.0|
| rohan|  4|   orc|70000|46000.0|17126.976771553505|  1.4012980994907414|
| rohan|  5|   orc|70000|46000.0|17126.976771553505|  1.4012980994907414|
| rohan|  6|   orc|70000|46000.0|17126.976771553505|  1.4012980994907414|
|gondor|  7|   elf|40000|40000.0|11547.005383792515|                 0.0|
|gondor|  8|hobbit|30000|40000.0|11547.005383792515| -0.8660254037844387|
| rohan|  9|   elf|40000|46000.0|17126.976771553505|-0.35032452487268534|
|gondor| 10|   elf|40000|40000.0|11547

Note that we can build more complex boolean conditions for joining, as well as joining on columns that do not have the same name:

### Handling null values

In [70]:
# O los descarto o los invento
mounts = spark.createDataFrame([['rohan', 'horse', None], ['gondor', None, 1]], schema = ['land', 'mount', 'burned_alive'])
mounts.show()

+------+-----+------------+
|  land|mount|burned_alive|
+------+-----+------------+
| rohan|horse|        null|
|gondor| null|           1|
+------+-----+------------+



In [79]:
with_nulls = annotated.join(mounts, on = 'land')
with_nulls.show()

+------+---+------+-----+-------+------------------+-----+------------+
|  land| id|  race|   hp| avg_hp|            std_hp|mount|burned_alive|
+------+---+------+-----+-------+------------------+-----+------------+
| rohan| 19|   elf|40000|46000.0|17126.976771553505|horse|        null|
| rohan| 16|   elf|40000|46000.0|17126.976771553505|horse|        null|
| rohan| 15|hobbit|30000|46000.0|17126.976771553505|horse|        null|
| rohan| 14|hobbit|30000|46000.0|17126.976771553505|horse|        null|
| rohan| 11|hobbit|30000|46000.0|17126.976771553505|horse|        null|
| rohan|  9|   elf|40000|46000.0|17126.976771553505|horse|        null|
| rohan|  6|   orc|70000|46000.0|17126.976771553505|horse|        null|
| rohan|  5|   orc|70000|46000.0|17126.976771553505|horse|        null|
| rohan|  4|   orc|70000|46000.0|17126.976771553505|horse|        null|
| rohan|  1|   elf|40000|46000.0|17126.976771553505|horse|        null|
|gondor| 18|   orc|70000|40000.0|11547.005383792515| null|      

In [80]:
with_nulls.dropna().show()

+----+---+----+---+------+------+-----+------------+
|land| id|race| hp|avg_hp|std_hp|mount|burned_alive|
+----+---+----+---+------+------+-----+------------+
+----+---+----+---+------+------+-----+------------+



## SQL querying

We need to register our DataFrame as a table in the SQL context in order to be able to query against it.

In [None]:
spark.createDataFrame(pd_df)

Once registered, we can perform queries as complex as we want.

In [None]:
filled.registerTemplate()

## Interoperation with Pandas

Easy peasy. We can convert a spark DataFrame into a Pandas one, which will `collect` it, and viceversa, which will distribute it.

## Writing out


In [None]:
filled.write.csv('filled.csv')

In [None]:
filled.write.json
filled.write.jdbc

#### Exercise

Repeat the exercise from the previous notebook, but this time with DataFrames.

Get stats for all tickets with destination MAD from `coupons150720.csv`.

You will need to extract ticket amounts with destination MAD, and then calculate:

1. Total ticket amounts per origin
2. Top 10 airlines by average amount

In [81]:
coupons = spark.sql('''SELECT
                      CAST (_C0 AS BIGINT) as tkt_number,
                      CAST (_C1 AS INT) as cpn_number,
                      _c0 as tkt_number,
                      _c1 as cpn_number,
                      _c2 as origin,
                      _c3 as dest,
                      _c4 as carrier,
                      cast(_c6 AS FLOAT) as amount
                   FROM csv.`coupon150720.csv.gz`''')

coupons.show()

+--------------+----------+--------------+----------+------+----+-------+------+
|    tkt_number|cpn_number|    tkt_number|cpn_number|origin|dest|carrier|amount|
+--------------+----------+--------------+----------+------+----+-------+------+
|79062005698500|         1|79062005698500|         1|   MAA| AUH|     9W| 56.79|
|79062005698500|         2|79062005698500|         2|   AUH| CDG|     9W| 84.34|
|79062005924069|         1|79062005924069|         1|   CJB| MAA|     9W|  60.0|
|79065668570385|         1|79065668570385|         1|   DEL| DXB|     9W|160.63|
|79065668737021|         1|79065668737021|         1|   AUH| IXE|     9W|152.46|
|79062006192650|         1|79062006192650|         1|   RPR| BOM|     9W|  68.5|
|79062006192650|         2|79062006192650|         2|   BOM| RPR|     9W|  68.5|
|79062005733853|         1|79062005733853|         1|   DEL| DED|     9W| 56.16|
|79062005836987|         1|79062005836987|         1|   ATL| LGA|     AA|  28.3|
|79062005836987|         2|7

1) Extract the fields you need (c0,c1,c2,c3,c4 and c6) into a dataframe with proper names and types

Remember, you want to calculate:

Total ticket amounts per origin

Top 10 airlines by average amount

In [84]:
mad_tickets = coupons.filter(coupons['dest'] == 'MAD')

2) Total ticket amounts per origin

In [87]:
mad_tickets.groupBy('origin').sum('amount').show(10)

+------+------------------+
|origin|       sum(amount)|
+------+------------------+
|   PMI| 40547.17005729675|
|   YUL|284.44000244140625|
|   HEL| 8195.760055541992|
|   SXB| 264.4599914550781|
|   UIO| 8547.599964141846|
|   XRY| 9250.229990959167|
|   OLB|1801.4999809265137|
|   CCS| 94528.67986679077|
|   VRN|1020.5400009155273|
|   SPC| 7542.699995517731|
+------+------------------+
only showing top 10 rows



3) Top 10 Airlines by average amount



In [90]:
coupons.groupby('carrier').mean('amount').sort('avg(amount)', ascending = False).show(10)

+-------+------------------+
|carrier|       avg(amount)|
+-------+------------------+
|     S3| 5225.068852659132|
|     9V|1488.6739863189491|
|     GA| 991.2527673277024|
|     TN| 964.6472629731701|
|     7F| 668.0350036621094|
|     DT| 579.0006969363191|
|     B8| 443.2950134277344|
|     V0|438.61227978411614|
|     NE| 430.1000061035156|
|     4M| 377.7557483212701|
+-------+------------------+
only showing top 10 rows



## Further Reading

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.datacamp.com/community/tutorials/apache-spark-python

https://spark.apache.org/docs/2.2.0/sql-programming-guide.html

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf