In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

In [None]:
data_dir = '../data'
master = 'local[2]'

import os
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as st

spark = (
    pyspark.sql.SparkSession.builder
    .master(master) 
    .getOrCreate()
)
spark

# DataFrames Advanced
1. More DataFrame operations
1. Joins
1. Columns
1. Functions
1. UDFs & UDAFs

![footer_logo_new](images/logo_new.png)

## 1. More DataFrame operations

We only covered some of the DataFrame operations.

There are many more, such as:

- `drop()`
- `select()`
- `join()`
- `limit()`
- `distinct()`
- `drop_duplicates()`
- ...

Create a simple DataFrame

In [None]:
sdf1 = spark.createDataFrame([[1, 'a'], [2, 'b'], [2, 'c'], [3, 'd']],
                             schema=['number', 'letter'])

### `select()`: Select columns

In [None]:
sdf1.select('number', 'letter')

### `drop()`: Drop columns


In [None]:
sdf1.drop('letter')

### `join()`: Join two DataFrames.

In [None]:
sdf2 = spark.createDataFrame([[2], [3], [4]], schema=['number'])
sdf2.toPandas()

In [None]:
sdf1.toPandas()

Inner join on a column present in both DataFrames:

In [None]:
sdf1.join(sdf2, on='number', how='inner').toPandas()

Left-join:

In [None]:
(
    sdf1.join(sdf2, on=(sdf1.number >= sdf2.number), 
              how='left')
    .toPandas()
)

Note that here we end up with the columns from both DataFrames.

To drop it the column from `sdf1`:

In [None]:
(
    sdf1.join(sdf2, on=(sdf1.number >= sdf2.number), 
              how='left')
    .drop(sdf1.number)
    .toPandas()
)

### Exercise

Spark supports quite a few join types:
+ inner
+ left_outer
+ left_anti
+ right_outer
+ full_outer
+ left_semi

Try them out on the datasets shown, and formulate what each of them does.

### `limit()`: Limit the number of rows

Crucial when you're doing a `.toPandas()`.

Your big Spark DataFrame often won't fit in the RAM of your Gateway, so you don't want all rows to be converted.

Alternatively, use `sample()`.

In [None]:
sdf1.limit(2).toPandas()

In [None]:
sdf1.toPandas()

### Sidebar: limit() without order() gives arbitrary results

Limit chooses the first $n$ rows in the DataFrame; if you haven't explicitly specified the order of the DataFrame, what's it going to be?

Undefined behaviour is something your colleagues will reject during code review.

Something to think about: can the combination of `order()` and `limit()` be optimized?

### `limit()` relative: `sample()`

If you want a more representative sample of your data: `sample(withReplacement, fraction, seed=None)`

Note that you need to come up with a good number for `fraction`, as you can't directly specify the number of samples you want. You can however combine `sample()` with `limit()`.

### `distinct()`: Find distinct values

In [None]:
(
    sdf1
    .select('number')
    .distinct()
    .toPandas()
)

### `drop_duplicates()`: Drop duplicate entries

In [None]:
sdf1.toPandas()

In [None]:
sdf3 = (
    sdf1
    .drop_duplicates(subset=['number'])
)

sdf3.show()

sdf1.show()

## 2. Columns

`sf.col()` is an important tool. Use it to create new columns from mathematical operations; filter rows; etc.

Let's get some new data.

In [None]:
persons = spark.createDataFrame(
    [[float('nan'), 'John'],
     [None, 'Michael'],
     [30., 'Andy'],
     [19., 'Justin'],
     [30., 'James Dr No From Russia with Love Bond']], 
    schema = ['age', 'name']
)

### Filtering

In [None]:
persons.filter(sf.col('name') == 'Andy').toPandas()

In [None]:
persons.filter(sf.col('name') != 'Andy').toPandas()

### `.isin()`: Searching in a list

2 ways of being Andy or Justin.

In [None]:
(
    persons
    .withColumn('is_andy_or_justin', ((sf.col('name') == 'Andy') |
                                      (sf.col('name') == 'Justin')))
    .withColumn('is_andy_or_justin2', sf.col('name').isin('Andy', 'Justin', 'John', 'Jack'))
    .toPandas()
)

Use `sf.col('name').isin()` when having many alternatives.

In [None]:
teen_ages = list(range(10, 20))
teens = persons.withColumn('is_teen', sf.col('age').isin(teen_ages))
teens.toPandas()

### `~`: Negation

In [None]:
(
    teens.withColumn('aint_no_teen', ~sf.col('is_teen'))
    .toPandas()
)

### `.isNull()` and `.isNotNull()`: Finding missing values

In [None]:
(
    teens
    .withColumn('missing_age', sf.col('age').isNull())
    .withColumn('not_missing_age', sf.col('age').isNotNull())
    .toPandas()
)

### `.startswith()` and `.contains()`: String operations 

In [None]:
(
    teens
    .withColumn('starts_with_J', sf.col('name').startswith('J'))
    .withColumn('has_an_a', sf.col('name').contains('a'))
    .toPandas()
)

## 3. Functions

A lot of functionality can be found in the module `pyspark.sql.functions` or are methods of `sf.col('name')`.

* Needed for basic operations.
* Lots of functions (too many).
* Read the [API docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html).

In [None]:
from pyspark.sql import functions as sf

### `sf.when().otherwise()`: case statements

In [None]:
(
    persons
    .withColumn('whos_this', 
                sf.when(sf.col('name') == 'Andy', 'Yup, Andy')
                  .when(sf.col('name') == 'Justin', 'Justin here')
                  .otherwise('No idea'))
     .toPandas()
)

Adding boolean columns, don't do it with `sf.when`, just add it:

In [None]:
(
    persons
    .withColumn('is_andy_good', sf.col('name') == 'Andy') # YES
    .withColumn('is_andy_bad',
                sf.when(sf.col('name') == 'Andy', True)
                  .otherwise(False)) # NO
    .toPandas()
)

### `sf.lit()`: Add a constant column

In [None]:
persons.withColumn('value', sf.lit(5)).toPandas()

### `sf.isnull()`, `sf.isnan()`: Find missing or non-numbers

- Missing is something different than Not a Number
- (Columns also have the methods `.isNull()` and `isNotNull()`)

In [None]:
(
    persons
    .withColumn('is_missing', sf.isnull('age'))
    .withColumn('is_nan', sf.isnan('age'))
    .show()
)

Note that we can't see the difference between NaN and missing in `pandas`:

In [None]:
(
    persons
    .withColumn('is_missing', sf.isnull('age'))
    .withColumn('is_nan', sf.isnan('age'))
    .toPandas()
)

### Exercise
Recall our heroes dataset:
1. Sometimes our heroes like to fight in pairs. Find the most powerful pair, based on their cumulative HP. We place a few restrictions:
    - A pair must contain different roles
    - A pair A-B is the same as a pair B-A. Each pair should only appear once

In [None]:
heroes_path = os.path.join(data_dir, 'heroes.csv')
heroes = (
    spark.read.csv(heroes_path, header=True, inferSchema=True, nanValue='NA')
    .filter(~sf.isnan('attack'))
)

In [None]:
%load ../answers/02_heroes_pairs.py

## 4. UDFs & UDAFs

User defined functions (UDFs) allow you to write Python code that gets executed on every row. User defined aggregate functions (UDAFs) allow you to write Python code that creates aggregates over all (or multiple) rows in a DataFrame.

Since Spark 2.3 the whole machinery can use Arrow, an in-memory format for analytics.

In [None]:
%%timeit
from pyspark.sql.functions import pandas_udf, PandasUDFType

# double because we accept a Series of doubles. SCALAR because we give back a Series of doubles
# alternatives are GROUPED_MAP, that gives back a dataframe
@pandas_udf('double', PandasUDFType.SCALAR)  
def plus_one(v):
    return v + 1
    
(
    spark.range(0, 10 * 1000 * 1000)
    .withColumn('plus_one', plus_one(sf.col('id')))
    .select(sf.sum(sf.col('plus_one')))
    .collect()
)

There is a slower alternative (that you shouldn't use, unless things gets crashy!) 

In [None]:
%%timeit
len_plus_one_udf = sf.udf(lambda v: v + 1, st.IntegerType())

(
    spark.range(0, 10 * 1000 * 1000)
    .withColumn('plus_one', len_plus_one_udf(sf.col('id')))
    .select(sf.sum(sf.col('plus_one')))
    .collect()
)

In [None]:
%%timeit
def my_f(el):
    return el + 1

@sf.udf(st.IntegerType())
def my_g(el):
    return my_f(el) - 1

(
    spark.range(0, 10 * 1000 * 1000)
    .withColumn('neutral', my_g(sf.col('id')))
    .select(sf.sum(sf.col('neutral')))
    .collect()
)

In this example there's also the Spark native way.

In [None]:
%%timeit
(
    spark.range(0, 10 * 1000 * 1000)
    .withColumn('plus_one', sf.col('id') + 1)
    .select(sf.sum(sf.col('plus_one')))
    .collect()
)

It's fastest by a long way. Anyone want to guess why?

Since Spark 2.3 it is also possible to apply user-defined Pandas functions on groups resulting from `.groupby()`. The resulting dataframe can be of arbitrary length. The function type of the `@pandas_udf` should be `PandasUDFType.GROUPED_MAP`.

This allows you to define aggregate functions over groups (or the entire dataframe):

```python
sdf.groupby(column).apply(pandas_apply_udf)
```

Note! Be careful with your `pyarrow` versions. Spark may not always work with the latest version. If in doubt, check the versions in use by Databricks [here](https://docs.databricks.com/release-notes/runtime).

In [None]:
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
    [(1, 1.0),
     (1, 2.0), 
     (2, 3.0),
     (2, 5.0),
     (2, 10.0)],
    ("id", "v"))

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_funky_aggregate(v):
    return v.mean() - v.sum()

In [None]:
df.agg(my_funky_aggregate(df['v'])).show()

In [None]:
df.groupby(df['id']).agg(my_funky_aggregate(df['v'])).show()

# Summary
In this chapter we looked at:
+ Some useful functions to manipulate Spark dataframes
+ Joins across dataframes
+ The different types of UDF's and UDAF's and how to define them

## Exercises (part 1)

1. Explore the `airlines` DataFrame, and count how many NaN's you have in each column;
1. Fill the NaN with something that makes sense for each column.
1. Capture the state in the `airport_name` column (e.g. 'NY' in 'New York, NY: John F. Kennedy International') by using `sf.split()` and `sf.col('name').getItem()` twice.
1. Do the same, but now with `sf.regexp_extract`.
1. Do the same, but now with a pandas_udf.
1. Make a new dataframe `airport_states` with columns `airport` and `state`.
1. Remove duplicates from sdf_states (hint: lookup `drop_duplicates()` in the docs).
1. Join `airport_states` onto the original `airports`. 


In [None]:
airlines_path = os.path.join(data_dir, 'airlines.parquet')
airlines = spark.read.parquet(airlines_path)

## Exercises (part 2)
9. Add a column `weather_condition` that is: 
```
'rainy' if the `weather_delay` is greather than 1200
'stormy' if in addition to this the arrival is diverted by more than 15 minutes
'bright' otherwise
```
10. Split the DataFrame into a train and test set sorted by time cols

In [None]:
%load ../answers/02_airlines.py