In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

In [None]:
data_dir = '../data'
master = 'local[2]'

import os
import pyspark
import pyspark.sql.functions as sf

spark = (
    pyspark.sql.SparkSession.builder
    .master(master) 
    .getOrCreate()
)
spark

# DataFrame Basics

![footer_logo_new](images/logo_new.png)

## Overview

1. DataFrames
1. Basic operations
1. Getting data out

## 1. DataFrames

The core concept of Spark are __DataFrames__.

DataFrames are and consist of named columns containing data of a certain type.

DataFrames are similar to their counterparts in R or Python (`pandas`).
Alternatively, you can see them as a sheet in Excel or a table in a database.

Manually create a DataFrame:

In [None]:
sdf = spark.createDataFrame([[None, 'Michael'],
                             [30, 'Andy'],
                             [19, 'Justin'],
                             [30, 'James Dr No From Russia with Love Bond']], 
                            schema=['age', 'name'])

(Generally we would read data from a source.)

### Methods and properties

DataFrames have methods and properties:

In [None]:
sdf.columns

In [None]:
sdf.dtypes

In [None]:
sdf.printSchema()

In [None]:
sdf.show()

### Actions

Transformations are lazy; processing is only triggered when an action is invoked.

Common actions are:

 - `sdf.count()`: Count the number of rows
 - `sdf.toPandas()`: Convert to a `pandas` DataFrame.
 - `sdf.show()`: Print some rows to console.
 - `sdf.collect()`: Convert to Python objects.
 - Any write action.

In [None]:
sdf.count()

In [None]:
df = sdf.toPandas()
df

### DataSets vs DataFrames

If you're using Scala, you'll get the added bonus of strong typing. If you look around, you'll notice something: a DataFrame is actually a DataSet!

DataSets allow you to work with your own objects instead of the generic `Row` that we have to use in PySpark:

```scala
# Straight from a Scala sequence to a DataSet.
case class Employee(name: String, age: Long)
val caseClassDS = Seq(Employee("Amy", 32)).toDS
caseClassDS.show()
```

```scala
# Convert an untyped DataFrame into a DataSet.
case class Movie(actor_name: String,
                 movie_title: String,
                 produced_year: Long)
val movies = Seq(
    ("Damon, Matt", "The Bourne Ultimatum", 2007L),
    ("Damon, Matt", "Good Will Hunting", 1997L)
)
val moviesDS = movies.toDF("actor_name",
                           "movie_title",
                           "produced_year").as[Movie]
```

## 3. Basic operations

In [None]:
chicago_path = os.path.join(data_dir, 'chicagoCensus.csv')
chicago = spark.read.csv(chicago_path, header=True)
chicago.show(vertical=True)

In [None]:
import pyspark.sql.functions as sf

chicago \
    .filter(sf.col('HARDSHIP INDEX').isNotNull()) \
    .withColumn('high_hardship', sf.col('HARDSHIP INDEX') > 20) \
    .groupby('high_hardship') \
    .agg(sf.count('*').alias('n')) \
    .sort('n', 'high_hardship', ascending=False) \
    .show()

In SQL (depending on your dialect):
```sql
SELECT
    `HARDSHIP INDEX` > 20 AS high_hardship,
    count(*) AS n
FROM
    chicago
WHERE
    `HARDSHIP INDEX` IS NOT NULL
GROUP BY
    `HARDSHIP INDEX` > 20
ORDER BY n DESC
```

In [None]:
chicago.createOrReplaceTempView("chicagoTable")
chicagoSql = spark.sql("""
SELECT
    `HARDSHIP INDEX` > 20 AS high_hardship,
    count(*) AS n
FROM
    chicagoTable
WHERE
    `HARDSHIP INDEX` IS NOT NULL
GROUP BY
    `HARDSHIP INDEX` > 20
ORDER BY
    n DESC""")
chicagoSql.show()

### (Py)Spark DataFrame API vs Spark SQL API

 - In many respects, the DataFrame API and the Spark SQL are equivalent:
 
   - DataFrames represent tabular data and transformations.
   - SQL expresses queries on tabular data.
   
 - Sometimes you will prefer one over the other*.
 - Testing DataFrames tends to be easier than SQL.
 - In recent versions of Spark there are SQL queries that cannot be
   implemented via the DataFrames APIs.
   
\* Check out https://gdd.li/spark-df-api for more on this topic.

What SQL can't be done via DataFrame manipulation? There are some correlated subqueries that are only supported via SQL: I haven't been able to find a way of implementing them via DataFrames.

### `filter()`: Filter rows

```python
.filter(sf.col('HARDSHIP INDEX').isNotNull())
```

Filtering rows is done with booleans; all rows containg `True` will remain.

Use `sf.col('column')` to refer to an existing column named `'column'`.

Chain complicated multiple expressions like this:

In [None]:
(
    chicago
    .filter(sf.col('HARDSHIP INDEX').isNotNull() & 
            (sf.col('PER CAPITA INCOME ') > 30000) & 
            ((sf.col('Community Area Number') > 5) | ~(sf.col('HARDSHIP INDEX') > 10)))
    .limit(2)
    .toPandas()
)

Can anyone guess why all those brackets are there in the filter?

Answer: Operator precedence.

### `withColumn()`: Adding a column

```python
sdf.withColumn('high_hardship', sf.col('HARDSHIP INDEX') > 20)
```

Add a column with `sdf.withColumn('name', expression)`.

The `withColumn` function can also be used to replace an existing column in place:
   
```python
sdf.withColumn("high_hardship", ~sf.col("high_hardship"))
```


### `groupby().agg()`: Aggregate statistics    
 
```python
.groupby('high_hardship')
.agg(sf.count('*').alias('n'))
```

* Group by one or multiple columns.
* Aggregate with one of the aggregate functions found in `sf`.
* Give columns a readable name by using `.alias('name')`.

### `sort()`: Sort the DataFrame

```python
.sort('n', ascending=False)
```

* Sort the DataFrame by one or multiple columns.
* Give a list of booleans when sorting multiple columns in different order.

## Intermezzo on style

All operations return a modified DataFrame.

In our example, we chained all the operations under each other, while you could also write the query as:

In [None]:
filtered = chicago.filter(sf.col('HARDSHIP INDEX').isNotNull())
with_hardship = filtered.withColumn('high_hardship', sf.col('HARDSHIP INDEX') > 20)
n_per_hardship = with_hardship.groupby('high_hardship').agg(sf.count('*').alias('n'))
sorted_n = n_per_hardship.sort('n', ascending=False)
sorted_n.show()

Why didn't we write our query like that?

 - Our first query is more readable.
 - Don't have to come up with good names for temporary variables.

### Wide vs long chains of operations

We can also operations like this:

In [None]:
chicago.withColumn('high_hardship', sf.col('HARDSHIP INDEX') > 20).filter(sf.col('HARDSHIP INDEX').isNotNull()).groupby('high_hardship').agg(sf.count('*').alias('n')).show()

But that isn't really readable.

Irrespective of whether you're chaining long or wide, don't make them too long!

While you're writing them they can be obvious but if they are too long your colleages won't be able to follow the evolving schema.

And they will reject your code during review.

## 4. Getting data out

Similar to `spark.read`, most methods are under `sdf.write`:

- `sdf.write.csv()`: CSV.
- `sdf.write.json()`: JSON.
- `sdf.write.parquet()`: Parquet.
- `sdf.write.saveAsTable()`: Hive table.


Save the DataFrame `sdf` as the Hive table `my_table`:

In [None]:
# sdf.write.saveAsTable('my_table', mode='overwrite')
spark.table('my_table').show()

###  `write` and locality

Depending on how your run Spark, it will read files from different places.

This is similar to `spark.read.`

### `write` and tasks

Spark runs many tasks for a single stage (more about that later). They will run in parallel (depending on the number of cores per executor).

Each task will output a part of the file when done.

<img src="images/partitioning.png" width="70%" align="left"/>

If you want to avoid this, use `repartition()` or `coalesce()` before writing:

- `.repartition(numPartitions, *cols)`: Scales the number of partitions up or down, optionally distributing the data so that rows with the same values in the specified columns end up in the same partion.
- `.coalesce(numPartitions)`: Scales the number of partitions down.



In [None]:
# (
#     sdf
#     .repartition('name')
#     .write.saveAsTable('my_second_table')
# )
sdf.show()

Minimizing data movement is important. If you have the following situation:

- Node1: Partition 1A, 1B, 1C
- Node2: Partition 2A, 2B, 2C
- Node3: Partition 3A, 3B, 3C

Invoking `.coalesce(3)` will produce:

- Node1: Partition 1
- Node2: Partition 2
- Node3: Partition 3

Invoking `.coalesce(2)` will produce:

- Node1: Partition (1 + 3A)
- Node2: Partition (2 + 3(B+C))

The parenthesis indicate a single partition.

### `coalesce` vs `repartition`
We can use `coalesce` and `repartition` to perform (essentially) the same task. I can run both of these commands with the same effect:

In [None]:
sdf.rdd.getNumPartitions()

In [None]:
sdf.repartition(1).rdd.getNumPartitions()

In [None]:
sdf.coalesce(1).rdd.getNumPartitions()

However, there is a difference.

Can anyone guess what it is?

`repartition` will __always__ incur a *full* shuffle of the data, regardless of whether one is necessary.

`coalesce` will try to combine partitions in order to get to the desired number of partitions.

### Rule of thumb: `coalesce` vs `repartition`

Typically, only use `repartition` when the future number of partitions is greater than the current number, or if you are looking to partition by a (set of) column(s). Otherwise, use `coalesce`.

## Exercises

1. Load the Heroes of the Storm dataset with `spark.read.csv()`.
   Make sure your parse the `headers` in the first row.
1. Check the dtypes: what do you notice?
   How can you let Spark infer the schema?
1. Explore the data: some _NaN_ values are not encoded properly.
   Tell `spark.read.csv` how _NaN_ is encoded.
1. Filter out the hero with the _NaN_ value.
1. Which hero has the most hp?
1. Add a column with the `attack_momentum`, computed as: ${attack} * {attack\_spd}$.
1. Which role on average has the highest attack?
1. Figure out which roles and types of attack frequently occur together.
1. Deliver a dataframe with names of the heroes with the highest attack in their role.

In [None]:
heroes_path = os.path.join(data_dir, 'heroes.csv')
heroes = spark.read.csv(heroes_path, header=True, inferSchema=True, nanValue='NA').filter(~sf.isnan('attack'))

In [None]:
heroes.sort('attack', ascending=False).groupBy('role').agg(sf.first('name')).show()

#### Bonus
1. Make a function that accepts a dataframe and a list colnames. 
Let it return the mean and standard deviation of the columns.
1. Apply the function to the hp and attack column such that the result has columns:
    - `hp_mean`
    - `hp_stddev`
    - `attack_mean`
    - `attack_stddev`

In [None]:
%load "../answers/02_heroes.py"

# Summary

In this chapter, we looked at:
+ The basics of creating, filtering and working with Dataframes.
+ How to write dataframes to disk.