# 1 Creating a DataFrame

First, let's create some DataFrame from Python objects. While this is probably not the most common thing to do, it is easy and helpful in some situations where you already have some Python objects.

In [1]:
df = spark.createDataFrame([('Alice', 13), ('Bob', 12)], ['name', 'age'])
print(df.collect())

[Row(name='Alice', age=13), Row(name='Bob', age=12)]


## 1.1 Inspect Schema

The `spark` object has different methods for creating a so called Spark DataFrame object. This object is similar to a table, it contains rows of records, which all conform to a common schema with named columns and specific types. On the surface it heavily borrows concepts from Pandas DataFrames or R DataFrames, although the syntax and many operations are syntactically very different.

As the first step, we want to see the contents of the DataFrame. This can be easily done by using the show method.

In [2]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



# 2 Reading Data

Of course manually creating DataFrames from a couple of records is not the real use case. Instead we want to read data frames files.. Spark supports various file formats, we will use JSON in the following example.

The entrypoint for creating Spark objects is an object called spark which is provided in the notebook and read to use. We will read a file containing some informations on a couple of persons, which will serve as the basis for the next examples

In [3]:
persons = spark.read.json("s3a://dimajix-training/data/persons.json")
persons

DataFrame[age: bigint, height: bigint, name: string, sex: string]

In [4]:
persons.collect()

[Row(age=14, height=156, name='Alice', sex='female'),
 Row(age=21, height=181, name='Bob', sex='male'),
 Row(age=27, height=176, name='Charlie', sex='male'),
 Row(age=24, height=167, name='Eve', sex='female'),
 Row(age=19, height=172, name='Frances', sex='female'),
 Row(age=31, height=191, name='George', sex='male')]

## 2.1 Inspecting a DataFrame

Spark supports various methods for inspecting both the contents and the schema of a DataFrame

In [5]:
persons.show()

+---+------+-------+------+
|age|height|   name|   sex|
+---+------+-------+------+
| 14|   156|  Alice|female|
| 21|   181|    Bob|  male|
| 27|   176|Charlie|  male|
| 24|   167|    Eve|female|
| 19|   172|Frances|female|
| 31|   191| George|  male|
+---+------+-------+------+



In [6]:
persons.printSchema()

root
 |-- age: long (nullable = true)
 |-- height: long (nullable = true)
 |-- name: string (nullable = true)
 |-- sex: string (nullable = true)



## Pandas Interoperability

Spark also supports interoperation with Python Pandas, the standard library for modelling tabular data.

In [7]:
persons.toPandas()

Unnamed: 0,age,height,name,sex
0,14,156,Alice,female
1,21,181,Bob,male
2,27,176,Charlie,male
3,24,167,Eve,female
4,19,172,Frances,female
5,31,191,George,male


## 2.2 Loading CSV Data

Of course Spark also supports reading CSV data. CSV files may optionally contain a header containing the column names.

In [8]:
persons = spark.read \
    .option("header","true") \
    .csv("s3a://dimajix-training/data/persons_header.csv")
persons.toPandas()

Unnamed: 0,age,height,name,sex
0,23,156,Alice,female
1,21,181,Bob,male
2,27,176,Charlie,male
3,24,167,Eve,female
4,19,172,Frances,female
5,31,191,George,female


# 3 Simple Transformations

## 3.1 Projections

The simplest thing to do is to create a new DataFrame with a subset of the available columns

In [9]:
from pyspark.sql.functions import *

result = persons.select('name', col('age'))
result.toPandas()

Unnamed: 0,name,age
0,Alice,23
1,Bob,21
2,Charlie,27
3,Eve,24
4,Frances,19
5,George,31


## 3.2 Addressing Columns

Spark supports multiple different ways for addressing a columns. We just saw one way, but also the following methods are supported for specifying a column:

* `df.column_name`
* `df['column_name']`
* `col('column_name')`

All these methods return a Column object, which is an abstract representative of the data in the column. As we will see soon, transformations can be applied to Column in order to derive new values.

### Beware of Lowercase and Uppercase

While PySpark itself is case insenstive concering column names, Python itself is case sensitive. Since the first method for addressing columns by treating them as fields of a Python object *is* Python syntax, this is also case sensitive!

In [10]:
from pyspark.sql.functions import *

result = persons.select('name', persons.age, col('height'), persons['sex'])
result.toPandas()

Unnamed: 0,name,age,height,sex
0,Alice,23,156,female
1,Bob,21,181,male
2,Charlie,27,176,male
3,Eve,24,167,female
4,Frances,19,172,female
5,George,31,191,female


## 3.3 Transformations 

The `select` method actually accepts any column object. A column object conceptually represents a column in a DataFrame. The column may either refer directly to an existing column of the input DataFrame, or it may represent the result of a calculation or transformation of one or multiple columns of the input DataFrame. For example if we simply want to transform the name into upper case, we can do so by using a function `upper` provided by PySpark.

In [11]:
result = persons.select(persons.name, upper(persons.name))
result.toPandas()

Unnamed: 0,name,upper(name)
0,Alice,ALICE
1,Bob,BOB
2,Charlie,CHARLIE
3,Eve,EVE
4,Frances,FRANCES
5,George,GEORGE


### Defining new Column Names
The resulting DataFrame again has a schema, but the column names to not look very nice. But by using the `alias` method of a `Column` object, you can immediately rename the newly created column like you are already used to in SQL with `SELECT complex_operation(...) AS nice_name FROM ...`. 

Technically specifying a new name for the resulting column is not required (as we already saw above), if the name is not specified, PySpark will generate a name from the expression. But since this generated name tends to be rather long and contains the logic instead of the intention, it is highly recommended to always explicitly specify the name of the resulting column using `as`.

In [12]:
# Result should be "Alice is 23 years old"
result = persons.select(
            concat(persons.name, lit(" is "), persons.age, lit(" years old")).alias("description")
        )
result.toPandas()

Unnamed: 0,description
0,Alice is 23 years old
1,Bob is 21 years old
2,Charlie is 27 years old
3,Eve is 24 years old
4,Frances is 19 years old
5,George is 31 years old


You can also perform simple mathematical calculations like addition, multiplication etc.

In [13]:
result = persons.select((persons.age * 2).alias("age2"))
result.toPandas()

Unnamed: 0,age2
0,46.0
1,42.0
2,54.0
3,48.0
4,38.0
5,62.0


### Common Functions

You can find the full list of available functions at [PySpark SQL Module](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions). Commonly used functions for example are as follows:

* [`concat(*cols)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.concat) - Concatenates multiple input columns together into a single column.
* [`substring(col,start,len)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.substring) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
* [`instr(col,substr)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.instr) - Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
* [`locate(col,substr, pos)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.locate) - Locate the position of the first occurrence of substr in a string column, after position pos.
* [`length(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.length) - Computes the character length of string data or number of bytes of binary data. 
* [`upper(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.upper) - Converts a string column to upper case.
* [`lower(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lower) - Converts a string column to lower case.
* [`coalesce(*cols)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.coalesce) - Returns the first column that is not null.
* [`isnull(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.isnull) - An expression that returns true iff the column is null.
* [`isnan(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.isnan) - An expression that returns true iff the column is NaN.
* [`hash(cols*)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.hash) - Calculates the hash code of given columns.

Spark also supports conditional expressions, like the SQL `CASE WHEN` construct
* [`when(condition, value)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when) - Evaluates a list of conditions and returns one of multiple possible result expressions.

There are also some special functions often required
* [`col(str)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.col) - Returns a Column based on the given column name.
* [`lit(val)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lit) - Creates a Column of literal value.
* [`expr(str)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.expr) - Parses the expression string into the column that it represents

### User Defined Functions
Unfortunately you cannot directly use normal Python functions for transforming DataFrame columns. Although PySpark already provides many useful functions, this might not always sufficient. But fortunately you can *convert* a standard Python function into a PySpark function, thereby defining a so called *user defined function* (UDF). Details will be explained in detail in the training.

## 3.4 Adding Columns

A special variant of a `select` statement is the `withColumn` method. While the `select` statement requires all resulting columns to be defined in as arguments, the `withColumn` method keeps all existing columns and adds a new one. This operation is quite useful since in many cases new columns are derived from the existing ones, while the old ones still should be contained in the result.

Let us have a look at a simple example, which only adds the salutation as a new column:

In [14]:
result = persons.withColumn("even_odd_age", 
                when(persons.age % 2 == 0, "even").otherwise("odd")
        )
result.toPandas()

Unnamed: 0,age,height,name,sex,even_odd_age
0,23,156,Alice,female,odd
1,21,181,Bob,male,odd
2,27,176,Charlie,male,odd
3,24,167,Eve,female,even
4,19,172,Frances,female,odd
5,31,191,George,female,odd


As you can see from the example above, `withColumn` always takes two arguments: The first one is the name of the new column (and it has to be a string), and the second argument is the expression containing the logic for calculating the actual contents.

## 3.5 Dropping a Column

PySpark also supports the opposite operation which simply removes some columns from a dataframe. This is useful if you need to remove some sensitive data before saving it to disk:

In [15]:
result = persons.drop("sex")
result.toPandas()

Unnamed: 0,age,height,name
0,23,156,Alice
1,21,181,Bob
2,27,176,Charlie
3,24,167,Eve
4,19,172,Frances
5,31,191,George


# 4 Filtering

*Filtering* denotes the process of keeping only rows which meet a certain filter criteria.

## 4.1 Simple `WHERE` clauses

PySpark support two different approaches. The first approach specifies the filtering expression as a PySpark expression using columns:

In [16]:
result = persons.filter(persons.age > 22)
result.toPandas()

Unnamed: 0,age,height,name,sex
0,23,156,Alice,female
1,27,176,Charlie,male
2,24,167,Eve,female
3,31,191,George,female


In [17]:
result = persons.where((persons.age > 22) & (persons.height > 160))
result.toPandas()

Unnamed: 0,age,height,name,sex
0,27,176,Charlie,male
1,24,167,Eve,female
2,31,191,George,female


The second approach simply uses a string containing an SQL expression:

In [18]:
result = persons.filter("age > 22 AND height > 160")
result.toPandas()

Unnamed: 0,age,height,name,sex
0,27,176,Charlie,male
1,24,167,Eve,female
2,31,191,George,female


## 4.2 Limit Operations

When working with large datasets, it may be helpful to limit the amount of records (like an SQL `LIMIT` operation).

In [19]:
result = persons.limit(3)
result.toPandas()

Unnamed: 0,age,height,name,sex
0,23,156,Alice,female
1,21,181,Bob,male
2,27,176,Charlie,male


# 5 Simple Aggregations

PySpark supports simple global aggregations, like `COUNT`, `MAX`, `MIN` etc...

In [20]:
persons.count()

6

In [21]:
result = persons.select(
        max(persons.age).alias("max_age"), 
        avg(persons.height).alias("avg_height")
    )
result.toPandas()

Unnamed: 0,max_age,avg_height
0,31,173.833333


# 6 Grouping & Aggregating

An important class of operation is grouping and aggregation, which is equivalnt to an SQL `SELECT aggregation GROUP BY grouping` statement. In PySpark, grouping and aggregation is always performed by first creating groups using `groupBy` immediately followed by aggregation expressions inside an `agg` method. (Actually there are also some predefined aggregations which can be used instead of `agg`, but they do not offer the flexiviliby which is required most of the time).

Note that in the `agg` method you only need to specify the aggregation expression, the grouping columns are added automatically by PySpark to the resulting DataFrame.

In [22]:
result = persons.groupBy(persons.sex).agg(
    avg(persons.age).alias("avg_age"),
    min(persons.height).alias("min_height"),
    max(persons.height).alias("max_height")
)
result.toPandas()

Unnamed: 0,sex,avg_age,min_height,max_height
0,female,24.25,156,191
1,male,24.0,176,181


## Aggregation Functions

PySpark supports many aggregation functions, they can be found in the documentation at [PySpark Function Documentation](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions). Aggregation functions are marked as such in the documentation, unfortunately there is no simple overview. Among common aggregation functions, there are for example:

* count
* sum
* avg
* corr
* first
* last

# 7 Sorting DataFrames

You can sort the entries (= rows) of a DataFrame by an arbitrary column or expression.

In [23]:
result = persons.sort(persons.height)
result.toPandas()

Unnamed: 0,age,height,name,sex
0,23,156,Alice,female
1,24,167,Eve,female
2,19,172,Frances,female
3,27,176,Charlie,male
4,21,181,Bob,male
5,31,191,George,female


If nothing else is specified, PySpark will sort the records in increasing order of the sort columns. If you require descending order, this can be specified by manipulating the sort column with the `desc()` method as follows:

In [24]:
result = persons.orderBy(persons.height.desc())
result.toPandas()

Unnamed: 0,age,height,name,sex
0,31,191,George,female
1,21,181,Bob,male
2,27,176,Charlie,male
3,19,172,Frances,female
4,24,167,Eve,female
5,23,156,Alice,female


# User Defined Functions

Sometimes the built in functions do not suffice or you want to call an existing function of a Python library. Using User Defined Functions (UDF) it is possible to wrap an existing function into a Spark DataFrame function.

In [25]:
import html
from pyspark.sql.types import *

html_encode = udf(html.escape, StringType())

df = spark.createDataFrame([
        ("Alice & Bob",),
        ("Thelma & Louise",)
    ], ["name"])

result = df.select(html_encode(df.name).alias("html_name"))
result.toPandas()

Unnamed: 0,html_name
0,Alice &amp; Bob
1,Thelma &amp; Louise
