![Spark Image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1200px-Apache_Spark_logo.svg.png)

# Structured APIs

## RDDs vs DataFrames and Datasets

![](https://databricks.com/wp-content/uploads/2018/05/rdd-1024x595.png)

### Resilient Distributed Dataset (RDD)
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

### When to use RDDs?
Consider these scenarios or common use cases for using RDDs when:
- you want low-level transformation and actions and control on your dataset;
- your data is unstructured, such as media streams or streams of text;
- you want to manipulate your data with functional programming constructs than domain specific expressions;
- you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
- you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

### What happens to RDDs in Apache Spark 2.0?
You may ask: Are RDDs being relegated as second class citizens? Are they being deprecated?

**The answer is a resounding NO!**

What’s more, as you will note below, you can seamlessly move between DataFrame or Dataset and RDDs at will—by simple API method calls—and DataFrames and Datasets are built on top of RDDs.

### DataFrames

Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

In Spark 2.0, DataFrame APIs will merge with Datasets APIs, unifying data processing capabilities across libraries. Because of this unification, developers now have fewer concepts to learn or remember, and work with a single high-level and type-safe API called `Dataset`.

![Spark](https://databricks.com/wp-content/uploads/2016/06/Unified-Apache-Spark-2.0-API-1.png)

### Datasets
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.

### Typed and Un-typed APIs

<table class="table">
<thead>
<tr>
<th>Language</th>
<th>Main Abstraction</th>
</tr>
</thead>
<tbody>
<tr>
<td>Scala</td>
<td>Dataset[T] &amp; DataFrame (alias for Dataset[Row])</td>
</tr>
<tr>
<td>Java</td>
<td>Dataset[T]</td>
</tr>
<tr>
<td>Python*</td>
<td>DataFrame</td>
</tr>
<tr>
<td>R*</td>
<td>DataFrame</td>
</tr>
</tbody>
</table>

> **Note:** *Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.*

### Benefits of Dataset APIs

### 1. Static-typing and runtime type-safety

Consider static-typing and runtime safety as a spectrum, with SQL least restrictive to Dataset most restrictive. For instance, in your Spark SQL string queries, you won’t know a syntax error until runtime (which could be costly), whereas in DataFrames and Datasets you can catch errors at compile time (which saves developer-time and costs). That is, if you invoke a function in DataFrame that is not part of the API, the compiler will catch it. However, it won’t detect a non-existing column name until runtime.

At the far end of the spectrum is Dataset, most restrictive. Since Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be detected at compile time. Also, your analysis error can be detected at compile time too, when using Datasets, hence saving developer-time and costs.

All this translates to is a spectrum of type-safety along syntax and analysis error in your Spark code, with Datasets as most restrictive yet productive for a developer.

![](https://databricks.com/wp-content/uploads/2016/07/sql-vs-dataframes-vs-datasets-type-safety-spectrum.png)

### 2. High-level abstraction and custom view into structured and semi-structured data
DataFrames as a collection of Datasets[Row] render a structured custom view into your semi-structured data.

### 3. Ease-of-use of APIs with structure

Although structure may limit control in what your Spark program can do with data, it introduces rich semantics and an easy set of domain specific operations that can be expressed as high-level constructs. Most computations, however, can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform `agg`, `select`, `sum`, `avg`, `map`, `filter`, or `groupBy` operations. 

### 4. Performance and Optimization
Along with all the above benefits, you cannot overlook the space efficiency and performance gains in using DataFrames and Dataset APIs for two reasons.

First, because DataFrame and Dataset APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row] (an alias of DataFrame) is even faster and suitable for interactive analysis.

![](https://databricks.com/wp-content/uploads/2016/07/memory-usage-when-caching-datasets-vs-rdds.png)

Second, since Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

### When should I use DataFrames or Datasets?
- If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
- If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
- If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
- If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
- If you are a R user, use DataFrames.
- If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

## Starting a Spark Session

The programming language Python is used for the implementation in this course - for this we use 'pyspark. (PySpark documentation https://spark.apache.org/docs/latest/api/python/)
PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

In the previous notebooks we have defined *Spark configuration* and *Spark context* objects.
Now we are using *Spark Session*.

SparkSession vs SparkContext<br>
Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

What is SparkContext<br>
Spark SparkContext is an entry point to Spark and defined in org.apache.spark package since 1.x and used to programmatically create Spark RDD, accumulators and broadcast variables on the cluster. Since Spark 2.0 most of the functionalities (methods) available in SparkContext are also available in SparkSession. Its object sc is default available in spark-shell and it can be programmatically created using SparkContext class.

What is SparkSession<br>
SparkSession introduced in version 2.0 is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame and DataSet. It’s object spark is default available in spark-shell and it can be created programmatically using SparkSession builder pattern.

In [None]:
from pyspark.sql import SparkSession
# May take a little while on a local computer
spark = SparkSession.builder.appName("Structured API").getOrCreate()

In [None]:
# check (try) if Spark session variable (spark) exists and print information about the Spark context
try:
    spark
except NameError:
    print("Spark session does not context exist. Please create Spark session first (run cell above).")
else:
    configurations = spark.sparkContext.getConf().getAll()
    for item in configurations: print(item)

In [None]:
# download 'flight' data from the internet

# import Path library
from pathlib import Path
# check if file already exists
if Path('data/flight-data/2015-summary.json').is_file():
    print ("File 2015-summary.json already in data directory - no need to download.")
else:
    !wget https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/flight-data/json/2015-summary.json -P data/flight-data

In [None]:
# Data source: https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/flight-data/json/2015-summary.json

# create a Spark dataframe and read a file in json format
df = spark.read.format("json").load("data/flight-data/2015-summary.json")

In [None]:
# display the schema of the dataframe
df.printSchema()

In [None]:
# read the schema of a json file (direct from the file content)
spark.read.format("json").load("data/flight-data/2015-summary.json").schema

*A schema is a StructType made up of a number of fields, StructFields, that have a name, type, a Boolean flag which specifies whether that column can contain missing or null values, and, finally, users can optionally specify associated metadata with that column. The metadata is a way of storing information about this column.*

*If the types in the data (at runtime) do not match the schema, Spark will throw an error. The example that follows shows how to create and enforce a specific schema on a DataFrame.*

In [None]:
# import types from library
from pyspark.sql.types import StructField, StructType, StringType, LongType

In [None]:
# define a struct object as schema for the json format in the file
myManualSchema = StructType([StructField("DEST_COUNTRY_NAME", StringType(), True),
                             StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
                             StructField("count", LongType(), False, metadata={"hello":"world"}) 
                            ])
df = spark.read.format("json").schema(myManualSchema).load("data/flight-data/2015-summary.json")

In [None]:
# display the schema of the dataframe
df.printSchema()

**Spark Types**<br><br>
Please see https://spark.apache.org/docs/latest/sql-ref-datatypes.html for Spark version 3.2.1 types list.

<table class="table">
<tbody><tr>
  <th style="width:20%">Data type</th>
  <th style="width:40%">Value type in Python</th>
  <th>API to access or create a data type</th></tr>
<tr>
  <td> <b>ByteType</b> </td>
  <td>
  int or long <br>
  <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
  Please make sure that numbers are within the range of -128 to 127.
  </td>
  <td>
  ByteType()
  </td>
</tr>
<tr>
  <td> <b>ShortType</b> </td>
  <td>
  int or long <br>
  <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
  Please make sure that numbers are within the range of -32768 to 32767.
  </td>
  <td>
  ShortType()
  </td>
</tr>
<tr>
  <td> <b>IntegerType</b> </td>
  <td> int or long </td>
  <td>
  IntegerType()
  </td>
</tr>
<tr>
  <td> <b>LongType</b> </td>
  <td>
  long <br>
  <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
  Please make sure that numbers are within the range of
  -9223372036854775808 to 9223372036854775807.
  Otherwise, please convert data to decimal.Decimal and use DecimalType.
  </td>
  <td>
  LongType()
  </td>
</tr>
<tr>
  <td> <b>FloatType</b> </td>
  <td>
  float <br>
  <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
  point numbers at runtime.
  </td>
  <td>
  FloatType()
  </td>
</tr>
<tr>
  <td> <b>DoubleType</b> </td>
  <td> float </td>
  <td>
  DoubleType()
  </td>
</tr>
<tr>
  <td> <b>DecimalType</b> </td>
  <td> decimal.Decimal </td>
  <td>
  DecimalType()
  </td>
</tr>
<tr>
  <td> <b>StringType</b> </td>
  <td> string </td>
  <td>
  StringType()
  </td>
</tr>
<tr>
  <td> <b>BinaryType</b> </td>
  <td> bytearray </td>
  <td>
  BinaryType()
  </td>
</tr>
<tr>
  <td> <b>BooleanType</b> </td>
  <td> bool </td>
  <td>
  BooleanType()
  </td>
</tr>
<tr>
  <td> <b>TimestampType</b> </td>
  <td> datetime.datetime </td>
  <td>
  TimestampType()
  </td>
</tr>
<tr>
  <td> <b>DateType</b> </td>
  <td> datetime.date </td>
  <td>
  DateType()
  </td>
</tr>
<tr>
  <td> <b>ArrayType</b> </td>
  <td> list, tuple, or array </td>
  <td>
  ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br>
  <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
  </td>
</tr>
<tr>
  <td> <b>MapType</b> </td>
  <td> dict </td>
  <td>
  MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br>
  <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
  </td>
</tr>
<tr>
  <td> <b>StructType</b> </td>
  <td> list or tuple </td>
  <td>
  StructType(<i>fields</i>)<br>
  <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
  name are not allowed.
  </td>
</tr>
<tr>
  <td> <b>StructField</b> </td>
  <td> The value type in Python of the data type of this field
  (For example, Int for a StructField with the data type IntegerType) </td>
  <td>
  StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br>
  <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>.
  </td>
</tr>
</tbody></table>

In [None]:
# display the content of the dataframe
df.show(20, False)

In [None]:
# display the first row of the dataframe
df.first()

*This is a row object in spark DataFrame*

In [None]:
# asDict returns a object from type dict after selecting the first row of the dataframe.
# Specifing the name/key of the name value pair (in brackets)returns the value 
df.first().asDict()['DEST_COUNTRY_NAME']

In [None]:
# dispay the first three entries of the dataframe
df.take(3)

### Columns and Expressions

There are a lot of different ways to construct and refer to columns but the two simplest ways are
by using the `col` or `column` functions.<br><br>
Unfortunately, there is a bug in Spark version 3.1.2 that prevents you from using the column function 
as intended. I.e. in this specific Spark version we only use the col function. 
Here is the link to the problem, for info: https://issues.apache.org/jira/browse/SPARK-35643


In [None]:
# import functions from library
from pyspark.sql.functions import col, column

In [None]:
# select column "DEST_COUNTRY_NAME" from the dataframe and display two rows
df.select(col("DEST_COUNTRY_NAME")).show(2)

In Spark version 3.1.2, there is a bug for the function 'column'. More about the bug can be read here: https://issues.apache.org/jira/browse/SPARK-35643.<br><br>Therefore, the SPARK version is determined in the following cells when using the function 'Column' and the function is not executed.

In [None]:
if (spark.version != "3.1.2"):
    # select column "DEST_COUNTRY_NAME" from the dataframe and display two rows
    df.select(column("DEST_COUNTRY_NAME")).show(2)

### Expressions

In [None]:
# import expression from library
from pyspark.sql.functions import expr
# expr() is a SQL function to execute SQL-like expressions and to use an existing DataFrame column value 
# as an expression argument to Pyspark built-in functions.
df.select(expr("DEST_COUNTRY_NAME as Destination")).show(20, False)

## select and selectExpr


In [None]:
# select column "DEST_COUNTRY_NAME" from the dataframe and display two rows
df.select("DEST_COUNTRY_NAME").show(2)

In [None]:
# select columns "DEST_COUNTRY_NAME" and "ORIGIN_COUNTRY_NAME" from the dataframe and display two rows
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)


In [None]:
if (spark.version != "3.1.2"):
    # select column "DEST_COUNTRY_NAME" from the dataframe in three different ways and display two rows
    df.select(expr("DEST_COUNTRY_NAME"),
                    col("DEST_COUNTRY_NAME"),
                    column("DEST_COUNTRY_NAME")).show(2)

In [None]:
# select column "DEST_COUNTRY_NAME" from the dataframe with an expression and display two rows
df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

In [None]:
# select column "DEST_COUNTRY_NAME" from the dataframe with an expression, set an alias and display two rows
df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME")).show(2)

In [None]:
# select column "DEST_COUNTRY_NAME" from the dataframe with an expression, select the same coloumn by name
# and display two rows
df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

In [None]:
# select all original columns from the dataframe and add an additional column with a condition
df.selectExpr(
"*", # all original columns
"(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry").show(2)

In [None]:
# select and count the distinct values in column "DEST_COUNTRY_NAME" and the average of the count column
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

## Adding Columns

In [None]:
# Adding a column with literal 1
# The lit() function is used to add constant or literal value as a new column to the dataframe.

# import lit function from library
from pyspark.sql.functions import lit
# add column "numberOne" with literal value 1
df.withColumn("numberOne", lit(1)).show(2)

In [None]:
# withColumn() is a DataFrame function that is used to add a new column to DataFrame, change the value of 
# an existing column, convert the datatype of a column or derive a new column from an existing column

# add new column "withinCountry" using an expression to determine the values
df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME")).show(2)

In [None]:
# Renaming Columns
# withColumnRenamed() is used to rename one column or multiple DataFrame column names.
df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

In [None]:
# Removing Columns
# Spark DataFrame provides a drop() method to drop a column/field from a DataFrame/Dataset. 
# Thedrop() method also used to remove multiple columns at a time from a Spark DataFrame/Dataset.
df.drop("ORIGIN_COUNTRY_NAME").columns

In [None]:
# Changing a Column’s Type (cast)
# Add column "count2" from column "count" with new type "long" and display the dataframe schema
df.withColumn("count2", col("count").cast("long")).schema

## Filtering Rows


In [None]:
# display only rows with count column value < 2 using the filter function
df.filter(col("count") < 2).show(2)

In [None]:
# display only rows with count column value < 2 using the where function 

# The where() operator is often used if people have a SQL background. 
# filter() and where() functions operate exactly the same.
df.where("count < 2").show(2)

In [None]:
# filter row by combining multiple where() functions
df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia").show(2)

In [None]:
# Getting Unique Rows
# display the number of unique (distinct) rows from the field "ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME"
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

In [None]:
# display the number of unique (distinct) rows from the field "ORIGIN_COUNTRY_NAME"
df.select("ORIGIN_COUNTRY_NAME").distinct().count()

## Random Samples


In [None]:
# Spark sampling is a mechanism to get random sample records from the dataset

# Seed for sampling (default a random seed). Used to reproduce same random sampling
seed = 5
# Sample with replacement or not (default False).
withReplacement = False
# By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset.
fraction = 0.5
# get samples of the dataset and count the number of samples
df.sample(withReplacement, fraction, seed).count()

## Concatenating and Appending Rows (Union)


In [None]:
# import row functions from library
from pyspark.sql import Row
# save schema of dataframe df in variable schema
schema = df.schema
# Row can be used to create a row objects by using named arguments
# this command create two new structs for the new rows. Each row with three columns - which fits the schema
newRows = [Row("New Country", "Other Country", 5),Row("New Country 2", "Other Country 3", 1)]
# create the new RDD in parallel/concurrent tasks
# remember: parallelized collections are created by calling SparkContext’s parallelize method on an existing 
#           iterable or collection in your driver program. The elements of the collection are copied to form 
#           a distributed dataset that can be operated on in parallel.
parallelizedRows = spark.sparkContext.parallelize(newRows)
# create a new dataframe from the RDD using the same schema as in df
newDF = spark.createDataFrame(parallelizedRows, schema)

In [None]:
# display the new dataframe schema and content
newDF.printSchema()
newDF.show()

In [None]:
# The union() method of the DataFrame is used to combine two DataFrame’s of the same structure/schema.
# If schemas are not the same it returns an error.

# combine dataframe df with dataframe newDF and select only the rows 
# with "count = 1" and "ORIGIN_COUNTRY_NAME != 'United States'"
df.union(newDF).where("count = 1").where(col("ORIGIN_COUNTRY_NAME") != "United States").show()

## Sorting Rows

In [None]:
# Spark dataframe/dataset class provides sort() function to sort on one or more columns. 
# By default, it sorts by ascending order.
df.sort("count").show(5)

In [None]:
# Alternatively, Spark dataFrame/dataset class also provides orderBy() function to sort on one or more columns. 
# By default, it also orders by ascending.
df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

In [None]:
# sort the dataframe by two columns
df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

*By default it sorts in ascending order but if you want to explicitly define the order use desc and asc*

In [None]:
# import desc, asc functions from library
from pyspark.sql.functions import desc, asc
# sort the dataframe on column 'count' in descending order by using an expression
df.orderBy(expr("count desc")).show(2)

In [None]:
# sort the dataframe on two columns. One in ascending order and one in descending order.
df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)

## Limit

In [None]:
# The limit clause is used to constrain the number of rows returned by the SELECT statement. 
# In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
df.orderBy(expr("count desc")).limit(6).show()

## Repartition and Coalesce

In [None]:
# in PySpark you can get the current length/size of partitions by running getNumPartitions() of RDD class, 
# so to use with DataFrame first you need to convert to RDD.
# RDD: rdd.getNumPartitions()
# dataframe (convert to RDD first): 
df.rdd.getNumPartitions()

In [None]:
# repartition() is used to increase or decrease the RDD/dataframe
df.repartition(5)

In [None]:
# repartition dataframe and display number of partitions
df.repartition(5).rdd.getNumPartitions()

In [None]:
# If you know that you’re going to be filtering by a certain column often, 
# it can be worth repartitioning based on that column
df.repartition(col("DEST_COUNTRY_NAME"))

In [None]:
# Repartition with defined number of partitions and partition column
# If number of partitions is not specified, the default number of partitions is used.
df.repartition(5, col("DEST_COUNTRY_NAME"))

*Coalesce, on the other hand, will not incur a full shuffle and will try to combine partitions.*

In [None]:
# coalesce() is used to only decrease the number of partitions in an efficient way.
df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

## Working with Different Types of Data


In [None]:
# download 'retail' data from the internet

# import Path library
from pathlib import Path
# check if file already exists
if Path('data/retail-data/by-day/2010-12-01.csv').is_file():
    print ("File 2010-12-01.csv already in data directory - no need to download.")
else:
    !wget !wget https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/by-day/2010-12-01.csv -P data/retail-data/by-day/

In [None]:
# Data source: https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/retail-data/by-day/2010-12-01.csv

# Read a file in csv format. The path can be either a single CSV file or a directory of CSV files
# options:
#          header: uses the first line as names of columns. 
#          inferSchema: Infers the input schema automatically from data. It requires one extra pass over the data.
# See this page for all options: https://spark.apache.org/docs/latest/sql-data-sources-csv.html
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("data/retail-data/by-day/2010-12-01.csv")
# display the schema of the dataframe
df.printSchema()

In [None]:
# import lit function from library
from pyspark.sql.functions import lit
# 'read' a three columns row. Each with a different data type and the value '5'
df.select(lit(5), lit("five"), lit(5.0))

### Working with Booleans


In [None]:
# import col function from library
# col returns a Column based on the given column name.
from pyspark.sql.functions import col
# select only columns "InvoiceNo" and "Description" AND filter data where "InvoiceNo" != 536365
df.where(col("InvoiceNo") != 536365)\
.select("InvoiceNo", "Description")\
.show(5, False)

In [None]:
# import instr function from library
from pyspark.sql.functions import instr
# create the 'where' condition '(UnitPrice > 600)'
priceFilter = col("UnitPrice") > 600
# instr locates the position of the first occurrence of substr column in the given string. 
# Returns null if either of the arguments are null.
# create the 'where' condition '(instr(Description, POSTAGE) >= 1)'
descripFilter = instr(df.Description, "POSTAGE") >= 1
# isin function is a boolean expression that is evaluated to true if the value of this expression is 
# contained by the evaluated values of the arguments. 
# create 'where' condition to check if the value in column StockCode is in the list of values. 
# The list in this case only contains "DOT"
df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()
# the output only displays row where
#    - column StockCode contains the value "DOT" 
#      AND
#    - ('column UnitPrice value is greater than 600' OR 'The position of the word "POSTAGE"
#       in column Description >= 1')

In [None]:
# create the same 'where' conditions as in the cell above with different notations
DOTCodeFilter = col("StockCode") == "DOT"
priceFilter = col("UnitPrice") > 600
descripFilter = instr(col("Description"), "POSTAGE") >= 1
# create a true/false column "isExpensive" based on the where condition in the second argument
# filter all rows by "isExpensive" == True
# and display on the columns "unitPrice" and"isExpensive"
df.withColumn("isExpensive", DOTCodeFilter & (priceFilter | descripFilter))\
.where("isExpensive")\
.select("unitPrice", "isExpensive").show(5)

In [None]:
# create a true/false column "isExpensive" based on an expression (Unit Price > 250)
# filter all rows by "isExpensive" == True
# and display on the columns "Description" and "unitPrice"
df.withColumn("isExpensive", expr("NOT UnitPrice <= 250"))\
.where("isExpensive")\
.select("Description", "UnitPrice").show(5)

### Working with Numbers

In [None]:
# import expr and pow functions from library
from pyspark.sql.functions import expr, pow
# pow returns the value of the first argument raised to the power of the second argument.
# fabricatedQuantity is of type 'column' ()
fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5
# display the Quantity and UnitPrice of the dataframe
df.select(col("CustomerId"), col("Quantity"), col("UnitPrice")).show(2)
# display the new column in table with name "realQuantity"
df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

In [None]:
# same result as in cell above - using selectExpr
# the function selectExpr() takes a set of SQL expressions in a string to execute
df.selectExpr(
"CustomerId",
"(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

In [None]:
# import lit, round and bround functions from library
from pyspark.sql.functions import lit, round, bround
# lit: creates a column of literal value
# round: round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or 
#        at integral part when scale < 0.
# bround: round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or 
#        at integral part when scale < 0.
df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

In [None]:
# import corr function from library
from pyspark.sql.functions import corr
# corr: calculates the correlation of two columns of a dataframe as a double value.
# output type float
print( df.stat.corr("Quantity", "UnitPrice") )
# output type dataframe
df.select(corr("Quantity", "UnitPrice")).show()

In [None]:
# 'describe' computes basic statistics for numeric and string columns.
# This include count, mean, stddev, min, and max. 
# If no columns are given, this function computes statistics for all numerical or string columns.
df.describe().show()

In [None]:
# import count, mean, stddev_pop, min and max functions from library
from pyspark.sql.functions import count, mean, stddev_pop, min, max
# 'approxQuantile' calculates the approximate quantiles of numerical columns of a dataframe
# parameters are:
#    column name. Can be a single column name, or a list of names for multiple columns.
colName = "UnitPrice"
#    probabilities. A list of quantile probabilities. Each number must belong to [0, 1].
quantileProbs = [0.5]
#    relativeError. The relative target precision to achieve (>= 0). 
#                   If set to zero, the exact quantiles are computed, which could be very expensive.
relError = 0.05
# calculate the approximate quantiles of column "UnitPrice" with probability 0,5 and relativeError 0,05
df.stat.approxQuantile("UnitPrice", quantileProbs, relError) # 2.51

In [None]:
# 'crosstab' computes a pair-wise frequency table of the given columns. Also known as a contingency table.
df.stat.crosstab("StockCode", "Quantity").show(2, False)

In [None]:
# Finding frequent items for columns, possibly with false positives.
df.stat.freqItems(["StockCode", "Quantity"]).show(2)

In [None]:
# import monotonically_increasing_id function from library
from pyspark.sql.functions import monotonically_increasing_id
# monotonically_increasing_id: A column that generates monotonically increasing 64-bit integers.
# The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
df.select(monotonically_increasing_id()).show(2)

### Working with Strings

In [None]:
# import initcap function from library
from pyspark.sql.functions import initcap
# initcap: Translate the first letter of each word to upper case in the sentence.
df.select(initcap(col("Description"))).show(20, False)

In [None]:
# import lower and upper function from library
from pyspark.sql.functions import lower, upper
# lower: Converts a string expression to lower case.
# upper: Converts a string expression to upper case.
df.select(col("Description"),
lower(col("Description")),
upper(lower(col("Description")))).show(2, False)

In [None]:
# import lit, ltrim, rtrim, rpad, lpad and trim functions from library
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim
# lit: Creates a Column of literal value.
# ltrim: Trim the spaces from left end for the specified string value.
# rtrim: Trim the spaces from right end for the specified string value.
# rpad: Right-pad the string column to width len with pad.
# lpad: Left-pad the string column to width len with pad.
# trim: Trim the spaces from both ends for the specified string column.
df.select(
ltrim(lit("   HELLO   ")).alias("ltrim"),
rtrim(lit("   HELLO   ")).alias("rtrim"),
trim(lit("    HELLO   ")).alias("trim"),
lpad(lit("HELLO"), 3, " ").alias("lp"),
rpad(lit("HELLO"), 10, " ").alias("rp")).show(2, False)

In [None]:
# Regular Expressions
# import regexp_replace function from library
from pyspark.sql.functions import regexp_replace
# regexp_replace: Replace all substrings of the specified string value that match regexp with rep.
regex_string = "BLACK|WHITE|RED|GREEN|BLUE"
# replace the stings BLACK" or WHITE or RED or GREEN or BLUE with the string COLOR
df.select(
regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"),
col("Description")).show(2, False)


In [None]:
# import translate function from library
from pyspark.sql.functions import translate
# translate(srcCol, matching, replace):
# A function translate any character in the srcCol by a character in matching. 
# The characters in replace is corresponding to the characters in matching. 
# The translate will happen when any character in the string matching with the character in the matching.
df.select(translate(col("Description"), "LEET", "1337"),col("Description")).show(2, False)

In [None]:
# import regexp_extract function from library
from pyspark.sql.functions import regexp_extract
# regexp_extract: Extract a specific group matched by a regex, from the specified string column. 
# If the regex did not match, or the specified group did not match, an empty string is returned.
extract_str = "(BLACK|WHITE|RED|GREEN|BLUE)"
df.select(
regexp_extract(col("Description"), extract_str, 1).alias("color_clean"),
col("Description")).show(2, False)

In [None]:
# import instr function from library
from pyspark.sql.functions import instr
# instr: Locate the position of the first occurrence of substr column in the given string. 
# Returns null if either of the arguments are null.
containsBlack = instr(col("Description"), "BLACK") >= 1
containsWhite = instr(col("Description"), "WHITE") >= 1
# select only the decription columns if the text contains the string 'BLACK' or 'WHITE'
df.withColumn("hasSimpleColor", containsBlack | containsWhite)\
.where("hasSimpleColor")\
.select("Description").show(3, False)

### Working with Dates and Timestamps

In [None]:
# import current_date and current_timestamp functions from library
from pyspark.sql.functions import current_date, current_timestamp
# current_date: Returns the current date at the start of query evaluation as a DateType column. 
#               All calls of current_date within the same query return the same value.
# current_timestamp:  Returns the current timestamp at the start of query evaluation as a TimestampType column. 
#               All calls of current_timestamp within the same query return the same value.
#
# create a dataframe with 10 rows and 3 columns (id, today, now)
dateDF = spark.range(10)\
.withColumn("today", current_date())\
.withColumn("now", current_timestamp())
dateDF.show(10, False)

In [None]:
# display the schema of the dataframe
dateDF.printSchema()

In [None]:
# import date_add and date_sub functions from library
from pyspark.sql.functions import date_add, date_sub
# date_add: Returns the date that is 'days' (second parameter) days after start (first parameter)
# date_sub: Returns the date that is 'days' (second parameter) days before start (first parameter)
dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

In [None]:
# import datediff, months_between and to_date functions from library
from pyspark.sql.functions import datediff, months_between, to_date
# datediff: Returns the number of days from start (second parameter)to end (first parameter).
dateDF.withColumn("week_ago", date_sub(col("today"), 7))\
.select(datediff(col("week_ago"), col("today"))).show(1)

In [None]:
# months_between: 
# Returns number of months between 'start' date and 'end' date. 
# If 'start' date is later than 'end' date, then the result is positive. 
# If 'start' date and 'end' date are on the same day of month, or both are the last day of month, 
# returns an integer (time of day will be ignored). 
# The result is rounded off to 8 digits unless roundOff is set to False (optional parameter 'roundOff=True').
dateDF.select(
to_date(lit("2016-01-01")).alias("start"),
to_date(lit("2017-05-22")).alias("end"))\
.select(months_between(col("start"), col("end"))).show(1)

In [None]:
# import to_date and lit functions from library
from pyspark.sql.functions import to_date, lit
# to_date: The function is used to format string (StringType) to date (DateType) column.
#          It takes the first argument as a date string and the second argument (optional) 
#          takes the pattern the date is in the first argument.
#          Overview datetime patterns: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
spark.range(5).withColumn("date", lit("2017-01-01"))\
.select(to_date(col("date"))).show(1)

In [None]:
# import to_date function from library
from pyspark.sql.functions import to_date
# to_date: The function is used to format string (StringType) to date (DateType) column.
#          It takes the first argument as a date string and the second argument (optional) 
#          takes the pattern of the date which is in the first argument.
#          Overview datetime patterns: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
dateFormat = "yyyy-dd-MM"
cleanDateDF = spark.range(1).select(
to_date(lit("2017-12-11"), dateFormat).alias("date"),
to_date(lit("2017-20-12"), dateFormat).alias("date2"))
cleanDateDF.show()

In [None]:
# import to_timestamp function from library
from pyspark.sql.functions import to_timestamp
# to_timestamp: The function is used to convert a String to Timestamp.
#               The converted time would be in a default format of MM-dd-yyyy HH:mm:ss.SSS.
#               When the format is not in this format the function returns null.
#               An optional second parameter is an additional String argument to specify the format 
#               of the input Timestamp - if the date string format is not 'default'.
cleanDateDF.select(to_timestamp(col("date"), dateFormat)).show()

### Working with Nulls in Data

Before we run the 'working with Nulls' examples we want to investigate the dataframe content.<br>
This helps to understand the output of the examples.

In [None]:
# import when, count and col functions from library
from pyspark.sql.functions import when, count, col
print("Schema of the dataset:")
df.printSchema()
print("List of all columns with the number of NULL values in each column:")
df.select([count(when (col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [None]:
# import coalesce function from library
from pyspark.sql.functions import coalesce
# coalesce() is used to only decrease the number of partitions in an efficient way.
# If you specify the number of partitions (type integer) as a parameter the function returns a 
# new DataFrame that has exactly numPartitions partitions.
# If the parameters are dataframe colums the function returns the first non-null value among 
# the given columns or null if all columns are null.
# Coalesce requires at least one column and all columns have to be of the same or compatible types.
print("Rows with value in Description or CustomerID column is NULL:")
df.select(col("InvoiceNo"), col("StockCode"), col("Description"), col("CustomerId")).\
    filter("Description is NULL OR CustomerID is NULL").show(5, False)

print("Rows with coalesce(Description, CustomerId) output:")
df.select(col("InvoiceNo"), col("StockCode"), col("Description"), col("CustomerId"),\
          coalesce(col("Description"), col("CustomerId"))).\
              filter("InvoiceNo = 536414 OR InvoiceNo = 536544").show(5, False)

The following functions are using the 'na functions'. <dataframe>'.na' returns a DataFrameNaFunctions for handling missing values.
<br>Methods within 'na' are:

  drop([how, thresh, subset]) - Returns a new DataFrame omitting rows with null values.<br>
  fill(value[, subset]) - Replace null values, alias for na.fill().<br>
  replace(to_replace[, value, subset]) - Returns a new DataFrame replacing a value with another value.

In [None]:
# dropna() returns a new DataFrame omitting rows with null values. 
# DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other.

# The function can take 3 optional parameters that are used to remove Rows with NULL values 
# on single, any, all, multiple DataFrame columns.
print("number of rows before drop(): " + str(df.count()))
print("number of rows after drop(): " + str(df.na.drop().count()))

In [None]:
# The first parameter 'how' takes values ‘any’ or ‘all’. 
# By using ‘any’, drop a row if it contains NULLs on any columns. 
print("number of rows before drop('any'): " + str(df.count()))
print("number of rows after drop('any'): " + str(df.na.drop("any").count()))

In [None]:
# The first parameter 'how' takes values ‘any’ or ‘all’. 
# By using ‘all’, drop a row only if all columns have NULL values. Default is ‘any’.
print("number of rows before drop('all'): " + str(df.count()))
print("number of rows after drop('all'): " + str(df.na.drop("all").count()))

In [None]:
# subset can be used to select the columns for NULL values. Default is ‘None.
print("number of rows before drop('all') with subset: " + str(df.count()))
print("number of rows after drop('all'): " + str(df.na.drop("all", subset=["StockCode", "InvoiceNo"]).count()))

In [None]:
# Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() 
# are aliases of each other.
dfNew = df.na.fill("filled-null-describtion",["Description"]).\
           na.fill(0,["CustomerId"])

print("Some (selected) rows with original value in columns Description and CustomerID:")
df.select(col("InvoiceNo"), col("StockCode"), col("Description"), col("CustomerId")).\
    filter("Description is NULL OR CustomerID is NULL").show(5, False)

print("Rows with null values 'filled' output:")
dfNew.select(col("InvoiceNo"), col("StockCode"), col("Description"), col("CustomerId")).\
              filter("InvoiceNo = 536414 OR InvoiceNo = 536544").show(5, False)

In [None]:
# DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL/None values on all or 
# selected multiple DataFrame columns with either zero(0), empty string, space, or any constant 
# literal values.
print("Rows with value in Description is NULL:")
df.select(col("InvoiceNo"), col("StockCode"), col("Description")).\
    filter("Description is NULL").show(5, False)
    
# execute na.fill on dataframe and display only selected output rows    
print("Rows with NULL values replaced by literal value:")
df.na.fill("All Null values become this string").\
    select(col("InvoiceNo"), col("StockCode"), col("Description")).\
              filter(df.InvoiceNo.isin([536414, 536545, 536546, 536547, 536549, 536543])).show(20, False)              

In [None]:
# replace: Returns a new DataFrame replacing a value with another value.
# DataFrame.replace() and DataFrameNaFunctions.replace() are aliases of each other. 
# Values to_replace and value must have the same type and can only be numerics, booleans, or strings. 
# Value can have None. When replacing, the new value will be cast to the type of the existing column.
df.na.replace([""], ["UNKNOWN"], "Description").\
    filter("Description is null").show()

In [None]:
# import size function from library
from pyspark.sql.functions import size, split
# size(): returns the length of the array or map stored in the column.
df.select(size(split(col("Description"), " "))).show(2) # shows 5 and 3

In [None]:
# import array_contains function from library
from pyspark.sql.functions import array_contains
# array_contains: returns null if the array is null, true if the array contains the given value, 
#                 and false otherwise.
df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2, False)

In [None]:
# import split function from library
from pyspark.sql.functions import split, explode
# explode: Returns a new row for each element in the given array or map.
df.withColumn("splitted", split(col("Description"), " "))\
.withColumn("exploded", explode(col("splitted")))\
.select("Description", "InvoiceNo", "exploded").show(10, False)

In [None]:
# MapType (also called map type) is a data type to represent Python Dictionary (dict) to store 
# key-value pair, a MapType object comprises three fields, keyType (a DataType), valueType (a DataType) 
# and valueContainsNull (a BooleanType). Last field is optional.

# Maps are created by using the map function and key-value pairs of columns

# import create_map function from library
from pyspark.sql.functions import create_map
# create a the map in new column complex_map from columns Description and InvoiceNo
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")).show(2, False)
# print the schema of the map
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map")).printSchema()

In [None]:
# it is possible to use the map within 'queries'
# this selectExpr only selects the value within the map if the key value is "WHITE METAL LANTERN". 
# Returns null if the key value is not found.
df.select(col("Description"), create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
.selectExpr("Description","complex_map['WHITE METAL LANTERN']").show(5, False)

In [None]:
# The expode function on a MapType selects the key, value paires of the map
df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\
.selectExpr("explode(complex_map)").show(5, False)

### User-Defined Functions

PySpark UDF (a.k.a User Defined Function) is the most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build in capabilities. 
UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.<br><br>
UDFs created using the tags @udf can only be used in DataFrame APIs but not in Spark SQL. To use a UDF in Spark SQL, you have to register it using 
spark.udf.register. Notice that spark.udf.register can not only register UDFs but also a regular Python function (in which case you have to specify 
return types).<br><br> Note: UDF’s are the most expensive operations hence only should only be used when there is no choice and when essential.

In [None]:
# create a dataframe - which will be used in this section
udfExampleDF = spark.range(5).toDF("num")
udfExampleDF.show()

In [None]:
# define the UDF 'power3' 
def power3(double_value):
    return double_value ** 3

# ececute the UDF
power3(2.0)

In [None]:
# import udf function from library
from pyspark.sql.functions import udf
# pyspark allows to set a variable as an udf (data type function) - this does NOT register the funtion in Spark SQL
power3udf = udf(power3)

In [None]:
# import col function from library
from pyspark.sql.functions import col
# the udf can be used in 'queries' with colum values
udfExampleDF.select(col("num"), power3udf(col("num"))).show(5)

In [None]:
# import IntegerType and DoubleType functions from library
from pyspark.sql.types import IntegerType, DoubleType
# check if udf is already registerd and if not: register the udf to use the UDF in Spark SQL
if not ("power3py" in [row[0] for row in spark.catalog.listFunctions()]):
    spark.udf.register("power3py", power3, DoubleType())

In [None]:
# !!! cell under investigation: the output should not return nul values !!!
udfExampleDF.selectExpr("power3py(num)").show(5, False)

## Aggregation

In [None]:
# read ALL .csv files from a directory and create five partitions
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("data/retail-data/all/*.csv")\
.coalesce(5)
# cache: Persists the DataFrame with the default storage level (MEMORY_AND_DISK).
# check if dataframe is already cached - if not cache the dataframe
if not (df.storageLevel.useMemory) :
    df.cache()

In [None]:
# display dataframe content
df.show()

### Aggregation Functions


In [None]:
# import count function from library
from pyspark.sql.functions import count
# count the number of rows in column "StockCode" in this dataframe
df.select(count("StockCode")).show()

In [None]:
# import countDistinct function from library
from pyspark.sql.functions import countDistinct
# count the number of DISTINCT values in column "StockCode" in this dataframe
df.select(countDistinct("StockCode")).show() 

In [None]:
# import approx_count_distinct function from library
from pyspark.sql.functions import approx_count_distinct
# approx_count_distinct: Approximate distinct count is much faster at approximately counting the 
# distinct records rather than doing an exact count, which usually needs a lot of shuffles and other 
# operations. While the approximate count is not 100% accurate, many use cases can perform equally 
# well even without an exact count.
df.select(approx_count_distinct("StockCode", 0.1)).show() # 3364

In [None]:
# import first and last functions from library
from pyspark.sql.functions import first, last
# first: The function by default returns the first values it sees. It will return the first non-null value 
#        it sees when ignoreNulls is set to true. If all values are null, then null is returned.
# last: The function by default returns the last values it sees. It will return the last non-null value 
#       it sees when ignoreNulls is set to true. If all values are null, then null is returned.
df.select(first("StockCode"), last("StockCode")).show()

In [None]:
# import IntegerType and DoubleType functions from library
from pyspark.sql.functions import min, max
# min: returns the minimum value of the expression in a group.
# max: returns the maximum value of the expression in a group.
df.select(min("Quantity"), max("Quantity")).show()

In [None]:
# import sum function from library
from pyspark.sql.functions import sum
# sum: returns the sum of all values in the expression.
df.select(sum("Quantity")).show()

In [None]:
# The function sumDistinct is deprecated since Spark version 3.2.0: Use sum_distinct() instead.
# check if Spark version is < 3.2 and use the equivalent version of the function 
if (int(spark.version[0:1]) < 3):
    # import sumDistinct function from library (spark version < 3)
    from pyspark.sql.functions import sumDistinct
    # returns the sum of distinct values in the expression
    df.select(sumDistinct("Quantity")).show()
elif (int(spark.version[2:3]) < 2):
    # import sumDistinct function from library (spark version < 3.2)
    from pyspark.sql.functions import sumDistinct
    # returns the sum of distinct values in the expression
    df.select(sumDistinct("Quantity")).show()
else:
    # import sum_distinct function from library (spark version >= 3.2)
    from pyspark.sql.functions import sum_distinct
    # returns the sum of distinct values in the expression
    df.select(sum_distinct("Quantity")).show()

from pyspark.sql import SparkSession
# May take a little while on a local computer
spark = SparkSession.builder.appName("Structured API").getOrCreate()

In [None]:
# import sum, count, avg and expr functions from library
from pyspark.sql.functions import sum, count, avg, expr
# count: returns the number of rows in this dataframe.
# sum: returns the sum of all values in the expression.
# avg: returns the average of the values in a group.
# mean: returns the average of the values in a group.
df.select(
count("Quantity").alias("total_transactions"),
sum("Quantity").alias("total_purchases"),
avg("Quantity").alias("avg_purchases"),
expr("mean(Quantity)").alias("mean_purchases"))\
.selectExpr(
"total_purchases", 
"total_transactions",
"avg_purchases",
"mean_purchases").show()

In [None]:
# import var_pop, stddev_pop, var_samp and stddev_samp functions from library
from pyspark.sql.functions import var_pop, stddev_pop, var_samp, stddev_samp
# var_pop: returns the population variance of the values in a group.
# stddev_pop: returns population standard deviation of the expression in a group.
# var_samp: returns the unbiased sample variance of the values in a group.
# stddev_samp: returns the unbiased sample standard deviation of the expression in a group.
df.select(var_pop("Quantity"), var_samp("Quantity"),
stddev_pop("Quantity"), stddev_samp("Quantity")).show()

In [None]:
# Skewness is the measure of the asymmetry of an ideally symmetric probability distribution and 
# is given by the third standardized moment.
# Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to 
# a normal distribution.

# import skewness and kurtosis functions from library
from pyspark.sql.functions import skewness, kurtosis
# skewness: returns the skewness of the values in a group.
# kurtosis: returns the kurtosis of the values in a group.
df.select(skewness("Quantity"), kurtosis("Quantity")).show()

In [None]:
# import corr, covar_pop and covar_samp functions from library
from pyspark.sql.functions import corr, covar_pop, covar_samp
# corr: Calculates the correlation of two columns of a DataFrame as a double value. 
# covar_pop: Returns a new Column for the population covariance of the parameters <col1> and <col2>.
# covar_samp: Returns a new Column for the sample covariance of the parameters <col1> and <col2>.
df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
covar_pop("InvoiceNo", "Quantity")).show()

In [None]:
# import collect_set and collect_list functions from library
from pyspark.sql.functions import collect_set, collect_list
# collect_set: returns a set of objects with duplicate elements eliminated.
# collect_list: eturns a list of objects with duplicates.
df.agg(collect_set("Country"), collect_list("Country")).show()

In [None]:
# groupBy: Groups the DataFrame using the specified columns.
# group the dataframe on columns "InvoiceNo" and "CustomerId" and add the number of rows in the groups
df.groupBy("InvoiceNo", "CustomerId").count().show()

In [None]:
# a different way to count the number of rows in the groups is to use tha agg() function
# agg: Compute aggregates and returns the result as a DataFrame.

# count the number of row in the groups in two different ways (explicit function and expression)
df.groupBy("InvoiceNo").agg(
count("Quantity").alias("quan"),
expr("count(Quantity)")).show()

In [None]:
# the agg function work on built-in aggregation functions, such as avg, max, min, sum, count, etc.
df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\
.show()

## Joins


In [None]:
# create three different dataframes
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "Michael Armbrust", 1, [250, 100])])\
.toDF("id", "name", "graduate_program", "spark_status")
graduateProgram = spark.createDataFrame([
(0, "Masters", "School of Information", "UC Berkeley"),
(2, "Masters", "EECS", "UC Berkeley"),
(1, "Ph.D.", "EECS", "UC Berkeley")])\
.toDF("id", "degree", "department", "school")
sparkStatus = spark.createDataFrame([
(500, "Vice President"),
(250, "PMC Member"),
(100, "Contributor")])\
.toDF("id", "status")

In [None]:
# display dataframe person
person.show()

In [None]:
# display dataframe graduateProgram
graduateProgram.show()

In [None]:
# display dataframe sparkStatus
sparkStatus.show()

In [None]:
# joinExprs is of the form: df("key") === df("key").
# this expression defines the column "graduate_program" as key from dataframe "person"
# to be join with the colum "id" as key from dataframe "graduateProgram".
# as 'text' the expression would look like 'person("graduate_program") === graduateProgram("id")'
joinExpression = person["graduate_program"] == graduateProgram['id']

In [None]:
# in this case the join expression is used on 'non' key columns - so the join result would not
# make sense for the values an therefore it is a wrong join expression.
wrongJoinExpression = person["name"] == graduateProgram["school"]

In [None]:
# as SQL statement this would look like:
# SELECT person.id, person.name, person.graduate_program, person.spark_status, 
#        graduateProgram.id, graduateProgram.degree, graduateProgram.department, graduateProgram.school,
# FROM person
# INNER JOIN graduateProgram ON person.graduate_program=graduateProgram.id
joinType = "inner"
person.join(graduateProgram, joinExpression, joinType).show()

In [None]:
# same SQL statement with OUTER JOIN
joinType = "outer"
person.join(graduateProgram, joinExpression, joinType).show()

In [None]:
# same SQL statement with LEFT OUTER JOIN
joinType = "left_outer"
graduateProgram.join(person, joinExpression, joinType).show()

In [None]:
# same SQL statement with RIGHT OUTER JOIN
joinType = "right_outer"
person.join(graduateProgram, joinExpression, joinType).show()

In [None]:
# Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns 
# from the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, 
# this join returns columns from the only left dataset for the records match in the right dataset on 
# join expression, records not matched on join expression are ignored from both left and right datasets.
joinType = "left_semi"
graduateProgram.join(person, joinExpression, joinType).show()

In [None]:
# union: Return a new dataframe containing union of rows in this and another dataframe
gradProgram2 = graduateProgram.union(spark.createDataFrame([
(0, "Masters", "Duplicated Row", "Duplicated School")]))
gradProgram2.show()

In [None]:
# Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only 
# columns from the left DataFrame/Dataset for non-matched records.
joinType = "left_anti"
graduateProgram.join(person, joinExpression, joinType).show()

In [None]:
# cross join simply combines each row of the first table with each row of the second table.
joinType = "cross"
graduateProgram.join(person, joinExpression, joinType).show()

#  Spark SQL

To use SQL queries directly with the dataframe, you will need to register it to a temporary view:

In [None]:
# createOrReplaceTempView: Creates or replaces a local temporary view with this dataframe.
# The lifetime of this temporary table is tied to the SparkSession that was used to create this dataframe.

# read a json file and create a local temporary view in the current SparkSession
spark.read.json("data/flight-data/2015-summary.json")\
.createOrReplaceTempView("some_sql_view") # DF => SQL

In [None]:
# To issue any SQL query, use the sql() method on the SparkSession instance
# All spark.sql queries executed in this manner return a DataFrame on which you may perform 
# further Spark operations
spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""")\
.where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")\
.count() # SQL => DF

## Stop The Spark Session

In [None]:
# stop the underlying SparkContext.
try:
    spark
except NameError:
    print("Spark session does not context exist - nothing to stop.")
else:
    spark.stop()

---
*Now you know the concept of Apache Spark, have understood RDDs, worked with dataframes and datasets and used SQL like queries. Let's test the knowledge a do a little exercise.*

**Next UP: [Exercise](./05_1_Exercise.ipynb)**