# Day 3 - A Deeper Insight into DataFrames
[pyspark Doc](https://spark.apache.org/docs/2.4.5/api/python/index.html)

My mission for today is to get a better understandig of `DataFrame` objects, their structure and how to operate with them. Since handling data of different shapes was always a challange in my life as ETL developer so far, I start with the schema topic, which already came across yesterday.

## Schemas

So far, I've learned two things about schemas in Spark. First, they define the names and types of `DataFrame` `Columns`. However Spark is using internal types of its Catalyst language regardless of the API language I'm using. Second, I can ask Spark to derive the schema from a source file or I can explicitly define the schema of the data I want to process.

Spark derives the schema by just reading a small sample of data in the file, which might not be representitive enough for the entire dataset. So maybe for production purposes it might be a better idea to express my expectation explicitly of how the data I want to process and analyse is actually shaped.

How does schemas look like in Spark? The`printSchema()` function will help me.

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession\
   .builder\
   .getOrCreate()

csvData = spark.read\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .format("csv")\
   .load("./data/flight-data/2015-summary.csv")\
   .printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [2]:
jsonData = spark.read\
   .option("inferSchema", "true")\
   .format("json")\
   .load("./data/flight-data/2015-summary.json")\
   .printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



Ok, the first thing I notice here is, even though both files have different format, CSV vs. JSON, they have nearly identical schema, i.e. having the same column names and types. The only difference is, that  Spark interprets values in the CSV file as strings and not as `long` numbers.



To find out how to define a schema in Spark, I have look at, how Spark does it.

In [3]:
spark.read\
   .option("header", "true")\
   .format("csv")\
   .load("./data/flight-data/2015-summary.csv")\
   .schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,StringType,true)))

In [4]:
spark.read\
   .format("json")\
   .load("./data/flight-data/2015-summary.json")\
   .schema

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),StructField(ORIGIN_COUNTRY_NAME,StringType,true),StructField(count,LongType,true)))

Now I have the blueprints to create similar schemas by myself, e.g. to define the "count" column as optional. One tricky aspect here is, that to define a schema I need to call the constructor methods of the Spark internal types, e.g. `StringType()`, just using the typename like `StringType` won't work.

In [5]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myOwnCsv = StructType([
    StructField("DEST_COUNTRY_NAME",StringType(),True),
    StructField("ORIGIN_COUNTRY_NAME",StringType(),True),
    StructField("count",StringType(),False)
])

myOwnJson = StructType([
    StructField("DEST_COUNTRY_NAME",StringType(),True),
    StructField("ORIGIN_COUNTRY_NAME",StringType(),True),
    StructField("count",LongType(),False)
])

Since my schema definitions are less restrictive, the file load should also work when I enforce them explicitly by calling the `schema()` function. 

In [6]:
spark.read\
   .option("header", "true")\
   .format("csv")\
   .schema(myOwnCsv)\
   .load("./data/flight-data/2015-summary.csv")\
   .show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [7]:
spark.read\
   .format("json")\
   .schema(myOwnJson)\
   .load("./data/flight-data/2015-summary.json")\
   .show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



And they do.
## Columns and Expressions
The `DataFrame` API provides two ways how to create `Column` objects using either the `col()` or the `expr()` function, which confuses me at first sight. If I apply both of them on the "count" column of my test data, I get exactly the same result.

In [8]:
from pyspark.sql.functions  import col, expr
df = spark.read\
   .option("header", "true")\
   .format("csv")\
   .schema(myOwnCsv)\
   .load("./data/flight-data/2015-summary.csv")

In [9]:
df.select("ORIGIN_COUNTRY_NAME", "count", col("count") -10).show(5)

+-------------------+-----+------------+
|ORIGIN_COUNTRY_NAME|count|(count - 10)|
+-------------------+-----+------------+
|            Romania|   15|         5.0|
|            Croatia|    1|        -9.0|
|            Ireland|  344|       334.0|
|      United States|   15|         5.0|
|              India|   62|        52.0|
+-------------------+-----+------------+
only showing top 5 rows



In [10]:
df.select("ORIGIN_COUNTRY_NAME", "count", expr("count - 10")).show(5)

+-------------------+-----+------------+
|ORIGIN_COUNTRY_NAME|count|(count - 10)|
+-------------------+-----+------------+
|            Romania|   15|         5.0|
|            Croatia|    1|        -9.0|
|            Ireland|  344|       334.0|
|      United States|   15|         5.0|
|              India|   62|        52.0|
+-------------------+-----+------------+
only showing top 5 rows



Ok Spark, explain to me what's going on behind the scenes.

In [11]:
df.select("ORIGIN_COUNTRY_NAME", "count", col("count") - 10).explain()

== Physical Plan ==
*(1) Project [ORIGIN_COUNTRY_NAME#95, count#96, (cast(count#96 as double) - 10.0) AS (count - 10)#132]
+- *(1) FileScan csv [ORIGIN_COUNTRY_NAME#95,count#96] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ORIGIN_COUNTRY_NAME:string,count:string>


In [12]:
df.select("ORIGIN_COUNTRY_NAME", "count", expr("count - 10")).explain()

== Physical Plan ==
*(1) Project [ORIGIN_COUNTRY_NAME#95, count#96, (cast(count#96 as double) - 10.0) AS (count - 10)#136]
+- *(1) FileScan csv [ORIGIN_COUNTRY_NAME#95,count#96] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ORIGIN_COUNTRY_NAME:string,count:string>


In [13]:
df.selectExpr("ORIGIN_COUNTRY_NAME", "count", "count - 10").explain()

== Physical Plan ==
*(1) Project [ORIGIN_COUNTRY_NAME#95, count#96, (cast(count#96 as double) - 10.0) AS (count - 10)#140]
+- *(1) FileScan csv [ORIGIN_COUNTRY_NAME#95,count#96] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ORIGIN_COUNTRY_NAME:string,count:string>


In [14]:
df.createOrReplaceTempView("table")
spark.sql("""SELECT ORIGIN_COUNTRY_NAME, count, count -10 FROM table""").explain()

== Physical Plan ==
*(1) Project [ORIGIN_COUNTRY_NAME#95, count#96, (cast(count#96 as double) - 10.0) AS (CAST(count AS DOUBLE) - CAST(10 AS DOUBLE))#144]
+- *(1) FileScan csv [ORIGIN_COUNTRY_NAME#95,count#96] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/oli/github/pyspark-tutorial/data/flight-data/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ORIGIN_COUNTRY_NAME:string,count:string>


As I've learned on day 2, any functional transformation on `DataFrame` objects is equivalent to SQL queries on tables, so `col("columnName")` is equivalent to the projection in relational algebra in terms of, take the third data element of each record (row). So finally there are three equivalent ways how to express transformations in Spark:
* as a composition of funtion: `select(col("count") - 10)`
* as an expression String: `select(expr("count - 10")* or *selectExpr("count - 10"))`
* as SQL query: `sql("""SELECT count - 10 FROM""")`

## DataFrame Metadata
`DataFrame` objects have attributes describing their metadata. The attribute `column` provides a list of all column names of the given `DataFrame`. Python lists are iterable objects so I can use this attribute to iterate over all column names of a `DataFrame`.

In [15]:
df.columns

['DEST_COUNTRY_NAME', 'ORIGIN_COUNTRY_NAME', 'count']

If I need the data types as well, I can use the attribute `dtypes` instead.

In [16]:
df.dtypes

[('DEST_COUNTRY_NAME', 'string'),
 ('ORIGIN_COUNTRY_NAME', 'string'),
 ('count', 'string')]

The number of rows in the `DataFrame`.

In [17]:
df.count()

256

Some basic statistics about the column values are also available for data profiling

In [18]:
df.describe().show()

+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|               986|
+-------+-----------------+-------------------+------------------+



In [19]:
df.summary().show()

+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    25%|             null|               null|              14.0|
|    50%|             null|               null|              63.0|
|    75%|             null|               null|             268.0|
|    max|           Zambia|            Vietnam|               986|
+-------+-----------------+-------------------+------------------+



## Records and Rows

Records are logical sets of related data values. In Spark they are technically represented by `Row` objects.

Spark `DataFrame` objects are collections of `Row` objects all having the same structure according to the schema of the `DataFrame`. I can create a `Row` object by myself, even independently from any schema or the existens of a `DataFrame`, which is not possible for `Column` objects.  

In [20]:
from pyspark.sql import Row
myOwnRow = Row(42, "is the answer to all questions", True)
type(myOwnRow)

pyspark.sql.types.Row

There is now `DataFrame` so no schema yet, but I have a row now. Like with Python lists I can access the `Row` data elements by their positional index starting with 0 for the first element.

In [21]:
myOwnRow[1]

'is the answer to all questions'

Well, the schema comes back into place as soon as I want to collect multiple records into the same `DataFrame` because than, all records must comply with the same schema.

Next to creating DataFrames on the fly from source files, like I did so far, I could also create my `DataFrame` by my own. This get's relevant to me, when there is no data source because I created the data by myself, e.g. simulated scenario data. All I need to is creating a schema, like I did on day 2 an some `Row` objects like I created just before and compile both into a `DataFrame`.

In [22]:
from pyspark.sql.types import LongType, StringType, BooleanType

myOwnSchema = StructType([
    StructField("ID", LongType(), True),
    StructField("Message", StringType(), True),
    StructField("is true or false", BooleanType(), True),
])

In [23]:
mySecondRow = Row(73, "is a prime number", True)

In [24]:
myDF = spark.createDataFrame([myOwnRow, mySecondRow], myOwnSchema)
myDF.show()

+---+--------------------+----------------+
| ID|             Message|is true or false|
+---+--------------------+----------------+
| 42|is the answer to ...|            true|
| 73|   is a prime number|            true|
+---+--------------------+----------------+



The `DataFrame` custructor `createDataFrame()` accepts a list of `Row` objects, so I don't have to insert rows manually one-by-one but can handover a list objects which I've maybe generated at another place in my code.

> **Note:** 
> By default, `show()` truncates strings longer than 20 characters.  setting this paramter to FALS or I can 
> define another number of character limit

In [25]:
rowList = [myOwnRow, mySecondRow]
myDF = spark.createDataFrame(rowList, myOwnSchema)
myDF.show(truncate=False)

+---+------------------------------+----------------+
|ID |Message                       |is true or false|
+---+------------------------------+----------------+
|42 |is the answer to all questions|true            |
|73 |is a prime number             |true            |
+---+------------------------------+----------------+



In [26]:
myDF.show(truncate=25)

+---+-------------------------+----------------+
| ID|                  Message|is true or false|
+---+-------------------------+----------------+
| 42|is the answer to all q...|            true|
| 73|        is a prime number|            true|
+---+-------------------------+----------------+



## DataFrame Transformations
Eventhough rows are individual objects, Spark does not manipulate them individually. To keep mass data processing fast, `Row` objects are manipulated by column expressions applied on `DataFrame`, which in fact are collections of `Row` objects. Again this is equivalent to the relational algebra of SQL SELECT queries. Here are some common example, I'm familiar with in SQL, but how do they look like in the functional format?
### Adding a Column with Literal Values

In [27]:
from pyspark.sql.functions import lit

myDF\
   .select(col("ID"), col("Message"))\
   .withColumn("literal", lit(23))\
   .show(truncate=False)

+---+------------------------------+-------+
|ID |Message                       |literal|
+---+------------------------------+-------+
|42 |is the answer to all questions|23     |
|73 |is a prime number             |23     |
+---+------------------------------+-------+



Alternatively, I could use `alias()` instead and create the new column inside of `select()`.

In [28]:
myDF\
   .select(col("ID"), col("Message"), lit(23).alias("literal"))\
   .show(truncate=False)

+---+------------------------------+-------+
|ID |Message                       |literal|
+---+------------------------------+-------+
|42 |is the answer to all questions|23     |
|73 |is a prime number             |23     |
+---+------------------------------+-------+



I appreaciate the readability when using `withColumn()`, especually in more complex queries. On the other hand it is less flexible because new columns get always appended to the right whereas with the second approach, using `alias()`, I can put the new `Column` at any position in the `DataFrame`.

In [29]:
myDF\
   .select(col("ID"), lit(23).alias("literal"), col("Message"))\
   .show(truncate=False)

+---+-------+------------------------------+
|ID |literal|Message                       |
+---+-------+------------------------------+
|42 |23     |is the answer to all questions|
|73 |23     |is a prime number             |
+---+-------+------------------------------+



### Adding a Calculated Column

In [30]:
myDF\
   .select(col("ID"), col("Message"))\
   .withColumn("Calculated ID", col("ID") * 100)\
   .show(truncate=False)

+---+------------------------------+-------------+
|ID |Message                       |Calculated ID|
+---+------------------------------+-------------+
|42 |is the answer to all questions|4200         |
|73 |is a prime number             |7300         |
+---+------------------------------+-------------+



In [31]:
myDF\
   .select(col("ID"), col("Message"), (col("ID") * 100).alias("ID calculated"))\
   .show(truncate=False)

+---+------------------------------+-------------+
|ID |Message                       |ID calculated|
+---+------------------------------+-------------+
|42 |is the answer to all questions|4200         |
|73 |is a prime number             |7300         |
+---+------------------------------+-------------+



In both versions the new Column "ID Calculated" is the result of three consecutive expressions
* `col("ID")`, i.e. take first data element of each row in the input DataFrame
* `* 100`, i.e. multiply each data value by 100
* `alias("ID calculated")` to rename the new column

The first two are data manipulating whereas the third one manipulates the metadata of the `DataFrame`

### Renaming Columns

In [32]:
myDF\
   .select(col("ID"), col("Message"))\
   .withColumnRenamed("ID", "Identifier")\
   .show(truncate=False)

+----------+------------------------------+
|Identifier|Message                       |
+----------+------------------------------+
|42        |is the answer to all questions|
|73        |is a prime number             |
+----------+------------------------------+



In [33]:
myDF\
   .select(col("ID").alias("Identifier"), col("Message"))\
   .show(truncate=False)

+----------+------------------------------+
|Identifier|Message                       |
+----------+------------------------------+
|42        |is the answer to all questions|
|73        |is a prime number             |
+----------+------------------------------+



In my opinion, the Spark syntax regarding column names is quite misleading. Even though I have to put the column name argument in quotation marks and eventhough the output shows me the column name with exactly the same upper/lower cases, as I defined it, Spark is **NOT** case sensitive. 

In [34]:
myDF\
   .select(col("Message"), col("MESSAGE"))\
   .show(truncate=False)

+------------------------------+------------------------------+
|Message                       |MESSAGE                       |
+------------------------------+------------------------------+
|is the answer to all questions|is the answer to all questions|
|is a prime number             |is a prime number             |
+------------------------------+------------------------------+



However, I can switch to case sensitivity by changing the `SparkSession` configuration.

In [35]:
spark.conf.set("spark.sql.caseSensitive", "true")

In [36]:
myDF\
   .select(col("ID"), (col("ID") * 100).alias("id"))\
   .select(col("id"))\
   .show()

+----+
|  id|
+----+
|4200|
|7300|
+----+



Column names with reserved character or keywords can also be very tricky because, as I've seen befor, Spark does not interprete everything in quation marks literally as strings.

In [37]:
dfWithSpecialNames = myDF\
   .select(col("ID"), col("Message"))\
   .withColumn("ID * by 100", col("ID") * 100)

dfWithSpecialNames.show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
|73 |is a prime number             |7300       |
+---+------------------------------+-----------+



The question is, how can I reference the third column without getting problems. First of all the direct string to column reference by using `col()` works fine.

In [38]:
dfWithSpecialNames\
   .select(col("ID * by 100"))\
   .show()

+-----------+
|ID * by 100|
+-----------+
|       4200|
|       7300|
+-----------+



In [39]:
dfWithSpecialNames\
   .select(col("ID * by 100") * 100)\
   .show()

+-------------------+
|(ID * by 100 * 100)|
+-------------------+
|             420000|
|             730000|
+-------------------+



Does this reference also work in expression strings?

In [40]:
dfWithSpecialNames\
   .selectExpr("ID * by 100  * 100")\
   .show()

ParseException: "\nmismatched input '100' expecting <EOF>(line 1, pos 8)\n\n== SQL ==\nID * by 100  * 100\n--------^^^\n"

Ups, something got broken. Now it is undefined, which parts of the expression string are operators and which are literal stings valus.  This problem will get very relevant when it comes to filter or join conditions, so I need a solution. Spark provides the back stick character to escape the column name strings.

In [41]:
dfWithSpecialNames\
   .selectExpr("`ID * by 100`  * 100")\
   .show()

+-------------------+
|(ID * by 100 * 100)|
+-------------------+
|             420000|
|             730000|
+-------------------+



Well, my lessons learned her is: avoid blanks, reserved characters or keywords in column names whenever possible.
### Removing Columns
In the SQL world, there is a shortcut to grep all columns using the wildcard, which is very comfortable querying a table with, e.g. 100 columns. What I'm realy missing, is the option to exclude just one or two columns and get the remaining 98 columns without having to list them all in the SELECT clause. So I'm very happy to see that Spark provides me this feature by the `drop()` function.

In [42]:
dfWithSpecialNames.drop("ID * by 100").show(truncate=False)

+---+------------------------------+
|ID |Message                       |
+---+------------------------------+
|42 |is the answer to all questions|
|73 |is a prime number             |
+---+------------------------------+



### Filtering Rows
Spark provides in total four ways of filtering rows. First, the two methonds `filter()` and `where()` are synonyms in pyspark. Second, the filter condition can either be a boolean column created by column manipualtion or an expression string, which evaluates to True or False.

In [43]:
dfWithSpecialNames.filter(col("ID") == 42).show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



In [44]:
dfWithSpecialNames.filter("ID = 42").show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



In [45]:
dfWithSpecialNames.where(col("ID") == 42).show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



In [46]:
dfWithSpecialNames.where("ID == 42").show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



I think, the last version looks most familiar to me as an senior SQL user. A nice feature is, that I can chain up multiple AND filters, which is more readable than complex in-line expression.

In [47]:
dfWithSpecialNames\
    .where("ID == 42")\
    .where("`ID * by 100`== 4200")\
    .show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



In [48]:
dfWithSpecialNames\
    .where("ID == 42 and `ID * by 100`== 4200")\
    .show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
+---+------------------------------+-----------+



Nevertheless, ass soon as OR comes into place, there is no choice anymore.

In [49]:
dfWithSpecialNames\
    .where("ID == 42 or ID == 73")\
    .show(truncate=False)

+---+------------------------------+-----------+
|ID |Message                       |ID * by 100|
+---+------------------------------+-----------+
|42 |is the answer to all questions|4200       |
|73 |is a prime number             |7300       |
+---+------------------------------+-----------+



Filter expressions can get quite complex and hard to read. Maybe I want to reuse them multiple times in my code and re-writing them all the times makes me pain. Fortunately I have the option to pre-define filter expressions and giving them more self-explaining names, from business point of view. Doing so I can reference these filters by their names in my queries, making them much more readable and the filter expressions reusable.

So my question as a Sales Manager could be: which invoices from the Scandinavian market contain high volumn orders?

In [50]:
from pyspark.sql.functions import col, instr

# loading data from CSV source
dfRetail = spark.read\
    .format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("./data/retail-data/by-day/*.csv")

# named filter expressions evaluating to Boolean
highVolumeOrder = (col("Quantity") >= 40) & (col("UnitPrice") >= 1.50)
Scandinavia = (col("Country") == "Norway") \
               | (col("Country") == "Sweden") \
               | (col("Country") == "Denmark") \
               | (col("Country") == "Finland")

# the query answering my question
dfRetail.where(highVolumeOrder & Scandinavia)\
    .select("InvoiceNo", "Country")\
    .distinct()\
    .show(10)

+---------+-------+
|InvoiceNo|Country|
+---------+-------+
|   540040| Sweden|
|   560549|Finland|
|   580704| Sweden|
|   559145| Sweden|
|   546780|Denmark|
|   541756|Finland|
|   538003|Denmark|
|   536532| Norway|
|   576083| Sweden|
|   552957| Sweden|
+---------+-------+
only showing top 10 rows



### Random Samples and Split

In [51]:
import pyspark
import random

from pyspark.sql import SparkSession
spark = SparkSession\
   .builder\
   .getOrCreate()

df = spark.read\
   .format("csv")\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .load("./data/retail-data/by-day/*.csv")

df.count()

541909

In [52]:
seed = random.seed()
withReplacement = False
fraction = 0.1 # i.e. 10%

In [53]:
df.sample(withReplacement, fraction, seed).count()

54425

In [54]:
df.sample(withReplacement, fraction, seed).show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   580540|    23358|HOT STUFF HOT WAT...|       6|2011-12-05 08:49:00|     3.75|   13417.0|United Kingdom|
|   580541|    22586|FELTCRAFT HAIRBAN...|      12|2011-12-05 09:03:00|     0.85|   15358.0|United Kingdom|
|   580541|    22047|    EMPIRE GIFT WRAP|      25|2011-12-05 09:03:00|     0.42|   15358.0|United Kingdom|
|   580541|    21703|BAG 125g SWIRLY M...|      12|2011-12-05 09:03:00|     0.42|   15358.0|United Kingdom|
|   580541|    21679|    SKULLS  STICKERS|      12|2011-12-05 09:03:00|     0.85|   15358.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



Spark's `ramdomSplit()` feature can be very usefull if I want to train a neural network model where spliting up a dataset by random into two samples, the training data and the test data, is an important datap reparation step. 

In [55]:
seed = random.seed()
mlData = df.randomSplit([0.8, 0.2], seed)
traingData = mlData[0]
testData = mlData[1]

### Other DataFrame Transformations
The following functions are nearly the same as I already know from SQL. So I skip them here.

* `df1.count()`
* `df1.distinct().count()`
* `df1.union(df2)`
* `df1.orderBy(... desc, ... asc)`

### Repartitioning Data
As I've learnd on day 2, wide transformations, forces Spark to shuffel data across the cluster which can end up in performance issues since Spark cannot do it in-memory. Therefore I can improve the performance, if I partition the data according to my expected query patterns. 

Ok, let's see how Spark has partitioned my data, I've read from the daily retail csv files.

In [56]:
df.rdd.getNumPartitions()

12

Currently there are 12 partitions. I want to optimaize the partitioning for queries on stock codes. How many Stock codes do I have?

In [57]:
df.select("StockCode").distinct().count()

4070

One option would be, to raise the number of partitions. But this doesn't help if I have not enough CPU cores.

In [58]:
# df.repartition(100)

Furthermore, just raising the number of partitions would not ensure, that all rows having the same StockCode value which reside in the same partition. The later is the crucial point to gain partition locality of the query processing. So I have to define the StockCode column as partition criteria.

In [59]:
df.repartition("StockCode")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

Nevertheless, If I had some further CPU cores at hand, can raise the number of partitions additionally.

In [60]:
df.repartition(20, "StockCode")

DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string]

### Collecting Rows to Local Driver
Especially when I do some ad-hoc analysis of the data on my local machine, it is quite usefull to limit the data I' fetching from the cluster.

`collect()` returns all the records as a list of `Row` objects. In Python, lists are iterable sequences.

In [61]:
df.collect()

[Row(InvoiceNo='580538', StockCode='23084', Description='RABBIT NIGHT LIGHT', Quantity=48, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.79, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='23077', Description='DOUGHNUT LIP GLOSS ', Quantity=20, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.25, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='22906', Description='12 MESSAGE CARDS WITH ENVELOPES', Quantity=24, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.65, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='21914', Description='BLUE HARMONICA IN BOX ', Quantity=24, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.25, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='22467', Description='GUMBALL COAT RACK', Quantity=6, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=2.55, Custom

In [62]:
type(df.collect())

list

`limit(num)` limits the result count to the specified number of rows

In [63]:
df.limit(10).collect()

[Row(InvoiceNo='580538', StockCode='23084', Description='RABBIT NIGHT LIGHT', Quantity=48, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.79, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='23077', Description='DOUGHNUT LIP GLOSS ', Quantity=20, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.25, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='22906', Description='12 MESSAGE CARDS WITH ENVELOPES', Quantity=24, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.65, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='21914', Description='BLUE HARMONICA IN BOX ', Quantity=24, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.25, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='22467', Description='GUMBALL COAT RACK', Quantity=6, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=2.55, Custom

`take(num)` returns just the first *num* rows as a list of Row objects

In [64]:
df.take(3)

[Row(InvoiceNo='580538', StockCode='23084', Description='RABBIT NIGHT LIGHT', Quantity=48, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.79, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='23077', Description='DOUGHNUT LIP GLOSS ', Quantity=20, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.25, CustomerID=14075.0, Country='United Kingdom'),
 Row(InvoiceNo='580538', StockCode='22906', Description='12 MESSAGE CARDS WITH ENVELOPES', Quantity=24, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.65, CustomerID=14075.0, Country='United Kingdom')]

In [65]:
type(df.take(3))

list

If I just want the first row on top, I can do it this way.

In [66]:
df.first()

Row(InvoiceNo='580538', StockCode='23084', Description='RABBIT NIGHT LIGHT', Quantity=48, InvoiceDate=datetime.datetime(2011, 12, 5, 8, 38), UnitPrice=1.79, CustomerID=14075.0, Country='United Kingdom')

`toLocalIterator()` returns an iterator that contains all of the rows in this `DataFrame`. The iterator will consume as much memory as the largest partition in this `DataFrame`.

I think, that's enough for today.