# Spark DataFrames and SQL Basic Part 2

## 1. More DataFrame Operations

Let's manually load some data that describes purchase events.  As we've seen, there are many ways to load data manually.  Let's create an RDD first and then convert to a DataFrame:

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName('week 7 spark').getOrCreate()

sc = spark.sparkContext
purchases_rdd = sc.parallelize([
("Geoffrey", "2016-04-22", "A", "apples", 1, 50.00, "android"),
("Geoffrey", "2016-05-03", "B", "Lamp", 2, 38.00, "android"),
("Geoffrey", "2016-05-03", "D", "Solar Pannel", 1, 29.00, "windows"),
("Geoffrey", "2016-05-03", "A", "apples", 3, 50.00, "android"),
("Geoffrey", "2016-05-03", "C", "Rice", 5, 15.00, "android"),
("Geoffrey", "2016-06-05", "A", "apples", 5, 50.00, "windows"),
("Geoffrey", "2016-06-05", "A", "bananas", 5, 55.00, "windows"),
("Geoffrey", "2016-06-15", "Y", "Motor skate", 7, 68.00, "windows"),
("Geoffrey", "2016-06-15", "E", "Book: The noose", 1, 125.00, "android"),
("Yann", "2016-04-22", "B", "Lamp", 1, 38.00, "ios"),
("Yann", "2016-05-03", "Y", "Motor skate", 1, 68.00, "ios"),
("Yann", "2016-05-03", "D", "Recycle bin", 5, 27.00, "macos"),
("Yann", "2016-05-03", "C", "Rice", 15, 15.00, "macos"),
("Yann", "2016-04-02", "A", "bananas", 3, 55.00, "macos"),
("Yann", "2016-04-02", "B", "Lamp", 2, 38.00, "macos"),
("Yann", "2016-04-03", "E", "Book: Crime and Punishment", 5, 100.00, "macos"),
("Yann", "2016-04-13", "E", "Book: The noose", 5, 125.00, "macos"),
("Yann", "2016-04-27", "D", "Solar Pannel", 5, 29.00, "ios"),
("Yann", "2016-05-27", "D", "Recycle bin", 5, 27.00, "ios"),
("Yann", "2016-05-27", "A", "bananas", 3, 55.00, "ios"),
("Yann", "2016-05-01", "Y", "Motor skate", 1, 68.00, "ios"),
("Yann", "2016-06-07", "Z", "space ship", 1, 227.00, "ios"),
("Yoshua", "2016-02-07", "Z", "space ship", 2, 227.00, "windows"),
("Yoshua", "2016-02-14", "A", "bananas", 9, 55.00, "windows"),
("Yoshua", "2016-02-14", "B", "Lamp", 2, 38.00, "windows"),
("Yoshua", "2016-02-14", "A", "apples", 10, 55.00, "ios"),
("Yoshua", "2016-03-07", "Z", "space ship", 5, 227.00, "ios"),
("Yoshua", "2016-04-07", "Y", "Motor skate", 4, 68.00, "windows"),
("Yoshua", "2016-04-07", "D", "Recycle bin", 5, 27.00, "ios"),
("Yoshua", "2016-04-07", "C", "Rice", 5, 15.00, "ios"),
("Yoshua", "2016-04-07", "A", "bananas", 9, 55.00, "windows"),
("Yoshua", "2016-04-07", "D", "Solar Pannel", 1, 29.00, "windows"),
("Jurgen", "2016-05-01", "Z", "space ship", 1, 227.00, "macos"),
("Jurgen", "2016-05-01", "A", "bananas", 5, 55.00, "macos"),
("Jurgen", "2016-05-08", "A", "bananas", 5, 55.00, "macos"),
("Jurgen", "2016-05-08", "Y", "Motor skate", 1, 68.00, "android"),
("Jurgen", "2016-06-05", "A", "bananas", 5, 55.00, "android"),
("Jurgen", "2016-06-05", "C", "Rice", 5, 15.00, "windows"),
("Jurgen", "2016-06-05", "Y", "Motor skate", 2, 68.00, "windows"),
("Jurgen", "2016-06-05", "D", "Recycle bin", 5, 27.00, "windows"),
])

In [None]:
column_names = ["customer_name", "date", "category", "product_name", "quantity", "price", "channel"]
purchases_df = purchases_rdd.toDF(column_names)

We could use `.show(5)`, `.take(5)` or `.head(5)`, but Pandas actually has prettier output:

In [None]:
purchases_df.limit(5).toPandas().head()

Let's check the distinct products that are being purchased by our customers:

In [None]:
purchases_df.select('product_name').distinct().show()

## Summary statistics on certain columns

A valuable way to get a quick look at some data is to use the `.describe()` method.  This will give some very basic statistics about the columns that I specify:

In [None]:
purchases_df.describe('quantity', 'price').show()

## Contingency tables

Recall in statistics we have the concept of a "contingency table".  In DataFrames we use the `.crosstab()` method to produce one.  This can be a useful way to look at data, but we need to be careful with interpretation here:  this only counts *rows*.

In [None]:
product_freq = purchases_df.crosstab('customer_name', 'product_name')
product_freq.toPandas().head()

Let's look at the columns

In [None]:
cols = product_freq.columns
cols

So now we can just pass these column names to `.describe()` to get some basic purchase frequency stats:

In [None]:
product_freq.describe(cols[1:]).toPandas().head()

`.describe()` should ONLY be used for exploratory analysis.  If we really wanted to get the average number of purchase events per product (to be used in further calculations) then we should perform an explicit aggregation ourselves:

In [None]:
product_count = purchases_df.groupBy('customer_name', 'product_name').count()
product_count.show()

In [None]:
product_count = product_count.withColumnRenamed('count', 'num_purchase_events')
product_count.show()

Let's compute the average number of purchase events per product:

In [None]:
product_count.groupBy('product_name').avg('num_purchase_events').show()

## Pivoting columns

What if we wanted to take the `quantity` column into account (i.e. for each purchase event a customer might buy MORE THAN ONE of a given product)?

One way to analyze this is to use the `.pivot()` method.  `.pivot()` *roughly* "makes a column horizontal".  More precisely, it constructs a new table where the column names are taken from column *values* in the old table.

To make sense of this we always need to start with a `.groupBy()` and end with an aggregation.
It's easier seen than said:

In [None]:
product_quantity = purchases_df.groupBy('customer_name').pivot('product_name').sum('quantity')
product_quantity.toPandas().head()

Look at all of those `NaN` (not a number).  In this context that means that the customer never bought that particular product.  Let's fill those in with zeros:

In [None]:
product_quantity = product_quantity.na.fill(0)
product_quantity.toPandas().head()

Let's say we wanted to compute the average number of products purchased over all customers?  Let's start by getting a list of products:

In [None]:
products = product_quantity.columns[1:]
products

It is easy to get averages by hand over a couple of products:

In [None]:
avg_quantity = product_quantity.groupBy().avg('apples', 'bananas')
avg_quantity.show()

If we want to compute averages for ALL products then we need to use a specify Python syntax.  Recall that we have a list of products in `products`.  We can "unpack" this list to be the arguments of a function by using the `*` operator:

In [None]:
avg_quantity_all = product_quantity.groupBy().avg(*products)
avg_quantity_all.toPandas().head()

## 2. UDFs and Windowing

User-defined functions are very useful when performing computations on DataFrames.  These are similar in spirit to the lambdas that we often used when computing on RDDs:

In [None]:
import pyspark.sql.functions as fn
from pyspark.sql.types import DoubleType

# define the function itself
def amount_spent(quantity, price):
    return quantity*price

# convert it to a UDF
amount_spent_udf = fn.udf(amount_spent, DoubleType())

Now create a new column named `amount_spent` where the values are computed using the UDF:

In [None]:
purchases_df = purchases_df.withColumn('amount_spent', amount_spent_udf(fn.col('quantity'), fn.col('price')))
purchases_df.limit(5).toPandas().head()

### Windowing

Windowing is the way to aggregate a row with neighboring rows to produce interesting statistics.  For example, imagine answering questions like "average spend over last 5 visits".

Let's just do a simple example:  cumulative historical spend

We can make this example more interesting.  Above we were computing spend per *visit*.  Very often it is interesting to answer questions about buckets of time (e.g. weekly spend).

Just like we did for RDDs, we can use our old friend `datetime` to perform time analysis:

In [None]:
from pyspark.sql import Window

#start by defining the window over which computations will be performed
window = Window.partitionBy('customer_name').orderBy('date',).rowsBetween(Window.unboundedPreceding, 0)

#now apply the window aggregation to compute a new column `cumulative_spend`
purchases_df = purchases_df.withColumn('cumulative_spend', fn.sum(fn.col('amount_spent')).over(window))

purchases_df.limit(20).toPandas().head(20)

In [None]:
#start by defining the window over which computations will be performed
window = Window.partitionBy('customer_name').orderBy('date',).rowsBetween(-2, 0)

#now apply the window aggregation to compute a new column `cumulative_spend`
purchases_df = purchases_df.withColumn('cumulative_spend_3', fn.sum(fn.col('amount_spent')).over(window))

purchases_df.limit(20).toPandas().head(20)

In [None]:
#start by creating a UDF that converts the date string to a datetime object
from datetime import datetime
from pyspark.sql.types import DateType

def parse_date(datestr):
    return datetime.strptime(datestr, '%Y-%m-%d')

string_to_datetime = fn.udf(parse_date, DateType())

In [None]:
purchases_df = purchases_df.withColumn('datetime', string_to_datetime(fn.col('date')))
purchases_df = purchases_df.drop('date')
purchases_df.limit(10).toPandas().head(10)

Let's add a `weekofyear` column so that we can aggregate by the week:

In [None]:
purchases_df = purchases_df.withColumn('weekofyear', fn.weekofyear(fn.col('datetime')))
purchases_df.limit(10).toPandas().head()

Now aggregating by the week is easy!

In [None]:
purchases_df.groupBy('customer_name', 'weekofyear').sum('amount_spent').orderBy('customer_name', 'weekofyear').show()

### Save dataframe to disk

In [None]:
purchases_df.write.csv('./purchases_df.csv')

In [None]:
# parquet is very popular, and much more efficient than csv
purchases_df.write.parquet('./purchases_df.parquet')