d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

#Introduction to DataFrames, Part #3

** Data Source **
* English Wikipedia pageviews by second
* Size on Disk: ~255 MB
* Type: Parquet files
* More Info: <a href="https://datahub.io/en/dataset/english-wikipedia-pageviews-by-second" target="_blank">https&#58;//datahub.io/en/dataset/english-wikipedia-pageviews-by-second</a>

**Technical Accomplishments:**
* Introduce the various aggregate functions.
* Explore more of the `...sql.functions` operations
  * more aggregate functions
  * date & time functions

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) The Data Source

This data uses the **Pageviews By Seconds** data set.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

# I've already gone through the exercise to determine
# how many partitions I want and in this case it is...
partitions = 8

# Make sure wide operations don't repartition to 200
spark.conf.set("spark.sql.shuffle.partitions", partitions)

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

# The directory containing our parquet files.
parquetFile = source + "/wikipedia/pageviews/pageviews_by_second.parquet/"

In [0]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
initialDF = (spark.read
  .parquet(parquetFile)          # Read the data in
  .repartition(partitions)       # From 5 >>> 8 partitions
  .cache()                       # Cache the expensive operation
)
# materialize the cache
initialDF.count()

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Preparing Our Data

If we will be working on any given dataset for a while, there are a handful of "necessary" steps to get us ready...

Most of which we've just knocked out above.

**Basic Steps**
0. <div style="text-decoration:line-through">Read the data in</div>
0. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
0. <div style="text-decoration:line-through">Cache the data</div>
0. <div style="text-decoration:line-through">Adjust the `spark.sql.shuffle.partitions`</div>
0. Perform some basic ETL (i.e., convert strings to timestamp)
0. Possibly re-cache the data if the ETL was costly

What we haven't done is some of the basic ETL necessary to explore our data.

Namely, the problem is that the field "timestamp" is a string.

In order to performed date/time - based computation I need to convert this to an alternate datetime format.

In [0]:
initialDF.printSchema()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) withColumnRenamed(..), withColumn(..), select(..)

My first hangup is that we have a **column named timestamp** and the **datatype will also be timestamp**

The nice thing about Apache Spark is that I'm allowed the have an issue with this because it's very easy to fix...

Just rename the column...

In [0]:
(initialDF
  .select( col("timestamp").alias("capturedAt"), col("site"), col("requests") )
  .printSchema()
)

There are a number of different ways to rename a column...

In [0]:
(initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .printSchema()
)

In [0]:
(initialDF
  .toDF("capturedAt", "site", "requests")
  .printSchema()
)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) unix_timestamp(..) & cast(..)

Now that **we** are over **my** hangup, we can focus on converting the **string** to a **timestamp**.

For this we will be looking at more of the functions in the `functions` package
* `pyspark.sql.functions` in the case of Python
* `org.apache.spark.sql.functions` in the case of Scala & Java

And so that we can watch the transformation, will will take one step at a time...

The first function is `unix_timestamp(..)`

If you look at the API docs, `unix_timestamp(..)` is described like this:
> Convert time string with given pattern (see <a href="http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html" target="_blank">SimpleDateFormat</a>) to Unix time stamp (in seconds), return null if fail.

`SimpleDataFormat` is part of the Java API and provides support for parsing and formatting date and time values.

In order to know what format the data is in, let's take a look at the first row...

Comparing that value with the patterns express in the docs for the `SimpleDateFormat` class, we can come up with a format:

**yyyy-MM-dd HH:mm:ss**

In [0]:
tempA = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd HH:mm:ss") )
)
tempA.printSchema()

In [0]:
display(tempA)

** *Note:* ** *If you haven't caught it yet, there is a bug in the previous code....*

A couple of things happened...
0. We ended up with a new column - that's OK for now
0. The new column has a really funky name - based upon the name of the function we called and its parameters.
0. The data type is now a long.
  * This value is the Java Epoch
  * The number of seconds since 1970-01-01T00:00:00Z
  
We can now take that epoch value and use the `Column.cast(..)` method to convert it to a **timestamp**.

In [0]:
from pyspark.sql.functions import col, unix_timestamp

tempB = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempB.printSchema()

In [0]:
display(tempB)

Now that our column `createdAt` has been converted from a **string** to a **timestamp**, we just need to deal with this REALLY funky column name.

Again.. there are several ways to do this.

I'll let you decide which you like better...

### Option #1
The `as()` or `alias()` method can be appended to the chain of calls.

This version will actually produce an odd little bug.<br/>
That is, how do you get rid of only one of the two `capturedAt` columns?

In [0]:
from pyspark.sql.functions import col, unix_timestamp

tempC = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").alias("capturedAt") )
)
tempC.printSchema()

In [0]:
display(tempC)

### Option #2
The `withColumn(..)` renames the column (first param) and accepts as a<br/>
second parameter the expression(s) we need for our transformation

In [0]:
from pyspark.sql.functions import col, unix_timestamp

tempD = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempD.printSchema()

In [0]:
display(tempD)

### Option #3

We can take the big ugly name explicitly rename it.

This version will actually produce an odd little bug.<br/>
That is how do you get rid of only one of the two "capturedAt" columns?

In [0]:
from pyspark.sql.functions import col, unix_timestamp

tempE = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
  .withColumnRenamed("CAST(unix_timestamp(capturedAt, yyyy-MM-dd'T'HH:mm:ss) AS TIMESTAMP)", "capturedAt")
  # .drop("timestamp")
)
tempE.printSchema()

In [0]:
display(tempE)

### Option #4

The last version is a twist on the others in which we start with the <br/>
name `timestamp` and rename it and the expression all in one call<br/>

But this version leaves us with the old column in the DF

In [0]:
from pyspark.sql.functions import col, unix_timestamp

tempF = (initialDF
  .withColumn("capturedAt", unix_timestamp( col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempF.printSchema()

In [0]:
display(tempF)

Let's pick the "cleanest" version...

And with our base `DataFrame` in place we can start exploring the data a little...

In [0]:
from pyspark.sql.functions import col, unix_timestamp

pageviewsDF = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)

pageviewsDF.printSchema()

In [0]:
display(pageviewsDF)

And just so that we don't have to keep performing these transformations.... 

Mark the `DataFrame` as cached and then materialize the result.

In [0]:
pageviewsDF.cache().count()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) year(..), month(..), dayofyear(..)

Let's take a look at some of the other date & time functions...

With that we can answer a simple question: When was this data captured.

We can start specifically with the year...

In [0]:
from pyspark.sql.functions import col, year

display(
  pageviewsDF
    .select( year( col("capturedAt")) ) # Every record converted to a single column - the year captured
    .distinct()                         # Reduce all years to the list of distinct years
)

Now let's take a look at in which months was this data captured...

In [0]:
from pyspark.sql.functions import col, month

display(
  pageviewsDF
    .select( month( col("capturedAt")) ) # Every record converted to a single column - the month captured
    .distinct()                          # Reduce all months to the list of distinct years
)

And of course this both can be combined as a single call...

In [0]:
from pyspark.sql.functions import col, month, year

(pageviewsDF
  .select( month(col("capturedAt")).alias("month"), year(col("capturedAt")).alias("year"))
  .distinct()
  .show()                     
)

It's pretty easy to see that the data was captured during March & April of 2015.

We will have more opportunities to play with the various date and time functions in the next lab.

For now, let's just make sure to review them in the Spark API

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) groupBy()

Aggregating data is one of the more common tasks when working with big data.
* How many customers are over 65?
* What is the ratio of men to women?
* Group all emails by their sender.

The function `groupBy()` is one tool that we can use for this purpose.

If you look at the API docs, `groupBy(..)` is described like this:
> Groups the Dataset using the specified columns, so that we can run aggregation on them.

This function is a **wide** transformation - it will produce a shuffle and conclude a stage boundary.

Unlike all of the other transformations we've seen so far, this transformation does not return a `DataFrame`.
* In Scala it returns `RelationalGroupedDataset`
* In Python it returns `GroupedData`

This is because the call `groupBy(..)` is only 1/2 of the transformation.

To see the other half, we need to take a look at it's return type, `RelationalGroupedDataset`.

### RelationalGroupedDataset

If we take a look at the API docs for `RelationalGroupedDataset`, we can see that it supports the following aggregations:

| Method | Description |
|--------|-------------|
| `avg(..)` | Compute the mean value for each numeric columns for each group. |
| `count(..)` | Count the number of rows for each group. |
| `sum(..)` | Compute the sum for each numeric columns for each group. |
| `min(..)` | Compute the min value for each numeric column for each group. |
| `max(..)` | Compute the max value for each numeric columns for each group. |
| `mean(..)` | Compute the average value for each numeric columns for each group. |
| `agg(..)` | Compute aggregates by specifying a series of aggregate columns. |
| `pivot(..)` | Pivots a column of the current DataFrame and perform the specified aggregation. |

With the exception of `pivot(..)`, each of these functions return our new `DataFrame`.

Together, `groupBy(..)` and `RelationalGroupedDataset` (or `GroupedData` in Python) give us what we need to answer some basic questions.

For Example, how many more requests did the desktop site receive than the mobile site receive?

For this all we need to do is group all records by **site** and then sum all the requests.

In [0]:
from pyspark.sql.functions import col

display(
  pageviewsDF
    .groupBy( col("site") )
    .sum()
)

Notice above that we didn't actually specify which column we were summing....

In this case you will actually receive a total for all numerical values.

There is a performance catch to that - if I have 2, 5, 10? columns, then they will all be summed and I may only need one.

I can first reduce my columns to those that I wanted or I can simply specify which column(s) to sum up.

In [0]:
from pyspark.sql.functions import col

display(
  pageviewsDF
    .groupBy( col("site") )
    .sum("requests")
)

And because I don't like the resulting column name, **`sum(requests)`** I can easily rename it...

In [0]:
from pyspark.sql.functions import col

display(
  pageviewsDF
    .groupBy( col("site") )
    .sum("requests")
    .withColumnRenamed("sum(requests)", "totalRequests")
)

How about the total number of requests per site? mobile vs desktop?

In [0]:
from pyspark.sql.functions import col

display(
  pageviewsDF
    .groupBy( col("site") )
    .count()
)

This result shouldn't surprise us... there were after all one record, per second, per site....

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) sum(), count(), avg(), min(), max()

The `groupBy(..)` operation is not our only option for aggregating.

The `...sql.functions` package actually defines a large number of aggregate functions
* `org.apache.spark.sql.functions` in the case of Scala & Java
* `pyspark.sql.functions` in the case of Python


Let's take a look at this in the Scala API docs (only because the documentation is a little easier to read).

Let's take a look at our last two examples... 

We saw the count of records and the sum of records.

Let's take do this a slightly different way...

This time with the `...sql.functions` operations.

And just for fun, let's throw in the average, minimum and maximum

In [0]:
from pyspark.sql.functions import col, sum, count, avg, min, max

(pageviewsDF
  .filter("site = 'mobile'")
  .select( sum( col("requests")), count(col("requests")), avg(col("requests")), min(col("requests")), max(col("requests")) )
  .show()
)
          
(pageviewsDF
  .filter("site = 'desktop'")
  .select( sum( col("requests")), count(col("requests")), avg(col("requests")), min(col("requests")), max(col("requests")) )
  .show()
)

And let's just address one more pet-peeve...

Was that 3.6M records or 360K records?

In [0]:
from pyspark.sql.functions import col, sum, count, avg, min, max, format_number

(pageviewsDF
  .filter("site = 'mobile'")
  .select( 
    format_number(sum(col("requests")), 0).alias("sum"), 
    format_number(count(col("requests")), 0).alias("count"), 
    format_number(avg(col("requests")), 2).alias("avg"), 
    format_number(min(col("requests")), 0).alias("min"), 
    format_number(max(col("requests")), 0).alias("max") 
  )
  .show()
)

(pageviewsDF
  .filter("site = 'desktop'")
  .select( 
    format_number(sum(col("requests")), 0), 
    format_number(count(col("requests")), 0), 
    format_number(avg(col("requests")), 2), 
    format_number(min(col("requests")), 0), 
    format_number(max(col("requests")), 0) 
  )
  .show()
)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/labs.png) Data Frames Lab #3
It's time to put what we learned to practice.

Go ahead and open the notebook [Introduction to DataFrames, Lab #3]($./Intro To DF Part 3 Lab) and complete the exercises.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>