# Perform date and time manipulation

** Data Source **
* English Wikipedia pageviews by second
* Size on Disk: ~255 MB
* Type: Parquet files

**Technical Accomplishments:**
* Explore more of the `...sql.functions` operations
  * Date & time functions

## Getting Started

In [0]:
from pyspark.sql import SparkSession

In [0]:
# Initialize Spark Session
spark = (SparkSession.builder
         .appName("Date-Time Manipulation")
         .getOrCreate())

## The Data Source

This data uses the **Pageviews By Seconds** data set.

In [0]:
%run ../DatasetSourcePath

In [0]:
spark.conf.get("spark.sql.shuffle.partitions")

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

partitions = 7

# Make sure wide operations don't repartition to 200
spark.conf.set("spark.sql.shuffle.partitions", str(partitions))

In [0]:
# The directory containing our parquet files.
parquetFile = sourcePath + "/dataset/pageviews_by_second.parquet/"

In [0]:
# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
initialDF = (spark.read
  .option("inferSchema", "true") # The default, but not costly w/Parquet
  .parquet(parquetFile)          # Read the data in
  .repartition(partitions)       # From 7 >>> 8 partitions
  .cache()                       # Cache the expensive operation
)
# materialize the cache
initialDF.count()

## Preparing Our Data

If we will be working on any given dataset for a while, there are a handful of "necessary" steps to get us ready...

Most of which we've just knocked out above.

**Basic Steps**
1. <div style="text-decoration:line-through">Read the data in</div>
1. <div style="text-decoration:line-through">Balance the number of partitions to the number of slots</div>
1. <div style="text-decoration:line-through">Cache the data</div>
1. <div style="text-decoration:line-through">Adjust the `spark.sql.shuffle.partitions`</div>
1. Perform some basic ETL (i.e., convert strings to timestamp)
1. Possibly re-cache the data if the ETL was costly

What we haven't done is some of the basic ETL necessary to explore our data.

Namely, the problem is that the field "timestamp" is a string.

In order to performed date/time - based computation I need to convert this to an alternate datetime format.

In [0]:
initialDF.printSchema()

## withColumnRenamed(..), withColumn(..), select(..)

My first hangup is that we have a **column named timestamp** and the **datatype will also be timestamp**

Just rename the column...

In [0]:
(initialDF
  .select( col("timestamp").alias("capturedAt"), col("site"), col("requests") )
  .printSchema()
)

There are a number of different ways to rename a column...

In [0]:
(initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .printSchema()
)

In [0]:
(initialDF
  .toDF("capturedAt", "site", "requests")
  .printSchema()
)

## unix_timestamp(..) & cast(..)

Now that **we** are over **my** hangup, we can focus on converting the **string** to a **timestamp**.

For this we will be looking at more of the functions in the `functions` package
* `pyspark.sql.functions` in the case of Python
* `org.apache.spark.sql.functions` in the case of Scala & Java

And so that we can watch the transformation, will will take one step at a time...

The first function is `unix_timestamp(..)`

If you look at the API docs, `unix_timestamp(..)` is described like this:
> Convert time string with given pattern (see <a href="http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html" target="_blank">SimpleDateFormat</a>) to Unix time stamp (in seconds), return null if fail.

`SimpleDataFormat` is part of the Java API and provides support for parsing and formatting date and time values.

In order to know what format the data is in, let's take a look at the first row...

Comparing that value with the patterns express in the docs for the `SimpleDateFormat` class, we can come up with a format:

**yyyy-MM-dd HH:mm:ss**

In [0]:
tempA = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd HH:mm:ss") )
)
tempA.printSchema()

In [0]:
tempA.show(5)

** *Note:* ** *If you haven't caught it yet, there is a bug in the previous code....*

A couple of things happened...
1. We ended up with a new column - that's OK for now
1. The new column has a really funky name - based upon the name of the function we called and its parameters.
1. The data type is now a long.
    * This value is the Java Epoch
    * The number of seconds since 1970-01-01T00:00:00Z
  
We can now take that epoch value and use the `Column.cast(..)` method to convert it to a **timestamp**.

In [0]:
tempB = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempB.printSchema()

In [0]:
tempB.show(5)

Now that our column `createdAt` has been converted from a **string** to a **timestamp**, we just need to deal with this REALLY funky column name.

Again.. there are several ways to do this.

I'll let you decide which you like better...

### Option #1
The `as()` or `alias()` method can be appended to the chain of calls.

This version will actually produce an odd little bug.<br/>
That is, how do you get rid of only one of the two `capturedAt` columns?

In [0]:
tempC = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp").alias("capturedAt") )
)
tempC.printSchema()

In [0]:
tempC.show(5)

### Option #2
The `withColumn(..)` renames the column (first param) and accepts as a<br/>
second parameter the expression(s) we need for our transformation

In [0]:
tempD = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempD.printSchema()

In [0]:
tempD.show(5)

### Option #3

We can take the big ugly name explicitly rename it.

This version will actually produce an odd little bug.<br/>
That is how do you get rid of only one of the two "capturedAt" columns?

In [0]:
#Option #3

tempE = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .select( col("*"), unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
  .withColumnRenamed("CAST(unix_timestamp(capturedAt, yyyy-MM-dd'T'HH:mm:ss) AS TIMESTAMP)", "capturedAt")
  # .drop("timestamp")
)
tempE.printSchema()

In [0]:
tempE.show(5)

### Option #4

The last version is a twist on the others in which we start with the <br/>
name `timestamp` and rename it and the expression all in one call<br/>

But this version leaves us with the old column in the DF

In [0]:
tempF = (initialDF
  .withColumn("capturedAt", unix_timestamp( col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)
tempF.printSchema()

In [0]:
tempF.show(5)


Let's pick the "cleanest" version...

And with our base `DataFrame` in place we can start exploring the data a little...

In [0]:
pageviewsDF = (initialDF
  .withColumnRenamed("timestamp", "capturedAt")
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
)

pageviewsDF.printSchema()

In [0]:
pageviewsDF.show(5)


And just so that we don't have to keep performing these transformations.... 

Mark the `DataFrame` as cached and then materialize the result.

In [0]:
pageviewsDF.cache().count()

## year(..), month(..), dayofyear(..)

Let's take a look at some of the other date & time functions...

With that we can answer a simple question: When was this data captured.

We can start specifically with the year...

In [0]:
(pageviewsDF
  .select( year( col("capturedAt")) ) # Every record converted to a single column - the year captured
  .distinct()                         # Reduce all years to the list of distinct years
  .show()
)


Now let's take a look at in which months was this data captured...

In [0]:
(pageviewsDF
    .select( month( col("capturedAt")) ) # Every record converted to a single column - the month captured
    .distinct()                          # Reduce all months to the list of distinct months
    .show()
)

And of course this both can be combined as a single call...

In [0]:
(pageviewsDF
  .select( month(col("capturedAt")).alias("month"), year(col("capturedAt")).alias("year"))
  .distinct()
  .show()                     
)

It's pretty easy to see that the data was captured during March & April of 2015.

In [0]:
# spark.stop()