d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Intro To DataFrames, Lab #4
## What-The-Monday?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions

As we saw in the previous notebook...
* There are a lot more requests for sites on Monday than on any other day of the week.
* The variance is **NOT** unique to the mobile or desktop site.

Your mission, should you choose to accept it, is to demonstrate conclusively why there are more requests on Monday than on any other day of the week.

Feel free to copy & paste from the previous notebook.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

fileName = source + "/wikipedia/pageviews/pageviews_by_second.parquet"

In [0]:
# ANSWER

from pyspark.sql.functions import *
from pyspark.sql.types import *

# I've already gone through the exercise to determine
# how many partitions I want and in this case it is...
partitions = 8

# Make sure wide operations don't repartition to 200
spark.conf.set("spark.sql.shuffle.partitions", str(partitions))

# The directory containing our parquet files.
parquetFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet/"

# Create our initial DataFrame. We can let it infer the 
# schema because the cost for parquet files is really low.
pageviewsDF = (spark.read
  .option("inferSchema", "true")                # The default, but not costly w/Parquet
  .parquet(parquetFile)                         # Read the data in
  .repartition(partitions)                      # From 7 >>> 8 partitions
  .withColumnRenamed("timestamp", "capturedAt") # rename and convert to timestamp datatype
  .withColumn("capturedAt", unix_timestamp( col("capturedAt"), "yyyy-MM-dd'T'HH:mm:ss").cast("timestamp") )
  .orderBy( col("capturedAt"), col("site") )    # sort our records
  .cache()                                      # Cache the expensive operation
)
# materialize the cache
pageviewsDF.count()

In [0]:
# ANSWER

recordCount = pageviewsDF.count()
distinctCount = pageviewsDF.distinct().count()

print(f"The DataFrame contains {recordCount} records. Number of distinct records: {distinctCount}\n{recordCount - distinctCount} records are duplicated.")

In [0]:
# ANSWER
from pyspark.sql.functions import col

# Let's see the duplicates
display(pageviewsDF.groupBy(col("capturedAt"), col("site")).count().orderBy(col("count").desc(), col("capturedAt")))

In [0]:
# ANSWER

answerDF = (pageviewsDF
  .withColumn("day_of_year", dayofyear(col("capturedAt")))
  .filter("day_of_year = 110")
  .orderBy("capturedAt", "site")
)
display(answerDF)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>