// Databricks notebook source exported at Wed, 10 Feb 2016 20:42:12 UTC


#![Wikipedia Logo](http://sameerf-dbc-labs.s3-website-us-west-2.amazonaws.com/data/wikipedia/images/w_logo_for_labs.png)

# Explore English Wikipedia pageviews by second
### Time to complete: 15 minutes

#### Business questions:

* Question # 1) How many rows in the table refer to *mobile* vs *desktop* site requests?
* Question # 2) How many total incoming requests were to the *mobile* site vs the *desktop* site?
* Question # 3) What is the start and end range of time for the pageviews data? How many days total of data do we have?
* Question # 4) Which day of the week does Wikipedia get the most traffic?
* Question #  5) Can you visualize both the mobile and desktop site requests together in a line chart to compare traffic between both sites by day of the week?

#### Technical Accomplishments:

* Use Spark's Scala and Python APIs
* Learn what a `sqlContext` is and how to use it
* Load a 255 MB tab separated file into a DataFrame
* Cache a DataFrame into memory
* Run some DataFrame transformations and actions to create visualizations
* Learn the following DataFrame operations: `show()`, `printSchema()`, `orderBy()`, `filter()`, `groupBy()`, `cast()`, `alias()`, `distinct()`, `count()`, `sum()`, `avg()`, `min()`, `max()`
* Write a User Defined Function (UDF)
* Join two DataFrames
* Bonus: Use Matplotlib and Python code within a Scala notebook to create a line chart



Dataset: http://datahub.io/en/dataset/english-wikipedia-pageviews-by-second

In [1]:
import sqlContext.implicits._

### Introduction to running Scala in Databricks Notebooks

Place your cursor inside the cells below, one at a time, and hit "Shift" + "Enter" to execute the code:

In [2]:
// This is a Scala cell. You can run normal Scala code here...
val x = 1 + 7

In [3]:
// Here is another Scala cell, that adds 2 to x
val y = 2 + x

In [4]:
// This line uses string interpolation to prints what y is equal to...
println(s"y is equal to ${y}")

y is equal to 10


In [5]:
// You can import additional modules and use them
import java.util.Date
println(s"This was last run on: ${new Date}")

This was last run on: Fri Feb 26 05:42:12 UTC 2016


### DataFrames
A `sqlContext` object is your entry point for working with structured data (rows and columns) in Spark.

Let's use the `sqlContext` to read a table of the English Wikipedia pageviews per second.

In [6]:
// Notice that the sqlContext is is actually a HiveContext
sqlContext

org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@1ab08e83

 A `HiveContext` includes additional features like the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. In general, you should always aim to use the `HiveContext` over the more limited `sqlContext`.

Create a DataFrame named `pageviewsDF` and understand its schema:

In [7]:
// You need to preload this data into DSE by running a command from a previous notebook.

val pageviewsDF = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "pageviews", "keyspace" -> "pageviews_ks" ))
.load()

In [8]:
// Shows the first 20 records in ASCII print
pageviewsDF.show()

+--------------------+--------+-------+-------------------+
|                 uid|requests|   site|                 ts|
+--------------------+--------+-------+-------------------+
|033c57bb-ee86-40d...|    1387| mobile|2015-04-03T01:42:10|
|9d94abff-0c6f-4c0...|    3360|desktop|2015-04-23T19:02:04|
|e7cbde2b-55b9-450...|    1240| mobile|2015-04-18T13:38:32|
|7eabd00c-47f9-4a2...|    2205|desktop|2015-04-18T06:15:32|
|6b4a1ead-d5b7-477...|    3047|desktop|2015-04-23T19:03:40|
|26c33699-e78b-4f6...|    1326| mobile|2015-04-21T03:05:42|
|a51642df-e07c-408...|    2158|desktop|2015-03-19T01:09:17|
|d3daf8d6-535a-4dd...|    1564| mobile|2015-04-14T17:44:07|
|24a3ef91-b17f-46c...|    1410| mobile|2015-03-19T23:00:42|
|c3cc0447-37dc-45c...|    2145|desktop|2015-03-31T09:34:12|
|c3077138-32a9-4fe...|    1459| mobile|2015-03-17T18:32:46|
|c166a0cf-6bd1-456...|    1665| mobile|2015-04-04T19:49:39|
|cf108e13-49d6-47b...|    2894|desktop|2015-04-13T13:07:31|
|f21a1f05-24b3-471...|    1501| mobile|2

 `printSchema()` prints out the schema, the data types and whether a column can be null:

In [9]:
pageviewsDF.printSchema()

root
 |-- uid: string (nullable = true)
 |-- requests: integer (nullable = true)
 |-- site: string (nullable = true)
 |-- ts: string (nullable = true)



 Notice above that the first 2 columns are typed as `Strings`, while the requests column holds `Integers`. 

 Also notice, in a few cells above when we displayed the table, that the rows seem to be missing chunks of time.

The first row shows data from March 16, 2015 at **12:09:55am**, and the second row shows data from the same day at **12:10:39am**. There appears to be missing data between those time intervals because the original data file from Wikimedia contains the data out of order and Spark read it into a DataFrame in the same order as the file.

Our data set does actually contain 2 rows for every second (one row for mobile site requests and another for desktop site requests).

We can verify this by ordering the table by the timestamp column:

In [12]:
// The following orders the rows by first the timestamp (ascending), then the site (descending) and then shows the first 10 rows

pageviewsDF.orderBy($"ts", $"site".desc).show(10)

                                                                                +--------------------+--------+-------+-------------------+
|                 uid|requests|   site|                 ts|
+--------------------+--------+-------+-------------------+
|8480716c-e6c8-408...|    1628| mobile|2015-03-16T00:00:00|
|8fd64417-b4e8-468...|    2343|desktop|2015-03-16T00:00:00|
|eea9678a-4d15-499...|    1636| mobile|2015-03-16T00:00:01|
|0df4666e-b69a-4b5...|    2382|desktop|2015-03-16T00:00:01|
|50f0b75f-398b-422...|    1619| mobile|2015-03-16T00:00:02|
|240b1ef3-3a42-48c...|    2546|desktop|2015-03-16T00:00:02|
|416a3aab-583e-4fa...|    1776| mobile|2015-03-16T00:00:03|
|8d3b769a-786f-463...|    2402|desktop|2015-03-16T00:00:03|
|1d9905f7-8ae2-441...|    1716| mobile|2015-03-16T00:00:04|
|85fa4361-e6e6-492...|    2370|desktop|2015-03-16T00:00:04|
+--------------------+--------+-------+-------------------+
only showing top 10 rows



### Reading from disk vs memory

In [13]:
// Count how many total records (rows) there are
pageviewsDF.count()

Long = 7200000

 Hmm, that took about 10 - 20 seconds. Let's cache the DataFrame into memory to speed it up.

In [14]:
pageviewsDF.cache()

pageviewsDF.type = [uid: string, requests: int, site: string, ts: string]

Caching is a lazy operation (meaning it doesn't take effect until you call an action that needs to read all of the data). So let's call the `count()` action again:

In [15]:
// During this count() action, the data is not only read from (*** FIXME ***) S3 and counted, but also cached
pageviewsDF.count()

Long = 7200000

The DataFrame should now be cached, let's run another `count()` to see the speed increase:

In [16]:
pageviewsDF.count()

Long = 7200000

 Notice that operating on the DataFrame now takes less than 1 second!

### Exploring pageviews

Time to do some data analysis!

 
### Question #1:
**How many rows in the table refer to mobile vs desktop?**

In [17]:
pageviewsDF.filter($"site" === "mobile").count()

Long = 3600000

In [18]:
pageviewsDF.filter($"site" === "desktop").count()

Long = 3600000

 We can also group the data by the `site` column and then call count:

In [19]:
pageviewsDF.groupBy($"site").count().show()

                                                                                +-------+-------+
|   site|  count|
+-------+-------+
| mobile|3600000|
|desktop|3600000|
+-------+-------+



 So, 3.6 million rows refer to the mobile page views and 3.6 million rows refer to desktop page views.

 
### Question #2:
** How many total incoming requests were to the mobile site vs the desktop site?**

 First, let's sum up the `requests` column to see how many total requests are in the dataset.

In [20]:
// Import the sql functions package, which includes statistical functions like sum, max, min, avg, etc.
import org.apache.spark.sql.functions._

In [21]:
pageviewsDF.select(sum($"requests")).show()

                                                                                +-------------+
|sum(requests)|
+-------------+
|  13342978934|
+-------------+



 So, there are about 13.3 billion requests total.

 But how many of the requests were for the mobile site?

 ** Challenge 1:** Using just the commands we explored above, can you figure out how to filter the DataFrame for just mobile traffic and then sum the requests column?

In [22]:
//Type in your answer here...
pageviewsDF.filter("site = 'mobile'").select(sum($"requests")).show()

+-------------+
|sum(requests)|
+-------------+
|   4605797962|
+-------------+



 So, that many requests were for the mobile site (and probably came from mobile phone browsers).

 ** Challenge 2:** What about the desktop site? How many requests did it get?

In [23]:
//Type in your answer here...
pageviewsDF.filter("site = 'desktop'").select(sum($"requests")).show()

+-------------+
|sum(requests)|
+-------------+
|   8737180972|
+-------------+



 So, twice as many were for the desktop site.

 
### Question #3:
** What is the start and end range of time for the pageviews data? How many days of data do we have?**

 To accomplish this, we should first convert the `timestamp` column from a `String` type to a `Timestamp` type.

In [24]:
// Currently in our DataFrame, `pageviewsDF`, the first column is typed as a string
pageviewsDF.printSchema()

root
 |-- uid: string (nullable = true)
 |-- requests: integer (nullable = true)
 |-- site: string (nullable = true)
 |-- ts: string (nullable = true)



 Create a new DataFrame, `pageviewsDF2`, that changes the timestamp column from a `string` data type to a `timestamp` data type.

In [25]:
// This probably needs to be "ts"

val pageviewsDF2 = pageviewsDF.select($"ts".cast("timestamp").alias("timestamp"), $"site", $"requests")

In [26]:
pageviewsDF2.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- site: string (nullable = true)
 |-- requests: integer (nullable = true)



In [27]:
pageviewsDF2.show()

+--------------------+-------+--------+
|           timestamp|   site|requests|
+--------------------+-------+--------+
|2015-04-03 01:42:...| mobile|    1387|
|2015-04-23 19:02:...|desktop|    3360|
|2015-04-18 13:38:...| mobile|    1240|
|2015-04-18 06:15:...|desktop|    2205|
|2015-04-23 19:03:...|desktop|    3047|
|2015-04-21 03:05:...| mobile|    1326|
|2015-03-19 01:09:...|desktop|    2158|
|2015-04-14 17:44:...| mobile|    1564|
|2015-03-19 23:00:...| mobile|    1410|
|2015-03-31 09:34:...|desktop|    2145|
|2015-03-17 18:32:...| mobile|    1459|
|2015-04-04 19:49:...| mobile|    1665|
|2015-04-13 13:07:...|desktop|    2894|
|2015-04-03 22:22:...| mobile|    1501|
|2015-03-30 08:41:...|desktop|    2217|
|2015-04-05 23:08:...|desktop|    2206|
|2015-04-24 15:34:...| mobile|    1302|
|2015-04-01 02:07:...|desktop|    2232|
|2015-03-29 19:48:...| mobile|    1738|
|2015-04-06 07:52:...| mobile|     977|
+--------------------+-------+--------+
only showing top 20 rows



 How many different years is our data from?

 For the next command, we'll use `year()`, one of the date time function available in Spark. You can review which functions are available for DataFrames in the [Spark API docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$).

In [28]:
pageviewsDF2.select(year($"timestamp")).distinct().show()

+---------------+
|year(timestamp)|
+---------------+
|           2015|
+---------------+



 The data only spans 2015. But which months?

 ** Challenge 3:** Can you figure out how to check which months of 2015 our data covers (using the Spark API docs linked to above)?

In [29]:
//Type in your answer here...
pageviewsDF2.select(month($"timestamp")).distinct().show()

                                                                                +----------------+
|month(timestamp)|
+----------------+
|               3|
|               4|
+----------------+



 The data covers the months you see above.

 ** Challenge 4:** How many weeks does our data cover?

*Hint, check out the Date time functions available in the  [Spark API docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$).*

In [30]:
//Type in your answer below...
pageviewsDF2.select(weekofyear($"timestamp")).distinct().show()

+---------------------+
|weekofyear(timestamp)|
+---------------------+
|                   12|
|                   13|
|                   14|
|                   15|
|                   16|
|                   17|
+---------------------+



 The data set covers the number of weeks you see above. Similarly, we can see how many days of coverage we have:

In [31]:
pageviewsDF2.select(dayofyear($"timestamp")).distinct().count()

Long = 41

 We have 41 days of data.

 To understand our data better, let's look at the average, minimum and maximum number of requests received for mobile, then desktop page views over every 1 second interval:

In [32]:
// Look at mobile statistics
pageviewsDF2.filter("site = 'mobile'").select(avg($"requests"), min($"requests"), max($"requests")).show()

+------------------+-------------+-------------+
|     avg(requests)|min(requests)|max(requests)|
+------------------+-------------+-------------+
|1279.3883227777778|          645|         3292|
+------------------+-------------+-------------+



In [33]:
// Look at desktop statistics
pageviewsDF2.filter("site = 'desktop'").select(avg($"requests"), min($"requests"), max($"requests")).show()

+------------------+-------------+-------------+
|     avg(requests)|min(requests)|max(requests)|
+------------------+-------------+-------------+
|2426.9947144444445|         1312|         5695|
+------------------+-------------+-------------+



 There certainly appears to be more requests for the desktop site.

 
### Question #4:
** Which day of the week does Wikipedia get the most traffic?**

 Think about how we can accomplish this. We need to pull out the day of the week (like Mon, Tues, etc) from each row, and then sum up all of the requests by day.

 First, use the `date_format` function to extract out the day of the week from the timestamp and rename the column as "Day of week".

Then we'll sum up all of the requests for each day and show the results.

In [34]:
// Notice the use of alias() to rename the new column
// "E" is a pattern in the SimpleDataFormat class in Java that extracts out the "Day in Week""

// Create a new DataFrame named pageviewsByDayOfWeekDF and cache it
val pageviewsByDayOfWeekDF = pageviewsDF2.groupBy(date_format(($"timestamp"), "E").alias("Day of week")).sum().cache()

// Show what is in the new DataFrame
pageviewsByDayOfWeekDF.show()

                                                                                +-----------+-------------+
|Day of week|sum(requests)|
+-----------+-------------+
|        Tue|   1995034884|
|        Thu|   1931508977|
|        Sat|   1662762048|
|        Sun|   1576726066|
|        Fri|   1842512718|
|        Mon|   2356818845|
|        Wed|   1977615396|
+-----------+-------------+



 You can learn more about patterns, like "E", that [Java SimpleDateFormat](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) allows in the Java Docs.

 It would help to visualize the results:

In [35]:
// This is the same command as above, except here we're tacking on an orderBy() to sort by day of week
pageviewsByDayOfWeekDF.orderBy($"Day of week").show()

+-----------+-------------+
|Day of week|sum(requests)|
+-----------+-------------+
|        Fri|   1842512718|
|        Mon|   2356818845|
|        Sat|   1662762048|
|        Sun|   1576726066|
|        Thu|   1931508977|
|        Tue|   1995034884|
|        Wed|   1977615396|
+-----------+-------------+



 Hmm, the ordering of the days of the week is off, because the `orderBy()` operation is ordering the days of the week alphabetically. Instead of that, let's start with Monday and end with Sunday. To accomplish this, we'll need to write a short User Defined Function (UDF) to prepend each `Day of week` with a number.

### User Defined Functions

A UDF lets you code your own logic for processing column values during a DataFrame query. 

First, let's create a Scala match expression for pattern matching:

In [36]:
def matchDayOfWeek(day:String): String = {
  day match {
    case "Mon" => "1-Mon"
    case "Tue" => "2-Tue"
    case "Wed" => "3-Wed"
    case "Thu" => "4-Thu"
    case "Fri" => "5-Fri"
    case "Sat" => "6-Sat"
    case "Sun" => "7-Sun"
    case _ => "UNKNOWN"
  }
}

 Test the match expression:

In [37]:
matchDayOfWeek("Tue")

String = 2-Tue

 Great, it works! Now define a UDF named `prependNumber`:

In [38]:
val prependNumberUDF = sqlContext.udf.register("prependNumber", (s: String) => matchDayOfWeek(s))

In [None]:
// Note, here is a more idomatic Scala way of registering the same UDF
// val prependNumberUDF = sqlContext.udf.register("prependNumber", matchDayOfWeek _)

 Test the UDF to prepend the `Day of Week` column in the DataFrame with a number:

In [39]:
pageviewsByDayOfWeekDF.select(prependNumberUDF($"Day of week")).show(7)

+----------------+
|UDF(Day of week)|
+----------------+
|           2-Tue|
|           4-Thu|
|           6-Sat|
|           7-Sun|
|           5-Fri|
|           1-Mon|
|           3-Wed|
+----------------+



 Our UDF looks like it's working. Next, let's apply the UDF and also order the x axis from Mon -> Sun:

In [40]:
// FIXME -- looks like a 

pageviewsByDayOfWeekDF.withColumnRenamed("sum(requests)", "total requests")
  .select(prependNumberUDF($"Day of week"), $"total requests")
  .orderBy("UDF(Day of week)").show()

+----------------+--------------+
|UDF(Day of week)|total requests|
+----------------+--------------+
|           1-Mon|    2356818845|
|           2-Tue|    1995034884|
|           3-Wed|    1977615396|
|           4-Thu|    1931508977|
|           5-Fri|    1842512718|
|           6-Sat|    1662762048|
|           7-Sun|    1576726066|
+----------------+--------------+



 Click on the bar chart icon again to convert the above table into a Bar Chart. Also, under the Plot Options, you may need to set the Keys as "UDF(Day of week)" and the values as "total requests".

 Wikipedia seems to get significantly more traffic on Mondays than other days of the week.

 
### Question #5:
** Can you visualize both the mobile and desktop site requests in a line chart to compare traffic between both sites by day of the week?**

 First, graph the mobile site requests:

In [41]:
val mobileViewsByDayOfWeekDF = pageviewsDF2.filter("site = 'mobile'").groupBy(date_format(($"timestamp"), "E").alias("Day of week")).sum().withColumnRenamed("sum(requests)", "total requests").select(prependNumberUDF($"Day of week"), $"total requests").orderBy("UDF(Day of week)").toDF("DOW", "mobile_requests")

// Cache this DataFrame
mobileViewsByDayOfWeekDF.cache()

mobileViewsByDayOfWeekDF.show()

+-----+---------------+
|  DOW|mobile_requests|
+-----+---------------+
|1-Mon|      790026669|
|2-Tue|      648087459|
|3-Wed|      631284694|
|4-Thu|      625338164|
|5-Fri|      635169886|
|6-Sat|      646334635|
|7-Sun|      629556455|
+-----+---------------+



 Click on the bar chart icon again to convert the above table into a Bar Chart. Also, under the Plot Options, you may need to set the Keys as "DOW" and the values as "mobile requests".

 Next, graph the desktop site requests:

In [42]:
val desktopViewsByDayOfWeekDF = pageviewsDF2.filter("site = 'desktop'").groupBy(date_format(($"timestamp"), "E").alias("Day of week")).sum().withColumnRenamed("sum(requests)", "total requests").select(prependNumberUDF($"Day of week"), $"total requests").orderBy("UDF(Day of week)").toDF("DOW", "desktop_requests")

// Cache this DataFrame
desktopViewsByDayOfWeekDF.cache()

desktopViewsByDayOfWeekDF.show

+-----+----------------+
|  DOW|desktop_requests|
+-----+----------------+
|1-Mon|      1566792176|
|2-Tue|      1346947425|
|3-Wed|      1346330702|
|4-Thu|      1306170813|
|5-Fri|      1207342832|
|6-Sat|      1016427413|
|7-Sun|       947169611|
+-----+----------------+



 Now that we have two DataFrames (one for mobile views by day of week and another for desktop views), let's join both of them to compare mobile vs. desktop page views:

In [43]:
mobileViewsByDayOfWeekDF.join(desktopViewsByDayOfWeekDF, mobileViewsByDayOfWeekDF("DOW") === desktopViewsByDayOfWeekDF("DOW")).show()

+-----+---------------+-----+----------------+
|  DOW|mobile_requests|  DOW|desktop_requests|
+-----+---------------+-----+----------------+
|1-Mon|      790026669|1-Mon|      1566792176|
|2-Tue|      648087459|2-Tue|      1346947425|
|3-Wed|      631284694|3-Wed|      1346330702|
|4-Thu|      625338164|4-Thu|      1306170813|
|5-Fri|      635169886|5-Fri|      1207342832|
|6-Sat|      646334635|6-Sat|      1016427413|
|7-Sun|      629556455|7-Sun|       947169611|
+-----+---------------+-----+----------------+

