[ScaDaMaLe, Scalable Data Science and Distributed Machine Learning](https://lamastex.github.io/scalable-data-science/sds/3/x/)
==============================================================================================================================

Wiki Clickstream Analysis
=========================

\*\* Dataset: 3.2 billion requests collected during the month of
February 2015 grouped by (src, dest) \*\*

\*\* Source: https://datahub.io/dataset/wikipedia-clickstream/ \*\*

![NY clickstream
image](https://databricks-prod-cloudfront.s3.amazonaws.com/docs/images/ny.clickstream.png "NY clickstream image")

*This notebook requires Spark 1.6+.*

This notebook was originally a data analysis workflow developed with
[Databricks Community
Edition](https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html),
a free version of Databricks designed for learning [Apache
Spark](https://spark.apache.org/).

Here we elucidate the original python notebook ([also linked
here](/#workspace/scalable-data-science/xtraResources/sparkSummitEast2016/Wikipedia%20Clickstream%20Data))
used in the talk by Michael Armbrust at Spark Summit East February 2016
shared from
<https://twitter.com/michaelarmbrust/status/699969850475737088> (watch
later)

[![Michael Armbrust Spark Summit
East](http://img.youtube.com/vi/35Y-rqSMCCA/0.jpg)](https://www.youtube.com/v/35Y-rqSMCCA)

### Data set

![Wikipedia Logo](http://sameerf-dbc-labs.s3-website-us-west-2.amazonaws.com/data/wikipedia/images/w_logo_for_labs.png)
=======================================================================================================================

The data we are exploring in this lab is the February 2015 English
Wikipedia Clickstream data, and it is available here:
http://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82.

According to Wikimedia:

> "The data contains counts of (referer, resource) pairs extracted from
> the request logs of English Wikipedia. When a client requests a
> resource by following a link or performing a search, the URI of the
> webpage that linked to the resource is included with the request in an
> HTTP header called the "referer". This data captures 22 million
> (referer, resource) pairs from a total of 3.2 billion requests
> collected during the month of February 2015."

The data is approximately 1.2GB and it is hosted in the following
Databricks file:
`/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed`

In [None]:
display(dbutils.fs.ls("/databricks-datasets/wikipedia-datasets/"))

  

[TABLE]

  

### Let us first understand this Wikimedia data set a bit more

Let's read the datahub-hosted link
<https://datahub.io/dataset/wikipedia-clickstream> in the embedding
below. Also click the
[blog](http://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/)
by Ellery Wulczyn, Data Scientist at The Wikimedia Foundation, to better
understand how the data was generated (remember to Right-Click and use
-&gt; and &lt;- if navigating within the embedded html frame below).

  

Run the next two cells for some housekeeping.

In [None]:
if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

  

  

### Loading and Exploring the data

In [None]:
val data = sc.textFile("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")

  

>     data: org.apache.spark.rdd.RDD[String] = dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed MapPartitionsRDD[605] at textFile at command-1267216879634612:1

  

##### Looking at the first few lines of the data

In [None]:
data.take(5).foreach(println) 

  

>     prev_id	curr_id	n	prev_title	curr_title	type
>     	3632887	121	other-google	!!	other
>     	3632887	93	other-wikipedia	!!	other
>     	3632887	46	other-empty	!!	other
>     	3632887	10	other-other	!!	other

In [None]:
data.take(2)

  

>     res3: Array[String] = Array(prev_id	curr_id	n	prev_title	curr_title	type, "	3632887	121	other-google	!!	other")

  

-   The first line looks like a header
-   The second line (separated from the first by ",") contains data
    organized according to the header, i.e., `prev_id` = 3632887,
    `curr_id` = 121", and so on.

Actually, here is the meaning of each column:

-   `prev_id`: if the referer does not correspond to an article in the
    main namespace of English Wikipedia, this value will be empty.
    Otherwise, it contains the unique MediaWiki page ID of the article
    corresponding to the referer i.e. the previous article the client
    was on

-   `curr_id`: the MediaWiki unique page ID of the article the client
    requested

-   `prev_title`: the result of mapping the referer URL to the fixed set
    of values described below

-   `curr_title`: the title of the article the client requested

-   `n`: the number of occurrences of the (referer, resource) pair

-   `type`

    -   "link" if the referer and request are both articles and the
        referer links to the request
    -   "redlink" if the referer is an article and links to the request,
        but the request is not in the production enwiki.page table
    -   "other" if the *referer* and request are both articles but the
        referer does not link to the request. This can happen when
        clients search or spoof their refer

Referers were mapped to a fixed set of values corresponding to internal
traffic or external traffic from one of the top 5 global traffic sources
to English Wikipedia, based on this scheme:

> -   an article in the main namespace of English Wikipedia -&gt; the
>     article title
> -   any Wikipedia page that is not in the main namespace of English
>     Wikipedia -&gt; `other-wikipedia`
> -   an empty referer -&gt; `other-empty`
> -   a page from any other Wikimedia project -&gt; `other-internal`
> -   Google -&gt; `other-google`
> -   Yahoo -&gt; `other-yahoo`
> -   Bing -&gt; `other-bing`
> -   Facebook -&gt; `other-facebook`
> -   Twitter -&gt; `other-twitter`
> -   anything else -&gt; `other-other`

In the second line of the file above, we can see there were 121 clicks
from Google to the Wikipedia page on "!!" (double exclamation marks).
People search for everything!

-   prev\_id = *(nothing)*
-   curr\_id = 3632887 *--&gt; (Wikipedia page ID)*
-   n = 121 *(People clicked from Google to this page 121 times in this
    month.)*
-   prev\_title = other-google *(This data record is for referals from
    Google.)*
-   curr\_title = !! *(This Wikipedia page is about a double exclamation
    mark.)*
-   type = other

### Create a DataFrame from this CSV

-   From the next Spark release - 2.0, CSV as a datasource will be part
    of Spark's standard release. But, we are using Spark 1.6

In [None]:
// Load the raw dataset stored as a CSV file
val clickstream = sqlContext
    .read
    .format("com.databricks.spark.csv")
    .options(Map("header" -> "true", "delimiter" -> "\t", "mode" -> "PERMISSIVE", "inferSchema" -> "true"))
    .load("dbfs:///databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed")
  

  

>     clickstream: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

  

##### Print the schema

In [None]:
clickstream.printSchema

  

>     root
>      |-- prev_id: integer (nullable = true)
>      |-- curr_id: integer (nullable = true)
>      |-- n: integer (nullable = true)
>      |-- prev_title: string (nullable = true)
>      |-- curr_title: string (nullable = true)
>      |-- type: string (nullable = true)

  

#### Display some sample data

In [None]:
display(clickstream)

  

[TABLE]

Truncated to 30 rows

  

Display is a utility provided by Databricks. If you are programming
directly in Spark, use the show(numRows: Int) function of DataFrame

In [None]:
clickstream.show(5)

  

>     +-------+-------+---+------------------+----------+-----+
>     |prev_id|curr_id|  n|        prev_title|curr_title| type|
>     +-------+-------+---+------------------+----------+-----+
>     |   null|3632887|121|      other-google|        !!|other|
>     |   null|3632887| 93|   other-wikipedia|        !!|other|
>     |   null|3632887| 46|       other-empty|        !!|other|
>     |   null|3632887| 10|       other-other|        !!|other|
>     |  64486|3632887| 11|!_(disambiguation)|        !!|other|
>     +-------+-------+---+------------------+----------+-----+
>     only showing top 5 rows

  

### Reading from disk vs memory

The 1.2 GB Clickstream file is currently on S3, which means each time
you scan through it, your Spark cluster has to read the 1.2 GB of data
remotely over the network.

Call the `count()` action to check how many rows are in the DataFrame
and to see how long it takes to read the DataFrame from S3.

In [None]:
clickstream.cache().count()

  

>     res8: Long = 22509897

  

-   It took about several minutes to read the 1.2 GB file into your
    Spark cluster. The file has 22.5 million rows/lines.
-   Although we have called cache, remember that it is evaluated
    (cached) only when an action(count) is called

Now call count again to see how much faster it is to read from memory

In [None]:
clickstream.count()

  

>     res9: Long = 22509897

  

-   Orders of magnitude faster!
-   If you are going to be using the same data source multiple times, it
    is better to cache it in memory

### What are the top 10 articles requested?

To do this we also need to order by the sum of column `n`, in descending
order.

In [None]:
//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("n"))
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))

  

[TABLE]

  

### Who sent the most traffic to Wikipedia in Feb 2015?

In other words, who were the top referers to Wikipedia?

In [None]:
display(clickstream
  .select(clickstream("prev_title"), clickstream("n"))
  .groupBy("prev_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(10))

  

[TABLE]

  

As expected, the top referer by a large margin is Google. Next comes
refererless traffic (usually clients using HTTPS). The third largest
sender of traffic to English Wikipedia are Wikipedia pages that are not
in the main namespace (ns = 0) of English Wikipedia. Learn about the
Wikipedia namespaces here:
https://en.wikipedia.org/wiki/Wikipedia:Project\_namespace

Also, note that Twitter sends 10x more requests to Wikipedia than
Facebook.

### What were the top 5 trending articles people from Twitter were looking up in Wikipedia?

In [None]:
//Type in your answer here...
display(clickstream
  .select(clickstream("curr_title"), clickstream("prev_title"), clickstream("n"))
  .filter("prev_title = 'other-twitter'")
  .groupBy("curr_title")
  .sum()
  .orderBy($"sum(n)".desc)
  .limit(5))

  

[TABLE]

  

#### What percentage of page visits in Wikipedia are from other pages in Wikipedia itself?

In [None]:
val allClicks = clickstream.selectExpr("sum(n)").first.getLong(0)
val referals = clickstream.
                filter(clickstream("prev_id").isNotNull).
                selectExpr("sum(n)").first.getLong(0)
(referals * 100.0) / allClicks

  

>     allClicks: Long = 3283067885
>     referals: Long = 1095462001
>     res15: Double = 33.36702253416853

  

#### Register the DataFrame to perform more complex queries

In [None]:
clickstream.createOrReplaceTempView("clicks")

  

  

#### Which Wikipedia pages have the most referrals to the Donald Trump page?

In [None]:
SELECT *
FROM clicks
WHERE 
  curr_title = 'Donald_Trump' AND
  prev_id IS NOT NULL AND prev_title != 'Main_Page'
ORDER BY n DESC
LIMIT 20

  

[TABLE]

  

#### YouTry: Top referrers to other 2016 US presidential candidate pages

`'Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz'`

In [None]:
-- YouTry 
---
-- fill in the right sql query here

  

#### Load a visualization library

This code is copied after doing a live google search (by Michael
Armbrust at Spark Summit East February 2016 shared from
<https://twitter.com/michaelarmbrust/status/699969850475737088>). The
`d3ivan` package is an updated version of the original package used by
Michael Armbrust as it needed some TLC for Spark 2.2 on newer databricks
notebook. These changes were kindly made by Ivan Sadikov from Middle
Earth.

You need to hit the Play Button in next cell and 'Run Cell' exactly
once.

>     Warning: classes defined within packages cannot be redefined without a cluster restart.
>     Compilation successful.

In [None]:
d3ivan.graphs.help()

In [None]:
d3ivan.graphs.force(
  height = 800,
  width = 800,
  clicks = sql("""
    SELECT 
      prev_title AS src,
      curr_title AS dest,
      n AS count FROM clicks
    WHERE 
      curr_title IN ('Donald_Trump', 'Bernie_Sanders', 'Hillary_Rodham_Clinton', 'Ted_Cruz') AND
      prev_id IS NOT NULL AND prev_title != 'Main_Page'
    ORDER BY n DESC
    LIMIT 20""").as[d3ivan.Edge])

  

  

What we have done above is essentially pass the output of an SQL query
into a D3 visualizer via javascript. Don't worry about all the details.
The main idea here is that SQL and interactive visualizations usually
come together in a proper data exploratory tool and the above steps are
minimal excursions into how to do it in a simple way from within a
notebook environment like databricks. Python and R have many plotting
libraries and we can always write the dataframe to parquet and load it
into pySpark or SparkR to leverage those languages. But D3 is a nice
solutions also especially if you want somethinf customized for your
queries.

### Convert raw data to parquet

**Recall:**

[Apache Parquet](https://parquet.apache.org/) is a [columnar
storage](http://en.wikipedia.org/wiki/Column-oriented_DBMS) format
available to any project in the Hadoop ecosystem, regardless of the
choice of data processing framework, data model or programming language.
It is a more efficient way to store data frames.

-   To understand the ideas read [Dremel: Interactive Analysis of
    Web-Scale Datasets, Sergey Melnik, Andrey Gubarev, Jing Jing Long,
    Geoffrey Romer, Shiva Shivakumar, Matt Tolton and Theo
    Vassilakis,Proc. of the 36th Int'l Conf on Very Large Data Bases
    (2010), pp. 330-339](http://research.google.com/pubs/pub36632.html),
    whose Abstract is as follows:
    -   Dremel is a scalable, interactive ad-hoc query system for
        analysis of read-only nested data. By combining multi-level
        execution trees and columnar data layouts it is **capable of
        running aggregation queries over trillion-row tables in
        seconds**. The system **scales to thousands of CPUs and
        petabytes of data, and has thousands of users at Google**. In
        this paper, we describe the architecture and implementation of
        Dremel, and explain how it complements MapReduce-based
        computing. We present a novel columnar storage representation
        for nested records and discuss experiments on few-thousand node
        instances of the system.

In [None]:
// Convert the DatFrame to a more efficent format to speed up our analysis
clickstream.
  write.
  mode(SaveMode.Overwrite).
  parquet("/datasets/wiki-clickstream") 

  

  

#### Load parquet file efficiently and quickly into a DataFrame

Now we can simply load from this parquet file next time instead of
creating the RDD from the text file (much slower).

Also using parquet files to store DataFrames allows us to go between
languages quickly in a a scalable manner.

In [None]:
val clicks = sqlContext.read.parquet("/datasets/wiki-clickstream")

  

>     clicks: org.apache.spark.sql.DataFrame = [prev_id: int, curr_id: int ... 4 more fields]

In [None]:
clicks.printSchema

  

>     root
>      |-- prev_id: integer (nullable = true)
>      |-- curr_id: integer (nullable = true)
>      |-- n: integer (nullable = true)
>      |-- prev_title: string (nullable = true)
>      |-- curr_title: string (nullable = true)
>      |-- type: string (nullable = true)

In [None]:
display(clicks)  // let's display this DataFrame

  

[TABLE]

Truncated to 30 rows

  

##### DataFrame in python

In [None]:
clicksPy = sqlContext.read.parquet("/datasets/wiki-clickstream")

In [None]:
# in Python you need to put the object int its own line like this to get the type information
clicksPy 

  

>     Out[2]: DataFrame[prev_id: int, curr_id: int, n: int, prev_title: string, curr_title: string, type: string]

In [None]:
clicksPy.show()

  

>     +--------+-------+---+--------------------+----------+-----+
>     | prev_id|curr_id|  n|          prev_title|curr_title| type|
>     +--------+-------+---+--------------------+----------+-----+
>     |  334751|  19271| 24|            Cambodia|  Mongolia| link|
>     |     737|  19271| 24|         Afghanistan|  Mongolia|other|
>     |18603746|  19271| 13|             Beijing|  Mongolia| link|
>     | 7770444|  19271| 10|Agriculture_in_Mo...|  Mongolia| link|
>     | 7712057|  19271| 12|Christianity_in_M...|  Mongolia| link|
>     |16489766|  19271| 11| Cities_of_East_Asia|  Mongolia| link|
>     |31632993|  19271| 10|Assassin's_Creed:...|  Mongolia| link|
>     | 2421391|  19271| 32|              Bhutan|  Mongolia| link|
>     |17961792|  19271| 66|China–Mongolia_re...|  Mongolia| link|
>     | 1105940|  19271| 26|            Borjigin|  Mongolia| link|
>     |   56896|  19271| 33|     Altai_Mountains|  Mongolia| link|
>     | 5037822|  19271| 11|Economy_of_the_Mo...|  Mongolia| link|
>     |   78449|  19271| 34|  Developing_country|  Mongolia| link|
>     |19605700|  19271|416|           East_Asia|  Mongolia| link|
>     | 5042916|  19271| 11|              Canada|  Mongolia|other|
>     |    3383|  19271| 13|              Brazil|  Mongolia|other|
>     |  572720|  19271| 11|Biligtü_Khan_Ayus...|  Mongolia| link|
>     | 1496582|  19271| 40|        Asia-Pacific|  Mongolia| link|
>     |    6742|  19271| 52|        Central_Asia|  Mongolia| link|
>     | 2067400|  19271| 18|Democratic_Party_...|  Mongolia| link|
>     +--------+-------+---+--------------------+----------+-----+
>     only showing top 20 rows

  

Now you can continue from the original python notebook tweeted by
Michael.

Recall from the beginning of this notebook that this python databricks
notebook was used in the talk by Michael Armbrust at Spark Summit East
February 2016 shared from
<https://twitter.com/michaelarmbrust/status/699969850475737088>

(watch now, if you haven't already!)

[![Michael Armbrust Spark Summit
East](http://img.youtube.com/vi/35Y-rqSMCCA/0.jpg)](https://www.youtube.com/watch?v=35Y-rqSMCCA)

**You Try!**

Try to laoad a DataFrame in R from the parquet file just as we did for
python. Read the docs in databricks guide first:

-   <https://docs.databricks.com/spark/latest/sparkr/overview.html>

And see the `R` example in the Programming Guide:

-   <https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>.

In [None]:
library(SparkR)

# just a quick test
df <- createDataFrame(faithful)
head(df)


In [None]:
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
clicksR <- read.df("/datasets/wiki-clickstream", source = "parquet")
clicksR # in R you need to put the object int its own line like this to get the type information

In [None]:
head(clicksR)

In [None]:
display(clicksR)

  

[TABLE]

Truncated to 30 rows