// Databricks notebook source exported at Tue, 9 Feb 2016 20:38:06 UTC


#![Wikipedia Logo](http://sameerf-dbc-labs.s3-website-us-west-2.amazonaws.com/data/wikipedia/images/w_logo_for_labs.png)

# Analyzing the Wikipedia PageCounts with RDDs
### Time to complete: 20 minutes

#### Business questions:

* Question # 1) How many unique articles in English Wikipedia were requested in this hour?
* Question # 2) How many requests total did English Wikipedia get in this hour?
* Question # 3) How many requests total did each Wikipedia project get total during this hour?
* Question # 4) How many requests did the Apache Spark project recieve during this hour? Which language got the most requests?
* Question # 5) How many requests did the English Wiktionary project get during the captured hour?
* Question # 6) Which Apache project in English Wikipedia got the most hits during the capture hour?
* Question # 7) What were the top 10 pages viewed in English Wikipedia during the capture hour?

#### Technical Accomplishments:

* Learn how to use the following actions: count, take, takeSample, collect
* Learn the following transformations: filter, map, reduceByKey, sortBy
* Learn how to cache an RDD and view its number of partitions and total size in memory
* Learn how to send a closure function to a map transformation
* Learn how to define a case class to organize data in an RDD into objects
* Learn how to interpret a DAG visualization and understand the number of stages and tasks
* Learn why groupByKey should be avoided



Dataset: https://dumps.wikimedia.org/other/pagecounts-raw/

### Getting to know the Data
How large is the data? Let's use `%fs` to find out. Note: This is not supported on jupyter.

In [None]:
//%fs ls /databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/

 589722455  bytes means 589 MB.

 Note that this file is from Nov 24, 2015 at 17:00 (5pm). It only captures 1 hour of page counts to all of Wikipedia languages and projects.

### RDDs
RDDs can be created by using the Spark Context object's `textFile()` method.

In [None]:
// In Databricks, the SparkContext is already created for you as the variable sc
sc

In [None]:
val pagecountsRDD = sc.textFile("file:///mnt/ephemeral/summitdata/pagecounts-20160210-180000")
pagecountsRDD.first

If you would like to run it from S3 directly, the credentials are set up.  Replace step 2 with

```
val pagecountsRDD = sc.textFile("s3a://datastaxtraining/summitdata/pagecounts-20160210-180000")
```

 The `count` action counts how many items (lines) total are in the RDD:

In [None]:
pagecountsRDD.count()

 So there are about 7.7 million lines. Notice that the `count()` action took 5 - 10 seconds to run b/c it had to scan the entire 589 MB file remotely from S3. This command requires 9 tasks to compute. 

 You can use the take action to get the first K records (here K = 10):

In [None]:
pagecountsRDD.take(10)

 The take command is much faster because it does not have read the entire file. This command only requires 1 task to compute.

 Unfortunately this is not very readable because `take()` returns an array and Scala simply prints the array with each element separated by a comma. We can make it prettier by traversing the array to print each record on its own line:

In [None]:
pagecountsRDD.take(10).foreach(println)

 In the output above, the first column `aa` is the Wikimedia project name. The following abbreviations are used:
```
wikibooks: ".b"
wiktionary: ".d"
wikimedia: ".m"
wikipedia mobile: ".mw"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
mediawiki: ".w"
```

Projects without a period and a following character are Wikipedia projects.

The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.

### Common RDD Transformantions and Actions
Next, we'll explore some common transformation and actions.

 But first, let's cache our base RDD into memory:

In [None]:
pagecountsRDD.setName("pagecountsRDD").cache.count

In [None]:
val addr = java.net.InetAddress.getByName("node0_ext").getHostAddress
kernel.magics.html(s"""<iframe src="http://$addr:4040/storage" width=1000 height=500/>""")

You should now see the RDD in Spark UI's storage tab:
#![PagecountsRDD in Storage](http://i.imgur.com/Y3UFJl1.png)

Notice that the RDD takes more than 2x the space when cached in memory deserialized.

 
### Question #1:
** How many unique articles in English Wikipedia were requested in this hour?**

 Let's filter out just the lines referring to English Wikipedia:

In [None]:
val enPagecountsRDD = pagecountsRDD.filter { _.startsWith("en ") }

 Note that the above line is lazy and doesn't actually run the filter. We have to trigger the filter transformation to run by calling an action:

In [None]:
enPagecountsRDD.count()

 2.4 million lines refer to the English Wikipedia project. So about half of the 5 million articles in English Wikipedia get requested per hour. Let's take a look at 5 random lines:

In [None]:
enPagecountsRDD.takeSample(true, 5).foreach(println)

 
### Question #2:
** How many requests total did English Wikipedia get in this hour?**

 Let's define a function, `parse`, to parse out the 4 fields on each line. Then we'll run the parse function on each item in the RDD and create a new RDD named `enPagecountsParsedRDD`

In [None]:
// Define a function
def parse(line:String) = {
  val fields = line.split(' ') //Split the original line with 4 fields according to spaces
  (fields(0), fields(1), fields(2).toInt, fields(3).toLong) // return the 4 fields with their correct data types
}

 ** Challenge 1:**  Can you use the parse function above in a map closure and assign the results to an RDD named *enPagecountsParsedRDD*?

In [None]:
//Type in your answer here...


In [None]:
enPagecountsParsedRDD.take(3)

 Using a combination of `map` and `take`, we can yank out just the requests field:

In [None]:
enPagecountsParsedRDD.map(_._3).take(10)

 ** Challenge 2:** Finally, let's sum all of the requests to English Wikipedia during the captured hour:

In [None]:
//Type in your answer here...


 We can see that there were about 9.6 million requests to English Wikipedia on Nov 24, 2015 from 5pm - 6pm.

 
### Question #3:
** How many requests total did each Wikipedia project get total during this hour?**

 Recall that our data file contains requests to all of the Wikimedia projects, including Wikibooks, Wiktionary, Wikinews, Wikiquote... and all of the 200+ languages.

In [None]:
// Use the parse function in a map closure
val allPagecountsParsedRDD = pagecountsRDD.map(parse)

In [None]:
allPagecountsParsedRDD.take(5).foreach(println)

 Next, we'll create key/value pairs from the project prefix and the number of requests:

In [None]:
allPagecountsParsedRDD.map(line => (line._1, line._3)).take(10)

 Finally, we can use `reduceByKey()` to calculate the final answer:

In [None]:
val projectcountsRDD = allPagecountsParsedRDD.map(line => (line._1, line._3)).reduceByKey(_ + _)

In [None]:
// Sort by the value (number of requests) and pass in false to sort in descending order
projectcountsRDD.sortBy(x => x._2, false).take(10).foreach(println)

 We can see that the English Wikipedia Desktop and the English Wikipedia Mobile got the most hits this hour, followed by the Russian and Spanish Wikipedias.

 
### Question #4:
** How many requests did the Apache Spark project recieve during this hour? Which language got the most requests?**

 First we define a case class to organize our data in PageCount objects:

In [None]:
case class PageCount(val project: String, val title: String, val requests: Long, val size: Long)

In [None]:
val pagecountObjectsRDD = pagecountsRDD
  .map(_.split(' '))
  .filter(_.size == 4)
  .map(pc => new PageCount(pc(0), pc(1), pc(2).toLong, pc(3).toLong))

 Filter out just the items that mention "Apache_Spark" in the title:

In [None]:
pagecountObjectsRDD
  .filter(_.title.contains("Apache_Spark"))
  .count

 ** Challenge 3:** Can you figure out which language edition of the Apache Spark page got the most hits? 

Hint: Consider using a .map() after the filter() in the cell above.

In [None]:
//Type in your answer here...


 It seems like the English version of the Apache Spark page got the most hits by far.

 
### Question #5:
** How many requests did the English Wiktionary project get during the captured hour?**

 
The [Wiktionary](https://en.wiktionary.org/wiki/Wiktionary:Main_Page) project is a free dictionary with 4 million+ entries from over 1,500 languages.

 ** Challenge 4:** Can you figure this out? Start by figuring out the correct prefix that identifies the English Wikitionary project.

In [None]:
//Type in your answer here...


 The English Wikionary project got a total of 76,000 requests.

 
### Question #6:
** Which Apache project in English Wikipedia got the most hits during the capture hour?**

In [None]:
// Here we reuse the PageCount case class we had defined earlier
val enPagecountObjectsRDD = enPagecountsRDD
  .map(_.split(' '))
  .filter(_.size == 4)
  .map(pc => new PageCount(pc(0), pc(1), pc(2).toLong, pc(3).toLong))

In [None]:
enPagecountObjectsRDD
  .filter(_.title.contains("Apache_"))
  .map(x => (x.title, x.requests))
  .collect
  .foreach(println)

In [None]:
enPagecountObjectsRDD
  .filter(_.title.contains("Apache_"))
  .map(x => (x.title, x.requests))
  .map(item => item.swap) // interchanges position of entries in each tuple
  .sortByKey(false, 1) // 1st arg configures ascending sort, 2nd arg configures one task
  .map(item => item.swap)
  .collect
  .foreach(println)

 We can infer from the above results that Apache's Hadoop and HTTP Server projects are the most popular, followed by Spark and Tomcat.

 
### Question #7:
** What were the top 10 pages viewed in English Wikipedia during the capture hour?**

In [None]:
//Recall that we already have a RDD created that we can use for this analysis
enPagecountsParsedRDD

In [None]:
enPagecountsParsedRDD
  .takeSample(true, 5)
  .foreach(println)

In [None]:
enPagecountsParsedRDD
  .map(line => (line._2, line._3))
  .reduceByKey(_ + _)
  .sortBy(x => x._2, false)
  .take(10)
  .foreach(println)

 The Lucy article sticks out as a unique article that received over 33,000 requests on Nov 24, 2015 between 5pm and 6pm.

What could have caused this?

On November 24, Google had a special [Google Doodle](https://www.google.com/doodles/41st-anniversary-of-the-discovery-of-lucy) on their main page to celebrate the 41st anniversary of Lucy. There were also a ton of news articles gobally about Lucy on Nov 24.