// Databricks notebook source exported at Sun, 21 Feb 2016 05:12:02 UTC


#![Wikipedia Logo](http://sameerf-dbc-labs.s3-website-us-west-2.amazonaws.com/data/wikipedia/images/w_logo_for_labs.png)

# Explore English Wikipedia via DataFrames and RDD API
### Time to complete: 20 minutes

#### Business Questions:

* Question # 1) What percentage of Wikipedia articles were edited in the past month (before the data was collected)?
* Question # 2) How many of the 1 million articles were last edited by ClueBot NG, an anti-vandalism bot?
* Question # 3) Which user in the 1 million articles was the last editor of the most articles?
* Question # 4) Can you display the titles of the articles in Wikipedia that contain a particular word?
* Question # 5) Can you extract out all of the words from the Wikipedia articles? (bag of words)
* Question # 6) What are the top 15 most common words in the English language?
* Question # 7) After removing stop words, what are the top 10 most common words in the english language? 
* Question # 8) How many distinct/unique words are in noStopWordsListDF?


#### Technical Accomplishments:

* Work with one fifth of the sum of all human knowledge!

 Attach to, and then restart your cluster first to clear out old caches and get to a default, standard environment. The restart should take 1 - 2 minutes.

#![Restart cluster](http://i.imgur.com/xkRjRYy.png)

### Getting to know the Data
Let's pick up where the instructor left off in the earlier demo. Locate the Parquet data from the demo using `dbutils`:

In [None]:
display(dbutils.fs.ls("/mnt/wikipedia-readonly/en_wikipedia/flattenedParquet_updated2016/"))

 These are the ~840 parquet files (~4.3 GB) from the English Wikipedia Articles (Feb 4, 2016 snapshot) that were last updated in 2016.

 Load the articles into memory and lazily cache them:

In [None]:
val wikiDF = sqlContext.read.parquet("dbfs:/mnt/wikipedia-readonly/en_wikipedia/flattenedParquet_updated2016/").cache()

 Notice how fast `printSchema()` runs... this is because we can derive the schema from the Parquet metadata:

In [None]:
wikiDF.printSchema()

 Look at the first 5 rows:

In [None]:
wikiDF.show(5)

 Let's count how many total articles we have. (Note that when using a local mode cluster, the next command will take **4 minutes**, so you may want to skip ahead and read some of the next cells:

In [None]:
// You can monitor the progress of this count + cache materialization via the Spark UI's storage tab
wikiDF.count()

 ## During live ETL demo: Run everything above this cell!

 This lab is meant to introduce you to working with unstructured text data in the Wikipedia articles. DataFrames and SQL queries are great for structured data like CSV, JSON or parquet files. However, when exploring unstructured data, using the RDD or Datasets API directly could give you more flexible, lower level control. 

 In this lab, among other tasks, we will continue the ETL process from the earlier demo and apply basic Natural Language Processing to the article text to extract out a bag of words.

 By now the `.count()` operation might be completed. Go back up and check and only proceed after the count has completed. You should see the count's results as 1,029,377 items.

 Check the Spark UI's Storage tab to ensure that 100% of the data set fits in memory:

#![memory](http://i.imgur.com/YUhs1Bz.png)

 Run `.count()` again to see the speed increase:

In [None]:
wikiDF.count()

 That's pretty impressive! We can scan through 1 million recent articles of English Wikipedia using a single 22 GB Executor in under 2 seconds.

 Register the DataFrame as a temporary table, so we can execute SQL against it:

In [None]:
wikiDF.registerTempTable("wikipedia")

 
### Question #1:
** What percentage of Wikipedia articles were edited in the past week (before the data was collected)? **

 Recall that our dataset was collected on Feb 4, 2016. Let's figure out how many of the articles were last edited between Jan 28, 2016 - Feb 4, 2016. This should give us a good idea of how many articles are "fresh".

 ** Challenge 1:**  Can you write this query using SQL? Hint: Just count all the articles where the last revision time is greater than Jan 28, 2016.

In [None]:
%sql SELECT COUNT(*) FROM wikipedia WHERE lastrev_est_time >= DATE '2016-01-28';

 315 thousand articles are less than a month old. Since English Wikipedia contains 5,072,474 articles, that means:

In [None]:
315931/5072474.0

 About 6% of English Wikipedia is less than 1 week old (from the Feb 4th collection date). Here are 10 such articles:

In [None]:
%sql SELECT title, lastrev_est_time FROM wikipedia WHERE lastrev_est_time >= DATE '2016-01-28' LIMIT 10;

 
### Question #2:
** How many of the 1 million articles were last edited by [ClueBot NG](https://en.wikipedia.org/wiki/User:ClueBot_NG), an anti-vandalism bot? **

 ** Challenge 2:**  Write a SQL query to answer this question. The username to search for is `ClueBot BG`.

In [None]:
%sql SELECT COUNT(*) FROM wikipedia WHERE contributorusername = "ClueBot NG";

In [None]:
%sql SELECT * FROM wikipedia WHERE contributorusername = "ClueBot NG" LIMIT 10;

 You can study at the specifc revisions like so: https://en.wikipedia.org/?diff=#

For example: https://en.wikipedia.org/?diff=702283675

 
### Question #3:
** Which user in the 1 million articles was the last editor of the most articles? **

 Here's a slightly more complicated query:

In [None]:
%sql SELECT contributorusername, COUNT(contributorusername) FROM wikipedia GROUP BY contributorusername ORDER BY COUNT(contributorusername) DESC; 

 Hmm, looks are bots are quite active in maintaining Wikipedia.

 Interested in learning more about the bots that edit Wikipedia? Check out: https://en.wikipedia.org/wiki/Wikipedia:List_of_bots_by_number_of_edits

 
### Question #4:
** Can you display the titles of the articles in Wikipedia that contain a particular word? **

 Start by registering a User Defined Function (UDF) that can search for a string in the text of an article.

In [None]:
// Register a function that can search that a string is found.

val containsWord = (s: String, w: String) => {
  (s != null && s.indexOfSlice(w) >= 0).toString()
}
sqlContext.udf.register("containsWord", containsWord)

 Verify that the `containsWord` function is working as intended:

In [None]:
// Look for the word 'test' in the first string
containsWord("hello astronaut, how's space?", "test")

In [None]:
// Look for the word 'space' in the first string
containsWord("hello astronaut, how's space?", "space")

  Use a parameterized query so you can easily change the word to search for:

In [None]:
%sql  select title from wikipedia where containsWord(text, '$word') == 'true'

 Try typing in `NASA` or `Manhattan` into the search box above and hit SHIFT + ENTER.

 
### Question #5:
** Can you extract out all of the words from the Wikipedia articles? ** (Create a bag of words)

 Use Spark.ml's RegexTokenizer to read an input column of 'text' and write a new output column of 'words':

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer
 
val tokenizer = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
  .setPattern("\W+")
val wikiWordsDF = tokenizer.transform(wikiDF)

In [None]:
wikiWordsDF.show(5)

In [None]:
wikiWordsDF.select($"title", $"words").first

 
### Question #6:
** What are the top 15 most common words in the English language? ** Compute this only on a random 1% of the 1 million articles.

 For this analysis, we should get reasonably accurate results even if we work on just 1% of the 1 million articles. Plus, this will speed things up tremendously. Note that 1% of 1 million is 10,000 articles.

In [None]:
// This sample + repartition command will take 4-5 mins to run, so skip this cell and just read the same results via the parquet file in the following cell
//val onePercentDF = wikiWordsDF.sample(false, .01, 555).repartition(100).cache

In [None]:
val onePercentDF = sqlContext.read.parquet("dbfs:/mnt/wikipedia-readonly/en_wikipedia/flattenedParquet_updated2016_1percent/").cache

In [None]:
onePercentDF.count // Materialize the cache

 The `onePercentDF` contains 10,297 articles (that is 1% of 1 million articles). Take a look at the onePercentDF:

In [None]:
display(onePercentDF)

 Note that the `words` column contains arrays of Strings:

In [None]:
onePercentDF.select($"words")

 Let's explode the `words` column into a table of one word per row:

In [None]:
import org.apache.spark.sql.{functions => func}
val onePercentWordsListDF = onePercentDF.select(func.explode($"words").as("word"))

In [None]:
display(onePercentWordsListDF)

In [None]:
onePercentWordsListDF.cache().count()

 The onePercentWordsListDF contains 18.6 million words.

 Finally, run a wordcount on the exploded table:

In [None]:
val wordGroupCountDF = onePercentWordsListDF
                      .groupBy("word")  // group
                      .agg(func.count("word").as("counts"))  // aggregate
                      .sort(func.desc("counts"))  // sort

wordGroupCountDF.take(15).foreach(println)

 These would be good [stop words](https://en.wikipedia.org/wiki/Stop_words) to filter out before running Natural Language Processing algorithms on our data.

 
### Question #7:
** After removing stop words, what are the top 10 most common words in the english language? ** Compute this only on a random 1% of the 1 million articles.

 Use Spark.ml's stop words remover:

In [None]:
import org.apache.spark.ml.feature.StopWordsRemover

val remover = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("noStopWords")

 Notice the removal of words like "about", "the",  etc:

In [None]:
remover.transform(onePercentDF).select("id", "title", "words", "noStopWords").show(7)

In [None]:
val noStopWordsListDF = remover.transform(onePercentDF).select(func.explode($"noStopWords").as("word"))

In [None]:
noStopWordsListDF.show(7)

 The onePercentWordsListDF (which included stop words) contained 18.6 million words. How many words are in the noStopWordsListDF?

In [None]:
noStopWordsListDF.cache.count

 13.9 million words remain. That means about 4.7 million words in our 1% sample were actually stop words.

 Finally, let's see the top 15 words now:

In [None]:
val noStopWordsGroupCount = noStopWordsListDF
                      .groupBy("word")  // group
                      .agg(func.count("word").as("counts"))  // aggregate
                      .sort(func.desc("counts"))  // sort

noStopWordsGroupCount.take(15).foreach(println)

 Hmm, there are still some words in the list (like http, 1, 2, s) that are kind of meaningless. Perhaps we should consider a custom stop words remover in the future?

 
### Question #8:
** How many distinct/unique words are in noStopWordsListDF?**

In [None]:
noStopWordsListDF.distinct.count

 Looks like the Wikipedia corpus has around 500,000 unique words. Probably a lot of these are rare scientific words, numbers, etc.

 This concludes the English Wikipedia NLP lab.