In [None]:
// Databricks notebook source exported at Sun, 21 Feb 2016 05:12:13 UTC

 Check if reading all col, except the text + lowertext of the Parquet file, results in faster reading with less I/O via UI.

 ### Live Demo: Wikipedia ETL

 We will: 
* Use a cluster with 50 Executors with 8 cores on each (400 cores total)
* ETL from an XML file into a Parquet file
* Work with nested fields in a table

 ##### Step 0: Ask students to run first 5 cells in "Wikipedia - ETL NLP - ReadMe"

 ##### Step 1: Convert XML to Parquet

 Start with using the Spark-XML library to convert the XML file to the more efficient Parquet format:

In [None]:
display(dbutils.fs.ls("/mnt/wikipedia-readonly/"))

 Note that the cell below will run 400 tasks at a time, each taking about 3.5 mins median

In [None]:
// This cell should take about 30-50 mins to run
// Using Spark-xml1, it takes 50 mins, median 3.8 mins, 75th perct is 4.5 mins

val wiki001DF = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "page")
    .load("/mnt/wikipedia-readonly/en_wikipedia/enwiki-20160204-pages-articles-multistream.xml")
    .write.parquet("/mnt/wikipedia-readwrite/en_wikipedia/parquetX/")

 **Instructor note:** 
* Under Executors tab, show that each Executor is reading an equal amount of data, around 1 GB
* Show Event Timeline
* Thread dump will show that the most time is spent in these two libraries:

`com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2819)`

or 

`java.util.zip.ZipFile.getEntry(ZipFile.java:309)`

In [None]:
%fs ls /mnt/wikipedia-readonly/en_wikipedia/parquet

 ##### Step 2: Read Parquet into a Dataframe and explore...

 Read the new optimized Parquet file into a Dataframe:

In [None]:
val wikiDF = sqlContext.read.parquet("/mnt/wikipedia-readonly/en_wikipedia/parquet").cache() //We are caching

 Notice the nested schema:

In [None]:
wikiDF.printSchema()

 There are only 6 high-level columns (the rest are nested fields):

In [None]:
wikiDF.show(2)

 Here's an example of how to read nested data:

In [None]:
// Note: If a user is logged in, we don't get an IP
wikiDF.select($"id", $"title", $"revision.contributor.username", $"revision.contributor.ip").show(10)

In [None]:
// These are the articles last touched by anonymous editors
wikiDF.filter("revision.contributor.ip is not null").select($"id", $"title", $"revision.contributor.username", $"revision.contributor.ip").show(10)

 ##### Step 3: Materialize the cache and sanity check the data for namespaces and redirects

Are there really 5 million articles?

In [None]:
//materialize the cache and count how many rows (takes 9 secs to run)
wikiDF.count()

 Hmm, why are there 16 million rows? I thought English Wikipedia had 5 million articles...

 Wikipedia namespaces: https://en.wikipedia.org/wiki/Wikipedia:Namespace

In [None]:
wikiDF.groupBy("ns").count().show()

 Ahh, there are many other namespaces in this Dataframe.

 Filter the Dataframe down to just the main namespace of articles:

In [None]:
val wikiMainDF = wikiDF.filter("ns = 0")

In [None]:
wikiMainDF.count()

 There are still way too many rows... 12 million, instead of 5 million.

In [None]:
// Notice the redirect column
wikiMainDF.show(10)

In [None]:
// Try going to: https://en.wikipedia.org/wiki/AccessibleComputing    This is not a real article, just redir

// This is a real article: https://en.wikipedia.org/wiki/Anarchism

// Notice that many of the rows are just redirects

display(wikiMainDF.select($"id", $"title", $"redirect.@title"))

In [None]:
// Now we see that there are 5 million normal articles that are not redirects

wikiMainDF.select($"redirect.@title".isNotNull.as("hasRedirect"))
  .groupBy("hasRedirect")
  .count
  .show

 Create a new wikiArticlesDF with just the 5 million articles, removing the redirect rows:

In [None]:
val wikiArticlesDF = wikiMainDF.filter($"redirect.@title".isNull)

In [None]:
// This makes sense, 5 million articles
wikiArticlesDF.count()

 ##### Step 4: Convert the String timestamp cols to real timestamp data types

In [None]:
wikiArticlesDF.printSchema()

In [None]:
wikiArticlesDF.select($"title", $"revision.timestamp").show(5)

In [None]:
import org.apache.spark.sql.{functions => func}

Let's use a function for time zone manipulation and to store the relavent fields as a timestamp rather than a string.  Let's use `from_utc_timestamp` to get a timestamp object back with the correct time zone.

In [None]:
// We are using this function added in Spark 1.5: https://issues.apache.org/jira/browse/SPARK-8188

// https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

In [None]:
val wikiArticlesWTimeDF = wikiArticlesDF.withColumn("lastrev_est_time", func.from_utc_timestamp($"revision.timestamp", "US/Eastern"))

wikiArticlesWTimeDF.printSchema

wikiArticlesWTimeDF
  .select($"title", $"revision.timestamp", $"lastrev_est_time")
  .show(3)

 Notice that the lastrev_est_time column is now a timestamp.

 ##### Step 5: Flatten the table and drop unnecessary columns

 Flatten out the table and drop some cols:

In [None]:
val wikiFlatForParquetDF = wikiArticlesWTimeDF
                      .drop($"ns")
                      .drop($"redirect")
                      .drop($"restrictions")
                      .withColumn("revid", $"revision.id")
                      .withColumn("comment", $"revision.comment.#value")
                      .withColumn("contributorid", $"revision.contributor.id")
                      .withColumn("contributorusername", $"revision.contributor.username")
                      .withColumn("contributorip", $"revision.contributor.ip")
                      .withColumn("text", $"revision.text.#value")
                      .withColumn("comment", $"revision.comment.#value")
                      .drop($"revision")

In [None]:
wikiFlatForParquetDF.show(5)

 ##### Step 6: Write full data to Parquet

In [None]:
// takes 93 sec on 45W8c
// takes 108 sec on 30W8c
// takes 1 min to write on 50W8C
wikiFlatForParquetDF.write.parquet("/mnt/wikipedia-readwrite/en_wikipedia/flattenedParquet/")

 ##### Step 7: Keep only the articles that were last updated during or after 2016 and write that smaller subset to Parquet

In [None]:
// Import the sql functions package, which includes statistical functions like sum, max, min, avg, etc.
import org.apache.spark.sql.functions._

In [None]:
wikiFlatForParquetDF.filter(year($"lastrev_est_time") >= 2016).count()

 Over 1 million articles were last updated since the beginning of 2016.

In [None]:
wikiFlatForParquetDF.filter(year($"lastrev_est_time") >= 2016).write.parquet("/mnt/wikipedia-readwrite/en_wikipedia/flattenedParquet_updated2016/")

 ### Write a 1% sample of the 1 million words for students' lab:

In [None]:
val wikiFlatDFx = sqlContext.read.parquet("dbfs:/mnt/wikipedia-readonly/en_wikipedia/flattenedParquet_updated2016/")

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer
 
val tokenizer = new RegexTokenizer()
  .setInputCol("text")
  .setOutputCol("words")
  .setPattern("\W+")

val wikiWordsDFx = tokenizer.transform(wikiFlatDFx)

In [None]:
val onePercentDFx = wikiWordsDFx
                      .sample(false, .01, 555)
                      .repartition(100)
                      .write
                      .parquet("/mnt/wikipedia-readwrite/en_wikipedia/flattenedParquet_updated2016_1percent/")

In [None]:
onePercentDFx.write.parquet("/mnt/wikipedia-readwrite/en_wikipedia/flattenedParquet_updated2016_1percent/")

 ### More ETL and NLP

 Restart Cluster before continuing...(different sql.functions import)

In [None]:
val wikiFlat5milDF = sqlContext.read.parquet("/mnt/wikipedia-readwrite/en_wikipedia/flattenedParquet/").cache
wikiFlat5milDF.count

 ##### Step 1: Natural Language Processing: lowercase

Next, let's convert the text field to lowercase.  We'll use the `lower` function for this.

In [None]:
wikiFlat5milDF.select($"text").show(5)

In [None]:
import org.apache.spark.sql.functions._

val wikiFlat5milLoweredDF = wikiFlat5milDF.select($"*", lower($"text").as("lowerText"))

In [None]:
wikiFlat5milLoweredDF.select($"text", $"lowerText").show(5)

 ##### Step 2: NLP: Convert the lowerText column into a bag of words and remove stop words

Next, let's convert our text into a list of words so that we can perform some analysis at the word level.  For this will use a feature transformer called `RegexTokenizer` which splits up strings into tokens (words in our case) based on a split pattern.  We'll split our text on anything that matches one or more non-word characters.

In [None]:
import org.apache.spark.ml.feature.RegexTokenizer
 
val tokenizer = new RegexTokenizer()
  .setInputCol("lowerText")
  .setOutputCol("words")
  .setPattern("\W+")

val wikiWordsDF = tokenizer.transform(wikiFlat5milLoweredDF)

In [None]:
wikiWordsDF.select("words").first

There are some very common words in our list of words which won't be that useful for our later analysis.  We'll create a UDF to remove them.
 
[StopWordsRemover](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover) is implemented for Scala but not yet for Python.  We'll use the same [list](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) of stop words it uses to build a user-defined function (UDF).

In [None]:
val stopWords = sc.textFile("/mnt/wikipedia-readonly/stopwords/stop_words.txt").collect.toSet

 Create a custom, more powerful function to remove stop words (including words with length < 3 and words containing a digit or underscore):

In [None]:
import scala.collection.mutable.WrappedArray
 
val stopWordsBroadcast = sc.broadcast(stopWords)  // Notice we're using a broadcast variable
 
def isDigitOrUnderscore(c: Char) = {
    Character.isDigit(c) || c == '_'
}
 
def keepWord(word: String) = word match {
    case x if x.length < 3 => false
    case x if stopWordsBroadcast.value(x) => false
    case x if x exists isDigitOrUnderscore => false
    case _ => true
}
 
def removeWords(words: WrappedArray[String]) = {
    words.filter(keepWord(_))
}

Test the function locally.

In [None]:
removeWords(Array("test", "cat", "do343", "343", "spark", "the", "and", "hy-phen", "under_score"))

Create a UDF from our function.

In [None]:
import org.apache.spark.sql.functions.udf
val removeWordsUDF = udf { removeWords _ }

Register this function so that we can call it later from another notebook.  Note that in Scala `register` also returns a `udf` that we can use, so we could have combined the above step into this step.

In [None]:
sqlContext.udf.register("removeWords", removeWords _)

Apply our function to the `wikiWordsDF` `DataFrame`.

In [None]:
val wikiCleanedDF = wikiWordsDF
  .withColumn("noStopWords", removeWordsUDF($"words"))
  .drop("words")
  .withColumnRenamed("noStopWords", "words")
 
wikiCleanedDF.select("words").take(1)

 Let's see what the top 15 words are now:

In [None]:
val noStopWordsListAllWikiDF = wikiCleanedDF.select(explode($"words").as("word"))

In [None]:
noStopWordsListAllWikiDF.show(7)

In [None]:
noStopWordsListAllWikiDF.count()

 That's 2.7 billion words!

 Finally, let's see the top 15 words in all of Wikipedia now (with the stop words removed):

In [None]:
val noStopWordsGroupCount = noStopWordsListAllWikiDF
                      .groupBy("word")  // group
                      .agg(count("word").as("counts"))  // aggregate
                      .sort(desc("counts"))  // sort

noStopWordsGroupCount.take(15).foreach(println)

 Hmm, that looks better than the list we say when working with just 10,000 articles and using the Spark.ML built in stop words remover (which left words like 1, 2, s, etc)

 ##### Step 3: Write the cleaned dataframe to a new Parquet file

In [None]:
// takes 93 sec on 45W8c
// takes 108 sec on 30W8c
// takes 1 min to write on 50W8C
wikiCleanedDF.write.parquet("/mnt/wikipedia-readonly/en_wikipedia/cleanedParquet/")

 ## Machine Learning Pipeline: TF-IDF and K-Means

In [None]:
//val wikiCleanedDF = sqlContext.read.parquet("/mnt/wikipedia-readonly/en_wikipedia/cleanedParquet/").cache()

In [None]:
wikiCleanedDF.columns

In [None]:
wikiCleanedDF.show(5)

 #### Set up the ML Pipeline:

In [None]:
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover, IDF, HashingTF, Normalizer}

In [None]:
/*val tokenizer = new RegexTokenizer()
  .setInputCol("lowerText")
  .setOutputCol("words2")
  .setPattern("\W+")
  */

In [None]:
// There are probably > 20K unique words
// More features = more complexity and computational time and accucaracy

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("hashingTF").setNumFeatures(20000)
val featurizedData = hashingTF.transform(wikiCleanedDF)

In [None]:
val idf = new IDF().setInputCol("hashingTF").setOutputCol("idf")
val idfModel = idf.fit(featurizedData)

In [None]:
// A normalizer is a common operation for text classification.

// It simply gets all of the data on the same scale... for example, if one article is much longer and another, it'll normalize the scales for the different features.

// If we don't normalize, an article with more words would be weighted differently


val normalizer = new Normalizer()
  .setInputCol("idf")
  .setOutputCol("features")

Now, let's build the `KMeans` estimator and a `Pipeline` that will contain all of the stages.  We'll then call fit on the `Pipeline` which will give us back a `PipelineModel`.  This will take about a minute to run.

In [None]:
//for k = 50 takes 7 mins on a VeryLarge 60 worker, 8 core cluster

//for k = 100, 11 mins to run on VeryLarge

// 30W8c can run 120 tasks simultaneously, 15.7 mins (4 cores really each Exec)
// 45W8c can run 180 tasks  simul, 13.3 mins (4 cores really each Exec)

// %sql SET spark.sql.shuffle.partitions

// On a 50 Worker cluster, this takes... 757 sec / 60 = 12 mins

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
 
val kmeans = new KMeans()
  .setFeaturesCol("features")
  .setPredictionCol("prediction")
  .setK(100)
  .setSeed(0) // for reproducability
 
val pipeline = new Pipeline()
  .setStages(Array(hashingTF, idf, normalizer, kmeans))  //add tokenizer, then stop workds remover back later for demo2
 
val model = pipeline.fit(wikiCleanedDF)

// This should kick off 33 jobs and take 20 mins to run (1257 sec)

 Show DAG visualization while we wait.

 The above ML pipeline costs under $50 to run.

Spot prices are set by Amazon EC2 and fluctuate periodically depending on the supply of and demand for Spot instance capacity.

On Demand: ($0.66 per r3.2xlarge machine * 50)
https://aws.amazon.com/ec2/pricing/

Spot: ($0.07 per r3.2xlarge machine * 50)
https://aws.amazon.com/ec2/spot/pricing/

Let's take a look at a sample of the data to see if we can see a pattern between predicted clusters and titles.

In [None]:
val predictionsDF = model.transform(wikiCleanedDF)

In [None]:
predictionsDF.columns

In [None]:
predictionsDF.groupBy("prediction").count().show(100)

In [None]:
//politics
display(predictionsDF.filter("prediction = 16").select("title", "prediction"))

In [None]:
// This cluster seems to be about Ford, but notice that TF-IDF can't tell between difference the car and last name

//Name:  https://en.wikipedia.org/wiki/Whitey_Ford
// https://en.wikipedia.org/wiki/Harrison_Ford

display(predictionsDF.filter("prediction = 70").select("title", "prediction"))

In [None]:
//Norway, nordic
display(predictionsDF.filter("prediction = 40").withColumn("num_words", size($"words")).select("title", "num_words", "prediction"))

In [None]:
// Games and Sports
display(predictionsDF.filter("prediction = 96").withColumn("num_words", size($"words")).select("title", "num_words", "prediction"))

 Let's find which cluster Apache_Spark is in:

In [None]:
// Looking for Spark
display(predictionsDF.filter($"title" === "Apache Spark").withColumn("num_words", size($"words")).select("title", "num_words", "prediction"))

In [None]:
// What will the cluster contain? big data? technology? software?
display(predictionsDF.filter("prediction = 8").withColumn("num_words", size($"words")).select("title", "num_words", "prediction"))