## Goal

### Classify a subset of wikipedia articles into 100 different Groups

### KMeans Algorithm
[image1]: ./images/kmeans.jpg
![pipeline][image1]

### Understand the dataset

In [1]:
val ss = SparkSession.builder().getOrCreate()
import ss.implicits._
import org.apache.spark.sql.functions._

In [2]:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

In [10]:
val wikiDF = spark.read.parquet("/Users/josiahsams/SparkDemo/datasets/wiki.parquet")
            .filter($"text".isNotNull)
            .select($"id", $"title", $"revid", 
                    $"lastrev_pdt_time", $"comment", 
                    $"contributorid", $"contributorusername",
                    $"contributorip", lower($"text").as("lowerText"))

lastException: Throwable = null
wikiDF: org.apache.spark.sql.DataFrame = [id: bigint, title: string ... 7 more fields]


In [178]:
wikiDF.printSchema

root
 |-- id: long (nullable = true)
 |-- title: string (nullable = true)
 |-- revid: long (nullable = true)
 |-- lastrev_pdt_time: timestamp (nullable = true)
 |-- comment: string (nullable = true)
 |-- contributorid: long (nullable = true)
 |-- contributorusername: string (nullable = true)
 |-- contributorip: string (nullable = true)
 |-- lowerText: string (nullable = true)



### Dataset contains Articles updated in the last 3 days

In [107]:
wikiDF.select(dayofyear($"lastrev_pdt_time")).distinct().count

res84: Long = 3


In [108]:
wikiDF.select(to_date($"lastrev_pdt_time").as("dates")).orderBy($"dates").distinct.show

+----------+
|     dates|
+----------+
|2017-09-30|
|2017-10-01|
|2017-10-02|
+----------+



In [109]:
printf("%.2f%%\n", wikiDF.count()/5096292.0*100)

1.86%


In [110]:
wikiDF.select($"title", $"lastrev_pdt_time").show(10)

+--------------------+--------------------+
|               title|    lastrev_pdt_time|
+--------------------+--------------------+
|Fingal (music group)|2017-10-01 06:10:...|
|File:KXOL-AM logo...|2017-10-01 07:55:...|
|Komorze Nowomiejskie|2017-09-30 23:40:...|
|Adamowo, Wolsztyn...|2017-09-30 21:36:...|
|Stara Dąbrowa, Wo...|2017-09-30 21:42:...|
|               TopoR|2017-10-01 23:49:...|
|Jeremy Bates (Ame...|2017-10-01 20:58:...|
|       Matthew Mixer|2017-09-30 19:53:...|
|     Western culture|2017-09-30 17:43:...|
|History of FC Bar...|2017-10-01 12:18:...|
+--------------------+--------------------+
only showing top 10 rows



In [180]:
wikiDF.createOrReplaceTempView("wikipedia")

### Number of Articles edited by Bots 

In [184]:
%%sql
select count(*) from wikipedia where contributorusername like "%Bot %"

res163: org.apache.toree.magic.MagicOutput =
MagicOutput(ArrayBuffer((text/plain,+--------+
|count(1)|
+--------+
|     516|
+--------+
)))


### Who is the most active contributor in the last 3 days ?

In [185]:
%%SQL
select  contributorusername, count(contributorusername) from wikipedia group by contributorusername order by count(contributorusername) desc  

res164: org.apache.toree.magic.MagicOutput =
MagicOutput(ArrayBuffer((text/plain,+-------------------+--------------------------+
|contributorusername|count(contributorusername)|
+-------------------+--------------------------+
|       Red Director|                      4003|
| InternetArchiveBot|                      3134|
|             DatBot|                      1970|
|          WOSlinker|                      1716|
|            AvicBot|                      1077|
|             Dawynn|                       992|
|         WP 1.0 bot|                       886|
|               Tim!|                       879|
|             Hmains|                       846|
|           Onel5969|                       818|
+-------------------+--------------------------+
only showing top 10 rows
)))


### Task for a Data Scientist 

### Spark String Tokenizer using Regular Expression

#### Transformers RegexTokenizer : Transform a dataframe into another dataframe with additional columns

In [166]:
import org.apache.spark.ml.feature.RegexTokenizer
 
val tokenizer = new RegexTokenizer()
  .setInputCol("lowerText")
  .setOutputCol("words")
  .setPattern("\\W+")

val wikiWordsDF = tokenizer.transform(wikiDF)

tokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_9e08f2608e68
wikiWordsDF: org.apache.spark.sql.DataFrame = [id: bigint, title: string ... 8 more fields]


In [167]:
wikiWordsDF.select($"title", $"words").show

+--------------------+--------------------+
|               title|               words|
+--------------------+--------------------+
|Fingal (music group)|[image, with, unk...|
|File:KXOL-AM logo...|[summary, logo, f...|
|Komorze Nowomiejskie|[infobox, settlem...|
|Adamowo, Wolsztyn...|[otherplaces, ada...|
|Stara Dąbrowa, Wo...|[other, places, s...|
|               TopoR|[redir, toporoute...|
|Jeremy Bates (Ame...|[infobox, college...|
|       Matthew Mixer|[no, footnotes, b...|
|     Western culture|[about, this, art...|
|History of FC Bar...|[about, a, statis...|
|John Archer (poli...|[other, people, j...|
|Tremont Avenue–17...|[other, uses, tre...|
|Wikipedia:WikiPro...|[anchor, aastart,...|
|Wikipedia:Article...|[div, class, boil...|
|List of Soviet fi...|[soviet, film, li...|
|           L. Gordon|[no, footnotes, d...|
|              Raygun|[about, the, fict...|
|Black and White (...|[unreferenced, da...|
|        Chabrouh Dam|[the, faraya, cha...|
|Wikipedia:Article...|[div, clas

### Most common words in the edited articles

In [168]:
val tenPercentDF = wikiWordsDF.sample(false, .01, 555).cache

tenPercentDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint, title: string ... 8 more fields]


In [169]:
printf("%,d words (total)\n", wikiWordsDF.count)
printf("%,d words (sample)\n", tenPercentDF.count)

94,924 words (total)
994 words (sample)


Let's explode the words column into a table of one word per row:

In [170]:
val tenPercentWordsListDF = tenPercentDF.select(explode($"words").as("word"))

tenPercentWordsListDF: org.apache.spark.sql.DataFrame = [word: string]


In [171]:
tenPercentWordsListDF.show

+------------+
|        word|
+------------+
|     infobox|
|officeholder|
|        name|
|     filippo|
|   tamagnini|
|      office|
|        list|
|          of|
|    captains|
|      regent|
|          of|
|         san|
|      marino|
|     captain|
|      regent|
|          of|
|         san|
|      marino|
|   alongside|
|       maria|
+------------+
only showing top 20 rows



In [172]:
printf("%,d words\n", tenPercentWordsListDF.cache().count())

2,743,246 words


In [173]:
val wordGroupCountDF = tenPercentWordsListDF
                      .groupBy("word")  // group
                      .agg(count("word").as("counts"))  // aggregate
                      .sort(desc("counts"))  // sort

wordGroupCountDF.show(15)

wordGroupCountDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [word: string, counts: bigint]


+-----+------+
| word|counts|
+-----+------+
|  the| 88288|
|   of| 48834|
|  ref| 38809|
|  and| 37230|
|   in| 36719|
|    a| 27794|
|   to| 27402|
|    s| 17555|
|title| 17407|
| http| 15277|
|  for| 14738|
|  was| 14436|
|   by| 14144|
|   on| 13677|
| name| 13097|
+-----+------+
only showing top 15 rows



In [177]:
tenPercentWordsListDF.unpersist

res156: tenPercentWordsListDF.type = [word: string]


### Good to clean up the "stop words" before running Natural Language Processing algorithms on our data

In [247]:
import org.apache.spark.ml.feature.StopWordsRemover

val stopwords = spark.read.textFile("/Users/josiahsams/SparkDemo/datasets/stopwords.txt").collect

val remover = new StopWordsRemover()
  .setInputCol("words")
  .setOutputCol("noStopWords")
  .setStopWords(stopwords)

stopwords: Array[String] = Array(a, about, above, after, again, against, all, am, an, and, any, are, aren't, as, at, be, because, been, before, being, below, between, both, but, by, can't, cannot, could, couldn't, did, didn't, do, does, doesn't, doing, don't, down, during, each, few, for, from, further, had, hadn't, has, hasn't, have, haven't, having, he, he'd, he'll, he's, her, here, here's, hers, herself, him, himself, his, how, how's, i, i'd, i'll, i'm, i've, if, in, into, is, isn't, it, it's, its, itself, let's, me, more, most, mustn't, my, myself, no, nor, not, of, off, on, once, only, or, other, ought, our, ours, ourselves, out, over, own, same, shan't, she, she'd, she'll, she's, should, shouldn't, s, so, some, such, than, that, ...


In [248]:
remover.transform(tenPercentDF).select("id", "title", "words", "noStopWords").show(15)

+--------+--------------------+--------------------+--------------------+
|      id|               title|               words|         noStopWords|
+--------+--------------------+--------------------+--------------------+
|31361855|   Filippo Tamagnini|[infobox, officeh...|[infobox, officeh...|
| 2216434|         FK Pelister|[infobox, footbal...|[infobox, footbal...|
| 9772924|      Trevor Hebberd|[engvarb, date, j...|[engvarb, july, u...|
| 4674998|Borodino-class ba...|[about, the, boro...|[borodino, battle...|
|53866204|Template:2017–18 ...|[noinclude, read,...|[noinclude, read,...|
|53871267|               Saaho|[use, indian, eng...|[use, indian, eng...|
|53980626|      Vicente Parras|[infobox, footbal...|[infobox, footbal...|
|54142804|Recurring Saturda...|[unreferenced, da...|[unreferenced, ma...|
| 6068811|Category:Zoos in ...|[commons, categor...|[commons, zoos, w...|
| 2560194|    Brian Culbertson|[use, mdy, dates,...|[use, mdy, dates,...|
|  229723|                 Lie|[other,

In [249]:
val noStopWordsListDF = remover.transform(tenPercentDF).select(explode($"noStopWords").as("word"))

noStopWordsListDF: org.apache.spark.sql.DataFrame = [word: string]


In [250]:
noStopWordsListDF.show(10)

+------------+
|        word|
+------------+
|     infobox|
|officeholder|
|     filippo|
|   tamagnini|
|      office|
|        list|
|    captains|
|      regent|
|         san|
|      marino|
+------------+
only showing top 10 rows



In [251]:
printf("%,d words\n", noStopWordsListDF.cache.count)

1,711,368 words


In [253]:
val noStopWordsGroupCount = noStopWordsListDF
                      .groupBy("word")  // group
                      .agg(count("word").as("counts"))  // aggregate
                      .sort(desc("counts"))  // sort

noStopWordsGroupCount.show(30)

noStopWordsGroupCount: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [word: string, counts: bigint]


+---------+------+
|     word|counts|
+---------+------+
|september|  3655|
|      may|  3630|
|   united|  3490|
|   closed|  3392|
|   august|  3220|
|       uk|  3171|
|     year|  3165|
|      one|  3160|
|     list|  3087|
|     left|  2900|
| national|  2887|
|       17|  2881|
| articles|  2875|
|    right|  2856|
|  october|  2831|
|     june|  2814|
|       18|  2799|
|     city|  2792|
|     2008|  2779|
|     july|  2757|
|       15|  2740|
|  january|  2720|
|   states|  2689|
|       16|  2685|
|   series|  2673|
|     film|  2652|
|   season|  2636|
|    world|  2626|
|       14|  2573|
|       13|  2548|
+---------+------+
only showing top 30 rows



In [254]:
printf("%,d distinct words", noStopWordsListDF.distinct.count)

138,945 distinct words

### Term Frequency

In [8]:
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover, HashingTF, IDF, Normalizer}

In [9]:
val noStopWordsListDF = remover.transform(wikiWordsDF)

noStopWordsListDF: org.apache.spark.sql.DataFrame = [id: bigint, title: string ... 10 more fields]


In [10]:
noStopWordsListDF.select("id", "title", "words", "noStopWords").show(15)

+--------+--------------------+--------------------+--------------------+
|      id|               title|               words|         noStopWords|
+--------+--------------------+--------------------+--------------------+
|21127403|Fingal (music group)|[image, with, unk...|[image, unknown, ...|
|21145234|File:KXOL-AM logo...|[summary, logo, f...|[summary, logo, f...|
|21157758|Komorze Nowomiejskie|[infobox, settlem...|[infobox, settlem...|
|21158912|Adamowo, Wolsztyn...|[otherplaces, ada...|[otherplaces, ada...|
|21158944|Stara Dąbrowa, Wo...|[other, places, s...|[places, stara, b...|
|21174956|               TopoR|[redir, toporoute...|[redir, toporoute...|
|21188846|Jeremy Bates (Ame...|[infobox, college...|[infobox, college...|
|21197084|       Matthew Mixer|[no, footnotes, b...|[footnotes, blp, ...|
|21208262|     Western culture|[about, this, art...|[article, equival...|
|21221470|History of FC Bar...|[about, a, statis...|[statistical, bre...|
|21233027|John Archer (poli...|[other,

In [11]:
// More features = more complexity and computational time and accuracy

val hashingTF = new HashingTF().setInputCol("noStopWords").setOutputCol("hashingTF").setNumFeatures(20000)
val featurizedDataDF = hashingTF.transform(noStopWordsListDF)

hashingTF: org.apache.spark.ml.feature.HashingTF = hashingTF_d7631946c6e2
featurizedDataDF: org.apache.spark.sql.DataFrame = [id: bigint, title: string ... 11 more fields]


In [12]:
featurizedDataDF.select("id", "title", "noStopWords", "hashingTF").show(7)

+--------+--------------------+--------------------+--------------------+
|      id|               title|         noStopWords|           hashingTF|
+--------+--------------------+--------------------+--------------------+
|21127403|Fingal (music group)|[image, unknown, ...|(20000,[196,230,3...|
|21145234|File:KXOL-AM logo...|[summary, logo, f...|(20000,[32,278,34...|
|21157758|Komorze Nowomiejskie|[infobox, settlem...|(20000,[15,589,83...|
|21158912|Adamowo, Wolsztyn...|[otherplaces, ada...|(20000,[15,212,57...|
|21158944|Stara Dąbrowa, Wo...|[places, stara, b...|(20000,[15,212,65...|
|21174956|               TopoR|[redir, toporoute...|(20000,[15,20,22,...|
|21188846|Jeremy Bates (Ame...|[infobox, college...|(20000,[1,15,20,8...|
+--------+--------------------+--------------------+--------------------+
only showing top 7 rows



### Inverse Document Frequency (IDF)

#### Estimator -  IDF : 

* Transform a dataframe into Model
* This model is a transformer which can transform a dataframe into another dataframe with additional columns (predictions)

In [13]:
// This will take 3 - 4 mins to run
val idf = new IDF().setInputCol("hashingTF").setOutputCol("idf")
val idfModel = idf.fit(featurizedDataDF)

idf: org.apache.spark.ml.feature.IDF = idf_3314bdc1993f
idfModel: org.apache.spark.ml.feature.IDFModel = idf_3314bdc1993f


In [14]:
// A normalizer is a common operation for text classification.

// It simply gets all of the data on the same scale... for example, if one article is much longer and another, it'll normalize the scales for the different features.

// If we don't normalize, an article with more words would be weighted differently

val normalizer = new Normalizer()
  .setInputCol("idf")
  .setOutputCol("features")

normalizer: org.apache.spark.ml.feature.Normalizer = normalizer_93cac342c469


### Create a Spark ML pipeline

[image2]: ./images/ml-pipeline1.png
![mlpipeline][image2]

### Save the model

In [None]:
// This will take over 1 hour to run!

/*
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
 
val kmeans = new KMeans()
  .setFeaturesCol("features")
  .setPredictionCol("prediction")
  .setK(100)
  .setSeed(0) // for reproducability
 
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, remover, hashingTF, idf, normalizer, kmeans))  
 
val model = pipeline.fit(wikiDF)

model.save("/Users/josiahsams/SparkDemo/datasets/wiki.model")
*/

### Load the model

In [160]:
val model2 = org.apache.spark.ml.PipelineModel.load("/Users/josiahsams/SparkDemo/datasets/wiki.model")

Types will not be printed


lastException: Throwable = null
model2: org.apache.spark.ml.PipelineModel = pipeline_fc837089157a


In [163]:
val rawPredictionsDF = model2.transform(wikiDF)

rawPredictionsDF: org.apache.spark.sql.DataFrame = [id: bigint, title: string ... 13 more fields]


In [164]:
rawPredictionsDF.columns

res147: Array[String] = Array(id, title, revid, lastrev_pdt_time, comment, contributorid, contributorusername, contributorip, lowerText, words, noStopWords, hashingTF, idf, features, prediction)


In [14]:
val predictionsDF = rawPredictionsDF.select($"title", $"prediction").cache

predictionsDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [title: string, prediction: int]


In [156]:
predictionsDF.show(10)

Output will NOT be truncated
+--------------------+----------+
|               title|prediction|
+--------------------+----------+
|Fingal (music group)|        29|
|File:KXOL-AM logo...|         6|
|Komorze Nowomiejskie|        37|
|Adamowo, Wolsztyn...|        37|
|Stara Dąbrowa, Wo...|        37|
|               TopoR|        87|
|Jeremy Bates (Ame...|        46|
|       Matthew Mixer|        15|
|     Western culture|        81|
|History of FC Bar...|        86|
+--------------------+----------+
only showing top 10 rows



In [150]:
// This will take 4-5 minutes
predictionsDF.groupBy("prediction").count().orderBy($"count" desc).show(10)



+----------+-----+
|prediction|count|
+----------+-----+
|        29|21923|
|        81| 9552|
|        12| 6886|
|        87| 4714|
|        21| 3961|
|        42| 3904|
|        58| 2948|
|        86| 2641|
|         6| 2423|
|        91| 1980|
+----------+-----+
only showing top 10 rows



In [151]:
// Awards
predictionsDF.filter("prediction = 27").select("title", "prediction").show(10)

+--------------------+----------+
|               title|prediction|
+--------------------+----------+
|Filmfare Award fo...|        27|
|Filmfare Award fo...|        27|
|    2011 Emmy Awards|        27|
|        IAWTV Awards|        27|
|Sumathi Best Tele...|        27|
|          Saba Qamar|        27|
|List of awards an...|        27|
|List of awards an...|        27|
|List of awards an...|        27|
| World Travel Awards|        27|
+--------------------+----------+
only showing top 10 rows



In [153]:
// Navy war ships
predictionsDF.filter("prediction = 11").select("title", "prediction").show(10)

+--------------------+----------+
|               title|prediction|
+--------------------+----------+
|    HMS Alban (1806)|        11|
|TSS Duke of Lanca...|        11|
|Borodino-class ba...|        11|
|USS Glacier (AK-183)|        11|
|USS Muscatine (AK...|        11|
|USS Beaufort (PCS...|        11|
|USNS Private John...|        11|
| Orpheus (1818 ship)|        11|
|USS Dukes County ...|        11|
|USS Alligator (1862)|        11|
+--------------------+----------+
only showing top 10 rows



In [154]:
// Samsung phones
predictionsDF.filter("prediction = 57").select("title", "prediction").show(10)

+--------------------+----------+
|               title|prediction|
+--------------------+----------+
|Samsung Galaxy A3...|        57|
|   Samsung Galaxy S4|        57|
|  Samsung Experience|        57|
|Samsung Internet ...|        57|
|Renault Samsung M...|        57|
|Samsung Galaxy Ta...|        57|
|File:Logo of Sams...|        57|
|    Japanese noctule|        57|
|Samsung Galaxy S III|        57|
|   Samsung Galaxy J5|        57|
+--------------------+----------+
only showing top 10 rows



In [155]:
// airports airlines
predictionsDF.filter("prediction = 56").select("title", "prediction").show(10)

+--------------------+----------+
|               title|prediction|
+--------------------+----------+
|Columbia Regional...|        56|
|List of airports ...|        56|
|Gregorio Luperón ...|        56|
|New Plymouth Airport|        56|
|Goa International...|        56|
|Cologne Bonn Airport|        56|
|            Sita Air|        56|
|Yemelyanovo Inter...|        56|
|List of airports ...|        56|
|Dallas/Fort Worth...|        56|
+--------------------+----------+
only showing top 10 rows

