![](http://kodcu.com/wp/wp-content/uploads/2014/06/mllib.png)

----
By the end of this session, you should be able to:
----

- Build ML Pipelines
- Recongize when to use Spark Packages
- Perform fundamental text processing:
    + tf-idf
    + word2vec

---
ML Pipelines
---

![](images/pipelines.png)

ML pipelines combine multiple algorithms into a single pipeline, or workflow. 

---
Review
---

<details><summary>
What is a DataFrame?
</summary>
Spark ML uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types.
<br>
E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
</details>

---
Pipeline components
----

__Transformer__: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.

__Estimator__: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

__Pipeline__: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.

__Parameter__: All Transformers and Estimators now share a common API for specifying parameters.

![](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/images/spark-mllib-pipeline.png)

![](https://databricks.com/wp-content/uploads/2015/01/pipeline-1.png)

[Source](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ml-guide.html)

---
[Pipelines Demo](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6058142077065523/505261678981876/4338926410488997/latest.html)
----

---
Hyper-parameter Tuning
---

```python
# Build a parameter grid.
paramGrid = ParamGridBuilder() \
                .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()

# Set up cross-validation.
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3) 
                          
# Fit a model with cross-validation.
cvModel = crossval.fit(training)
```
see http://spark.apache.org/docs/latest/ml-guide.html  
and https://github.com/apache/spark/blob/master/examples/src/main/python/ml/cross_validator.py#L69

---
Spark 2.0 has model persistence, aka saving and loading
----

```python
# Define the workflow
rf = RandomForestClassifier()
cv = CrossValidator(estimator=rf, ...)

# Fit the model, running Cross-Validation
cvModel = cv.fit(trainingData)

# Extract the results, i.e., the best Random Forest model
bestModel = cvModel.bestModel

# Save the RandomForest model
bestModel.save("rfModelPath")
```

[Source](https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html)

<br>
<br> 
<br>

----
But what about "Streaming foo bar on baz" algorithm?
---

It was invented yesterday by Google. They haven't release any code. There is just a fuzzy screenshot of Jeff Dean's desktop, but I __can't__ do my capstone without it!

![](http://i.imgur.com/SK9VCDJ.jpg)

Have you heard of [Spark Packages](http://spark-packages.org/)?



![](http://static1.fjcdn.com/thumbnails/comments/Thank+you+kind+sir+_621c0f81885b90188ac2580876cb77f2.jpg)

<br>
<br> 
<br>

----
Text Processing in MLlib
----

![](http://image.slidesharecdn.com/introtobigdataandhadoop01-141116061412-conversion-gate02/95/an-introduction-to-bigdata-processing-applying-hadoop-4-638.jpg?cb=1417344122)

---
Review
---

<details><summary>
What is the goal of Natural Language Processing (NLP)?
</summary>
Try strings into numbers, than apply standard machine learning algorithms
</details>
<br>
<br>
<details><summary>
What are the most common text algorithms?
</summary>
1. Word Count (including ngrams)  
2. tf-idf  
3. word2vec
</details>

---
tf-idf demo
----

In [33]:
from pyspark.mllib.feature import HashingTF, IDF

In [54]:
data = [["a", "a", "b"], 
        ["a", "b", "c"], 
        ["a", "a", "d"]]
rdd = sc.parallelize(data, numSlices=2)

In [55]:
tf = HashingTF(numFeatures=100)
doc = "a a b".split(" ")
tfs = tf.transform(rdd) #=> (numFeatures, {term_index, term_frequency})
tfs.collect()

[SparseVector(100, {31: 1.0, 44: 2.0}),
 SparseVector(100, {14: 1.0, 31: 1.0, 44: 1.0}),
 SparseVector(100, {1: 1.0, 44: 2.0})]

In [56]:
idf = IDF().fit(tfs)
tfidf = idf.transform(tfs)
tfidf.collect() #=> (numFeatures, {term_index, tf-idf})

[SparseVector(100, {31: 0.2877, 44: 0.0}),
 SparseVector(100, {14: 0.6931, 31: 0.2877, 44: 0.0}),
 SparseVector(100, {1: 0.6931, 44: 0.0})]

---
Check for understanding
---

<details><summary>
What is tf-idf useful for?
</summary>
The "real" importance of a term for a give corpus
</details>

----
word2vec
----

In [1]:
from pyspark.mllib.feature import Word2Vec

In [23]:
inp = (sc
       .textFile("grimms_fairy_tales.txt")
       .map(lambda row: row.split(" "))
        ) # Have to keep context for word2vec

word2vec = Word2Vec()
model = word2vec.fit(inp)

In [24]:
n = 5
synonyms = model.findSynonyms('king', n)

for word, cosine_distance in synonyms:
    print("{}: {:.3}".format(word, cosine_distance))

dwarf: 0.567
morning: 0.567
wind: 0.558
princess: 0.556
table: 0.556


[RTFM for PySpark NLP](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF)

---
Check for understanding
---

<details><summary>
What do the rankings change? What is the best way to prevent that?
</summary>
The weights of the network are randomized differently each time. <br>
Train the model on more data to stablize the learning.
</details>
<br>
<br>
<details><summary>
What other word2vec functions are missing?
</summary>
Almost all of them! <br>
For example, "doesn't match" for a group <br>
</details>

----
Summary
----

- Spark tries to make Big Data easier (and faster), including moving into production
- What is there is very easy but often you will have to "Roll Your Own" (RYO) code
- You can limit "re-inventing the wheel" with help from the community via Spark Packages
- Spark has powerful (but limited) tools for text processing

<br>
<br> 
<br>

----