# Analytics with Apache Spark

## Why Apache Spark for Machine Learning?

- Bigger than memory datasets

One does not have to think about sampling a smaller, but statistical significant fraction of the data in order to train a model on a machine. Since Spark is able to deliver any-size dataset to a model.

- Compatibility

Multiple languages are supported, and integrate well with Apache Spark (e.g., Pandas).

- **General purpose!**

Not only scalable Machine Learning, but also advanced data preparation; normalizing features, handling invalid values, constructing new features, etc...

## MLlib

- RDD-based API (spark.mllib)
  - No new features, only bugfixes.
  - Is expected to be removed in Spark 3.0.


- DataFrame-based API (spark.ml)
  - Friendlier, and uniform API compared to spark.mllib
  
## Main concepts

In the following code snippets I will introduce concepts such as, a **dataframe**, **transformer**, **estimator**, and **pipeline** in Spark's "new" ML API.

In [1]:
from pyspark import SQLContext
from pyspark import SparkContext


sc = SparkContext(appName="CERN Spark ML tutorial: main concepts")
sqlContext = SQLContext(sc)

### Dataframe

In this example we assume we have a dataset consisting of blog-comments on machine learning. Every instance (a comment) is assigned with a specific label, which can be either negative (0.0), or positive (1.0). In this example, it is our job to train a classifier which should be able to classify new comments.

In [2]:
# Imagine having a dataset with comments on machine learning. 
# Every instance (example or row), is tagged with a positive (1.0) or a negative (0.0) label.
dataset = sqlContext.createDataFrame([
    (0L, "robots will take over and destroy the world like skynet", 0.0),
    (1L, "AI helps humanity solve many problems", 1.0),
    (2L, "unsupervised learning is pretty cool you can do a lot of awesome stuff with it", 1.0),
    (3L, "i think unsupervised learning is naive", 0.0),
    (4L, "machine learning is just a hype", 0.0),
    (5L, "machine learning is awesome", 1.0)], ["id", "text", "label"])

In [3]:
dataset.show()

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0|robots will take ...|  0.0|
|  1|AI helps humanity...|  1.0|
|  2|unsupervised lear...|  1.0|
|  3|i think unsupervi...|  0.0|
|  4|machine learning ...|  0.0|
|  5|machine learning ...|  1.0|
+---+--------------------+-----+



### Transformers and estimators

**tldr;** a *transfomer* is responsible for changing the structure, or contents of a DataFrame. For example, in Spark, a trained machine learning model is a transformer. Since this model will add a prediction (default name) to every instance in the DataFrame. An *estimator* is basically an abstraction which computes some parameters based on the provided DataFrame. This could be for example, the learning algorithm, or a method which obtains some statistics, which in turn can be used by a model to transform a DataFrame.


We first need to apply some preprocessing to the fulltext data before we can actually start training our model. In this (very) simple example, we show that we can easily apply some preprocessing with Spark. In the first step, we apply a tokenizer. This will parse the text and create a vector of words. Next, since frequent words like "i", "a", "as", ... are not really descriptive and can thus be filtered (reducing the dimensionality of the problem in progress). Now every comments is described by a vector of "meaningfull" words, we can start constructing our feature vectors for our machine learning model. In this example, we compute a term-frequency vector in order to represent an instance. One could for example use more complex features such as; TF-IDF vectors, Word Embeddings (see Word2Vec by Tomas Mikolov), ... Note that in order to compute a term frequency vector, one first needs the amount of words in the dictionary and a mapping of a string to the corresponding index. As a result, this cannot be done directly with a transformer (since you need to loop 2 times over the data).

``cvModel = cv.fit(dataset)``

This model wild hold all the information we need in order to construct the term frequency vectors. Finally, we apply this model to the dataset (thus, the model is a transformer), and obtain the processed dataset we will apply to the model.


In [4]:
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer

# Most approaches cannot handle full-text. As a result, a word, or a piece of text needs to be described by a set
# of features. In this example we will be using count vectors to describe a comment. However, one could also use
# more advanced features such as word (or paragraph) embeddings (see Word2Vec by Tomas Mikolov for further details).

# Clean the dataset if we rerun this part of the notebook.
print("Original dataframe:")
dataset = dataset.select([dataset.id, dataset.text, dataset.label])
dataset.show()

tokenizer = Tokenizer(inputCol="text", outputCol="words")
stopWordRemover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
cv = CountVectorizer(inputCol="filtered_words", outputCol="features")

# Show how the dataset evolves when it has been applied by every transformer.
dataset = tokenizer.transform(dataset)
print("After applying tokenizer:")
dataset.show()
dataset = stopWordRemover.transform(dataset)
print("After applying stop-word remover:")
dataset.show()
cvModel = cv.fit(dataset)
print("After applying the count-vectorizer:")
dataset = cvModel.transform(dataset)
dataset.show()

Original dataframe:
+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0|robots will take ...|  0.0|
|  1|AI helps humanity...|  1.0|
|  2|unsupervised lear...|  1.0|
|  3|i think unsupervi...|  0.0|
|  4|machine learning ...|  0.0|
|  5|machine learning ...|  1.0|
+---+--------------------+-----+

After applying tokenizer:
+---+--------------------+-----+--------------------+
| id|                text|label|               words|
+---+--------------------+-----+--------------------+
|  0|robots will take ...|  0.0|[robots, will, ta...|
|  1|AI helps humanity...|  1.0|[ai, helps, human...|
|  2|unsupervised lear...|  1.0|[unsupervised, le...|
|  3|i think unsupervi...|  0.0|[i, think, unsupe...|
|  4|machine learning ...|  0.0|[machine, learnin...|
|  5|machine learning ...|  1.0|[machine, learnin...|
+---+--------------------+-----+--------------------+

After applying stop-word remover:
+---+--------------------+-----+----------------

In [5]:
from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
mlModel = nb.fit(dataset)

### Pipelines

One could compact the above lines into a *pipeline*. This will encapsulate the above workflow so other data can be processed more easily without having to write the same code twice.

In [6]:
from pyspark.ml import Pipeline

# Print old dataset to show they are equal.
dataset.show()
# Clean the dataset if we rerun this part of the notebook.
dataset = dataset.select([dataset.id, dataset.text, dataset.label])
preprocessingPipeline = Pipeline(stages=[tokenizer, stopWordRemover, cv, cvModel])
preprocessedModel = preprocessingPipeline.fit(dataset)
dataset = preprocessedModel.transform(dataset)
# Show the output of the preprocessing pipeline.
dataset.show()

+---+--------------------+-----+--------------------+--------------------+--------------------+
| id|                text|label|               words|      filtered_words|            features|
+---+--------------------+-----+--------------------+--------------------+--------------------+
|  0|robots will take ...|  0.0|[robots, will, ta...|[robots, destroy,...|(22,[7,13,18,19,2...|
|  1|AI helps humanity...|  1.0|[ai, helps, human...|[ai, helps, human...|(22,[6,8,10,15,20...|
|  2|unsupervised lear...|  1.0|[unsupervised, le...|[unsupervised, le...|(22,[0,1,2,4,9,11...|
|  3|i think unsupervi...|  0.0|[i, think, unsupe...|[think, unsupervi...|(22,[0,1,12,14],[...|
|  4|machine learning ...|  0.0|[machine, learnin...|[machine, learnin...|(22,[0,3,5,16],[1...|
|  5|machine learning ...|  1.0|[machine, learnin...|[machine, learnin...|(22,[0,2,3],[1.0,...|
+---+--------------------+-----+--------------------+--------------------+--------------------+

+---+--------------------+-----+-------

In [7]:
# Magically fetch new comments (with a label) in the same format as the new dataset.
testset = sqlContext.createDataFrame([
    (0L, "skynet is here", 0.0),
    (1L, "unsupervised learning is very cool", 1.0)], ["id", "text", "label"])

# Display the testset before any preprocessing steps.
print("Original DataFrame:")
testset.show()
# Feed the set set to the previously created preprocessing pipeline.
preprocessedModel = preprocessingPipeline.fit(testset)
testset = preprocessedModel.transform(testset)

result = mlModel.transform(testset)

# Show the label and prediction after applying the machine learning model.
print("DataFrame after applying the model and preprocessing pipeline:")
result.select([result.prediction, result.label, result.probability]).show()

Original DataFrame:
+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0|      skynet is here|  0.0|
|  1|unsupervised lear...|  1.0|
+---+--------------------+-----+

DataFrame after applying the model and preprocessing pipeline:
+----------+-----+--------------------+
|prediction|label|         probability|
+----------+-----+--------------------+
|       0.0|  0.0|[0.67889908256880...|
|       1.0|  1.0|[0.37134813750430...|
+----------+-----+--------------------+

