<img src="uva_seal.png">  

## Machine Learning with MLlib
## *Introduction and Feature Extraction*

### University of Virginia
### DS 5559: Big Data Analytics
### Last Updated: June 11, 2020

---  

### SOURCES 

1. Learning Spark
2. Spark Documentation  
	https://spark.apache.org/docs/latest/mllib-data-types.html  
	http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html

### OBJECTIVES
1. Introduction to the machine learning library
2. Introduction to MLlib data types
3. Discuss Feature Extraction tools in MLLib


### CONCEPTS AND FUNCTIONS
- pipeline  
- supervised and unsupervised learning  
- learning tasks: classification, regression, clustering, dimensionality reduction  
- training set, testing set  
- feature extraction  

- MLlib data types:  
  - LabeledPoint  
  - sparse vector, dense vector  
  - sparse matrix, dense matrix  
  - Rating  

- Feature Extraction  
- TF-IDF  
- Word2Vec  
- Cosine Similarity  


---  

**Machine Learning in Spark**  
Spark actually has two ML libraries: `MLlib` and `ML`, with `MLlib` being the older library; it is based on RDDs. The newer `ML` package is based on DataFrame use.  `ML` tends to be a more natural package, as users will generally have the data in DataFrames to build features.  This notebook covers `MLlib`, and `ML` will be covered later in the course.


**MLlib**

This is the original machine learning library from the "olden" days.
It works on RDDs, because in the olden days, DataFrames did not yet exist, and people needed to do machine learning.

MLlib contains only algorithms that can be parallelized, since those run well on clusters.  This does limit the algorithm choices.

MLlib includes a pipeline API useful for building ML pipelines, similar to scikit-learn in Python.  It is HIGHLY recommended that you use pipelines.  They encapsulate the process, reducing the chance of errors, and making the scoring process simple.  More on pipelines later.

Next, we jump right in, building a classifier and making predictions. You might not yet know about objects like `LabeledPoint`, but this should be fun and motivating!

### Build LogReg Classifier to Predict Spam vs Not

In [9]:
# IMPORT MODULES
import os
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.sql import SparkSession

In [10]:
spark = SparkSession.builder \
        .master("local") \
        .appName("mllib_classifier") \
        .getOrCreate()

In [11]:
spark

In [12]:
sc = spark.sparkContext

In [5]:
# read in spam and ham (not spam) data
spam = sc.textFile("spam.txt")
ham = sc.textFile("ham.txt")

In [6]:
spam.collect()[0]

'Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...'

In [7]:
# note you wouldn't collect to driver if RDD was massive
spam.collect()

['Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...',
 'Get Viagra real cheap!  Send money right away to ...',
 'Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...',
 'YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...',
 'THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...']

In [8]:
ham.collect()

['Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...',
 'Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...',
 'Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...',
 'Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...',
 "Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...",
 'Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...',
 'Summit demo got whoops from audience!  Had to let you know. --Joe']

In [9]:
# set up a Term Frequency object using the hashing trick
tf = HashingTF(numFeatures = 10000)

In [10]:
# tokenize the datasets, parsing on spaces
spamFeatures = spam.map(lambda email: tf.transform(email.split(" ")))
normalFeatures = ham.map(lambda email: tf.transform(email.split(" ")))

In [11]:
spam.collect()[0]

'Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...'

In [12]:
# build LabeledPoint datasets (1=spam, 0=ham)
# LabeledPoints package (label, features) for each record
positiveExamples = spamFeatures.map(lambda features: LabeledPoint(1, features))
negativeExamples = normalFeatures.map(lambda features: LabeledPoint(0, features))

In [13]:
pos = positiveExamples.collect()

In [18]:
pos[4]

LabeledPoint(1.0, (10000,[0,365,1395,1451,1458,1819,2701,3834,4323,4671,5336,5469,5878,6300,6384,6910,7296,9101,9604],[1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0]))

In [19]:
neg = negativeExamples.collect()

In [20]:
neg[0]

LabeledPoint(0.0, (10000,[0,1162,2403,2809,3080,3317,4161,4770,5423,5651,5743,5831,6006,6827,6971,7069,7872,9150,9370,9521,9604],[1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))

In [21]:
# build training set; this stacks positive and negative records
trainData = positiveExamples.union(negativeExamples)

# cache since model training is recursive; o.w. would rebuild DataFrame
trainData.cache()

UnionRDD[6] at union at NativeMethodAccessorImpl.java:0

In [22]:
# train LogReg model using default params
model = LogisticRegressionWithSGD.train(trainData, iterations=1000)

In [23]:
# push "not spam" example through classifier. this is label=0
Test = tf.transform("I love learning Spark programming".split(" "))

In [24]:
# Prediction
print("Prediction for example: {}".format(model.predict(Test)))
if model.predict(Test)==0:
    print("CORRECT!")
else:
    print("INCORRECT!")

Prediction for example: 0
CORRECT!


In [29]:
# push "not spam" example through classifier. this is label=0
Test2 = tf.transform("Get Viagra real cheap!".split(" "))

# Prediction
print("Prediction for example: {}".format(model.predict(Test2)))
if model.predict(Test2)==1:
    print("CORRECT!")
else:
    print("INCORRECT!")

Prediction for example: 0
INCORRECT!


### Definitions

Next we define the `MLlib` objects.

**LabeledPoint**  
Stores feature vector together with label  
**Rating**  
Rating of product by a user. Used in recommendation, for instance.  
**Vector**  
Handles dense and sparse. For sparse, only nonzero values and their indices are stored, along w vector length.  
Sparse saves on memory and runtime.  
**Matrix**  
A local matrix has integer-typed row and column indices and double-typed values, stored on a single   machine.  
MLlib supports dense matrices, whose entry values are stored in a single double array in column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse Column (CSC) format in column-major order.  
**Distributed matrix**  
A distributed matrix has long-typed row and column indices and double-typed values  
**Row matrix**  
A RowMatrix is a row-oriented distributed matrix without meaningful row indices  
**CoordinateMatrix**  
CoordinateMatrix is a distributed matrix backed by an RDD of its entries  
A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

Take a look at this wiki to learn about row- versus column-major order.  It is super important to know how the data is saved.  Could you imagine what would happen to results if this were mixed up?

https://en.wikipedia.org/wiki/Row-_and_column-major_order

In [None]:
# Create sparse vector [1.0 0.0 2.0 0.0]
from pyspark.mllib.linalg import Vectors

sv1 = Vectors.sparse(4, {0: 1.0, 2: 2.0})

In [None]:
sv1

### Feature Extraction

*mllib.feature*  
contains classes for common feature transformations:  
-  Term Frequency-Inverse Document Frequency (TF-IDF)  
Produces feature vectors from text documents

There are two algorithms that compute TF-IDF:  

**1. HashingTF**  
	Computes term frequency vector from document  
	Can process one document or an RDD of documents  
	Each document needs to be an interable sequence (a list in Python)  

To reduce the chance of collision, we can increase the target feature dimension, i.e., the  
	 number of buckets of the hash table. The default feature dimension is 1,048,576  

**2. IDF**  
	Computes inverse document frequency  
	Terms that appear in high fraction of the docs are not as valuable  
	IDF will downweight such terms  

Here is a good example of Feature Extraction:  
http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html

**Word2Vec**  
Computes distributed vector representation of words.  
Similar words are close in the vector space  
Useful in many NLP applications:  
named entity recognition, disambiguation, parsing, tagging and machine translation.  

The algorithm uses a neural network and some interesting concepts like the *hierarchical softmax*.  I encourage you to learn more if you have the time and interest.

### Fit Word2VecModel to some text data

In [30]:
from pyspark.mllib.feature import Word2Vec

inp = sc.textFile("fed_rates_article.txt").map(lambda row: row.split(" "))
topk = 5
print('First {} records:'.format(topk))
first_five = inp.take(topk)
for i in range(topk):
    print(first_five[i])
print("-----------------")
                        
word2vec = Word2Vec() # construct Word2Vec object
model = word2vec.fit(inp) # train Word2Vec on the dasta

# apply Word2Vec to find synonyms by representing words as vectors
synonyms = model.findSynonyms('rate', 20)

for word, cosine_distance in synonyms:
    print("{}: {}".format(word, cosine_distance))

First 5 records:
['Fed', 'expected', 'to', 'leave', 'interest', 'rates', 'alone', 'as', 'Trump', 'pushes', 'for', 'more', 'cuts']
['Donna', 'Borak', 'byline']
['By', 'Donna', 'Borak,', 'CNN', 'Business']
['']
['Trump', 'asks', 'if', 'Federal...']
-----------------
the: 0.2611371576786041
Powell: 0.2211126983165741
economy: 0.20416149497032166
policy: 0.14639779925346375
of: 0.1154128685593605
and: 0.11317558586597443
rates: 0.10572120547294617
that: 0.10568731278181076
to: 0.09924145042896271
is: 0.09905049204826355
US: 0.09054430574178696
have: 0.06850812584161758
has: 0.04858196899294853
a: 0.02839130535721779
Fed: 0.01105659082531929
be: 0.0006084572523832321
said: -0.0003059235750697553
for: -0.025064580142498016
Trump: -0.03947291150689125
if: -0.08069650828838348


In [32]:
type(inp)

pyspark.rdd.PipelinedRDD

**StandardScaler**   

Standardization can improve the convergence rate during the optimization process, and it also prevents against features with very large variances exerting an overly large influence during model   training.  

For each feature,  
1. Scales to unit variance  
2. Centers to mean zero  
Useful or even essential for some models  

`K-means` works in Euclidean space, so all features should be on same scale  

Tree models do not need this

Use this in a *Pipeline* so the statistics can be applied to datasets for scoring later. You would NOT compute means and standard deviations on the scoring set to standardize.

### Standard Scaler  
Load dataset in libsvm format, standardize the features so that the new features have unit variance and/or zero mean

In [13]:
from pyspark.mllib.util import MLUtils
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.feature import StandardScaler

In [14]:
data = MLUtils.loadLibSVMFile(sc, "sample_libsvm_data.txt")

In [15]:
data.take(1)

[LabeledPoint(0.0, (692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0

In [16]:
type(data)

pyspark.rdd.PipelinedRDD

In [17]:
# extract labels and features; stored as RDDs
label = data.map(lambda x: x.label)
features = data.map(lambda x: x.features)

In [None]:
scaler1 = StandardScaler().fit(features)

In [None]:
# data1 will be unit variance.
data1 = label.zip(scaler1.transform(features))

In [None]:
data1.take(2)

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Print the label and features (before scaling) from the first record in *data*.

2) Compute the first 20 synonyms of the word "economy." Then extract and print the cosine distances.  Do the results make sense?

3) Copy the Ham/Spam classifier code in the cell below.  Then try a different model, leaving the rest of the code unchanged.  Run the code.  Does it get the "not spam" example right?