# Word Embeddings Demo

## Demo 1

Let's import the necessary libraries...

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import Word2Vec

And we need to initiate a Spark session

In [2]:
spark = SparkSession.builder.appName('word-embeddings-demo').getOrCreate()

#### Data preparation

The Reuters corpus is a collection of news documents. It contains over 10,000 news documents totaling over 1.3 million words.

This Reuters corpus was loaded from Python's `NLTK` package and preprocessed using `NLTK` in the following ways:

  * Removing punctuation
  * Lemmatizing each word

The data was imported into Spark and saved as a single-field document-level parquet file for this demonstration which is reloaded below:

In [3]:
reuters_df = spark.read.parquet('data/reuters-train-corpus.parquet')

Some basic information about `reuters_df`:

In [4]:
[reuters_df.count(), len(reuters_df.columns)]

[13328, 1]

In [5]:
reuters_df.printSchema()

root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [6]:
reuters_df.show(5)

+--------------------+
|               value|
+--------------------+
|[u, k, money, mar...|
|[u, k, money, mar...|
|[south, korea, to...|
|[u, k, money, mar...|
|[u, s, treasury, ...|
+--------------------+
only showing top 5 rows



#### Creating and fitting the model

When using Spark's `Word2Vec` functionality, we need to first instantiate a `Word2Vec` object

In [7]:
reuters_word2vec = Word2Vec(
    vectorSize = 100, windowSize = 5, inputCol = 'value', 
    outputCol = 'embedding', seed = 6231912
)

There are a couple key parameters here:

  * `vectorSize` - the desired dimension of the output vector
  * `windowSize` - the number of words in the context window
  

And there are a few others that we'll leave be for now: `maxSentenceLength`, `maxIter`, `stepSize`, `numPartitions`, etc.

We can then fit a model on the Reuters corpus using `Word2Vec`'s `fit` method

In [8]:
reuters_model = reuters_word2vec.fit(reuters_df)

The resulting `reuters_model` object is the trained model containing the word embeddings

## Demo 2

The `Word2VecModel`'s `findSynonyms(word, num)` method finds the `num` most similar words to `word` using cosine similarity

In [9]:
reuters_model.findSynonyms('spain', 3).show()

+-------+------------------+
|   word|        similarity|
+-------+------------------+
|  italy|0.7977195978164673|
|belgium|0.7509787678718567|
|denmark|0.7170287370681763|
+-------+------------------+



In [10]:
reuters_model.findSynonyms('president', 3).show()

+--------+------------------+
|    word|        similarity|
+--------+------------------+
|chairman|0.6982269883155823|
|    vice|0.6736516952514648|
|   chief|0.6535186767578125|
+--------+------------------+



## Demo 3

And we can use the following to set up the equation to solve for the fourth word:

* $Father - Mother + Grandmother = Grandfather$
* $Lose - Win + Rise = Fall$

The following function estimates the fourth word by: 

  1. Taking the first three words and the model as input
  2. Finding the vectors of the three words according to the model
  3. Computing the fourth vector using the equation previously outlined
  4. Finding the words in the model's vocabulary that are closest to the fourth vector

In [12]:
def getAnalogy(model, word1, word2, word3):
    
    from pyspark.sql.functions import col
    
    embeddings_df = model.getVectors()
    word1_vector = embeddings_df.filter(col('word') == word1).select('vector').collect()[0][0] 
    word2_vector = embeddings_df.filter(col('word') == word2).select('vector').collect()[0][0]
    word3_vector = embeddings_df.filter(col('word') == word3).select('vector').collect()[0][0]
        
    return model \
        .findSynonyms(word2_vector - word1_vector + word3_vector, 4) \
        .filter(col('word') != word3)

The analogy below tries to solve the following equation:

$$Lose - Win + Rise = \hspace{2mm}\rule{1.5cm}{0.15mm}$$

In [13]:
getAnalogy(reuters_model, 'win', 'lose', 'rise').show()

+-------+------------------+
|   word|        similarity|
+-------+------------------+
|decline|0.6982008218765259|
|   fall|0.6950353980064392|
|   drop| 0.681812047958374|
+-------+------------------+



While *fall* isn't the top result, it's close and all of the words carry a similar meaning

## Demo 4

Syntactic relationships are harder to learn and require a large amount of data when creating the word embeddings

We'll load in pre-trained word embeddings developed by Google:

  * These are again based on news documents 
  * There are over 3 billion words included in the training corpus and 3 million words with vectors
  * Note that these are loaded in using `gensim` package

In [14]:
from gensim.models import KeyedVectors
google_embeddings = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary = True)

The same analogy framework used to show semantic relationships can be used to show these syntactic relationships

* $Add - Added = Substract - Subtracted$
* $Run - Ran = Shrink - \hspace{2mm}\rule{1.5cm}{0.15mm}$

A new analogy function is required due to the use of `gensim` rather than Spark.

In [15]:
def getAnalogyGensim(embeddings, word1, word2, word3):
        
    word1_vector = embeddings.get_vector(word1)
    word2_vector = embeddings.get_vector(word2)
    word3_vector = embeddings.get_vector(word3)
        
    return [
        element for element in embeddings.similar_by_word(word2_vector - word1_vector + word3_vector, 5) 
        if element[0] != word3
    ]        

In [16]:
getAnalogyGensim(google_embeddings, 'add', 'added', 'subtract')

[('subtracted', 0.6371101140975952),
 ('subtracting', 0.5363016128540039),
 ('subtracts', 0.5089912414550781),
 ('deducted', 0.4715757668018341)]

In [18]:
getAnalogyGensim(google_embeddings, 'run', 'ran', 'hang')

[('hung', 0.7027653455734253),
 ('hanging', 0.6336132287979126),
 ('ran', 0.6057102680206299),
 ('hangs', 0.5554769039154053)]

#### Plurality

Plurality can also be represented in word embeddings

In [19]:
getAnalogyGensim(google_embeddings, 'cat', 'cats', 'dog')

[('dogs', 0.9074791669845581),
 ('cats', 0.7861816883087158),
 ('pets', 0.7557172775268555),
 ('canines', 0.7553417682647705)]

In [20]:
getAnalogyGensim(google_embeddings, 'duck', 'ducks', 'goose')

[('geese', 0.7856709957122803),
 ('ducks', 0.702297031879425),
 ('Canada_geese', 0.6364935636520386),
 ('mallards', 0.6159207224845886)]

In [22]:
getAnalogyGensim(google_embeddings, 'dog', 'dogs', 'moose')

[('elk', 0.7245393991470337),
 ('caribou', 0.7196517586708069),
 ('deer', 0.7054836750030518),
 ('grizzlies', 0.6522204875946045)]

In [23]:
getAnalogyGensim(google_embeddings, 'dog', 'dogs', 'tableau')

[('tableaux', 0.6940608620643616),
 ('tableaus', 0.6453362703323364),
 ('tableau_vivant', 0.5498777031898499),
 ('tableaux_vivants', 0.5395413637161255)]

## Demo 5

In [24]:
reuters_model.findSynonyms('good', 3).show()

+------------+------------------+
|        word|        similarity|
+------------+------------------+
|  electronic|0.6784631013870239|
|manufactured|0.5808244943618774|
|  electrical|0.4770793318748474|
+------------+------------------+



In [25]:
reuters_model.findSynonyms('right', 3).show()

+---------+------------------+
|     word|        similarity|
+---------+------------------+
|  entitle|0.6891614198684692|
| entitles| 0.643031895160675|
|entitling|0.6147499680519104|
+---------+------------------+



These definitions are based on how the words are used in the Reuters corpus

Let's look at these same words using word embeddings created using the Gutenberg corpus &mdash; a collection of novels

In [None]:
# gutenberg_df = spark.read.parquet('data/gutenberg-train-corpus.parquet')

In [None]:
# gutenberg_word2vec = Word2Vec(
#     vectorSize = 300, windowSize = 10, inputCol = 'value', 
#     outputCol = 'embedding', seed = 6231912, maxSentenceLength = 2000, maxIter = 5
# )
# gutenberg_model = gutenberg_word2vec.fit(gutenberg_df)

In [None]:
# gutenberg_model.save("data/gutenberg_model.model")

In [26]:
from pyspark.ml.feature import Word2VecModel
gutenberg_model = Word2VecModel.load("data/gutenberg_model.model")

The resulting synonyms represent different definitions of the words than the Reuters corpus embeddings

In [27]:
gutenberg_model.findSynonyms('good', 3).show()

+-------+------------------+
|   word|        similarity|
+-------+------------------+
|natured|0.5722174644470215|
|   much|0.5657753944396973|
|   well|0.5315241813659668|
+-------+------------------+



In [28]:
gutenberg_model.findSynonyms('right', 3).show()

+---------+-------------------+
|     word|         similarity|
+---------+-------------------+
|     left| 0.3864390254020691|
|admirably| 0.3837079703807831|
|   decide|0.37801557779312134|
+---------+-------------------+



## Demo 6

**Most word embedding algorithms require a large amount of data to create successful results**

In [29]:
getAnalogy(gutenberg_model, 'father', 'mother', 'grandfather').show()

+-------+------------------+
|   word|        similarity|
+-------+------------------+
|  jerry| 0.668878436088562|
|muskrat|0.6316283345222473|
| anyway|0.6198262572288513|
+-------+------------------+



In [30]:
getAnalogyGensim(google_embeddings, 'father', 'mother', 'grandfather')

[('grandmother', 0.8793414831161499),
 ('aunt', 0.8139014840126038),
 ('mother', 0.8082102537155151),
 ('granddaughter', 0.7867908477783203)]