# Word Vectors

# Topics

- Why Word Vectors exist
- How to use them
- How to train them

# Setup

Run this command from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) conda install gensim cython nltk
```

### gensim: for training word2vec

https://radimrehurek.com/gensim/


### Cython: to speed up training word2vec
http://docs.cython.org/en/latest/src/quickstart/install.html


### NLTK: NLP toolkit
Installation: https://www.nltk.org/install.html

## Why do we need Word Vectors

To represent word meanings in an efficient way

To express word meaning based on context
 - Context: window of words around this word

## Before Word Vectors: Synsets

- Synsets: Lists of synonyms for a word
  - Someone needs to curate the list (create, update, delete)
    - Language-specific
  - Does not adapt as language evolves

## Uses of Synsets
- Synsets are still useful in NLP
 - for tasks where word meaning is relatively static (e.g. dated literature)
 - as a baseline model

# Walkthrough - WordNet

WordNet is a well-known database that contains sets of synonyms of English words.

In this walkthrough, we will use NLTK to query WordNet.


1. Click on http://wordnetweb.princeton.edu/perl/webwn
2. Enter a word to search for (has to be single word)

For example, when searching for the word "machine":

![wordnet](assets/word-vectors/wordnet-search.png)


Besides the browser, you can use NLTK to query WordNet.

Examples:
http://www.nltk.org/howto/wordnet.html

API:
http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html

In [2]:
import nltk
from nltk.corpus import wordnet as wn

# query synsets for 'machine'
wn.synsets('machine')

[Synset('machine.n.01'),
 Synset('machine.n.02'),
 Synset('machine.n.03'),
 Synset('machine.n.04'),
 Synset('machine.n.05'),
 Synset('car.n.01'),
 Synset('machine.v.01'),
 Synset('machine.v.02')]

Synsets are a set of synonyms that share a common meaning.

Synset attributes, accessible via methods with the same name:

- name: The canonical name of this synset, formed using the first lemma
  of this synset. Note that this may be different from the name
  passed to the constructor if that string used a different lemma to
  identify the synset.
- pos: The synset's part of speech, matching one of the module level
  attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.
- lemmas: A list of the Lemma objects for this synset.
- definition: The definition for this synset.
- examples: A list of example strings for this synset.
- offset: The offset in the WordNet dict file of this synset.
- lexname: The name of the lexicographer file containing this synset.

In [22]:
# Loop to print out what each synset represents
# Each set is a different semantic meaning of the word 'machine'

for s in wn.synsets('machine'):
    print(s.hypernyms())

[Synset('device.n.01')]
[Synset('person.n.01')]
[Synset('organization.n.01')]
[Synset('mechanical_device.n.01')]
[Synset('organization.n.01')]
[Synset('motor_vehicle.n.01')]
[Synset('shape.v.02')]
[Synset('produce.v.02')]


In [23]:
# check if another word is similar to 'machine', using it's 'device' meaning
# larger number means more similarity

robot_synset1 = wn.synsets('robot')[0]
print('robot vs. machine similarity:', robot_synset1.path_similarity(machine_synset1))
      
sunset_synset1 = wn.synsets('sunset')[0]
print('sunset vs. machine similarity:', sunset_synset1.path_similarity(machine_synset1))

robot vs. machine similarity: 0.25
sunset vs. machine similarity: 0.058823529411764705


# Word Vectors

Instead of static lists of words, word vectors are trained from examples of text.

The examples of text (text corpus) should be "large enough" to capture the possible meanings of the word.

Anyone can train a word vector. Unlike SynSets, you don't need to be a linguist or expert in word meanings. You just need enough examples.

# Pre-trained Word Vectors

These are available, but are usually huge downloads (GB)
- https://code.google.com/archive/p/word2vec/
- https://github.com/Hironsan/awesome-embedding-models

Instead, we will train our own Word Vectors. This is most flexible because you can adapt to your particular text corpus.

# Workshop: Creating Word2Vec Models

## Download training text corpus

For demonstration purposes, we'll start with Wikipedia articles.

We'll use a python library that wraps the Wikipedia APIs.

https://pypi.org/project/wikipedia/

Run this from an Anaconda prompt (within the mldds03 environment):

```
(mldds03) pip install wikipedia
```

In [None]:
import wikipedia
from wikipedia import search, page

# Get our documents: wikipedia articles
topic = 'singapore'

titles = search(topic)
titles

In [None]:
# retrieve all pages
wikipages = [page(title) for title in titles]

# inspect the first page
wikipages[0].summary

## Process text

- Split into sentences
- Remove special characters
- Convert to lowercase
- Tokenize the text into words
- Optionally remove stop words such as 'a', 'the'
- Stem each word

In [None]:
import re # python regular expressions library
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK corpora
# List of available corpora: http://www.nltk.org/book/ch02.html#tab-corpora

# 1. Download the Punkt sentence tokenizer
# https://www.nltk.org/_modules/nltk/tokenize/punkt.html
nltk.download('punkt')

# 2. Download the Stop Words corpus
nltk.download('stopwords')

# 3. Helper function to convert text
def text_to_sentence_wordlists(text, remove_stopwords=True):
    """Cleans and converts text to a list of lists of tokens
    Args:
        text: input text
        remove_stopwords: whether to remove stopwords
    Returns: a tuple
        A list of lists of tokens that looks like:
           [["cat", "say", "meow"], ["dog", "say", "woof"]]
        Total of words
    """
    # Split into sentences
    # Reference: http://www.nltk.org/api/nltk.tokenize.html
    sentences = nltk.sent_tokenize(text)

    # set of stopwords
    stops = set(stopwords.words('english'))

    stemmer = PorterStemmer()
    
    wordcount = 0
    result = []
    for sentence in sentences:
        # Remove non-letters and numbers
        sentence = re.sub('[^a-zA-Z0-9]', ' ', sentence)

        # Convert to lowercase
        sentence = sentence.lower()
        
        # Tokenize the sentence into words
        tokens = nltk.word_tokenize(sentence)
    
        if remove_stopwords:
            # Remove stop words
            tokens = [token for token in tokens if not token in stops]
    
        # Stem the words
        tokens = [stemmer.stem(t) for t in tokens]
        
        result += [tokens]
        wordcount += len(tokens)
    
    return (result, wordcount)

In [None]:
# Test our helper function to see what it does
text = wikipages[0].summary
print('===== Original text for first article =====')
print(text)

wordlist, count = text_to_sentence_wordlists(text,
                                             remove_stopwords=False)
print('\n===== Stem words [%d words] =====' % count)
print(wordlist)

wordlist, count = text_to_sentence_wordlists(text)
print('\n===== Stem words - stopwords [%d words] =====' % count)
print(wordlist)

### Convert all articles to sentence word lists

Let's now convert all articles on our topic to sentence word lists.

We were examining the summary for each article, let's see how we can get to the content.

Looking at the wikipedia library's documentation, we can use `WikipediaPage.content` to get to the plain text content for each page: https://wikipedia.readthedocs.io/en/latest/code.html

In [None]:
wikipages[0].content

In [None]:
print('Converting %d articles to training set...' % len(titles))

training_set = []
training_set_size = 0

for wikipage in wikipages:
    wordlist, count = text_to_sentence_wordlists(wikipage.content)

    training_set_size += count
    training_set += wordlist
    
print('Training set size: %d stem words, %d sentences' \
      % (training_set_size, len(training_set)))

### Question to ponder:

Should we randomize the training set?

Why or why not?

## Train a word2vec model

(Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors)

With the list of nicely parsed sentences, we're ready to train the model. There are a number of parameter choices that affect the run time and the quality of the final model that is produced.

For details on the algorithms below, see the [word2vec API documentation](https://radimrehurek.com/gensim/models/word2vec.html) as well as the [Google documentation](https://code.google.com/archive/p/word2vec/)(Performance section).

### Domain characteristics

Our training set is:
- Small (under 25k words). Typically, word2vec training sets can go in hundreds of thousands.
- Wikipedia articles about a common topic. We'll expect some words (e.g. singapore) to appear more frequently about that topic. Whether this is something we need to worry about is unclear.

### Hyperparameters

#### Architecture:
Architecture options are skip-gram (the default: slower, better for infrequent words) or continuous bag of words (fast). 

#### Training algorithm:
This controls which algorithm to use.

Hierarchical softmax (the default: better for infrequent words) or negative sampling (better for frequent words, better with low dimensional vectors). Start with the default first.

#### Downsampling of frequent words:
This controls the threshold for frequent words to be removed randomly. 

Randomly removing frequent words in large datasets can improve both accuracy and speed.

$$p = \frac{f-t}{f} - \sqrt{\frac{t}{f}}$$

Where:
- $p$: probabability that word is present
- $f$: frequency of word in corpus
- $t$: the threshold (our downsampling hyperparameter)

A smaller $t$ means more words will be randomly removed.

(Source: https://levyomer.files.wordpress.com/2015/03/improving-distributional-similarity-tacl-2015.pdf)

The [Google documentation](https://code.google.com/archive/p/word2vec/) recommends values between 1e-3 and 1e-5. Let's try 1e-3 and then iterate from there, since our training set is small.

#### Word vector dimensionality:
This controls how many features the word vector should have. Higher dimensionality (more features) usually result in better models, but also longer runtimes. Reasonable values can be in the tens to hundreds. We'll try 200.

#### Context / window size:
This defines the window-size to look for related words. For skip-gram usually around 10, for CBOW around 5. More is better, up to a point.

### Worker threads:
Number of parallel processes to run. This can significantly improve training speed.  

The number to choose depends on how many logical CPU cores your computer has (on Windows, Start Menu -> System Information, look for Processors). 

Start with a number around 2-4, and then increase up if your computer is more powerful.

### Minimum word count:
This helps limit the size of the vocabulary to meaningful words. Any word that does not occur at least this many times across all documents is ignored. 

Reasonable values could be between 10 and 100. Higher values also help limit run time.

For wikipedia articles, we'll try a minimum wordcount of 10.

In [None]:
from gensim.models import word2vec

word2vec.Word2Vec?

In [None]:
# Credits: https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors

# Set values for various parameters
sg = 1                # Algorithm: 1: skip-gram, 0: CBOW
num_features = 200    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 2       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Import the built-in logging module and configure it so that Word2Vec 
# creates nice output messages
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Initialize and train the model.
# This may take a while if your training set is large (e.g. 500,000 words)
print('Training Word2Vec model...')
%time model = word2vec.Word2Vec(training_set, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "wikipedia_{}features_{}minwords_{}context_{}downsampling.w2v" \
    .format(num_features, min_word_count, context, str(downsampling))
model.save(model_name)

print('Saved model as %s' % model_name)

## Loading the saved model

Here's how to load a previously saved model.

In [None]:
model_name = "wikipedia_100features_50minwords_10context.w2v"

model = word2vec.Word2Vec.load(model_name)

## Evaluating the model

The trained model contains a read-only `models.keyedvectors.Word2VecMeyedVectors` with methods for evaluating word relationships.

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors

Here are some things to try with the word2vec model:

Get the vocabulary of the model:

In [None]:
# number of words in the vocab
len(model.wv.vocab)

In [None]:
model.wv.vocab

Check if a stem word is in the model's vocabulary:

In [None]:
stemmer = PorterStemmer()
stemmer.stem('malaysia') in model.wv.vocab

In [None]:
stemmer.stem('korea') in model.wv.vocab

Find a word that doesn't match in a list of words:

In [None]:
test = 'raffles indian chinese malay'

# you can either use the helper function to convert to stem words
# or call stemmer.stem() directly on each word
wordlist, _ = text_to_sentence_wordlists(test)
print('Input: %s' % wordlist[0])

print("Word that doesn't match: %s"
      % model.wv.doesnt_match(wordlist[0]))

Get the top N most similar words:

In [None]:
word = stemmer.stem('singapore')
model.wv.most_similar(word, topn=10)

In [None]:
word = stemmer.stem('changi')
model.wv.most_similar(word, topn=10)

Measures the cosine distance and similarity between two words.

In [None]:
word1 = stemmer.stem('changi')
word2 = stemmer.stem('aircraft')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

In [None]:
word1 = stemmer.stem('changi')
word2 = stemmer.stem('british')

print('distance: %f' %
      model.wv.distance(word1, word2))

print('similarity: %f' %
      model.wv.similarity(word1, word2))

Returns the word's representation in vector space as a 1D numpy array

In [None]:
word = stemmer.stem('malaysia')

raw_vectors = model.wv.word_vec(word, use_norm=True)

raw_vectors.shape

In [None]:
raw_vectors

# Visualizing Word2Vec

Next, we'll plot the Word Vectors to see how the clusters look like:

1. Use t-Distributed Stochastic Neighbor Embedding [TSNE](https://lvdmaaten.github.io/tsne/) to reduce the high-dimensional model into 2D
2. Plot the 2D representation of the word2vec model, with the words in its vocabulary as the labels

Credits: https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

vocab = list(model.wv.vocab)
X = model[vocab]

# Apply t-SNE
# this can take a while (like 1 minute or more)
tsne = TSNE(n_components=2)
%time X_tsne = tsne.fit_transform(X)

X_tsne

In [None]:
import pandas as pd

# Create a dataframe for the 2 dimensions,
# indexed by the words in the vocab
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
df.head()

In [None]:
# create a zoomable interactive plot
%matplotlib notebook

# Plot the 2D representation of the word2vec model,
# with the words in its vocabulary as the labels

fig, ax = plt.subplots(figsize=(10, 10))

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos)

## Exercise - Create Corpus and Train Word2Vec

In this exercise, we will create our own corpus and use it to train Word2Vec.

### Create Corpus

Create a corpus of text files, organized in a structure like this:

```
corpus/
   text001.txt
   text002.txt
   text003.txt
   ...
```

A sample corpus is included in the `corpus` folder, created with the first 3 chapters of Moby Dick:
https://www.gutenberg.org/files/2701/2701-0.txt

### Import corpus using NLTK

We will use [`nltk.corpus.reader.plaintext`](http://www.nltk.org/howto/corpus.html) to import the corpus.

Credits: https://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# directory containing the corpus
corpus_dir = 'corpus/'

# PlaintextCorpusReader uses nltk.tokenize.sent_tokenize() and
# nltk.tokenize.word_tokenize() to split texts into sentences and words
newcorpus = PlaintextCorpusReader(corpus_dir,
                                  '.*\.txt',
                                  encoding='latin1') # or 'utf-8'

In [None]:
# files found by the reader
newcorpus.fileids()

In [None]:
# print the first file in the corpus
f = newcorpus.open('text001.txt')
print(f.read().strip())

In [None]:
# sentences in the corpus:
newcorpus.sents()

In [None]:
# number of sentences
len(newcorpus.sents())

In [None]:
def clean_sentence_lists(sentence_lists, remove_stopwords=True):
    """Cleans and converts the sentence lists
    Args:
        text: sentence lists
        remove_stopwords: whether to remove stopwords
    Returns:
        A tuple:
            The cleaned sentence list
            The token count
    """
    # set of English stop words
    stops = set(stopwords.words('english'))

    stemmer = PorterStemmer()
    
    result = []
    wordcount = 0

    for sentence in sentence_lists:
        # Convert to lowercase
        tokens = [t.lower() for t in sentence]
        
        # Remove stop words
        if remove_stopwords:
            tokens = [t for t in tokens if not t in stops]
        
        # Remove non-letters and numbers
        tokens = [re.sub('[^a-zA-Z0-9]', '', t) for t in tokens]
        
        # Stem the words
        tokens = [stemmer.stem(t) for t in tokens]
        
        result += [tokens]
        wordcount += len(tokens)
    
    return (result, wordcount)

Your Tasks:

1. Convert newcorpus.sents() to sentence wordlists, using the `clean_sentence_lists` helper function
2. Train a Word2Vec model, with initial hyperparameter settings (use your best guess)
3. Try some word similarity queries
4. Apply t-SNE to project the high dimensional model to 2-dimensions
5. Plot the 2-dimensional projection of your model
6. Tweak your model by adjusting some hyperparameters

In [None]:
# 1. Convert newcorpus.sents() to sentence wordlists, 
# using the clean_sentence_lists helper function
#
# Your code here












In [None]:
# 2. Train a Word2Vec model, with initial hyperparameter settings
#
# Your code here


























In [None]:
# 3. Try some word similarity queries
# Your code here














In [None]:
# 4. Apply t-SNE to reduce the your model to 2 dimensions for plotting
# Your code here













In [None]:
# 5. Plot the completed Word2Vec model
# Your code here













# References

NLTK book: http://www.nltk.org/book/


CS224d 2015 Lecture 2: Word Vectors: https://youtu.be/T8tQZChniMk