# UBC instructor position sample lesson  

### By Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

### Today's plan

- Set the stage (~5 mins)
- Sample class (~40 mins)
- Reflection (~10 mins)
- Vision (~10 mins)
- Q and A (~10 mins)

#### Set the stage (~5 mins)

- I am envisioning this as the second lesson of DSCI 575 (Advanced Machine Learning in the context of Natural Language Processing (NLP) applications). 
- It is the first week of the final block of the MDS program's curriculum. 
- The students have already taken four Machine Learning courses: DSCI 571, DSCI 572, DSCI 573, DSCI 563. 
- They are now ready to apply the concepts they have learned so far on interesting problems. 
- The prerequisites I am assuming are
    - Familiarity with neural networks, which they have done in DSCI 572.

# DSCI 575: Advanced Machine Learning (in the context of NLP applications)

## Lecture 2: Introduction to Word Embeddings: _Toronto_ : _UofT_ :: _Vancouver_ : ?
Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]


#### Today's plan
- Recap from lecture 1 ()
- Activity 

In [3]:
import pandas as pd
import numpy as np
import os, sys
from IPython.display import display, HTML

import matplotlib.pyplot as plt

from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix, csr_matrix

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import re
from collections import defaultdict
from collections import Counter

import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors, FastText
common_texts

plt.rcParams['font.size'] = 16
from preprocessing import MyPreprocessor

In [4]:
# BEGIN STARTER CODE
class CooccurrenceMatrix:
    def __init__(self, corpus, 
                       tokenizer = word_tokenize, 
                       window_size = 3):
        self.corpus = corpus
        self.tokenizer = tokenizer
        self.window_size = window_size
        self.vocab = {}
        self.cooccurrence_matrix = None    
        
    def fit_transform(self):
        """
        Creates a co-occurrence matrix. 
        
        Returns vocabulary (dict) and co-occurrence matrix (csr_matrix)
        """
        data=[]
        row=[]
        col=[]
        for tokens in self.corpus:
            for target_index, token in enumerate(tokens):
                # Get the index of the word in the vocabulary. If the word is not in the vocabulary, 
                # set the index to the size of the vocabulary. 
                i = self.vocab.setdefault(token, len(self.vocab))
                
                # Consider the context words depending upon the context window 
                start = max(0, target_index - self.window_size)
                end = min(len(tokens), target_index + self.window_size + 1)
                
                for context_index in range(start, end):
                    # Do not consider the target word.  
                    if target_index == context_index: 
                        continue                        
                    j = self.vocab.setdefault(tokens[context_index], len(self.vocab))
                    # Set diagonal to 0
                    if i == j:
                        continue
                    data.append(1.0); row.append(i); col.append(j);
        self.cooccurrence_matrix = csr_matrix((data,(row,col)))
        return self.vocab, self.cooccurrence_matrix
            
    def get_word_vector(self, word):
        """
        Given a word returns the word vector associated with it from the co-occurrence matrix. 

        Keyword arguments:
        word -- (str) the word to look up in the vocab.
        """
        # YOUR CODE HERE
        # BEGIN SOLUTION
        if word in self.vocab: 
            return self.cooccurrence_matrix[self.vocab[word]]
        else:
            print('The word not present in the vocab')
        # END SOLUTION

# END STARTER CODE

### Last class: What is Natural Language Processing (NLP)?
#### How often do you search everyday? 
<img src="images/Google_search.png" width="900" height="900">

## Last class: What is Natural Language Processing (NLP)?

<img src="images/WhatisNLP.png" width="900" height="900">

## Why is it hard?

- Language is complex and subtle. 
- All the problems related to representation and reasoning in artificial intelligence arise in this domain. 

## Example: Lexical ambiguity

- Language is ambiguous at different levels. 

<img src="images/lexical_ambiguity.png" width="800" height="800">

## Example: Referential ambiguity

 - Language understanding involves common-sense knowledge and real-world reasoning.
 
<img src="files/images/referential_ambiguity.png" width="800" height="800">

### Representing text 

- So far we have been using data that looks like this: 

$X = \begin{bmatrix}1 & 0.8 & 0.3\\ 0 & 0 & 0.4\\ 1 & 0.2 & 0.8\\ \end{bmatrix}$ and $y = \begin{bmatrix}1 \\ 0 \\ 1 \end{bmatrix}$

- But consider data that looks like this instead: 

          
$X = \begin{bmatrix}\text{"@united you're terrible. You don't understand safety",}\\ \text{"@JetBlue safety first !! #lovejetblue"}\\ \text{"@SouthwestAir truly the best in #customerservice!"}\\ \end{bmatrix}$ and $y = \begin{bmatrix}0 \\ 1 \\ 1 \end{bmatrix}$

- ML algorithms we have seen so far prefer well-defined and fixed length input/output and text data is usually messy. 
- How can we effectively represent these reviews using features? 
    - Ideally we want to capture the "meaning" of text in our representation. 

#### Today's promise

- A method that learns a powerful representation of text data.  

#### Learning outcomes

From this class, you will be able to 

- Explain the general idea of the vector space models and co-occurrence matrices.
- Explain the difference between sparse and dense word vectors.
- Explain the general idea of the continuous skip-gram model.
- Train your own word vectors with Gensim and work with dense word vectors. 
- Load pre-trained word embeddings.

### Word meaning: _cup_ or _bowl_?

- Before moving to meaning of a sentence or a paragraph, let's start with word meaning. 
- Philosophical debate for centuries 
- Example: Where does the category cup end?
- Labov, 1975

<center>
<img src="files/images/Labov_cups.png" width="600" height="600">
</center>

#### [Are hockey gloves gloves or "articles of plastics"?](https://www.scc-csc.ca/case-dossier/info/sum-som-eng.aspx?cas=36258)

<blockquote>
Canada (A.G.) v. Igloo Vikski Inc. was a tariff code case that made its way to the SCC (Supreme Court of Canada). The case disputed the definition of hockey gloves as either gloves or as "articles of plastics."
</blockquote>



### Word meaning: NLP view
- Modeling word meaning that allows us to 
    * draw useful inferences to solve meaning-related problems 
    * find relationship between words, e.g., which words are similar, which ones have positive or negative connotations

### Pair-share (~5 mins) 

- Suppose you are building a Question Answering system and you are given the following question and three candidate answers. 
- Discuss the following questions with your neighbour(s). 
    - What kind of relationship between words do we need to capture in order to arrive at the correct answer?  
    - Brainstorm ways to represent words that will allow you to capture this relationship.   
<blockquote>       
<p style="font-size:30px"><b>Question:</b> How <b>tall</b> is Machu Picchu?</p>
    <p style="font-size:30px"><b>Candidate 1:</b> Machu Picchu is 13.164 degrees south of the equator.</p>    
<p style="font-size:30px"><b>Candidate 2:</b> The official height of Machu Picchu is 2,430 m.</p>
<p style="font-size:30px"><b>Candidate 3:</b> Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.</p>    
</blockquote> 
    

In [59]:
corpus = ["How tall is Machu Picchu?",
          "Machu Picchu is 13.164 degrees south of the equator.", 
          "The official height of Machu Picchu is 2,430 m.",
          "Machu Picchu is 80 kilometres (50 miles) northwest of Cusco."
         ]
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
pp_corpus

[['tall', 'machu', 'picchu'],
 ['machu', 'picchu', '13.164', 'degrees', 'south', 'equator'],
 ['official', 'height', 'machu', 'picchu', '2,430'],
 ['machu', 'picchu', '80', 'kilometres', '50', 'miles', 'northwest', 'cusco']]

In [21]:
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
X = vec.fit_transform(corpus)

In [22]:
### Let's check whether bag of words representation is better
bow_df = pd.DataFrame(X.toarray(), columns=sorted(vec.vocabulary_), index=corpus)
bow_df

Unnamed: 0,(,),.,13.164,"2,430",50,80,?,cusco,degrees,...,m,machu,miles,northwest,of,official,picchu,south,tall,the
How tall is Machu Picchu?,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,1,0
Machu Picchu is 13.164 degrees south of the equator.,0,0,1,1,0,0,0,0,0,1,...,0,1,0,0,1,0,1,1,0,1
"The official height of Machu Picchu is 2,430 m.",0,0,1,0,1,0,0,0,0,0,...,1,1,0,0,1,1,1,0,0,1
Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.,1,1,1,0,0,1,1,0,1,0,...,0,1,1,1,1,0,1,0,0,0


In [47]:
### Let's examine the similarity between these three documents
for i in range(1,4,1):
    print('Question: %s \nCandidate: %s \nSimilarity = %0.3f\n\n' \
                          %(bow_df.index[0],bow_df.index[i], \
                            bow_df.iloc[0].dot(bow_df.iloc[i])))

Question: How tall is Machu Picchu? 
Candidate: Machu Picchu is 13.164 degrees south of the equator. 
Similarity = 3.000


Question: How tall is Machu Picchu? 
Candidate: The official height of Machu Picchu is 2,430 m. 
Similarity = 3.000


Question: How tall is Machu Picchu? 
Candidate: Machu Picchu is 80 kilometres (50 miles) northwest of Cusco. 
Similarity = 3.000




### Representation 1: Representing words as atomic symbols

- Build the **vocabulary** containing all unique words from the corpus. 
- Represent each word as **one-hot** encoding. 
- A vector of length $V$ such that the value at word index is 1 and all other indices is 0.

In [22]:
def get_onehot_encoding(word, vocab):
    onehot = np.zeros(len(vocab), dtype='float64')    
    onehot[vocab[word]] = 1
    print('one-hot encoding of the word "%s" is: %s' % (word, str(onehot)))
    return onehot

In [23]:
# Note: In the NLP community a text data set is referred 
# to as a **corpus** (plural: corpora).

vocab = vec.vocabulary_
word1 = 'tall'
onehot_word1 = get_onehot_encoding(word1, vocab)

word2 = 'height'
onehot_word2 = get_onehot_encoding(word2, vocab)

print("The dot product between %s and %s is: %d" % 
      (word1, word2, onehot_word1.dot(onehot_word2)))

one-hot encoding of the word "tall" is: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0.]
one-hot encoding of the word "height" is: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0.]
The dot product between tall and height is: 0


### Problem with one-hot encoding

-  The problem with this representation is that there is no inherent notion of relationship between words.

<center>
$\vec{height}.\vec{tall} = 0$ 
</center>

### Representation 2: Term-term co-occurrence matrix

### Distributional hypothesis

<blockquote> 
    <p>You shall know a word by the company it keeps.</p>
    <footer>Firth, 1957</footer>        
</blockquote>

<blockquote> 
If A and B have almost identical environments we say that they are synonyms.
<footer>Harris, 1954</footer>    
</blockquote>    

Example: 

- Her **child** loves to play in the playground. 
- Her **kid** loves to play in the playground. 



### Vector space model

- A standard way to represent meaning in NLP
- Model the meaning of a word by placing it into a vector space.  
- Distances among words in the vector space indicate the relationship between them. 
- Called an "embedding" because it's embedded into a high-dimensional space

<center>
<img src="files/images/t-SNE_word_embeddings.png" width="700" height="700">
    (Attribution: Jurafsky and Martin 3rd edition)
</center>

### Visualizing word vectors and similarity 

<center>
<img src="files/images/word_vectors_and_angles.png" width="600" height="600">
(Credit: Jurafsky and Martin 3rd edition)
</center>

- $similarity_{cosine}(\vec{w_1},\vec{w_2}) = \frac{\vec{w_1}.\vec{w_2}}{\left\lVert w_1\right\rVert_2 \left\lVert w_2\right\rVert_2}$ 

- $similarity_{cosine}(\vec{\text{digital}},\vec{\text{information}}) = \frac{0 \times 1 + 1 \times 6}{\sqrt{1} \sqrt{1 + 36}} = 0.98$

- $similarity_{cosine}(\vec{\text{apricot}},\vec{\text{information}}) = \frac{2 \times 1 + 0 \times 6}{\sqrt{4}\sqrt{1 + 36}} = 0.16$

- Often just the dot product is also used as a measure of similarity. 

In [64]:
cm = CooccurrenceMatrix(pp_corpus)
vocab, comat = cm.fit_transform()
words = [key for key, value in sorted(vocab.items(), key = lambda item: (item[1],item[0]))]
df = pd.DataFrame(comat.todense(), 
                  columns = words, 
                  index = words,
                  dtype = np.int8
                 )
df.head()

Unnamed: 0,tall,machu,picchu,13.164,degrees,south,equator,official,height,"2,430",80,kilometres,50,miles,northwest,cusco
tall,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
machu,1,0,4,1,1,0,0,1,1,1,1,1,0,0,0,0
picchu,1,4,0,1,1,1,0,1,1,1,1,1,1,0,0,0
13.164,0,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0
degrees,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0


In [67]:
vec1 = cm.get_word_vector('tall')
vec2 = cm.get_word_vector('height')

  (0, 1)	1.0
  (0, 2)	1.0


### Sparse vs. dense word vectors

- Term-term co-occurrence matrices are long and sparse. 
    - length |V|= 20,000 to 50,000
    - most elements are zero
- OK because there are efficient ways to deal with sparse matrices.


### Alternative 
- Learn short (~100 to 1000 dimensions) and dense vectors. 
- Short vectors may be easier to train with ML models (less weights to train).
- They may generalize better.
- In practice they work much better! 

### Representation 3: Dense word embeddings

### Word2Vec 

- A family of algorithms to create dense word embeddings
- $|V| \rightarrow$ vocabulary size
- $d \rightarrow$ number of dimensions 
<img src="images/word2vec.png" width="700" height="700">


#### Pair-share (~5 mins)

- Before we go into the details of the algorithm, let's examine what can we do with word embeddings. 
- Let's try an online demo of word embeddings created by Turku NLP group from Finland. 
- Explore similarities between words pairs of your choice. 

http://bionlp-www.utu.fi/wv_demo/

### Success of Word2Vec

- Able to capture complex relationships between words.
- Example: What is the word that is similar to **WOMAN** in the same sense as **KING** is similar to **MAN**?
- Perform a simple algebraic operations with the vector representation of words.
    $\vec{X} = \vec{\text{KING}} − \vec{\text{MAN}} + \vec{\text{WOMAN}}$
- Search in the vector space for the word closest to $\vec{X}$ measured by cosine distance.

<img src="files/images/word_analogies1.png" width="500" height="500">
(Credit: Mikolov et al. 2013)    


### How can we get dense vectors?
 
- Count-based methods
    - Singular Value Decomposition (SVD)
- Prediction-based methods
    - [Word2Vec](https://github.com/tmikolov/word2vec)
    - [fastText](https://fasttext.cc/)
    - [GloVe](https://nlp.stanford.edu/projects/glove/)

### Word2Vec 

- A family of models to get word vectors.

- Two primary algorithms 
    - **Skip-gram**
    - Continuous bag of words

- Two moderately efficient training methods 
    - Hierarchical softmax
    - negative sampling 

### Skip-gram

### Fake word-prediction task 

- Given a target word, predict context words. 
<center>
<blockquote>
    Add freshly squeezed <b>pineapple</b> juice to your smoothie. 
</blockquote> 
</center>

- Target word: **pineapple**
- Is the word **juice** likely to occur in the context of **pineapple**? 


### Context words and target word

- Given a *target* word, we define a **context window** and predict each word in the given context window.
- If the context window is $n$, consider $n$ preceding and $n$ following words of the target word.
- Example 
    * Sequence of words &rarr;  $w_{t-3}, w_{t-2}, w_{t-1}, w_t, w_{t+1}, w_{t+2}$
    * Target word &rarr; $w_t$
    * $n=2$
    * Context words &rarr;  $w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}$.
- Context window does not usually cross sentence boundaries. 

### Context words and target word

<center>
<blockquote>
    Add freshly squeezed <b>pineapple</b> juice to your smoothie . It will make your smoothie extra special . 
</blockquote> 
</center>

- What are the context words of the target word **pineapple** with the window size 2? 
- What are the context words of the target word **smoothie** with the window size 2?

### General idea

- We want to predict a single word and words in its context similar to what we saw in distributional representations.
- $P(context|word)$
- Once we have that we define a loss function 

    $$Loss = 1 - P(context|word)$$ 
- If we predict them perfectly, the $P(context|word)$ is 1, we have no loss
    
    
- We do that for many word and context pairs in a large corpus
- Keep adjusting the weights to minimize the loss

### Continuous skip-gram objective
- Consider the conditional probabilities $p(w_c|w_t)$ and set the parameters $\theta$ of $p(w_c|w_t; \theta)$ so as to maximize the corpus probability. 

<center>
$
\arg\max\limits_{\substack{\theta}} \prod\limits_{\substack{t=1}}^{\substack{T}}\prod\limits_{\substack{-m \leq c \leq +m \\ c\neq0}} p(w_c|w_t;\theta)\\
\arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} p(w_c|w_t;\theta)
$
</center>

- $w_t$ &rarr; target word, 
- $m$ &rarr; the context window size
- $D$ is the set of all word and context pairs from the text. 

### Parameters to learn

- Given a corpus with vocabulary of size $V$, where a word $w_i$ is identified by its index $i \in {1, ..., V}$, learn a vector representation for each $w_i$ by predicting the words that appear in its context. 
- Learn the following parameters of the model
    - Suppose $V = 10,000$, $d = 300$, the number of parameters to learn are 6,000,000! 

<center>
$
\theta = 
\begin{bmatrix} aardvark_t\\
                aback_t\\
                \dots\\
                zymurgi_t\\
                aardvark_c\\
                aback_c\\                
                \dots\\
                zymurgi_c\\                
\end{bmatrix} \in R^{2dV}
$
</center>

### Modeling conditional probability

- Model the conditional probability using softmax of the dot product.
    * Higher the dot product higher the probability and vice-versa.     
    
<center>
$P(w_c|w_t;\theta) = \frac{exp(\vec{w_c}.\vec{w_t})}{\sum\limits_{\substack{c' \in V}} exp(\vec{w_{c'}}.\vec{w_t})}\\
$
</center>    

### Continuous skip-gram objective
   
<center>
$    
\arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} p(w_c|w_t;\theta) \approx \prod\limits_{(w_c,w_t) \in D}\frac{exp(\vec{w_c}.\vec{w_t})}{\sum\limits_{\substack{c' \in V}} exp(\vec{w_{c'}}.\vec{w_t})}\\
\arg \max\limits_\theta \sum\limits_{(w_c,w_t) \in D} log\mbox{ }p(w_c|w_t;\theta) \approx \sum\limits_{(w_c,w_t) \in D} (log\mbox{ }exp(\vec{w_c}.\vec{w_t})  - log \sum\limits_{\substack{c' \in V}} exp(\vec{w_{c'}}.\vec{w_t}))
$    
</center>    


- Assumption: Maximizing this objective will results in meaningful embeddings for all words in the vocabulary. 
- Learn two embeddings for each word: context embedding and target embedding. 

<img src="images/skip-gram.png" width="1000" height="1000">


### Questions? 

### Main hyperparameters of the model

- Dimensionality of the word vectors 
- Window size
    * shorter window: more syntactic representation
    * longer window: more semantic representation 
    * Mikolov et al. (2015) suggest setting this parameter in the range 5 to 20 for small training datasets and in the range 2 to 5 for large training datasets.    

### Training word2vec embeddings 

- [Original C code](https://code.google.com/archive/p/word2vec/) 
- [GitHub version of the code](https://github.com/tmikolov/word2vec)
- [Gensim](https://radimrehurek.com/gensim/), an open source Python library has provides a Python interface for word2vec family of algorithms

In [17]:
import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors, FastText
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [18]:
corpus = ["How tall is Machu Picchu?",
          "Machu Picchu is 13.164 degrees south of the equator.", 
          "The official height of Machu Picchu is 2,430 m.",
          "Machu Picchu is 80 kilometres (50 miles) northwest of Cusco."
         ]
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
pp_corpus

model = Word2Vec(pp_corpus, 
                 size=100, 
                 window=5, 
                 min_count=1,
                 sg=1, 
                 negative=4, 
                )
model.wv['tall']



array([-0.00055038, -0.00025228,  0.00476185, -0.00422345, -0.00320533,
        0.00493144,  0.00104723,  0.00467511, -0.00425784,  0.00324989,
       -0.00267231,  0.00352778,  0.00219797, -0.00090192,  0.0031217 ,
        0.00375813, -0.0028545 , -0.00256043,  0.00247986,  0.00249211,
        0.00067601,  0.00359497, -0.0019647 ,  0.00037799,  0.00037788,
       -0.0026525 ,  0.00303038,  0.00310917,  0.00252059,  0.00350445,
       -0.00371354,  0.00395391, -0.00496338, -0.0031164 ,  0.00135512,
        0.00201187, -0.00470974, -0.00464884,  0.00060289, -0.00069758,
        0.00229744, -0.00397985,  0.00255192, -0.00015335, -0.00186081,
       -0.00260885, -0.00122199,  0.00491633,  0.00323905, -0.00066195,
       -0.00134309,  0.00428094,  0.00287925,  0.00470387,  0.00444078,
       -0.00140785,  0.0042861 ,  0.00238582,  0.00021381, -0.00388645,
        0.00193228, -0.00160409,  0.00019206, -0.00120234,  0.00479675,
       -0.00296858,  0.00301671,  0.00423891,  0.00194561,  0.00

### Other popular methods to get embeddings

### Pre-trained embeddings

A number of pre-trained word embeddings are available. The most popular ones are:  

- [word2vec](https://code.google.com/archive/p/word2vec/)
    * trained on several corpora using the word2vec algorithm 
- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using [the fastText algorithm](http://aclweb.org/anthology/Q17-1010)
    * published by Facebook

In [6]:
# Load Google's pre-trained Word2Vec model.
model = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [1]:
print('Size of vocabulary: ', len(model.wv.vocab))
word_pairs = [('height','tall'),
              ('pineapple','mango'), 
              ('pineapple','juice'), 
              ('sun','robot')]
for pair in word_pairs: 
    print('The similarity between %s and %s is %0.3f' %(pair[0], pair[1], model.similarity(pair[0], pair[1])))
    

NameError: name 'model' is not defined

### Pre-trained embeddings

Why? 
- Training embeddings is computationally expensive
- For large corpora, the vocabulary size is more that 100,000.  
- If the size of embeddings is 300, the number of parameters of the model is $2 \times 30,000,000$

### Relational meaning between words

- Complex similarities between words
    - **big** : **biggest** :: **small** : ?
- Perform a simple algebraic operations with the vector representation of words
    $\vec{X} = \vec{biggest} − \vec{big} + \vec{small}$
- Search in the vector space for the word closest to $\vec{X}$ measured by cosine distance

<center>
<img src="files/images/word_analogies1.png" width="500" height="500">
(Credit: Mikolov et al. 2013)    
</center>


### Examples of semantic and syntactic relationships

<center>
<img src="files/images/word_analogies2.png" width="800" height="800">
(Credit: Mikolov 2013)
</center>

In [127]:
def analogy(word1, word2, word3, model = w2v_model):
    '''    
    Returns analogy word using the given model. 
    
    Keyword arguments:
    word1 -- (str) 
    word2 -- (str)
    word3 -- (str)
    model -- word embedding model
    '''
    print('%s : %s :: %s : ?' %(word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=['Analogy word', 'Score'])

In [39]:
analogy('man','king','woman')

man : king :: woman : ?


Unnamed: 0,Analogy word,Score
0,queen,0.711819
1,monarch,0.618967
2,princess,0.590243
3,crown_prince,0.549946
4,prince,0.537732
5,kings,0.523684
6,Queen_Consort,0.523595
7,queens,0.518113
8,sultan,0.509859
9,monarchy,0.508741


In [40]:
analogy('Montreal', 'Canadiens', 'Vancouver')

Montreal : Canadiens :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,Canucks,0.821327
1,Vancouver_Canucks,0.750401
2,Calgary_Flames,0.70547
3,Leafs,0.695783
4,Maple_Leafs,0.691617
5,Thrashers,0.687504
6,Avs,0.681716
7,Sabres,0.665307
8,Blackhawks,0.664625
9,Habs,0.661023


In [46]:
analogy('Microsoft', 'Windows', 'Apple')

Microsoft : Windows :: Apple : ?


Unnamed: 0,Analogy word,Score
0,Macs,0.673568
1,iMac,0.64634
2,Mac_OS,0.640714
3,iPhone,0.640588
4,iPad,0.633464
5,OS_X,0.632136
6,iBook,0.626197
7,iMacs,0.619245
8,iOS,0.617178
9,Mac_mini,0.61114


### Implicit biases and stereotypes in word embeddings

- Reflect gender stereotypes present in broader society.
- They may also amplify these stereotypes because of their widespread usage. 
- See [this paper](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf).

In [27]:
analogy('man', 'computer_programmer', 'woman')

man : computer_programmer :: woman : ?


Unnamed: 0,Similar word,Score
0,homemaker,0.562712
1,housewife,0.510505
2,graphic_designer,0.50518
3,schoolteacher,0.497949
4,businesswoman,0.493489
5,paralegal,0.492551
6,registered_nurse,0.490797
7,saleswoman,0.488163
8,electrical_engineer,0.479773
9,mechanical_engineer,0.47554


### Pair-share (5 mins)

- Again go to following online demo. 
http://bionlp-www.utu.fi/wv_demo/ 
- Team up with your neighbour and try out some pairs for analogies together (~3 minutes).
- Class discussion (~2 minutes). 

### Summary

- Vector space model 
    * Modeling word meaning by placing it in a vector space.
    * Distance between words in this vector space indicate the relationship between them. 
- Word embeddings
    * Creating short and dense representations of words. 
- Word2Vec
    * A family of models to learn dense vector representations of words
    * Freely available code and pre-trained models 
    * Available for many different languages. 

## Relevant papers

- [Distributed representations of words and phrases and their compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)
- [Linguistic regularities in continuous space word representations](https://www.aclweb.org/anthology/N13-1090)
- [Enriching Word Vectors with Subword Information](http://aclweb.org/anthology/Q17-1010)


## Links for pre-trained embeddings
- [GloVe](https://nlp.stanford.edu/projects/glove/)
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 

## Fun tools
[wevi: word embedding visual inspector](https://ronxin.github.io/wevi/)

#### Reflection 
+  10 minute  reflection:  Once you  complete the  sample  class, we  stop
  role-playing and return to behaving like normal Faculty members, and you
  have the opportunity to present

        (1) an organized reflection on  the sample class (e.g., addressing
        pedagogical issues you considered  when developing this class, the
        lecture structure and presentation choices you made and why);

        (2) your broader vision of pedagogy;

        (3)  how  you  see  yourself  contributing  to  the  departmental,
        university, and broader computing communities.


#### Vision 
