# UBC instructor position sample lesson  

### By Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

### Today's plan

- Set the stage (~5 mins)
- Sample class (~40 mins)
- Reflection (~10 mins)
- Vision (~10 mins)
- Q and A (~10 mins)

#### Set the stage (~5 mins)

- I am envisioning this as the second half of the first lesson of DSCI 575 (Advanced Machine Learning in the context of Natural Language Processing (NLP) applications). 
- It is the first week of the final block of the MDS program's curriculum. 
- The students have already taken four 1-credit Machine Learning courses: DSCI 571, DSCI 572, DSCI 573, DSCI 563. 
- They are now ready to apply the concepts they have learned so far on interesting problems. 
- This course uses Python. 
- The prerequisites I am assuming are
    - Familiarity with dot product, cosine similarity (basic linear algebra)
    - Familiarity with neural networks, softmax, which they have done in DSCI 572.

## DSCI 575: Advanced Machine Learning (in the context of NLP applications)

#### Lecture 1 (Part 2): Introduction to Word Embeddings: _Toronto_ : _UofT_ :: _Vancouver_ : ? 
Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

#### Today's plan
- Quick introduction
- Word representations
    - Sparse representations with co-occurrence matrix
    - Dense representation with word2vec (the Skipgram algorithm) 

In [1]:
import pandas as pd
import numpy as np
import os, sys
from IPython.display import display, HTML

import matplotlib.pyplot as plt

from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix, csr_matrix

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import re
from collections import defaultdict
from collections import Counter

import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors, FastText
common_texts

plt.rcParams['font.size'] = 16
import sys
sys.path.append('code/.')
from preprocessing import MyPreprocessor

In [2]:
# BEGIN STARTER CODE
class CooccurrenceMatrix:
    def __init__(self, corpus, 
                       tokenizer = word_tokenize, 
                       window_size = 3):
        self.corpus = corpus
        self.tokenizer = tokenizer
        self.window_size = window_size
        self.vocab = {}
        self.cooccurrence_matrix = None    
        
    def fit_transform(self):
        """
        Creates a co-occurrence matrix. 
        
        Returns vocabulary (dict) and co-occurrence matrix (csr_matrix)
        """
        data=[]
        row=[]
        col=[]
        for tokens in self.corpus:
            for target_index, token in enumerate(tokens):
                # Get the index of the word in the vocabulary. If the word is not in the vocabulary, 
                # set the index to the size of the vocabulary. 
                i = self.vocab.setdefault(token, len(self.vocab))
                
                # Consider the context words depending upon the context window 
                start = max(0, target_index - self.window_size)
                end = min(len(tokens), target_index + self.window_size + 1)
                
                for context_index in range(start, end):
                    # Do not consider the target word.  
                    if target_index == context_index: 
                        continue                        
                    j = self.vocab.setdefault(tokens[context_index], len(self.vocab))
                    # Set diagonal to 0
                    if i == j:
                        continue
                    data.append(1.0); row.append(i); col.append(j);
        self.cooccurrence_matrix = csr_matrix((data,(row,col)))
        return self.vocab, self.cooccurrence_matrix
            
    def get_word_vector(self, word):
        """
        Given a word returns the word vector associated with it from the co-occurrence matrix. 

        Keyword arguments:
        word -- (str) the word to look up in the vocab.
        """
        # YOUR CODE HERE
        # BEGIN SOLUTION
        if word in self.vocab: 
            return self.cooccurrence_matrix[self.vocab[word]]
        else:
            print('The word not present in the vocab')
        # END SOLUTION

# END STARTER CODE

### What should a search engine return when asked the following question? 

<img src="imgs/lexical_ambiguity.png" width="800" height="800">


### What is Natural Language Processing (NLP)?

<img src="imgs/WhatisNLP.png" width="800" height="800">

### Why is it hard?

- All the problems related to representation and reasoning in artificial intelligence arise in this domain. 
- For language understanding, we need a representation that captures its "meaning". 



#### Today's promise

- We will learn a state-of-the art method for word "meaning" representation.  

#### Specific learning outcomes

From this class, you will be able to 

- Explain the general idea of vector space model.
- Explain the difference between sparse and dense word vectors.
- Explain the general idea of the skip-gram model.
- Use word2vec models to get word similarity and analogies. 

### Representing text

- In order to build machine learning models for text, we need to represent it effectively
    - How can we represent text that captures its "meaning"? 
- Let's start small. 
- How can we represent word "meaning"? 

### Word meaning 

- A favourite topic of philosophers for centuries. 
- An example from legal domain: [Are hockey gloves gloves or "articles of plastics"?](https://www.scc-csc.ca/case-dossier/info/sum-som-eng.aspx?cas=36258)

<blockquote>
Canada (A.G.) v. Igloo Vikski Inc. was a tariff code case that made its way to the SCC (Supreme Court of Canada). The case disputed the definition of hockey gloves as either gloves or as "articles of plastics."
</blockquote>


<img src="imgs/hockey_gloves_case.png" width="600" height="600">

### Word meaning: NLP view
- Modeling word meaning that allows us to 
    * draw useful inferences to solve meaning-related problems 
    * find relationship between words, e.g., which words are similar, which ones have positive or negative connotations
    

### Activity 1:  Brainstorm ways to represent words (~4 mins) 

- Suppose you are building a Question Answering system and you are given the following question and three candidate answers. 
- Discuss the following point with your neighbour(s). 
    - What kind of relationship between words do we need to capture in order to arrive at the correct answer?  
    - Would one-hot representation you have seen before work in this context?
    
<blockquote>       
<p style="font-size:30px"><b>Question:</b> How <b>tall</b> is Machu Picchu?</p>
    <p style="font-size:30px"><b>Candidate 1:</b> Machu Picchu is 13.164 degrees south of the equator.</p>    
<p style="font-size:30px"><b>Candidate 2:</b> The official height of Machu Picchu is 2,430 m.</p>
<p style="font-size:30px"><b>Candidate 3:</b> Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.</p>    
</blockquote> 
    

### Reminder: One-hot representation

- Build the **vocabulary** containing all unique words from the corpus 
- Represent each word as **one-hot** encoding
- A vector of length $V$ such that the value at word index is 1 and all other indices is 0
- Example: 
    * Vocabulary size = 10
    * Index of the word *pineapple* = 4
    * One-hot vector for *pineapple*:
    \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\end{bmatrix}

### Representation 1: Term-term co-occurrence matrix

### Distributional hypothesis

<blockquote> 
    <p>You shall know a word by the company it keeps.</p>
    <footer>Firth, 1957</footer>        
</blockquote>

<blockquote> 
If A and B have almost identical environments we say that they are synonyms.
<footer>Harris, 1954</footer>    
</blockquote>    

Example: 

- Her **child** loves to play in the playground. 
- Her **kid** loves to play in the playground. 



### Vector space model

- Model the meaning of a word by placing it into a vector space.  
- A standard way to represent meaning in NLP
- Distances among words in the vector space indicate the relationship between them. 
- Called an "embedding" because it's embedded into a high-dimensional space

<img src="imgs/t-SNE_word_embeddings.png" width="700" height="700">
    (Attribution: Jurafsky and Martin 3rd edition)

### Term-term co-occurrence matrix

- The idea is to go through a corpus of text, keeping a count of all of the words that appear in context of each word (within a window).

<img src="imgs/term-term_comat.png" width="600" height="600">
(Credit: Jurafsky and Martin 3rd edition)


### Visualizing word vectors and similarity 

<img src="imgs/word_vectors_and_angles.png" width="600" height="600">
(Credit: Jurafsky and Martin 3rd edition)

- The similarity is calculated using dot products between word vectors.
    - Example: $\vec{\text{digital}}.\vec{\text{information}} = 0 \times 1 + 1\times 6 = 6$
    - Higher the dot product more similar the words.

- We can also calculate a normalized version of dot products. 
    $$similarity_{cosine}(\vec{w_1},\vec{w_2}) = \frac{\vec{w_1}.\vec{w_2}}{\left\lVert w_1\right\rVert_2 \left\lVert w_2\right\rVert_2}$$


In [3]:
### Let's build term-term co-occurrence matrix for our text. 
corpus = ["How tall is Machu Picchu?",
          "Machu Picchu is 13.164 degrees south of the equator.", 
          "The official height of Machu Picchu is 2,430 m.",
          "Machu Picchu is 80 kilometres (50 miles) northwest of Cusco.",
          "It is 80 kilometres (50 miles) northwest of Cusco, on the crest of the mountain Machu Picchu, located about 2,430 metres (7,970 feet) above mean sea level, over 1,000 metres (3,300 ft) lower than Cusco, which has an elevation of 3,400 metres (11,200 ft)."
         ]
pp = MyPreprocessor()
pp_corpus = pp.preprocess_corpus(corpus)
cm = CooccurrenceMatrix(pp_corpus)
vocab, comat = cm.fit_transform()
words = [key for key, value in sorted(vocab.items(), 
                                      key = lambda item: (item[1],item[0]))]
df = pd.DataFrame(comat.todense(), 
                  columns = words, 
                  index = words,
                  dtype = np.int8
                 )
df.head()

Unnamed: 0,tall,machu,picchu,13.164,degrees,south,equator,official,height,"2,430",...,mean,sea,level,"1,000","3,300",ft,lower,elevation,"3,400","11,200"
tall,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
machu,1,0,5,1,1,0,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
picchu,1,5,0,1,1,1,0,1,1,2,...,0,0,0,0,0,0,0,0,0,0
13.164,0,1,1,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
degrees,0,1,1,1,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
from sklearn.metrics.pairwise import cosine_similarity
def similarity(word1, word2): 
    """
    """
    vec1 = cm.get_word_vector(word1).todense().flatten()
    vec2 = cm.get_word_vector(word2).todense().flatten()
    v1 = np.squeeze(np.asarray(vec1))
    v2 = np.squeeze(np.asarray(vec2))
    print('The dot product between %s and %s is %0.2f and cosine similarity is %0.2f' 
          %(word1,word2,v1.dot(v2),cosine_similarity(vec1, vec2)))

In [5]:
### Let's look at similarity between word pairs
similarity('tall', 'height')
similarity('tall', 'official')

### Not very reliable similarity scores because we used only 4 sentences. 

The dot product between tall and height is 2.00 and cosine similarity is 0.71
The dot product between tall and official is 2.00 and cosine similarity is 0.82


### Sparse vs. dense word vectors

- Term-term co-occurrence matrices are long and sparse. 
    - length |V|= 20,000 to 50,000
    - most elements are zero
- OK because there are efficient ways to deal with sparse matrices.


### Alternative 
- Learn short (~100 to 1000 dimensions) and dense vectors. 
- Short vectors may be easier to train with ML models (less weights to train).
- They may generalize better.
- In practice they work much better! 

### Representation 2: Dense word embeddings

### Word2Vec 

- A family of algorithms to create dense word embeddings
<img src="imgs/word2vec.png" width="700" height="700">


#### Activity 2: Try out Word similarity with word embeddings (~4 mins)

- Go to the following online demo of word embeddings created by Turku NLP group from Finland.
http://bionlp-www.utu.fi/wv_demo/
- Under `Select one of the available models`, select `English GoogleNews Negative300`
- Under `Nearest words` option, type a word and get the most similar words for the given word. Some suggestions to get you started: *UBC, bread, Computer_Science*
- To get the similarity between two words, under `Similarity of two words`, type word pairs of your interest. Some pairs to get you started: *tall and height*, *lion and GPU*

### How can we get dense vectors?
 
- Count-based methods
    - Singular Value Decomposition (SVD)
- Prediction-based methods
    - [Word2Vec](https://github.com/tmikolov/word2vec)
    - [fastText](https://fasttext.cc/)
    - [GloVe](https://nlp.stanford.edu/projects/glove/)

### Word2Vec 

- A family of models to obtain dense word vectors.

- Two primary algorithms 
    - **Skip-gram**
    - Continuous bag of words (CBOW)
- Two moderately efficient training methods 
    - Hierarchical softmax
    - Negative sampling 

### Skip-gram

- A neural network model to obtain robust and dense representations of words. 

### Fake word-prediction task 

- Given a target word (i.e., center word) word, predict context words (i.e., surrounding words). 
<blockquote>
    Add freshly squeezed$_{context}$ pineapple$_{target}$ juice$_{context}$ to your smoothie. 
</blockquote> 

<center>
<img src="imgs/target_context.png" width="300" height="300">
</center>

- So in the example above given the target word **pineapple**, predict whether: 
    - **juice** is likely to occur in the context of **pineapple**
    - **squeezed** is likely to occur in the context of **pineapple** 

### Skip-gram objective
- Consider the conditional probabilities $p(w_c|w_t)$ and set the parameters $\theta$ of $p(w_c|w_t; \theta)$ so as to maximize the corpus probability. 

<center>
$
\arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} p(w_c|w_t;\theta)
$
</center>

- $w_t$ &rarr; target word, 
- $m$ &rarr; the context window size
- $D$ is the set of all word and context pairs from the text. 

### Skip-gram objective

- Model the conditional probability using softmax of the dot product.
    * Higher the dot product higher the probability and vice-versa.     
    

$$P(w_c|w_t;\theta) = \frac{exp(\vec{w_c}.\vec{w_t})}{\sum\limits_{\substack{c' \in V}} exp(\vec{w_{c'}}.\vec{w_t})}\\
$$

- Substituting the conditional probability with the softmax of dot product: 
$$    
\arg \max\limits_\theta \prod\limits_{(w_c,w_t) \in D} p(w_c|w_t;\theta) \approx \prod\limits_{(w_c,w_t) \in D}\frac{exp(\vec{w_c}.\vec{w_t})}{\sum\limits_{\substack{c' \in V}} exp(\vec{w_{c'}}.\vec{w_t})}$$
- Assumption: Maximizing this objective will results in meaningful embeddings for all words in the vocabulary. 

### How do we do it?

- We use a neural network architecture with 
    - an input layer
    - a hidden layer
    - an output layer 
- We use the softmax activation function for the output layer. 
    


### Example 

<img src="imgs/skipgram_0.png" width="1000" height="1000">

### Input layer and "gold" 

<img src="imgs/skipgram_1.png" width="1000" height="1000">

### Hidden layer

<img src="imgs/skipgram_2.png" width="1000" height="1000">

### What will be the dimensions of the weight matrix between input and hidden layers?

1. $10000 \times 1$
2. $300 \times 10000$
3. $300 \times 300$


<img src="imgs/skipgram_2.png" width="500" height="500">

### Hidden layer and output layer 

<img src="imgs/skipgram_3.png" width="1000" height="1000">

### What will be the dimensions of the weight matrix between hidden and output layers?

1. $10000 \times 1$
2. $10000 \times 300$
3. $300 \times 300$


<img src="imgs/skipgram_3.png" width="500" height="500">

### Softmax activation function 

- Apply softmax to get probability distribution 

<img src="imgs/skipgram_4.png" width="1000" height="1000">

### Compare prediction ($\hat{y}$) with "gold" ($y$)

- We want a number closer to 1 in the prediction at index 5,428
    - Loss is high!

<img src="imgs/skipgram_5.png" width="1000" height="1000">



- Learn weights using backpropagation and gradient descent. 

<img src="imgs/skipgram_5.png" width="500" height="500">


### Skip-gram model for two target-context pairs 

<img src="imgs/skip-gram.png" width="1000" height="1000">


### Parameters to learn

- Given a corpus with vocabulary of size $V$, where a word $w_i$ is identified by its index $i \in {1, ..., V}$, learn a vector representation for each $w_i$ by predicting the words that appear in its context. 
- Learn the following parameters of the model
    - Suppose $V = 10,000$, $d = 300$, the number of parameters to learn are 6,000,000! 

<center>
$
\theta = 
\begin{bmatrix} aardvark_t\\
                aback_t\\
                \dots\\
                zymurgi_t\\
                aardvark_c\\
                aback_c\\                
                \dots\\
                zymurgi_c\\                
\end{bmatrix} \in R^{2dV}
$
</center>

### Questions? 

### Main hyperparameters of the model

- Dimensionality of the word vectors 
- Window size
    * shorter window: more syntactic representation
    * longer window: more semantic representation 
    * Mikolov et al. (2015) suggest setting this parameter in the range 5 to 20 for small training datasets and in the range 2 to 5 for large training datasets.    

### Training word2vec embeddings 

- [Original C code](https://code.google.com/archive/p/word2vec/) 
- [GitHub version of the code](https://github.com/tmikolov/word2vec)
- [Gensim](https://radimrehurek.com/gensim/), an open source Python library has provides a Python interface for word2vec family of algorithms

In [9]:
import gensim
from gensim.test.utils import common_texts
from gensim.models import Word2Vec, KeyedVectors, FastText
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [16]:
# Let's build a word2vec model on this tiny corpus
model = Word2Vec(common_texts, 
                 size=100, 
                 window=5, 
                 min_count=1)

# How does a learned dense word vector look like? 
model.wv['trees']

array([-8.0577651e-04, -4.1928720e-03, -4.1012844e-04, -4.5362543e-03,
        2.8545428e-03, -2.0831355e-03, -1.5676343e-03,  2.9114417e-03,
        4.8867948e-03, -2.0088769e-04,  4.4950220e-04, -3.1292173e-03,
        2.3083936e-03,  2.1350672e-03,  4.1387044e-03,  4.6810219e-03,
        3.5922842e-03, -4.8433754e-04,  7.2008377e-04, -3.0727694e-03,
        1.6958485e-03,  1.1154150e-03,  1.2656286e-03,  4.4413866e-03,
       -4.9861602e-04, -2.6142984e-03,  2.7393631e-03,  1.2527937e-03,
       -4.4371560e-03, -2.0429664e-03, -2.4972567e-03, -2.0705322e-03,
        2.5140313e-03,  4.9008396e-03,  4.0714932e-03, -1.8321427e-03,
        3.0235352e-04, -2.1816576e-03, -3.9005610e-03,  3.6648144e-03,
        4.5554382e-03, -4.6852673e-04, -1.5028286e-03,  2.0181837e-03,
       -2.2745879e-04, -2.8567808e-03, -3.6195670e-03,  4.9500149e-03,
        3.0311327e-03, -2.9000025e-03,  5.4507988e-04, -7.7085308e-04,
        1.9078582e-03,  4.7682840e-03,  2.0714775e-03,  1.7742552e-03,
      

### Pre-trained embeddings

A number of pre-trained word embeddings are available. The most popular ones are:  

- [word2vec](https://code.google.com/archive/p/word2vec/)
    * trained on several corpora using the word2vec algorithm 
- [GloVe](https://nlp.stanford.edu/projects/glove/)
    * trained using [the GloVe algorithm](https://nlp.stanford.edu/pubs/glove.pdf) 
    * published by Stanford University 
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 
    * trained using [the fastText algorithm](http://aclweb.org/anthology/Q17-1010)
    * published by Facebook

In [22]:
# Load Google's pre-trained Word2Vec model.
# You can download them from 
model = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [38]:
print('Size of vocabulary: ', len(model.vocab))
word_pairs = [('height','tall'),
              ('pineapple','mango'), 
              ('pineapple','juice'), 
              ('sun','robot'), 
              ('GPU','lion')]
for pair in word_pairs: 
    print('The similarity between %s and %s is %0.3f' %(pair[0], pair[1], model.similarity(pair[0], pair[1])))
    

Size of vocabulary:  3000000
The similarity between height and tall is 0.473
The similarity between pineapple and mango is 0.668
The similarity between pineapple and juice is 0.418
The similarity between sun and robot is 0.029
The similarity between GPU and lion is 0.002


### Success of Word2Vec

- Able to capture complex relationships between words.
- Example: What is the word that is similar to **WOMAN** in the same sense as **KING** is similar to **MAN**?
- Perform a simple algebraic operations with the vector representation of words.
    $\vec{X} = \vec{\text{KING}} − \vec{\text{MAN}} + \vec{\text{WOMAN}}$
- Search in the vector space for the word closest to $\vec{X}$ measured by cosine distance.

<img src="imgs/word_analogies1.png" width="500" height="500">
(Credit: Mikolov et al. 2013)    


In [26]:
def analogy(word1, word2, word3, model = model):
    '''    
    Returns analogy word using the given model. 
    
    Keyword arguments:
    word1 -- (str) 
    word2 -- (str)
    word3 -- (str)
    model -- word embedding model
    '''
    print('%s : %s :: %s : ?' %(word1, word2, word3))
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
    return pd.DataFrame(sim_words, columns=['Analogy word', 'Score'])

In [27]:
analogy('man','king','woman')

man : king :: woman : ?


Unnamed: 0,Analogy word,Score
0,queen,0.711819
1,monarch,0.618967
2,princess,0.590243
3,crown_prince,0.549946
4,prince,0.537732
5,kings,0.523684
6,Queen_Consort,0.523595
7,queens,0.518113
8,sultan,0.509859
9,monarchy,0.508741


In [29]:
analogy('Montreal', 'Canadiens', 'Vancouver')

Montreal : Canadiens :: Vancouver : ?


Unnamed: 0,Analogy word,Score
0,Canucks,0.821327
1,Vancouver_Canucks,0.750401
2,Calgary_Flames,0.70547
3,Leafs,0.695783
4,Maple_Leafs,0.691617
5,Thrashers,0.687504
6,Avs,0.681716
7,Sabres,0.665307
8,Blackhawks,0.664625
9,Habs,0.661023


In [1]:
### Recall the title of today's lesson 
analogy('Toronto', 'UofT', 'Vancouver')

### Implicit biases and stereotypes in word embeddings

- Reflect gender stereotypes present in broader society.
- They may also amplify these stereotypes because of their widespread usage. 
- See [this paper](http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf).

In [27]:
analogy('man', 'computer_programmer', 'woman')

man : computer_programmer :: woman : ?


Unnamed: 0,Similar word,Score
0,homemaker,0.562712
1,housewife,0.510505
2,graphic_designer,0.50518
3,schoolteacher,0.497949
4,businesswoman,0.493489
5,paralegal,0.492551
6,registered_nurse,0.490797
7,saleswoman,0.488163
8,electrical_engineer,0.479773
9,mechanical_engineer,0.47554


### Activity: Explore analogies with word embeddings (~4 mins)

- Again go to following online demo. 
http://bionlp-www.utu.fi/wv_demo/ 
- Team up with your neighbour and try out some pairs for analogies together (~3 minutes).
- Class discussion (~2 minutes). 

### Summary

- Vector space model 
    * Modeling word meaning by placing it in a vector space.
    * Distance between words in this vector space indicate the relationship between them. 
- Word embeddings
    * Sparse embeddings using co-occurrence matrix
    * Dense embeddings using word2vec models 
        * Freely available code and pre-trained models 
        * Available for many different languages. 

### Revisit: Learning outcomes


1. Word representation created by term-term co-occurrence matrix are long and sparse whereas the ones created by word2vec models are short and dense. True or False? 
3. Given the following table, which word pair is more similar in terms of dot product: the word pair in A or in B?  
    1. X, Y
    2. X, Z

<img src="imgs/similarity_question.png" width="500" height="500">

2. The skip-gram model predicts context word given a target word. True or False? 
<br><br><br><br><br><br><br><br><br><br>

### Coming up 

- Wait! Don't we want to represent sentences and paragraphs for tasks such as below? 
          
$X = \begin{bmatrix}\text{"@united you're terrible. You don't understand safety",}\\ \text{"@JetBlue safety first !! #lovejetblue"}\\ \text{"@SouthwestAir truly the best in #customerservice!"}\\ \end{bmatrix}$ and $y = \begin{bmatrix}0 \\ 1 \\ 1 \end{bmatrix}$

- We will build on the idea of word embeddings to come up with general text representations. 

### Bonus: Relevant papers

- [Distributed representations of words and phrases and their compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)
- [Linguistic regularities in continuous space word representations](https://www.aclweb.org/anthology/N13-1090)
- [Enriching Word Vectors with Subword Information](http://aclweb.org/anthology/Q17-1010)


### Bonus: Examples of semantic and syntactic relationships

<center>
<img src="files/images/word_analogies2.png" width="800" height="800">
(Credit: Mikolov 2013)
</center>

### Bonus: Links for pre-trained embeddings
- [GloVe](https://nlp.stanford.edu/projects/glove/)
- [fastText pre-trained embeddings for 294 languages](https://fasttext.cc/docs/en/pretrained-vectors.html) 

### Bonus: Fun tools
[wevi: word embedding visual inspector](https://ronxin.github.io/wevi/)

## Reflection (~10 mins)

### Reflection: Activities 






### My vision for MDS


### My vision for CS

- 


### Vision 

- TBA

### Questions?