In [None]:
#@title Install Required Packages

# On your local machine, uncomment them
# !pip install -qU torch
# !pip install -qU numpy
# !pip install -qU pandas

!pip install -qU transformers

[K     |████████████████████████████████| 1.8MB 7.8MB/s 
[K     |████████████████████████████████| 3.2MB 25.0MB/s 
[K     |████████████████████████████████| 890kB 44.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
#@title Load Packages

from transformers import AutoTokenizer, AutoModelForMaskedLM

from pprint import pprint
from IPython import display

# Word Embedding

The way machine learning models see data is different from how we (humans) do. For example, we can easily understand the text, I saw a cat, but our models cannot.
They need vectors of features. Such vectors, or word embeddings, are representations of words which can be fed into your model.

<br/>

In practice, you have a vocabulary of allowed words; you choose this vocabulary in advance. For each vocabulary word, a look-up table contains its embedding. This embedding can be found the word index in the vocabulary.

<br/>

To account for unknown words (the ones which are not in the vocabulary), usually a vocabulary contains a special token UNK. Alternatively, unknown tokens can be ignored or assigned a zero vector.



# How Do We Get These Word Vectors?



## Represent as Discrete Symbols: One-hot Vectors

The easiest you can do is to represent words as one-hot vectors: for the i-th word in the vocabulary, the vector has 1 on the i-th dimension and 0 on the rest. In Machine Learning, this is the most simple way to represent categorical features.

One of the problems is that for large vocabularies, these vectors will be very long: vector dimensionality is equal to the vocabulary size.
What is really important, is that these vectors know nothing about the words they represent.

<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/one-hot.png" />
    <br/>
    <em>Figure 1: One Hot Encoding</em>
</p>
<br/>

## Distributional Semantics

Words which frequently appear in similar contexts have similar meaning.
Main idea: We need to put information about word contexts into word representation.






## Count-Based Methods

Put this information manually based on global corpus statistics.

- Co-Occurence Counts
- Positive Pointwise Mutual Information (PPMI)
- Latent Semantic Analysis (LSA): Understanding Documents

## Word2Vec: a Prediction-Based Method

Learn word vectors by teaching them to predict contexts.
Word2Vec is a model whose parameters are word vectors. These parameters are optimized iteratively for a certain objective. The objective forces word vectors to "know" contexts a word can appear in: the vectors are trained to predict possible contexts of the corresponding words.

Word2Vec is an iterative method. Its main idea is as follows:


- Take a huge text corpus
- Go over the text with a sliding window, moving one word at a time. At each step, there is a central word and context words (other words in this window)
- For the central word, compute probabilities of context words
- Adjust the vectors to increase these probabilities


Word2Vec variants: Skip-Gram and CBOW

- **Skip-Gram**: it predicts context words given the central word. Skip-Gram with negative sampling is the most popular approach.
- **CBOW** (Continuous Bag-of-Words) predicts the central word from the sum of context vectors. This simple sum of word vectors is called "bag of words", which gives the name for the model.

<br/>
<p align="center">
    <img src="https://hooshvare.s3.ir-thr-at1.arvanstorage.com/w2v.jpg" />
    <br/>
    <em>Figure 2: Word2Vec</em>
</p>
<br/>






## GloVe: Global Vectors for Word Representation

The GloVe model is a combination of count-based methods and prediction methods (e.g., Word2Vec). Model name, GloVe, stands for "Global Vectors", which reflects its idea: the method uses global information from corpus to learn vectors.

The simplest count-based method uses co-occurrence counts to measure the association between word $w$ and context $c: N(w, c)$. GloVe also uses these counts to construct the loss function


Similar to Word2Vec, we also have different vectors for central and context words - these are our parameters. Additionally, the method has a scalar bias term for each word vector.


What is especially interesting, is the way GloVe controls the influence of rare and frequent words: loss for each pair $(w, c)$ is weighted in a way that

- rare events are penalized
- very frequent events are not over-weighted



## FastText

In 2016 Facebook research team proposed a method and released a library for both learning word representation and sentence classification. 


FastText differs in the sense that other word representation methods such as skip-gram, CBOW and Glove treat every single word as a smallest unit whose vector representation is to be found. However, FastText assumes a word to be formed by an n-grams of characters, for examlpe, sunny is composed of $[sun, sunn, sunny], [sunny, unny, nny]$ etc, where n could range from 1 to the range of the word.