# Embedding

## Why Embedding?
As we know, machines can't handle text, it can only handle numbers. But how to convert a word to numbers?

The most naive approach would be to take a list of all the words in your text and attribute a number to all of them. It will work but you can imagine that some problems will appear:
* How do you handle unknown words? 
* If your text contains `doctor`, `nurse`, and `candy`. `doctor` and `nurse` have a strong similarity but `candy` doesn't. How can we make the machine understand that? With our naive technique, `doctor` could have the number `5` associated to it and nurse the number `98767`.

Of course, a lot of people already spent some time with those problems. the solution that came out of it is "Embedding". 

## What is embeddings?

An embedding is a **VECTOR** which represents a word or a document.

A vector will be attributed to each token. Each vector will contain multiple dimensions (usually tens or hundreds of dimensions).

```
[...] associate with each word in the vocabulary a distributed word feature vector [...] The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features [...] is much smaller than the size of the vocabulary.
```
- [A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), 2003.

Long story short, embeddings convert words into vectors in a way that allows the machine to understand the similarity betweens them.

Each embedding library has it's own way of classifying words, it will regroup words into big categories. Each word will get a score for each category.

To take a simple example the word `mother` could be classified like that:

|        | female | family | human | animal|
|--------|--------|---------|-------|-------|
| mother | 0.9    | 0.9.    | 0.7   | 0.1   |

**Explanations:** Mother has a strong similarity with female, family and human but it has a low similarity with animal.

**Disclaimer:** Those numbers and categories are totally arbitrary and are only here to show an example.

Here is another example with more complete datas:

![embedding](https://miro.medium.com/max/2598/1*sAJdxEsDjsPMioHyzlN3_A.png)

## Should I do it by hand?

You could, but if some people already did the job for you and spent a lot of time to optimize it, why not use it?

## What to use?

There are a lot of libraries out there for embeddings. Which one is the best? Once again, *it depends*. The results will change depending on the text you are using, the information you want to extract, the model you use,...

Choosing the "best" embedding model will be part of the hyper-optimization that you can do at the end of a project.

If you want understand embeddings more in depth, [follow this link](http://jalammar.github.io/illustrated-word2vec/) or watch this [video](https://www.youtube.com/watch?v=gQddtTdmG_8).

Here are some of the best libraries:

* [Gensim](https://pypi.org/project/gensim/)
* [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec)

This next bit of code loads a model for practice

In [None]:
import os
import gensim.downloader as api
from gensim.models import KeyedVectors
import math
import numpy as np

# Path where you want to store/load the model
model_path = "glove-wiki-gigaword-300.kv"

# Load model from disk if exists, else download and save it
if os.path.exists(model_path):
    print("Loading model from local file...")
    model = KeyedVectors.load("data/"+model_path)
else:
    print("Downloading model...")
    model = api.load(model_path[:-3])
    model.save("data/"+model_path)
    print("Model downloaded and saved.")

## Practice time!

Enough reading, let's practice a bit. On this sentence:

In [None]:
sentence = "I love learning"

What do the word vectors look like? What is their size? What is their [magnitude](https://numpy.org/doc/2.1/reference/generated/numpy.linalg.norm.html)?

In [None]:
# Your code here

## Maths on text

Since the words are embedded into vectors we can now apply mathematical methods on them.

### Average vector

For example we could build the average vector for a text by using NumPy! This is a straightforward way to build one single representation for a text.

- Apply a gensim model on the text
- Get all word vectors into a list
- Compute and display the average vector of the list
- Get it's representation using the gensim most_similar method

In [None]:
text = "I want to be a famous data scientist"

# Your code here


### Word similarity

We can also compute the similarity between two words by using distance measures (e.g. [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), [euclidean distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html)...). These measures will calculate the distance between word embeddings in the vector space.

Identify what fundamental difference there is between these two metrics when it comes to assessissing similarity between vectors.

#### Let's practice!

- Compute the cosine and the euclidean distance between those 4 words in a similarity table visualizing it with matplotlib and/or seaborn
- Assess which words are the most similar and the most dissimilar

In [None]:
words = ["computer","keyboard","water","ocean"]

#Your code here

## Combining things together

This next bit of code uses the gensim library to allow you to perform arithmetic operations on vectors. Things you may want to try:

Silly additions:
 - man + hair

Checking for some more abstractions:
 - hair - woman + man
 - mice - home + city
 - children - child + goose
 - paris - france + belgium

Bonus points if you can make a function which takes any form of addition and substraction calculations on word vectors.

In [None]:
equals=model.most_similar(positive=['king', 'woman'], negative=['man'])[0][0]
print(f"'king' - 'man' + 'woman' = {equals}'")

#THE REST OF YOUR CODE HERE

When you play with these examples (or others). You quickly notice both the powerful levels of abstraction and the gaping limitations.

## More resources
* [Why do we use word embeddings in NLP?](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2)
* [More details on what word embeddings are exactly?](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)