## Word Embedding
**Word embedding** is a technique used in natural language processing (NLP) to represent words as numerical vectors, which can be used as input to machine learning algorithms. Word embedding is a way of mapping words to a high-dimensional space, where each word is represented by a vector of real numbers. These vectors capture the meaning and context of the words, allowing NLP models to better understand and analyze natural language. Word embedding is often used for tasks such as sentiment analysis, text classification, and machine translation. Popular word embedding techniques include Word2Vec, GloVe, and FastText.

The certain limitations of **BOW** and **TF-IDF** are:
  1. The vector size will be very big, so it may consume too much memory and compute resources. As well as the presentation will be **Sparse.**
  2. They both can't captures the meaning of the words properly.
    
### Word Embedding try to address both of these shortcoming.

In **Word Embedding:**
  1. Similar words or either similar sentences will have similar vectors.
  2. The're lower in dimensions. As well as these representations are **dense representations.** It means most of the values are not zeroes.
  3. 

<img src = "img.png" width = "600px" height = "500px"></img>

* There are various **Word Embedding** techniques:
   1. Based on CBOW and Skip gram we can build the following word embeddings:
       - Word2Vec
       - Glove
       - FastText
       
   2. Transformer based word embedding techniques are:   (These are the latest improvement in NLP.)
       - BERT
       - GPT
    
   3. Based on LSTM we have the following word embedding technique:
       - ELMO
    

<img src = "img1.png" width = "600px" height = "500px"></img>

* The main concept of using these techniques is, they convert word or sentence into a vector representation, so it can capture the meaning that word properly. Not only that, we can do mathematics with the words.

* Now, there is some variations for the techniques based on training datasets, for examples **Word2Vec** can be train on Google news dataset or some different corpus. Next is **Glove.** For example **glove-twitter-50** which is include in gensim library, it will understand the twitts better (all the slangs, all the short form...). Next is **glove-wiki-gigaword-100**, there are so many models... but the basic techniques are **Glove**, **Word2Vec**, **FastText** and so on.

<img src = "img2.png" width = "600px" height = "500px"></img>

* Similarly the transformer based models such as BERT also based on what type of dataset they are trained on, you can get **Bio BERT** or **Fin BERT**. Fin BERT is trained on financial datasets and Bio BERT is trained on Bio medical dataset. We can also do some parameter tuninng in BERT and get things like **ALBERT** or **RoBERTa**.

<img src = "img3.png" width = "600px" height = "500px"></img>

* Just to summarize what these techniques produces is, the whole purpose of these techniques are to convert word into a vector. We know ML don't understand text, they need numbers so these are the ways we can change the texts into numbers. 
* We can convert single word into a vector or we can convert entire sentence into a vector or we can convert the entire paragraph into a vector.

<img src = "img4.png" width = "600px" height = "500px"></img>

* So to use **Word Embedding** in **Spacy** we need to load the large 'en_core_web_lg' or medium 'en_core_web_md' model. The difference between these two models is, large model has more vectors than medium model.

In [3]:
import spacy
# word vectors occupy lot of space. hence en_core_web_sm model do not have them included. 
# make sure you have run "python -m spacy download en_core_web_lg" to install large english model
nlp = spacy.load("en_core_web_lg")

In [4]:
# Let's say we want to compare word vectors with different words such 'dog, cat, banana'.
# 'token.has_vector' will shows that this word is a vector or not!
# OOV will show Out Of Vocabulary.

doc = nlp("dog cat banana Baktash computer")

for token in doc:
    print(token.text, "Vector: ", token.has_vector, "OOV: ", token.is_oov)  

dog Vector:  True OOV:  False
cat Vector:  True OOV:  False
banana Vector:  True OOV:  False
Baktash Vector:  False OOV:  True
computer Vector:  True OOV:  False


* So we see that the general words such as 'dog, cat, banana, computer' have vectors but other random words don't have vectors. The reason is when spacy model was trained it probably hasn't seen the 'Baktash' word. 
* **Spacy** using **Glove Embedding** (global vectors) and they trained on some popular English datasets. It capture general English knowledge, it doesn't know what the word 'Baktash' means!

In [5]:
# If you want to print a prticular vector, you can specify it by index.
doc[0].vector   # will print vector for word 'dog'. 

array([ 1.2330e+00,  4.2963e+00, -7.9738e+00, -1.0121e+01,  1.8207e+00,
        1.4098e+00, -4.5180e+00, -5.2261e+00, -2.9157e-01,  9.5234e-01,
        6.9880e+00,  5.0637e+00, -5.5726e-03,  3.3395e+00,  6.4596e+00,
       -6.3742e+00,  3.9045e-02, -3.9855e+00,  1.2085e+00, -1.3186e+00,
       -4.8886e+00,  3.7066e+00, -2.8281e+00, -3.5447e+00,  7.6888e-01,
        1.5016e+00, -4.3632e+00,  8.6480e+00, -5.9286e+00, -1.3055e+00,
        8.3870e-01,  9.0137e-01, -1.7843e+00, -1.0148e+00,  2.7300e+00,
       -6.9039e+00,  8.0413e-01,  7.4880e+00,  6.1078e+00, -4.2130e+00,
       -1.5384e-01, -5.4995e+00,  1.0896e+01,  3.9278e+00, -1.3601e-01,
        7.7732e-02,  3.2218e+00, -5.8777e+00,  6.1359e-01, -2.4287e+00,
        6.2820e+00,  1.3461e+01,  4.3236e+00,  2.4266e+00, -2.6512e+00,
        1.1577e+00,  5.0848e+00, -1.7058e+00,  3.3824e+00,  3.2850e+00,
        1.0969e+00, -8.3711e+00, -1.5554e+00,  2.0296e+00, -2.6796e+00,
       -6.9195e+00, -2.3386e+00, -1.9916e+00, -3.0450e+00,  2.48

In [6]:
# To know about the size or dimension of the vector:
doc[0].vector.shape # will output that it's 300 dimension vector.

(300,)

* So **spacy large model** come with in build vectors, so in the moment we created the document 'doc = nlp ("dog ...")', we were able to get the vectors for the defined words.

In [7]:
# Now let's compare these vectors and see what kind of similary we get between these words!
# So we'll create another document called 'bread'.
base_token = nlp("bread")
base_token.vector.shape

(300,)

In [8]:
# Now we want to compare the 'bread' vectors with different words such as 'sandwich ....'.
# So we know that 'bread' and 'sandwich & burger' words are a kind similar, so we predict that their similarity will be 
# hihger, and with other words the similarity will be lower.
# So the way they produce the similarity is, it keep the context of the words, if two words are repeated with each other, 
# then they will have more similarity. Here the similarity is based on the context, 'loss' and 'profit' word will have more
# similarity, now some one will says that why they are similar? they're opposite of each other! The answer is, this similary
# is based on context of the words.
doc = nlp("bread sandwich burger car tiger human wheat")

for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

bread <-> bread: 1.0
sandwich <-> bread: 0.6341067010130894
burger <-> bread: 0.47520687769584247
car <-> bread: 0.06451532596945217
tiger <-> bread: 0.04764611272488976
human <-> bread: 0.2151154210812192
wheat <-> bread: 0.615036141030184


In [9]:
# Now we'll create a function for words similarity.

def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
        print(f"{token.text} <-> {base_token.text}: ", token.similarity(base_token))

In [10]:
# We'll compare 'iphone' with 'apple', 'samsun' ... words.
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone:  0.4387907748060368
samsung <-> iphone:  0.6708590303423401
iphone <-> iphone:  1.0
dog <-> iphone:  0.08211864228011527
kitten <-> iphone:  0.10222317834969896


In [11]:
# Let's have some mathematics with these feature vectors:
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman

In [12]:
# Now if you compared the vector of 'result' with the vector of 'queen', it would be a kind similar.
# For similarity we'll use cosine similarity.

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([result], [queen])

array([[0.6178014]], dtype=float32)