# Word Embeddings


A word embedding is a learned representation for text where words that have the same meaning have a similar representation.
![image.png](attachment:image.png)


# Word Embedding Algorithms
Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.

The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.



# Some of the popular word embedding methods are:
Binary Encoding.

TF Encoding.

TF-IDF Encoding.

Latent Semantic Analysis Encoding.

Word2Vec Embedding.

# Binary Encoding.
Binary encoding is a combination of Hash encoding and one-hot encoding. In this encoding scheme, the categorical feature is first converted into numerical using an ordinal encoder. Then the numbers are transformed in the binary number. ... Binary encoding works really well when there are a high number of categories.
![image.png](attachment:image.png)

# TF-IDF Encoding.
TF-IDF is a statistical measure used to determine the mathematical significance of words in documents[2]. The vectorization process is similar to One Hot Encoding. Alternatively, the value corresponding to the word is assigned a TF-IDF value instead of 1. The TF-IDF value is obtained by multiplying the TF and IDF values. As an example, let’s find the TF-IDF values ​​for 3 documents consisting of 1 sentence.

<b>[He is Walter]

[He is William]

[He isn’t Peter or September]</b>
    
## TF (Term Frequency)
    
    

In the simplest terms, term frequency is the ratio of the number of target terms in the document to the total number of terms in the document. If TF values ​​are calculated according to the above example, it will be
    
    

    
<b>[0.33, 0.33, 0.33]
    
[0.33, 0.33, 0.33]
    
[0.20, 0.20, 0.20, 0.20, 0.20]</b>

## IDF (Inverse Documet Frequency)   
The IDF value is the logarithm of the ratio of the total number of documents to the number of documents in which the target term occurs. At this stage, it does not matter how many times the term appears in the document. It is sufficient to determine whether it has passed or not. In this example, the base value of the logarithm to be taken is determined as 10. However, there is no problem in using different values.
    
<b>“He”: Log(3/3)= 0,
    
“is”: Log(3/2):0.1761,
    
“or, Peter, ..”: log(3/1) : 0.4771  </b>

    
Thus, both TF and IDF values ​​were obtained. If vectorization is created with these values, firstly a vector consisting of elements equal to the number of unique words in all documents is created for each document (in this example, there are 8 terms). At this stage, there is a problem to be solved. As seen in the term “He”, since the IDF value is 0, the TF-IDF value will also be zero. However, words that are not included in the document during the vectorization process (for example, the phrase “Peter” is not included in the 1st sentence) will be assigned a value of 0. In order to avoid confusion here, TF-IDF values ​​are smoothed for vectorization. The most common method is to add 1 to the obtained values. Depending on the purpose, normalization can be applied to these values ​​later. If the vectorization process is created according to the above-mentioned;
<b>    
[1. , 1.1761 , 1.4771 , 0. , 0. , 0. , 0. , 0.],
    
[1. , 1.1761 , 0. , 1.4771 , 0. , 0. , 0. , 0.],
    
[1. , 0. , 0. , 0. , 1.4771 , 1.4771, 1.4771 , 1.4771],    
    </b>

# One Hot Encoding
One of the most basic techniques used to represent data numerically is One Hot Encoding technique[1]. In this method, a vector is created in the size of the total number of unique words. The value of vectors is assigned such that the value of each word belonging to its index is 1 and the others are 0. As an example, Figure 1 can be examined.
![image.png](attachment:image.png)



# Word2Vec

Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus.
![image-2.png](attachment:image-2.png)

There are two architectures used by Word2vec:

<b>
    
Continuous Bag of words (CBOW)
    
Skip gram</b>

# Continuous Bowl of Words(CBOW)
In this model what we do is we try to fit the neighboring words in the window to the central word.
![image.png](attachment:image.png)
# Skip Gram
In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.
![image-2.png](attachment:image-2.png)

# Bag of Words
The bag of words approach is one of the simplest word embedding approaches. The following are steps to generate word embeddings using the bag of words approach.

We will see the word embeddings generated by the bag of words approach with the help of an example. Suppose you have a corpus with three sentences.


<b>S1 = I love rain</b>

 <b>S2 = rain rain go away</b>
 
 <b>S3 = I am away</b>


To convert above sentences into their corresponding word embedding representations using the bag of words approach, we need to perform the following steps:

Create a dictionary of unique words from the corpus. In the above corpus, we have following unique words: <b>[I, love, rain, go, away, am]</b>

Parse the sentence. For each word in the sentence, add 1 in place of the word in the dictionary and add zero for all the other words that don't exist in the dictionary. For instance, the bag of words representation for sentence S1 (I love rain), looks like this:</b> [1, 1, 1, 0, 0, 0]</b>. 
Similarly for S2 and S3, bag of word representations are <b>[0, 0, 2, 1, 1, 0] </b>and
<b>[1, 0, 0, 0, 1, 1]</b> respectively.

Notice that for S2 we added 2 in place of "rain" in the dictionary; this is because S2 contains "rain" twice.



# GloVe
The Global Vectors for Word Representation, or GloVe, algorithm is an extension to the word2vec method for efficiently learning word vector.
![image.png](attachment:image.png)

# Embedding Layer
An embedding layer, for lack of a better name, is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language modeling or document classification

# Word Embedding Techniques using Embedding Layer in Keras

In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [3]:

### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [4]:
### Vocabulary size
voc_size=10000

# One Hot Representation

In [5]:
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

[[6, 7453, 925, 8119], [6, 7453, 925, 1437], [6, 6426, 925, 1426], [8383, 3931, 1260, 5111, 8299], [8383, 3931, 1260, 5111, 6417], [3220, 6, 5874, 925, 8067], [9292, 8770, 4750, 5111]]


# Word Embedding Represntation

In [7]:

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

import numpy as np

In [8]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0    6 7453  925 8119]
 [   0    0    0    0    6 7453  925 1437]
 [   0    0    0    0    6 6426  925 1426]
 [   0    0    0 8383 3931 1260 5111 8299]
 [   0    0    0 8383 3931 1260 5111 6417]
 [   0    0    0 3220    6 5874  925 8067]
 [   0    0    0    0 9292 8770 4750 5111]]


In [9]:
dim=10

In [10]:

model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')

In [11]:

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [12]:
print(model.predict(embedded_docs))

[[[ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [ 4.16819938e-02 -5.43098524e-03 -1.51553042e-02  1.72118433e-02
    2.07072757e-02  7.37534836e-03  3.63064520e-02  2.81833746e-02
   -9.02490690e-03  4.92810272e-02]
  [-1.69889107e-02 -1.37569755e-03 -3.92897241e-02 -1.78460479e-02
   -4.20581102e-02 -3.44532840e-02  1.49007104e-02  4.55289744e-02
   -7.55148008e-03  2.39036717e-02]
  [ 3.77750434e-02 -4.34159301e-02 -1.61685944e-02  9.92812961e-03
   -4.57546823e-02  2.84096263e-02 -2.33545545e-02 -2.57159602e-02
    3.63175757e-

In [13]:

print(model.predict(embedded_docs)[0])

[[ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [ 0.04168199 -0.00543099 -0.0151553   0.01721184  0.02070728  0.00737535
   0.03630645  0.02818337 -0.00902491  0.04928103]
 [-0.01698891 -0.0013757  -0.03928972 -0.01784605 -0.04205811 -0.03445328
   0.01490071  0.04552897 -0.00755148  0.02390367]
 [ 0.03777504 -0.04341593 -0.01616859  0.00992813 -0.04575468  0.02840963
  -0.02335455 -0.02571596  0.03631758  0.04784216]
 [ 0.02253621 -0.01142471  0.04809308  0.00100965 -0.04898838  0.02958932
   0.00636113 -0.03732587 -0.01762884 -0.0301518 ]
 [-0.01348916 -0.02871039 -0.0475859   0.04592964  0.03675845  0.03470277
   0.01562592  0.03713954  0.00154363  0.04721672]]

# Word2vec

In [14]:


import nltk

import gensim
 
nltk.download('punkt')

from gensim.models import Word2Vec
from nltk.corpus import stopwords
import re
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

print(text)

i have three visions for india. in years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. from alexander onwards, the greeks, the turks, the moguls, the portuguese, the british, the french, the dutch, all of them came and looted us, took over what was ours. yet we have not done this to any other nation. we have not conquered anyone. we have not grabbed their land, their culture, their history and tried to enforce our way of life on them. why? because we respect the freedom of others.that is why my first vision is that of freedom. i believe that india got its first vision of this in , when we started the war of independence. it is this freedom that we must protect and nurture and build on. if we are not free, no one will respect us. my second vision for india’s development. for fifty years we have been a developing nation. it is time we see ourselves as a developed nation. we are among the top nations of the world in terms

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\deshm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [33]:
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
sentences

[['i', 'have', 'three', 'visions', 'for', 'india', '.'],
 ['in',
  'years',
  'of',
  'our',
  'history',
  ',',
  'people',
  'from',
  'all',
  'over',
  'the',
  'world',
  'have',
  'come',
  'and',
  'invaded',
  'us',
  ',',
  'captured',
  'our',
  'lands',
  ',',
  'conquered',
  'our',
  'minds',
  '.'],
 ['from',
  'alexander',
  'onwards',
  ',',
  'the',
  'greeks',
  ',',
  'the',
  'turks',
  ',',
  'the',
  'moguls',
  ',',
  'the',
  'portuguese',
  ',',
  'the',
  'british',
  ',',
  'the',
  'french',
  ',',
  'the',
  'dutch',
  ',',
  'all',
  'of',
  'them',
  'came',
  'and',
  'looted',
  'us',
  ',',
  'took',
  'over',
  'what',
  'was',
  'ours',
  '.'],
 ['yet',
  'we',
  'have',
  'not',
  'done',
  'this',
  'to',
  'any',
  'other',
  'nation',
  '.'],
 ['we', 'have', 'not', 'conquered', 'anyone', '.'],
 ['we',
  'have',
  'not',
  'grabbed',
  'their',
  'land',
  ',',
  'their',
  'culture',
  ',',
  'their',
  'history',
  'and',
  'tried',
  'to',
  'e

In [34]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

In [35]:
model = Word2Vec(sentences, min_count=1)

In [43]:
words = model.wv.get_vecattr
words

<bound method KeyedVectors.get_vecattr of <gensim.models.keyedvectors.KeyedVectors object at 0x000002CCFCFFC790>>

In [45]:
vector = model.wv['war']
vector


array([-0.00219905, -0.00970885,  0.00929075,  0.00203636, -0.00116388,
       -0.00551955, -0.0085126 , -0.00989383,  0.00894091, -0.00250522,
        0.00459427, -0.00452481,  0.00995189,  0.00366171,  0.00103129,
       -0.00403834,  0.00122027, -0.00265451,  0.00735284,  0.00447542,
        0.00099955,  0.0034782 ,  0.00372712, -0.00680036,  0.00893242,
        0.00173499, -0.00579935,  0.00866838, -0.00129286,  0.00818304,
       -0.0014927 ,  0.00698649,  0.00273452, -0.00436226, -0.00374683,
        0.00919046,  0.00159645, -0.00599784,  0.00034776, -0.00195135,
        0.00159242, -0.00771525,  0.00738298,  0.00131083,  0.00787099,
        0.00445568, -0.00439675,  0.00376054, -0.0006357 , -0.00984484,
        0.00825004,  0.00964326,  0.00965426, -0.00379659, -0.00844202,
        0.00483581, -0.00765107,  0.00853567,  0.00275977,  0.00560496,
        0.00611362,  0.00046455, -0.00209463,  0.000778  ,  0.00983559,
       -0.00711718, -0.00155744, -0.00235984,  0.00487084,  0.00

In [48]:
similar = model.wv.most_similar('people')
similar

[('captured', 0.2381242960691452),
 ('free', 0.22312279045581818),
 ('enforce', 0.18678195774555206),
 ('build', 0.17561188340187073),
 ('percent', 0.16878381371498108),
 ('conquered', 0.16357892751693726),
 ('self-assured', 0.1620420217514038),
 ('time', 0.1589709222316742),
 ('looted', 0.15646469593048096),
 ('greeks', 0.15320080518722534)]