# Word Embedding Techniques

Word embeddings is a technique where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space. Word embeddings are considered to be one of the successful applications of unsupervised learning at present. They do not require any annotated corpora. Embeddings use a lower-dimensional space while preserving semantic relationships.
Some popular word embedding methods to extract features from text are:

1. **Bag of words** - Bag of words is a simple and popular technique for feature extraction from text. Bag of word model processes the text to find how many times each word appeared in the sentence. This is also called as vectorization.  

2. **TF-IDF** - TF-IDF is a popular word embedding technique for extracting features from corpus or vocabulary. This is a statistical method to find how important a word is to a document all over other documents.   

    The full form of TF is Term Frequency. In TF, we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents. So to overcome this problem we will divide the frequency of a word with the length of the document (total number of words) to normalize.By using this technique also, we are creating a sparse matrix with frequency of every word.
    
    Formula to calculate Term Frequency (TF)
    
    TF = no. of times term occurrences in a document / total number of words in a document  
    
    The full form of IDF is Inverse Document Frequency. Here also we are assigning  a score value  to a word , this scoring  
    value explains how a word is rare across all documents. Rarer words have more IDF score.

    Formula to calculate Inverse Document Frequency (IDF) :-  
    IDF = log base e (total number of documents / number of documents which are having term )  
    Formula to calculate complete TF-IDF value is:

    TF-IDF  = TF * IDF  
    
    TF-IDF value will be increased based on frequency of the word in a document. Like Bag of Words in this technique also we  
    cannot get any semantic meaning for words.This technique is mostly used for document classification and also successfully  
    used by search engines like Google, as a ranking factor for content. 


3. **Word2vec** - Word2vec is an algorithm invented at Google for training word embeddings. word2vec relies on the distributional hypothesis. The distributional hypothesis states that words which, often have the same neighboring words tend to be semantically similar. This helps to map semantically similar words to geometrically close embedding vectors.  

4. **Fastext** - FastText is an open-source, free library from Facebook AI Research(FAIR) for learning word embeddings and word classifications. This model allows creating unsupervised learning or supervised learning algorithm for obtaining vector representations for words. It also evaluates these models. FastText supports both CBOW and Skip-gram models (*Continuous Bag of Words Model (CBOW) and Skip-gram Both are architectures to learn the underlying word representations for each word by using neural networks*).

5. **ELMO (Embeddings for Language models)** - Embeddings from Language Models (ELMo) is also a powerful computational model that converts words into numbers. This vital process allows machine learning models (which take in numbers, not words, as inputs) to be trained on textual data. It achieved state-of-the-art performance on many popular tasks including question-answering, sentiment analysis, and named-entity extraction. ELMo can uniquely account for a word’s context. Previous language models such as GloVe, Bag of Words, and Word2Vec simply produce an embedding based on the literal spelling of a word. They do not factor in how the word is being used.   

In [153]:
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings

warnings.filterwarnings(action = 'ignore')

import gensim
import tensorflow_hub as hub
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models.keyedvectors import KeyedVectors

In [154]:
# sample data
text = [
  'There was a man',
  'The man had a dog',
  'The dog and the man walked',
]

In [155]:
class Embeddings:
    
    #bag of words implementation function
    
    def bag_of_words(text):
    
        '''
        Fit a Tokenizer on the text.
        To create tokens out of the text we will use Tokenizer class from Keras Text preprocessing module.
        we use this model to get the vector representations of the sentences as well as to get the vocabulary.
        '''
        model = Tokenizer()
        model.fit_on_texts(text)
        print(f'Key : {list(model.word_index.keys())}') #displaying the vocabulary (tokens)

        #create bag of words representation 
        '''
        we use the text to matrix method from the tokenizer class. It converts a list of texts to a Numpy matrix. 
        By mentioning mode as count we make sure that the final matrix has the counts for each token.
        '''
        bow = model.texts_to_matrix(text, mode='count')
        print(f'Bag of words :\n',bow)
        #return (bag_of_words)
    
    #=================================================================================
    
    #TF-IDF implementation function
    
    def tf_idf(text):
        # create object
        tfidf = TfidfVectorizer()

        # get tf-idf values
        result = tfidf.fit_transform(text)

        # get idf values
        print('\nIDF values:')
        for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
            print(ele1, ':', ele2)

        # get indexing
        print('\nWord indexes:')
        print(tfidf.vocabulary_)

        # in matrix form
        print('\nTF-IDF values in matrix form:')
        print(result.toarray())
        #return(tf_idf)
    
    #=================================================================================
    
    # word2vec implementaion Function
    
    def word_2_vec(text):
        data = []

        # iterate through each sentence in the file
        for i in text:
            temp = []

            # tokenize the sentence into words
            for j in word_tokenize(i):
                temp.append(j.lower())

            data.append(temp)

        # Creating CBOW (Continuous Bag of Words) model
        model1 = gensim.models.Word2Vec(data, min_count = 1,
                                    vector_size = 100, window = 5)

        # Creating Skip Gram model
        model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100,
                                                    window = 5, sg = 1)

        
        print("Word Embeddings for Word2Vec CBOW model: \n", model1.wv[data[0][0]],'\n')
        print("Word Embeddings for Word2Vec Skip Gram model: \n", model2.wv[data[0][0]])

        #return(word_2_vec)
    
    #=================================================================================
    
    #FastText Implementation Function

    def fastext(text):
        data = []

        # iterate through each sentence in the file
        for i in text:
            temp = []

            # tokenize the sentence into words
            for j in word_tokenize(i):
                temp.append(j.lower())

            data.append(temp)

        # Creating CBOW (Continuous Bag of Words) model
        model1 = gensim.models.FastText(data, min_count = 1,
                                    vector_size = 100, window = 5)

        # Creating Skip Gram model
        model2 = gensim.models.FastText(data, min_count = 1, vector_size = 100,
                                                    window = 5, sg = 1)
        
        # Print results
        print("Word Embeddings for FastText CBOW model: \n", model1.wv[data[0][0]],'\n')
        print("Word Embeddings for FastText Skip Gram model: \n", model2.wv[data[0][0]])

        #return(fastext)
    
    #=================================================================================
    
    #Elmo Implementation Function   
     
    def elmo (text):
        # Load pre trained ELMo model
        elmo = hub.Module("https://tfhub.dev/google/elmo/3", trainable=True)

        # create an instance of ELMo
        embeddings = elmo(text,signature="default",as_dict=True)["elmo"]
        init = tf.initialize_all_variables()
        sess = tf.Session()
        sess.run(init)

        # Print word embeddings for word MAN in given three sentences
        print('\nElmo Embeddings:')
        print(sess.run(embeddings))


In [156]:
Embeddings.bag_of_words(text)

Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']
Bag of words :
 [[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]


In [157]:
Embeddings.tf_idf(text)


IDF values:
and : 1.6931471805599454
dog : 1.2876820724517808
had : 1.6931471805599454
man : 1.0
the : 1.2876820724517808
there : 1.6931471805599454
walked : 1.6931471805599454
was : 1.6931471805599454

Word indexes:
{'there': 5, 'was': 7, 'man': 3, 'the': 4, 'had': 2, 'dog': 1, 'and': 0, 'walked': 6}

TF-IDF values in matrix form:
[[0.         0.         0.         0.38537163 0.         0.65249088
  0.         0.65249088]
 [0.         0.4804584  0.63174505 0.37311881 0.4804584  0.
  0.         0.        ]
 [0.43681766 0.33221109 0.         0.25799154 0.66442217 0.
  0.43681766 0.        ]]


In [158]:
Embeddings.word_2_vec(text)

Word Embeddings for Word2Vec CBOW model: 
 [-9.5785474e-03  8.9431144e-03  4.1650678e-03  9.2347339e-03
  6.6435025e-03  2.9247357e-03  9.8040197e-03 -4.4246409e-03
 -6.8033123e-03  4.2273807e-03  3.7290000e-03 -5.6646108e-03
  9.7047593e-03 -3.5583067e-03  9.5494054e-03  8.3472492e-04
 -6.3384580e-03 -1.9771170e-03 -7.3770545e-03 -2.9795242e-03
  1.0416961e-03  9.4826864e-03  9.3558477e-03 -6.5958784e-03
  3.4751510e-03  2.2755694e-03 -2.4893521e-03 -9.2291739e-03
  1.0271263e-03 -8.1657078e-03  6.3201878e-03 -5.8000805e-03
  5.5354382e-03  9.8337224e-03 -1.6000033e-04  4.5284913e-03
 -1.8094016e-03  7.3607611e-03  3.9400961e-03 -9.0103243e-03
 -2.3985051e-03  3.6287690e-03 -9.9568366e-05 -1.2012720e-03
 -1.0554385e-03 -1.6716027e-03  6.0495140e-04  4.1650939e-03
 -4.2527914e-03 -3.8336229e-03 -5.2816868e-05  2.6935578e-04
 -1.6880751e-04 -4.7855065e-03  4.3134023e-03 -2.1719194e-03
  2.1035385e-03  6.6652300e-04  5.9696771e-03 -6.8423818e-03
 -6.8157101e-03 -4.4762585e-03  9.4358278e

In [159]:
Embeddings.fastext(text)

Word Embeddings for FastText CBOW model: 
 [ 1.00380252e-03  1.39509828e-03  1.89781445e-03  6.24475360e-04
 -1.30063141e-04  3.93523893e-04  6.32923096e-04  1.93771757e-05
 -1.76075741e-03 -2.10116617e-04 -2.47781791e-05  5.73117693e-04
  4.78895672e-04 -2.09319856e-04  1.31731026e-03 -1.37902168e-03
  6.72607741e-04 -1.33673998e-03 -2.25682952e-03  1.56312191e-03
  5.91260068e-05 -7.31510925e-04  1.60478265e-03 -1.52098120e-03
 -3.72569084e-05 -1.67902559e-03  2.16971341e-04 -2.63496279e-03
 -2.17510387e-03 -1.11606430e-04  1.07013562e-03 -2.60453066e-03
  8.77746497e-04 -2.07846123e-03  9.39855352e-04  2.38607987e-03
  2.09304551e-03 -2.41468355e-04 -1.47287245e-03 -3.44380643e-03
  8.93070945e-04  2.11852908e-04 -2.29015644e-03  8.90909520e-04
 -3.19046434e-03 -2.50204117e-03 -5.48833050e-04  1.69254537e-03
  1.46050891e-03 -1.21486676e-03 -2.52983300e-03 -1.47039289e-04
  2.34688423e-03 -1.77372748e-03  1.10516092e-03 -1.54753763e-03
 -1.49488144e-04 -2.30527972e-03 -1.33809668e-0

In [160]:
Embeddings.elmo(text)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore



Elmo Embeddings:
[[[-0.297597   -0.06557926 -0.15180793 ... -0.27876338 -0.02799578
   -0.09434052]
  [ 0.22895269 -0.6115743  -0.2225335  ...  0.09913652  0.80268836
    0.38518405]
  [ 0.19487718 -0.20308933  0.24879742 ...  0.18695995 -0.25210536
    0.07623564]
  [-0.6595057  -0.25561672  0.19957364 ... -0.14810869  0.0861657
    0.24787885]
  [-0.02840842 -0.04353214  0.04130162 ...  0.02583169 -0.01429836
   -0.01650422]
  [-0.02840842 -0.04353214  0.04130162 ...  0.02583169 -0.01429836
   -0.01650422]]

 [[-0.00646199  0.00602151 -0.35598326 ... -0.5774933  -0.09805511
   -0.13721548]
  [-0.42775753 -0.22676341 -0.07036661 ... -0.59731114  0.36165738
    0.30080837]
  [-0.10810221 -0.13908526 -0.6891192  ...  0.24988073  0.6034625
    0.3689208 ]
  [ 0.05006097 -0.19613525 -0.05395966 ...  0.2639987  -0.30015522
    0.36188126]
  [-0.38287795 -1.0384216   0.04224534 ...  0.06314397 -0.18220167
    0.4078887 ]
  [-0.02840842 -0.04353214  0.04130162 ...  0.02583169 -0.01429836
  