#### Outline.
            1. Word Embedding: using Frequency Base method
            
            2. Word Embedding: using Prediction Base method.
            
            3. Word Embedding: using Word2vec & GloVe
            
            4. Keras Embedding Layer.

In [1]:
import numpy as np
import pandas as pd

**`Word embedding`** is the collective name for a set of **`language modeling`** and **`feature learning`** techniques in **`Natural language processing (NLP)`** where `words` or `phrases` from the `vocabulary` are *mapped to vectors of real numbers*. 

Conceptually, it involves a `mathematical embedding` from a space with `many dimensions per word` to a `continuous vector space` with a much `lower dimension`.

Methods to generate this mapping include `neural networks`, `dimensionality reduction` on the `co-occurrence matrix`

## 1. Word Embedding: using Frequency Base method

Includes: `Count Vector`; `tf-idf Vector` and `Co-occurrence Matrix.`

### 1.1. CountVectorizer using with `TfidfTransformer` 

First, consider the simple sentences.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document?',
          'this Document is not yours..']

cvect = CountVectorizer()
X = cvect.fit_transform(corpus)

print("There are %d sentences in this corpus"%(X.shape[0]))
print('The number of the different words is :', X.shape[1], ", and ... they are:")
print(cvect.get_feature_names())

There are 5 sentences in this corpus
The number of the different words is : 11 , and ... they are:
['and', 'document', 'first', 'is', 'not', 'one', 'second', 'the', 'third', 'this', 'yours']


- Firstly, they will count how many `different words` in this sentence, here are `11`; noting that both of the words "`document`" and "`Document`" will be changed to the lower scripts : `"Document"`.
- Only the second document contains the `word has frequencies = 2`, it is `document`.
- The `unique` words in the corpus will be arranged to the `English alphabet characters`; starting at the word "**a**nd" and ending by "**y**ours".
- The `punctuation` (such as `"?"` or `"!"`, ....) will be ignored.



In [3]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

vocabulary = cvect.get_feature_names()
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=False, use_idf=True))]).fit(corpus)
pipe['count'].transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
       [0, 2, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1]], dtype=int64)

Next, **compute the `IDF` values.** (An `idf` is constant per corpus, and **accounts** for the ratio of documents that include the word.)

In [4]:
table = pd.DataFrame({"fea_names": cvect.get_feature_names(), "idf_smooth_False)": pipe['tfid'].idf_})
table

Unnamed: 0,fea_names,idf_smooth_False)
0,and,2.609438
1,document,1.223144
2,first,1.916291
3,is,1.0
4,not,2.609438
5,one,2.609438
6,second,2.609438
7,the,1.223144
8,third,2.609438
9,this,1.0


According to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992 and https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfTransformer

- If `smooth_idf=False`); the formula that is used to compute the `tf-idf` for a term t of a document `d` in a document set is 

                                    tf-idf(w, d) = tf(w, d) * idf(w), 

and the `idf` is computed as 

                                        idf(w) = log [ n / df(w) ] + 1, 

where `n` is the `total number of documents in the corpus` and `df(t) is the document frequency of w`; the document frequency is the number of documents in the document set that contain the word `w`. 

The effect of adding `“1”` to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, ***will not be entirely ignored***. 

For example: 
1. The word `"can"`. We have a corpus of `5 sentences/ documents` and all of them contain this word (`"is"`); so 

                                    idf("is") = log(5 / 5) + 1 = 1

2. The word `"and"`, we have

                                    idf("and") = log(5 / 1) + 1 appox 2.609 
                                
Noting that, the `log` here is `natural logarithm (default)`.

***Note that the `idf` formula above differs from the standard textbook notation that defines the idf as***

                                    idf(w) = log [ n / (df(w) + 1) ].

- If `smooth_idf=True` (the default), the constant `“1”` is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which ***prevents zero divisions:*** 

                                    idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

In [5]:
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, 
                                           min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=True, use_idf=True))]).fit(corpus)

table["idf_smooth_True"] = pipe['tfid'].idf_
table

Unnamed: 0,fea_names,idf_smooth_False),idf_smooth_True
0,and,2.609438,2.098612
1,document,1.223144,1.182322
2,first,1.916291,1.693147
3,is,1.0,1.0
4,not,2.609438,2.098612
5,one,2.609438,2.098612
6,second,2.609438,2.098612
7,the,1.223144,1.182322
8,third,2.609438,2.098612
9,this,1.0,1.0


In [6]:
## verify the idf_value of the word "and"

np.log(5) + 1, np.log((1 + 5)/(1+1))+1

(2.6094379124341005, 2.09861228866811)

**Compute the TFIDF score**, depend on how we compute the `idf_values`, the `tfidf` is defined by

                                tf-idf(w, d) = tf(w, d) * idf(w)

Recall that; the meaning of `TF` is **`term frequency`** and here defined by *the number of times that word `w` occurs in document `d`*

For example; in the first sentence, `d = 1`; the word `"and"` is not in this sentence, so `tf("and", d=1) = 0`.

See the table bellow.

In [7]:
vocabulary = cvect.get_feature_names()
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, 
                                           min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=True, use_idf = True))]).fit(corpus)

count_vector = pipe['count'].transform(corpus).toarray()  ## equivalent with CountVectorizer.fit_transform(corpus)

tf_idf_vector = pipe['tfid'].transform(count_vector)
tf_idf_vector[0].toarray()
table["tfidf_smooth_True_1st_doc"] = tf_idf_vector[0].T.toarray()
table["tfidf_smooth_True_2nd_doc"] = tf_idf_vector[1].T.toarray()
table["tfidf_smooth_True_3rd_doc"] = tf_idf_vector[2].T.toarray()
table

Unnamed: 0,fea_names,idf_smooth_False),idf_smooth_True,tfidf_smooth_True_1st_doc,tfidf_smooth_True_2nd_doc,tfidf_smooth_True_3rd_doc
0,and,2.609438,2.098612,0.0,0.0,0.514923
1,document,1.223144,1.182322,0.42712,0.646126,0.0
2,first,1.916291,1.693147,0.611659,0.0,0.0
3,is,1.0,1.0,0.361255,0.273244,0.245363
4,not,2.609438,2.098612,0.0,0.0,0.0
5,one,2.609438,2.098612,0.0,0.0,0.514923
6,second,2.609438,2.098612,0.0,0.573434,0.0
7,the,1.223144,1.182322,0.42712,0.323063,0.290099
8,third,2.609438,2.098612,0.0,0.0,0.514923
9,this,1.0,1.0,0.361255,0.273244,0.245363


### 1.2. `TfidfVectorizer` is equivalent to the first method

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix =  tf.fit_transform(corpus)
feature_names = tf.get_feature_names()
print(feature_names)

['and', 'document', 'first', 'is', 'not', 'one', 'second', 'the', 'third', 'this', 'yours']


**Viewing the `tfidf-score` by using `TfidfVectorizer`; first looking at the `tfidf-values` in the first sentences.**

In [9]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)
M = tfidf_vectorizer_vectors.toarray()
M[0,:]

array([0.        , 0.42712001, 0.6116585 , 0.36125537, 0.        ,
       0.        , 0.        , 0.42712001, 0.        , 0.36125537,
       0.        ])

**tf-idf values using Tfidfvectorizer**

In [10]:
pd.DataFrame({"fea_names": feature_names, 
              "tfidf_TfVec_1st_doc": M[0, :], 
              "tfidf_TfVec_2nd_doc": M[1, :],
              "tfidf_TfVec_3rd_doc": M[2, :]})

Unnamed: 0,fea_names,tfidf_TfVec_1st_doc,tfidf_TfVec_2nd_doc,tfidf_TfVec_3rd_doc
0,and,0.0,0.0,0.514923
1,document,0.42712,0.646126,0.0
2,first,0.611659,0.0,0.0
3,is,0.361255,0.273244,0.245363
4,not,0.0,0.0,0.0
5,one,0.0,0.0,0.514923
6,second,0.0,0.573434,0.0
7,the,0.42712,0.323063,0.290099
8,third,0.0,0.0,0.514923
9,this,0.361255,0.273244,0.245363


In a summary, the main difference between the two modules are as follows:

- With `Tfidftransformer` you will systematically compute word counts using `CountVectorizer` and then compute the `Inverse Document Frequency (IDF)` values and only then compute the `Tf-idf scores`.

- With `Tfidfvectorizer` on the contrary, **you will do all three steps at once**. It computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

**When to use what?**
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

- If you need the `term frequency` (term count) vectors for `different tasks`, use `Tfidftransformer`.
- If you need to compute `tf-idf scores` on documents within your `“training”` dataset, use `Tfidfvectorizer`.
- If you need to compute `tf-idf scores` on documents **`outside your “training”`** dataset, use either one, both will work.

-----------------------------------

In [11]:
import nltk
from nltk import bigrams
import itertools

However, the `disadvantage` of both preceding methods is that it **only focuses on the frequency of occurrence of a word, leading to it having almost no contextual meaning**. Using **`co-occurrence matrix`** can solves that problem partially. 

------------------------

### 1.3. Co-occurrence Matrix.

------------------------

**Definition.** `Co-occurrence matrix`, is a symmetric square matrix, each row or column will be the vector representing the corresponding word; it is measuring `co-occurrences` of `features` within a `user-defined context`. 

The `context` can be defined as a document or a `window` within a collection of documents, with an `optional vector of weights` applied to the co-occurrence counts.

------------------------

Hence, **`Co-occurrence Matrix`** has the ***advantage of preserving the semantic relationship between words, built on the number of occurrences of word pairs in the `Context Window`***. 


`A Context Window` is determined by its size and direction, the following table is an example of the `Context Window` with `size = 1`

https://en.wikipedia.org/wiki/Bigram

https://en.wikipedia.org/wiki/Lexical_analysis#Token

https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

https://www.sketchengine.eu/user-guide/n-grams/


In [51]:
## ------------------------- hiden code --------------------------------------


Unnamed: 0,worried,just,hers,I,NLP,Mathematics,kidding,Don't,Learning,Machine,love,and,you
worried,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
just,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
I,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
NLP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Mathematics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
kidding,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Don't,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Learning,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Machine,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


with the `window_size = 1`, then using `bi-grams` is a `sensible way` to build this matrix. The `detailed-discussion` on **`Bi-grams` and `N-grams`** will be introduced later in this link:

https://github.com/Nhan121/Lectures_notes-teaching-in-VN-/blob/master/Statistics/NLP/N-grams%20NLP.ipynb

**Quick reminder: Bigrams**

- A `bigram` or `digram` is a sequence of 2 `adjacent elements` from a `string of tokens`, which are typically `letters`, `syllables`, or `words`.

-  The `frequency distribution` of every bigram in a string is commonly used for simple `statistical analysis of text` in many `applications`, such as `computational linguistics`, `cryptography`, `speech recognition`.

- Bigrams help provide the `conditional probability` of a `token` given the **`preceding token`**, when the relation of the conditional probability is applied:

$$ \mathbb{P}\left( \text{token}_{k} \left\vert \text{token}_{k-1}\right. \right) = \dfrac{\mathbb{P}\left( \text{token}_{k}, \text{token}_{k-1} \right)}{ \mathbb{P} \left( \text{token}_{k-1} \right)} $$

For example, the sentence `"The office building was destroyed yesterday"` contains 5 `bigrams`:
        
        the office, office building, building was, was destroyed, demolished yesterday.

In [39]:
sentence = "The office building was destroyed yesterday"
bi_grams = list(bigrams(sentence.split()))
bi_grams

[('The', 'office'),
 ('office', 'building'),
 ('building', 'was'),
 ('was', 'destroyed'),
 ('destroyed', 'yesterday')]

Before build-in a function for the `co_occurrence_matrix`; we will use a few classes and packages; Firstly 

**Insight coding**
Use `itertools.chain.from_iterable`

In [57]:
draft_text = [["I", "love", "you", "and", "hers"], 
              ["Don't", "worried", "just", "kidding"], 
              ["I", "love", "NLP"],
              ["I", "love", "Machine", "Learning", "and", "Mathematics"]]

z = itertools.chain.from_iterable(draft_text)
z

<itertools.chain at 0x27243ddac88>

**put the `object: itertools.chain` to list, then we get an 1D vector**

In [58]:
data = list(z)
print(data)

['I', 'love', 'you', 'and', 'hers', "Don't", 'worried', 'just', 'kidding', 'I', 'love', 'NLP', 'I', 'love', 'Machine', 'Learning', 'and', 'Mathematics']


**Create the vocabulary_list, set and indexes**

In [59]:
vocab = set(data)
vocab = list(data)
print(vocab)

['I', 'love', 'you', 'and', 'hers', "Don't", 'worried', 'just', 'kidding', 'I', 'love', 'NLP', 'I', 'love', 'Machine', 'Learning', 'and', 'Mathematics']


In [64]:
len(vocab)

18

In [63]:
vocab_index = {word: i for i, word in enumerate(vocab)}
vocab_index

{'I': 12,
 'love': 13,
 'you': 2,
 'and': 16,
 'hers': 4,
 "Don't": 5,
 'worried': 6,
 'just': 7,
 'kidding': 8,
 'NLP': 11,
 'Machine': 14,
 'Learning': 15,
 'Mathematics': 17}

This meant the last time the word appear in the whole sentences. 

For example the last time the word `Mathematics` appears in the data is `18 (index = 17 in python)`.

-------------------------------

**Create bigrams from all words in corpus**

In [62]:
bi_grams = list(bigrams(data))
print(bi_grams)

[('I', 'love'), ('love', 'you'), ('you', 'and'), ('and', 'hers'), ('hers', "Don't"), ("Don't", 'worried'), ('worried', 'just'), ('just', 'kidding'), ('kidding', 'I'), ('I', 'love'), ('love', 'NLP'), ('NLP', 'I'), ('I', 'love'), ('love', 'Machine'), ('Machine', 'Learning'), ('Learning', 'and'), ('and', 'Mathematics')]


***Count the bigrams***

In [46]:
bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
bigram_freq

[(('I', 'love'), 3),
 (('love', 'you'), 1),
 (('you', 'and'), 1),
 (('and', 'hers'), 1),
 (('hers', "Don't"), 1),
 (("Don't", 'worried'), 1),
 (('worried', 'just'), 1),
 (('just', 'kidding'), 1),
 (('kidding', 'I'), 1),
 (('love', 'NLP'), 1),
 (('NLP', 'I'), 1),
 (('love', 'Machine'), 1),
 (('Machine', 'Learning'), 1),
 (('Learning', 'and'), 1),
 (('and', 'Mathematics'), 1)]

Noting that; the `bigram_freq` now is the list of the `tupple`. 

For instance; the first element in this list is the `tupple: (('I', 'love'), 3)` is combined by the `set: ('I', 'love')` and an  `integer: 3`

In [47]:
type(bigram_freq), type(bigram_freq[1]), type(bigram_freq[0][1])

(list, tuple, int)

NOW, assigning the `current(word) = bigram[0][1]` and the `previous(preceding word) = bigram[0][0]` for each iterations `bigram in bigram_freq`

In [48]:
co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
for bigram in bigram_freq:
    current = bigram[0][1]   ## row 1 ([0]); col 2 ([1]) of the first set in the 2D-set
    previous = bigram[0][0]  ## row 1, col 1
    count = bigram[1]        ## the values in the 2nd set
    print(bigram, '\t\t', current, "\t\t", previous, "\t\t",count)

(('I', 'love'), 3) 		 love 		 I 		 3
(('love', 'you'), 1) 		 you 		 love 		 1
(('you', 'and'), 1) 		 and 		 you 		 1
(('and', 'hers'), 1) 		 hers 		 and 		 1
(('hers', "Don't"), 1) 		 Don't 		 hers 		 1
(("Don't", 'worried'), 1) 		 worried 		 Don't 		 1
(('worried', 'just'), 1) 		 just 		 worried 		 1
(('just', 'kidding'), 1) 		 kidding 		 just 		 1
(('kidding', 'I'), 1) 		 I 		 kidding 		 1
(('love', 'NLP'), 1) 		 NLP 		 love 		 1
(('NLP', 'I'), 1) 		 I 		 NLP 		 1
(('love', 'Machine'), 1) 		 Machine 		 love 		 1
(('Machine', 'Learning'), 1) 		 Learning 		 Machine 		 1
(('Learning', 'and'), 1) 		 and 		 Learning 		 1
(('and', 'Mathematics'), 1) 		 Mathematics 		 and 		 1


**Build-in the function**

In [49]:
def generate_co_occurrence_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_index = {word: i for i, word in enumerate(vocab)}
 
    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))
 
    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
 
    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
 
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_index[current]
        pos_previous = vocab_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    
    # create matrix
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
 
    # return the matrix and the index
    return co_occurrence_matrix, vocab_index

draft_data = list(itertools.chain.from_iterable(draft_text))
matrix, vocab_index = generate_co_occurrence_matrix(draft_data)
  
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)
data_matrix

Unnamed: 0,worried,just,hers,I,NLP,Mathematics,kidding,Don't,Learning,Machine,love,and,you
worried,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
just,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
I,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
NLP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Mathematics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
kidding,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Don't,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Learning,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Machine,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [50]:
text = [["I", "love", "math", "." , "I", "love", "programming", ".", "I", "love", "Biology"],
         ["I", "love", "Math", "and", "Programming", "and", "Biology"],
         ["I", "hate", "cat", "." , "I", "hate", "dog", "and", "snake"]]

# Create one list using many lists
data = list(itertools.chain.from_iterable(text))
matrix, vocab_index = generate_co_occurrence_matrix(data)
  
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)
data_matrix

Unnamed: 0,cat,dog,Programming,snake,I,hate,.,love,Math,Biology,and,programming,math
cat,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
dog,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Programming,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
snake,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
I,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,2.0,0.0,0.0,0.0
hate,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
love,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Math,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
Biology,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


- Here the word `‘love’` is defined by the words `‘I’` and `‘Programming’`, meaning that we increment the value both for the `‘I love’` and the `‘love Programming’` co-occurrence. We do that for each window and obtain the preceding `co-occurrence matrix`.

- Since `‘Programming’` and `‘Math’` share the same co-occurrence values, they would be placed in the same place; meaning that in this context they mean the same thing (or `‘pretty much’` the same thing). `‘Biology’` would be the closest word to these 2 meaning ‘it has the closest possible meaning but it’s not the same thing’, and so on for every word. The semantic and syntactic relationships generated by this technique are really powerful but it’s computationally expensive since we are talking about a very high-dimensional space. Therefore, we need a technique that reduces dimensionality for us with the least data-loss possible.

----------------------------------
### Summarizations.

#### Advantages
- It preserves the `semantic relationship` between words.
- It uses `SVD (singular value decomposition)` at its core to reduce the size of vector, which produces more accurate word vector representations than existing methods.
- It uses `factorization` which is a `well-defined problem` and can be efficiently solved.
- It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

#### Disadvantage
The disadvantage of **`Co-occurrence Matrix (CM)`** is when the `text_data` contains a `large numbers of vocalbularies`; hence it requires huge memory to store the co-occurrence matrix. 

To make the representation of words clearer and save memory used to store **`CM`**; we have to choose or to remove some unnecessary words (such as **`stopwords`**).

--------------------------

## 2. Word Embedding: using Prediction Base method

**Glove Embedding**
