### Converting textual data into numerical vectors

1) Frequency based methods: TF-IDF, CountVectorizer
2) prediction-based embeddings: word2vec, gloVe -> captures semantic relationships, context

### TF-IDF

statistical measure that evaluates how relevent a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times word appear in a document (term frequency), and the inverse document frequency of the word acress a set of documents.

There are several ways of calculating TF. 
IDF refers to how common or rare word is in the entire document set. The closer it is to 0, the more common a word is otherwise it will be closer to 1. Calculated by taking total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

Multiplying these two numbers result in the TF-IDF score of the word in a document. The higher the score, the more relevant that word is in that particular document.


### CountVectorizer 

it converts a collection of text documents into a matrix of toekn counts. This means text converted into vectors which contains number of times the word has appeared in the sentence. These word vectors are word embeddings(frequency based).


### Prediction based embeddings 

uses pre-trained models to create vector representations using neural networks. These types of models are suitable when we need to have large contextual information.

word2vec: 300-D vector

word2vec used for learning word embeddings using shallow neural networks.
it contains vector representations of around 50 billion words.
similar words have similar vectors.
distance measured using cosine distance between two vectors.
represents each word as a 300-D vector.


Word2vec is not single algoritm, it is combination of two techniques:

CBOW (Continuous Bag of Words)
Skip-Gram model


1. Bag of words: Treat each sentence as a seperate document, will make list of all unique words from all docs excluding punctuation. Now create vector for each senetnce where assign value 1 to those indexes for word in sentence and rest to 0. Vector length will be number of unique words in our dataset. Here we do not take into account context.

2. CBOW: modefied version of BOW. we use context of each word as the input and then try to predict the next word corresponding to the context. We try to predict next word from given word, so in this process of predicting the target word, model (neural netwrok) will learn the vector representation of the target word, and it takes into account the context of the word which was not available in BOW.

We can give C number of context vectors to the model, and the hidden layer will average the vectors and produce final vector for a perticular word based on the C context vectors. 

![CBOW](datasets/CBOW.png)

3. Skip-gram Model: follows same strategy as the CBOW but in the output, model tries to predict the context words as the output given an input word.

If we provide one word to model, then model will predict 1 context word to right and left of that given word. it's called 1-context CBOW model. We can predict more words too.

![skip-gram](datasets/skip%20gram.png)

Advantage of CBOW model:
probabilistic nature and able to perform superior to deterministic methods.
does not need large RAM requirements.

Disadvantages of CBOW model:
does not capture two sementics for a single word, so we use skip-gram model for it. Meaning skip-gram model will have two different vector for single word for different contextual meaning. Ex: Apple as a company and a fruit.


Word2vec uses both techniques in tandem to create more realistic word vector representations.


### Word2vec in practice

In [5]:
# New URL for the Google News vectors
!curl -L -o ./datasets/GoogleNews-vectors-negative300.bin.gz "https://github.com/mmihaltz/word2vec-GoogleNews-vectors/raw/master/GoogleNews-vectors-negative300.bin.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1570M  100 1570M    0     0  22.5M      0  0:01:09  0:01:09 --:--:-- 21.8M    0  0:01:11  0:01:02  0:00:09 29.6M8  0:00:01 27.5M


In [9]:
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [6]:
word_vectors = KeyedVectors.load_word2vec_format('./datasets/GoogleNews-vectors-negative300.bin.gz', binary=True)



In [7]:
v_banana = word_vectors['banana']
v_mango = word_vectors['mango']

print(cosine_similarity([v_banana], [v_mango]))


[[0.6365211]]


In [8]:
#This tells mango and banana are 63% similar.

In [11]:
def odd_one_out(words,word_vectors):
    all_word_vectors = [word_vectors[w] for w in words]
    avg_vector = np.mean(all_word_vectors,axis=0)
    odd_one_out = None
    min_sim = 1.0

    for w in words:
        sim = cosine_similarity([word_vectors[w]],[avg_vector])
        if sim < min_sim:
            min_sim = sim
            odd_one_out = w

    return odd_one_out


In [12]:
list_of_words = ["apple","mango","party","juice","orange"]

print(odd_one_out(list_of_words,word_vectors))

party


In [17]:
'''
B - A = D - C
D = B - A + C
'''

def word_analogies(A,B,C,word_vectors):
    A,B,C = A.lower(),B.lower(),C.lower()
    max_sim=-100
    D= None

    words = word_vectors.index_to_key
    WA,WB,WC = word_vectors[A],word_vectors[B],word_vectors[C]

    for w in words:
        if w in [A,B,C]:
            continue
        w_vector = word_vectors[w]
        sim = cosine_similarity([WB-WA],[w_vector-WC])
        if sim > max_sim:
            max_sim = sim
            D = w
    return D

In [18]:
D = word_analogies("Man","Woman","King",word_vectors)

In [20]:
print(D)

queen


In [19]:
result = word_vectors.most_similar(positive = ['woman','king'], negative = ['man'],topn=1)
print(result)



[('queen', 0.7118192911148071)]
