# Technique To Represent Word as Vector

1. [Overview](#1)
1. [Count Vectorizer](#2)
1. [Count Vectorizer Implement](#3)
1. [TF-IDF](#4)
1. [TF-IDF Implementation](#5)
1. [Hashing Vectorizer](#6)
1. [Hashing Vectorizer Implementation](#7)
1. [word2vex Model](#8)
1. [word2vec Model Implementation](#9)

## <span id="1"></span>  1. Overview

Vectorization is jargon for a classic approach of converting input data from its raw format (i.e. text ) into vectors of real numbers which is the format that ML models support. This approach has been there ever since computers were first built, it has worked wonderfully across various domains, and it’s now used in NLP.

In Machine Learning, vectorization is a step in feature extraction. The idea is to get some distinct features out of the text for the model to train on, by converting text to numerical vectors.

## <span id="2"></span>  2. Count Vectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).

Let us consider a few sample texts from a document (each as a list element):


document = [ “One Geek helps Two Geeks”, “Two Geeks help Four Geeks”, “Each Geek helps many other Geeks at GeeksforGeeks.”]

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. 

This can be visualized as follows –

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200706061418/table.PNG" class="center">

Key Observations:

1.There are 12 unique words in the document, represented as columns of the table.<br>
2.There are 3 text samples in the document, each represented as rows of the table.<br>
3.Every cell contains a number, that represents the count of the word in that particular text.<br>
4.All words have been converted to lowercase.<br>
5.The words in columns have been arranged alphabetically.

## <span id="3"></span>  3. Count Vectorizer Implement 

##### Importing Necessary Libraries 

In [36]:
import numpy as np
import pandas as pd
import seaborn as sns

##### Import Dataset

In [88]:
df_cv = pd.read_csv(r'C:\Users\imsanjoykb\Downloads\test.csv')
df_cv

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


##### Seperate Features & Label Columns

In [89]:
x = df_cv['test']
x

0               I love Bangladesh
1    Could you give me an iphone?
2              Hello how are you?
3             I want to talk you.
Name: test, dtype: object

In [90]:
y = df_cv['class']
y

0    1
1    0
2    1
3    1
Name: class, dtype: int64

##### Apply Count Vectorizer

In [91]:
### Import Libraries
from sklearn.feature_extraction.text import CountVectorizer

In [92]:
cv = CountVectorizer()

In [93]:
x = cv.fit_transform(x)

In [94]:
x.shape

(4, 14)

In [95]:
cv.vocabulary_

{'love': 8,
 'bangladesh': 2,
 'could': 3,
 'you': 13,
 'give': 4,
 'me': 9,
 'an': 0,
 'iphone': 7,
 'hello': 5,
 'how': 6,
 'are': 1,
 'want': 12,
 'to': 11,
 'talk': 10}

In [96]:
doc = x.toarray()

In [97]:
cv.get_feature_names()

['an',
 'are',
 'bangladesh',
 'could',
 'give',
 'hello',
 'how',
 'iphone',
 'love',
 'me',
 'talk',
 'to',
 'want',
 'you']

In [99]:
freq = pd.DataFrame(doc, index = df_cv, columns= cv.get_feature_names())
freq

Unnamed: 0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
"(I love Bangladesh, 1)",0,0,1,0,0,0,0,0,1,0,0,0,0,0
"(Could you give me an iphone?, 0)",1,0,0,1,1,0,0,1,0,1,0,0,0,1
"(Hello how are you?, 1)",0,1,0,0,0,1,1,0,0,0,0,0,0,1
"(I want to talk you., 1)",0,0,0,0,0,0,0,0,0,0,1,1,1,1


## <span id="4"></span>  4. TF-IDF

Term Frequency (tf) - It gives us the recurrence of the word in each report in the corpus. It is the proportion of the number of times the word shows up in a report contrasted with the all-out the number of words in that record. It increments as the quantity of events of that word inside the record increments.

Inverse Data Frequency (idf) - It is used to figure the heaviness of uncommon words over all reports in the corpus. The words that happen seldom in the corpus have a high IDF score.<br>


Example:<br>
Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

<img src="https://cdn-media-1.freecodecamp.org/images/1*q3qYevXqQOjJf6Pwdlx8Mw.png" class="center">

From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.


## <span id="5"></span>  5. TF-IDF Implementation

##### Import Dataset

In [49]:
df_tfidf = pd.read_csv(r'C:\Users\imsanjoykb\Downloads\test.csv')
df_tfidf

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


##### Seperate Features & Label Columns

In [50]:
x2 = df_tfidf['test']
x2

0               I love Bangladesh
1    Could you give me an iphone?
2              Hello how are you?
3             I want to talk you.
Name: test, dtype: object

In [51]:
y2 = df_tfidf['class']
y2

0    1
1    0
2    1
3    1
Name: class, dtype: int64

In [52]:
### Import Tf-Idf Libraries
from sklearn.feature_extraction.text import TfidfVectorizer

In [53]:
tf = TfidfVectorizer()

In [54]:
x_tfidf = tf.fit_transform(x2)

In [55]:
x_tfidf.shape

(4, 14)

In [56]:
doc2 = x_tfidf.toarray()

In [57]:
doc2

array([[0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.70710678, 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.43003652, 0.        , 0.        , 0.43003652, 0.43003652,
        0.        , 0.        , 0.43003652, 0.        , 0.43003652,
        0.        , 0.        , 0.        , 0.27448674],
       [0.        , 0.5417361 , 0.        , 0.        , 0.        ,
        0.5417361 , 0.5417361 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.34578314],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.5417361 , 0.5417361 , 0.5417361 , 0.34578314]])

In [58]:
freq2 = pd.DataFrame(doc2, index = df , columns = tf.get_feature_names())
freq2

Unnamed: 0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
"(I love Bangladesh, 1)",0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0
"(Could you give me an iphone?, 0)",0.430037,0.0,0.0,0.430037,0.430037,0.0,0.0,0.430037,0.0,0.430037,0.0,0.0,0.0,0.274487
"(Hello how are you?, 1)",0.0,0.541736,0.0,0.0,0.0,0.541736,0.541736,0.0,0.0,0.0,0.0,0.0,0.0,0.345783
"(I want to talk you., 1)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.541736,0.541736,0.541736,0.345783


## <span id="6"></span>  6. Hashing Vectorizer

With HashingVectorizer, each token directly maps to a column position in a matrix, where its size is pre-defined. For example, if you have 10,000 columns in your matrix, each token maps to 1 of the 10,000 columns. This mapping happens via hashing. The hash function used is called Murmurhash3.

See figure 1 for a visual example of how HashingVectorizer works.


<img src="https://kavita-ganesan.com/wp-content/uploads/how-hashingvectorizer-works.png" class="center">

The benefit of not storing the vocabulary (dictionary of tokens) is two folded. First, this is very efficient for a large dataset.

Holding a 300M token vocabulary in memory could be a challenge in certain computing environments as these are essentially strings. It demands  more memory compared to their integer counterparts.

By not having to store the vocabulary, the resulting HashingVectorizer object when saved, would be much smaller and thus faster to load back into memory when needed.

## <span id="7"></span>  7. Hashing Vectorizer Implementation

##### Import Dataset

In [59]:
df_hv = pd.read_csv(r'C:\Users\imsanjoykb\Downloads\test.csv')
df_hv

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


##### Seperate Features & Label Columns

In [60]:
x3 = df_hv['test']
x3

0               I love Bangladesh
1    Could you give me an iphone?
2              Hello how are you?
3             I want to talk you.
Name: test, dtype: object

In [61]:
y3 = df_hv['class']
y3

0    1
1    0
2    1
3    1
Name: class, dtype: int64

In [62]:
from sklearn.feature_extraction.text import HashingVectorizer

In [63]:
hv = HashingVectorizer(n_features = 100)

In [64]:
x_hv = hv.fit_transform(x3)

In [65]:
x_hv.shape

(4, 100)

## <span id="8"></span>  8. word2vec model

##### How the word2vec model is trained
Move through the training corpus with a sliding window: Each word is a prediction problem.
The objective is to predict the current word using the neighboring words (or vice versa).
The outcome of the prediction determines whether we adjust the current word vector. Gradually, vectors converge to (hopefully) optimal values.
For example, we can use “artificial” to predict “intelligence”.


<img src="https://miro.medium.com/max/2400/1*WcIBmz0jR8KTtTkwGhQmmg.png" class="center">

##### Implementing Word2vec embedding in Gensim
min_count: Minimum number of occurrences of a word in the corpus to be included in the model. The higher the number, the less words we have in our corpus.<br>
window: The maximum distance between the current and predicted word within a sentence.<br>
size: The dimensionality of the feature vectors.<br>
workers: I know my system is having 4 cores.<br>
model.build_vocab: Prepare the model vocabulary.<br>
model.train: Train word vectors.<br>
model.init_sims(): When we do not plan to train the model any further, we use this line of code to make the model more memory-efficient.

## <span id="9"></span>  9. word2vec model Implementation

##### Import Dataset

In [75]:
df_wv = pd.read_csv(r'C:\Users\imsanjoykb\Downloads\test.csv')
df_wv

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


##### Seperate Features & Label Columns

In [76]:
x4 = df_wv['test']
x4

0               I love Bangladesh
1    Could you give me an iphone?
2              Hello how are you?
3             I want to talk you.
Name: test, dtype: object

In [77]:
y4 = df_wv['class']
y4

0    1
1    0
2    1
3    1
Name: class, dtype: int64

In [78]:
from gensim.models import Word2Vec, KeyedVectors

In [79]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\imsanjoykb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [85]:
textvector = [nltk.word_tokenize(test) for test in x4]
textvector

[['I', 'love', 'Bangladesh'],
 ['Could', 'you', 'give', 'me', 'an', 'iphone', '?'],
 ['Hello', 'how', 'are', 'you', '?'],
 ['I', 'want', 'to', 'talk', 'you', '.']]

In [86]:
model = Word2Vec(textvector,min_count=1)

In [87]:
model.wv.most_similar('give')

[('iphone', 0.1747603714466095),
 ('Hello', 0.11118055880069733),
 ('talk', 0.10888979583978653),
 ('want', 0.10560770332813263),
 ('you', 0.09291953593492508),
 ('to', 0.0805869922041893),
 ('Could', 0.004842505790293217),
 ('?', -0.0027540193405002356),
 ('an', -0.013679753988981247),
 ('are', -0.025461044162511826)]