## Vectorization

Techniques to convert text data to numerical vectors.
* OHE (One Hot Encoding): Simple converting words as vector of 0's and 1.
* BOW (Bag of words)/ Count vectorization : Vectors of documents based on their count or frequency.
* TF-IDF (term frequency and inverse document frequency) : Vector's were created for documents based on TTF-IDF  scores. More weightage was given more frequent are rare words in the documents:

Shortcommings of these Vectorization.

* (IMP) That they do not have the sementic understanding the words. For words like KIng, man , queen , women , apple , banana . Word 'king' is closely associated with word 'queen'.
* They all create a very sparse matrix.

What if we can add sementic understanding in these vectors?

In thata case my model will be able to understand the relation / similarity between king and queen is more as compared to king and apple.

**Word Embedding (Adds the sementic understanding in words.)**

In [1]:
import numpy as np
import pandas as pd

from numpy.linalg import norm


In [2]:
# Lets see how we can add sementic understanding in words
# we will create some vectors for the words with some hidden features .


king = np.array([0.99,0.97,0.01])
queen = np.array([0.96,0.53,0.02])
man = np.array([0.12,1,0.02])
women = np.array([0.14,0.5,0.01])
apple = np.array([0.02,0.01,0.96])
banana =np.array([0.01,0.01,0.97])

In [3]:
from re import S
# Now we have vectorsn that has some hidden information about the words
# can we find which word is more similar to word king?
# Yes , by finding cosine similarity b/w vectors.

def  cosine_similarity(v1,v2):
  simi =np.dot(v1,v2)/(norm(v1)*norm(v2))
  return simi

In [4]:
# Lets find the similarity of all the vectors with vector of word 'King'

for i in [queen,man,women,apple,banana]:
  print(f'{i}:Similarity Score: {cosine_similarity(king,i)}')

[0.96 0.53 0.02]:Similarity Score: 0.9635160049612765
[0.12 1.   0.02]:Similarity Score: 0.7799426405594699
[0.14 0.5  0.01]:Similarity Score: 0.8664834755420315
[0.02 0.01 0.96]:Similarity Score: 0.029377359668652032
[0.01 0.01 0.97]:Similarity Score: 0.021790878923892244


In [5]:
# Now if i say these vectors have sementic understanding
# Then can we solve thee following expression
# King - men + women = ??

king-man+women

# The resultant vector is very similar to Queen

array([1.01, 0.47, 0.  ])

## Word Embedding models

*  Word2Vec
    * Pretrained model : These models are already trained on datasets like google news , wikipedia or Gigawords.
    * Trainable models : Which can be trained on our own dataset . Examples: CBOW and SkipGram

* Glove (Pretrained model)
* FastText (Pretrained model)

## Trainable models (CBOW and SkipGram)

In [6]:
# Cbow and skipgram

# Lets create corpus of tokens

corpus = [['cat','and','dog','are','very','comman','pets'],
          ['cat','is','a','lazy','pet','and','likes','to','eat','fish'],
          ['people','also','have','fish','as','pet'],
          ['cat','and','fish','should','not','be','pet','in','the','same','house'],
          ['dog','hate','cat','but','they','can','live','with','fish']]


In [7]:
pip install gensim -q

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
from gensim.models import Word2Vec

In [9]:
model_cbow = Word2Vec(sentences = corpus,window=3,vector_size=20,min_count=1,sg=0)  # sg = SkipGram

# For skipgram sg =1
# for cbow sg = 0

In [10]:
# Lets see the word embedding for word 'pet'

model_cbow.wv['pet']

array([-0.04801775,  0.02503647, -0.04379793, -0.02195913, -0.0001755 ,
       -0.00148091, -0.0383062 ,  0.04807372,  0.02491029,  0.04616572,
       -0.04078959,  0.02247899, -0.02068538,  0.00412268,  0.0424931 ,
       -0.02231088,  0.0225875 , -0.0339348 , -0.01774244,  0.04699254],
      dtype=float32)

In [11]:
model_cbow.wv.most_similar('cat',topn=10)

[('have', 0.4368991553783417),
 ('is', 0.3069240152835846),
 ('hate', 0.2542324960231781),
 ('pets', 0.17545878887176514),
 ('be', 0.159010112285614),
 ('and', 0.15230195224285126),
 ('should', 0.1486302763223648),
 ('the', 0.12859836220741272),
 ('eat', 0.12588226795196533),
 ('likes', 0.11898518353700638)]

In [12]:
model_cbow.wv.most_similar('fish',topn=10)

[('very', 0.48102182149887085),
 ('as', 0.466048002243042),
 ('with', 0.3966265022754669),
 ('is', 0.3655945956707001),
 ('not', 0.35118815302848816),
 ('the', 0.33674266934394836),
 ('likes', 0.31869181990623474),
 ('have', 0.3150813579559326),
 ('people', 0.26447102427482605),
 ('are', 0.26341262459754944)]

In [13]:
model_cbow = Word2Vec(sentences = corpus,window=3,vector_size=20,min_count=1,sg=1)  # sg = SkipGram

# For skipgram sg =1
# for cbow sg = 0

In [14]:
model_cbow.wv.most_similar('fish',topn=10)

[('very', 0.48119232058525085),
 ('as', 0.4662320017814636),
 ('with', 0.39732688665390015),
 ('is', 0.36547091603279114),
 ('not', 0.3511258065700531),
 ('the', 0.33675912022590637),
 ('likes', 0.3190329968929291),
 ('have', 0.3151017427444458),
 ('people', 0.26445290446281433),
 ('are', 0.2632710039615631)]

In [15]:
model_cbow.wv.key_to_index

{'fish': 0,
 'cat': 1,
 'pet': 2,
 'and': 3,
 'dog': 4,
 'with': 5,
 'live': 6,
 'can': 7,
 'they': 8,
 'but': 9,
 'hate': 10,
 'house': 11,
 'same': 12,
 'the': 13,
 'in': 14,
 'be': 15,
 'not': 16,
 'should': 17,
 'as': 18,
 'have': 19,
 'also': 20,
 'people': 21,
 'eat': 22,
 'to': 23,
 'likes': 24,
 'lazy': 25,
 'a': 26,
 'is': 27,
 'pets': 28,
 'comman': 29,
 'very': 30,
 'are': 31}

In [16]:
len(model_cbow.wv.key_to_index)

32

## Lets see some pretrained models

In [17]:
# model name : Word2Vec-Google-news-300
# data : google news data
# vector size = 300
# vocabulary ~ approx 3 million
# Architecture : SkipGram

In [18]:
import gensim.downloader as gen_download

In [None]:
model = gen_download.load('word2vec-google-news-300')

[--------------------------------------------------] 1.1% 18.1/1662.8MB downloaded

In [None]:
# 'fasttext-wiki-news-subwords-300'
# 'glove-twitter-25'
# 'glove-wiki-gigaword-100'

In [None]:
model.most_similar('King',topn=20)

[('Jackson', 0.5326348543167114),
 ('Prince', 0.5306329727172852),
 ('Tupou_V.', 0.5292826294898987),
 ('KIng', 0.5227501392364502),
 ('e_mail_robert.king_@', 0.5173623561859131),
 ('king', 0.5158917903900146),
 ('Queen', 0.5157250165939331),
 ('Geoffrey_Rush_Exit', 0.49920955300331116),
 ('prosecutor_Dan_Satterberg', 0.49850785732269287),
 ('NECN_Alison', 0.49128594994544983),
 ('Greene', 0.4909343123435974),
 ('Singer_songwriter_Carole', 0.48465195298194885),
 ('Saunders', 0.4697389304637909),
 ('Lionhearted', 0.4682433605194092),
 ('Rama_VII', 0.4657835066318512),
 ('That_creates_opporunity', 0.46014007925987244),
 ('Stacy_Legg', 0.4587861895561218),
 ('Brown', 0.45793747901916504),
 ('Lopez_Kuhrt', 0.45784181356430054),
 ('agent_Ralph_Vicinanza', 0.4575446844100952)]

In [None]:
model.most_similar('pet',topn=20)

[('pets', 0.771199643611908),
 ('Pet', 0.7239742875099182),
 ('dog', 0.7164785265922546),
 ('puppy', 0.6972636580467224),
 ('cat', 0.6891531944274902),
 ('cats', 0.6719794869422913),
 ('pooch', 0.6579219102859497),
 ('Pets', 0.6363636255264282),
 ('animal', 0.6338440179824829),
 ('dogs', 0.6224827766418457),
 ('Tex._App._Eastland', 0.6163941621780396),
 ('doggie', 0.6154476404190063),
 ('feline', 0.6146664619445801),
 ('furry_friends', 0.6051517128944397),
 ('beagle', 0.5969854593276978),
 ('pup', 0.5965884923934937),
 ('puppies', 0.5963678359985352),
 ('doggy', 0.5931218266487122),
 ('Chihuahua_Loki', 0.59295254945755),
 ('kitties', 0.5912436842918396)]

In [None]:
model.most_similar('logic',topn=20)

[('reasoning', 0.7129802703857422),
 ('illogic', 0.6175681352615356),
 ('syllogism', 0.5849214792251587),
 ('syllogisms', 0.5720077753067017),
 ('rationale', 0.5709520578384399),
 ('axioms', 0.5681018829345703),
 ('theory', 0.5613352060317993),
 ('logical_reasoning', 0.5591957569122314),
 ('Such_brazenness_defies', 0.5575210452079773),
 ('inexorable_logic', 0.545832633972168),
 ('mathematical_formalism', 0.5452414751052856),
 ('deductive_logic', 0.5437464714050293),
 ('logics', 0.5364674925804138),
 ('rationality', 0.5295685529708862),
 ('Newtonian_mechanics', 0.5283963680267334),
 ('perverse_logic', 0.5275673866271973),
 ('methodological_naturalism', 0.523600697517395),
 ('logical_fallacy', 0.5222614407539368),
 ('bounded_rationality', 0.5184820294380188),
 ('fallacious_logic', 0.5184167623519897)]

In [None]:
model.most_similar('hello',topn=20)

[('hi', 0.6548984050750732),
 ('goodbye', 0.6399056315422058),
 ('howdy', 0.6310956478118896),
 ('goodnight', 0.5920578241348267),
 ('greeting', 0.5855877995491028),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871741294861),
 ('ya_doin', 0.5643119812011719),
 ('greet', 0.5636604428291321),
 ('hullo', 0.5621640682220459),
 ('hellos', 0.5596432685852051),
 ('Hey', 0.5594545602798462),
 ('bye_bye', 0.5593389272689819),
 ('bonjour', 0.5587834715843201),
 ('adios', 0.5560759902000427),
 ('ciao', 0.5548770427703857),
 ('hug', 0.5544619560241699),
 ('buh_bye', 0.5511860251426697),
 ("G'day", 0.5494420528411865)]

In [None]:
model = gen_download.load('fasttext-wiki-news-subwords-300')

