<a href="https://colab.research.google.com/github/gkdivya/NLP_Notebooks/blob/main/WordEmbeddings%26LanguageModels/Let's_Code_Lecture2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# One hot encoding using Pytorch

In [1]:
import torch
torch.manual_seed(0)

<torch._C.Generator at 0x7f43a1979570>

In [2]:
def text_to_onehot(text: str, corpus: str):
    # Read input file
    vocab = []
    with open(corpus, encoding="utf8") as f:
        passage = f.read()
        sentences = passage.lower().replace(".","").split('\n')
        for sentence in sentences:
            for word in sentence.split():
                if word not in vocab:
                    vocab.append(word)
    print(f'No. of words in our vocabulary: {len(vocab)}')
    print(vocab, '\n')

    # split the provided text into words
    txt = text.split()

    # Extract ranks out of word list
    rank_list = [rank for rank, _ in enumerate(vocab)]

    # Encode rank as one-hot vectors
    vocab_dict = {}
    text_vec = torch.zeros(len(txt), len(vocab)) # length of vocab is embedding size
    vocab_vec = torch.zeros(len(vocab), len(vocab))
    vocab_vec[torch.arange(len(vocab)), rank_list] = 1
    
    # create dict for mapping word to one-hot vector
    for index, word in enumerate(vocab):
            vocab_dict[word] = vocab_vec[index]
    
    # for each word of text in vocab assign its one-hot value
    for index, word in enumerate(txt):
        if word in vocab:
            text_vec[index] = vocab_dict[word]
    print(f'One hot vector of your text based on corpus:\n {text_vec}')

In [12]:
text_to_onehot('man eats biscuits and scares dog','/content/sample_data/corpus.txt')

No. of words in our vocabulary: 1403
['man', 'eats', 'biscuits', 'and', 'scares', 'dog', 'how', 'to', 'get', 'your', "dog's", 'appetite', 'back', 'has', 'always', 'been', 'an', 'avid', 'eater', 'as', 'a', 'matter', 'of', 'fact,', 'he', 'gulped', 'down', 'his', 'meal', 'in', 'seconds,', 'made', 'daily', 'trips', 'under', 'kitchen', 'table', 'just', 'looking', 'for', 'few', 'crumbs', 'yet,', 'day', 'comes', 'where', 'categorically', 'refuses', 'food', 'concerned,', 'you', 'start', 'wondering', 'if', 'may', 'be', 'sick', 'or', 'suddenly', 'become', 'fussy', 'some', 'dogs', 'can', 'extremely', 'finicky', 'they', 'turn', 'their', 'nose', 'away', 'from', 'foods', 'dislike', 'dealing', 'with', 'such', 'annoying', 'owners', 'find', 'themselves', 'continuously', 'rotating', 'between', 'different', 'brands', 'satisfy', 'special', 'cravings', 'however,', 'many', 'cases,', 'loss', 'suggest', 'health', 'ailment', 'this', 'is', 'why', 'before', 'considering', 'finicky,', 'it', 'good', 'idea', 'have'

**Words in our text which are not in corpus gets all zero one-hot vector. Thats OOV**.

# BOW Model using sklearn

In [14]:
processed_docs = []
with open('/content/sample_data/corpus.txt', encoding="utf8") as f:
    passage = f.read()
processed_docs = passage.lower().replace(".","").split('\n')
processed_docs

['man eats biscuits and scares dog',
 'man eats biscuits and scares dog',
 'man eats biscuits and scares dog',
 "how to get your dog's appetite back",
 'your dog has always been an avid eater as a matter of fact, he has always gulped down his meal in seconds, and he has always made daily trips under your kitchen table just looking for a few crumbs yet, a day comes where he categorically refuses his food concerned, you start wondering if he may be sick or if he has suddenly become a fussy eater',
 '',
 "some dogs can be extremely finicky they may turn their nose away from foods they dislike dealing with such dogs may be annoying as owners find themselves continuously rotating between different dog food brands to satisfy their dog's special cravings",
 '',
 'however, in many cases, a loss of appetite may suggest a health ailment this is why before considering your dog finicky, it is a good idea to have your vet rule out some conditions such as tooth problems, kidney or liver problems, or

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

#look at the documents list
print("Our corpus: ", processed_docs)

count_vect = CountVectorizer()
#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ", bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])
print("Bow representation:", temp.toarray())

Our corpus:  ['man eats biscuits and scares dog', 'man eats biscuits and scares dog', 'man eats biscuits and scares dog', "how to get your dog's appetite back", 'your dog has always been an avid eater as a matter of fact, he has always gulped down his meal in seconds, and he has always made daily trips under your kitchen table just looking for a few crumbs yet, a day comes where he categorically refuses his food concerned, you start wondering if he may be sick or if he has suddenly become a fussy eater', '', "some dogs can be extremely finicky they may turn their nose away from foods they dislike dealing with such dogs may be annoying as owners find themselves continuously rotating between different dog food brands to satisfy their dog's special cravings", '', 'however, in many cases, a loss of appetite may suggest a health ailment this is why before considering your dog finicky, it is a good idea to have your vet rule out some conditions such as tooth problems, kidney or liver problem

Here we are considering the frequency of words into account. However, sometimes, we don't care about frequency much, but only want to know whether a word appeared in a text or not. That is, each document is represented as a vector of 0s and 1s. We use the option binary=True in CountVectorizer for this purpose. This results in a different representation for the same sentence.

# Bag of N-grams

In [16]:
from sklearn.feature_extraction.text import CountVectorizer


#Ngram vectorization example with count vectorizer and uni, bi, trigrams
count_vect = CountVectorizer(ngram_range=(1,3))

#Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for 'dog bites man': ", bow_rep[0].toarray())
print("BoW representation for 'man bites dog: ", bow_rep[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["dog and dog are friends"])

print("Bow representation:", temp.toarray())

Our vocabulary:  {'man': 5022, 'eats': 2509, 'biscuits': 1232, 'and': 403, 'scares': 6846, 'dog': 2066, 'man eats': 5023, 'eats biscuits': 2510, 'biscuits and': 1233, 'and scares': 535, 'scares dog': 6847, 'man eats biscuits': 5024, 'eats biscuits and': 2511, 'biscuits and scares': 1234, 'and scares dog': 536, 'how': 4034, 'to': 8191, 'get': 3238, 'your': 9411, 'appetite': 686, 'back': 962, 'how to': 4037, 'to get': 8293, 'get your': 3271, 'your dog': 9418, 'dog appetite': 2074, 'appetite back': 687, 'how to get': 4038, 'to get your': 8299, 'get your dog': 3272, 'your dog appetite': 9420, 'dog appetite back': 2075, 'has': 3503, 'always': 347, 'been': 1131, 'an': 381, 'avid': 933, 'eater': 2467, 'as': 819, 'matter': 5061, 'of': 5728, 'fact': 2654, 'he': 3633, 'gulped': 3454, 'down': 2338, 'his': 3899, 'meal': 5168, 'in': 4176, 'seconds': 6859, 'made': 4971, 'daily': 1794, 'trips': 8531, 'under': 8611, 'kitchen': 4714, 'table': 7486, 'just': 4621, 'looking': 4892, 'for': 3044, 'few': 279

Note that the number of features (and hence the size of the feature vector) increased a lot for the same data, compared to the ther single word based representations!!

# TF-IDF

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs)

# IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)

# All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())
print("-"*10)

# TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation:\n", temp.toarray())

IDF for all words in the vocabulary [5.56434819 5.9698133  5.9698133  ... 5.9698133  3.02537432 5.9698133 ]
----------
All words in the vocabulary ['01', '04', '05', '09', '10', '11', '12', '13', '14', '17', '19', '20', '2008', '2011', '2012', '2017', '2018', '2019', '2020', '22', '23', '24', '25', '26', '27', '29', '30', '31', '3560', '4month', 'able', 'abnormalities', 'about', 'above', 'abruptly', 'absence', 'accept', 'accidentally', 'acclimate', 'accurate', 'ache', 'aches', 'act', 'acting', 'active', 'acts', 'add', 'addicted', 'adding', 'additions', 'address', 'adjust', 'adjusting', 'adopted', 'adrienne', 'adult', 'advance', 'advice', 'after', 'again', 'age', 'aggressive', 'ago', 'ailment', 'ailments', 'air', 'alisha', 'all', 'allows', 'almost', 'already', 'also', 'altogether', 'always', 'am', 'amiss', 'an', 'and', 'animals', 'annoying', 'another', 'answer', 'answers', 'antinausea', 'any', 'anybody', 'anyone', 'anything', 'anyway', 'apartment', 'apetite', 'appallingwhat', 'apparentl

# Pre-trained word2vec model

Google News Dataset

In [1]:
!wget -P /tmp/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-10-29 15:27:36--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.248.78
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.248.78|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [2]:
import warnings # This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

import os # This module provides a way of using operating system dependent functionality

import psutil # This module helps in retrieving information on running processes and system resource utilization
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

import time # This module is used to calculate the time

In [3]:
from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = '/tmp/input/GoogleNews-vectors-negative300.bin.gz'

# Load W2V model. This will take some time, but it is a one time effort! 
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() # Start the timer
ttl = mem.total # Total memory available

w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True) # load the model
print("%0.2f seconds taken to load"%float(time.time() - start_time)) # Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) # Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) # Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.vocab)) # Number of words in the vocabulary.

Memory used in GB before Loading the Model: 0.16
----------
107.21 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 5.01
----------
Percentage increase in memory usage: 3215.16% 
----------
Numver of words in vocablulary:  3000000


In [4]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353004455566406),
 ('lovely', 0.810693621635437),
 ('stunningly_beautiful', 0.7329413890838623),
 ('breathtakingly_beautiful', 0.7231341004371643),
 ('wonderful', 0.6854087114334106),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402292251587)]

In [5]:
# What if I am looking for a word that is not in this vocabulary?
w2v_model['kaunhotum?']

KeyError: ignored

# Train our Embedding on WikiCorpus using GenSim

In [None]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [None]:
# define training data
# Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
# Every list contains lists of tokens of that document.
corpus = [["dog","eats","biscuits"],["man", "eats","biscuits"],['dog','bites','man'], ["man", "bites" ,"dog"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) # using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)# using skipGram Architecture for training

### Continuous Bag of Words (CBOW)

In [None]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])

In [None]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))

From the above similarity scores we can conclude that eats is more similar to bites than man.

### SkipGram

In [None]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

In [None]:
# Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))

### Training Your Embedding on Wiki Corpus

In [None]:
!mkdir -p data/en/
!wget -P data/en/ https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-pages-articles-multistream14.xml-p13159683p14324602.bz2

In [None]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [None]:
#Preparing the Training data
wiki = WikiCorpus('data/en/enwiki-20201001-pages-articles-multistream14.xml-p13159683p14324602.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

#### Hyperparameters
sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.

min_count- Ignores all words with total frequency lower than this.

### CBOW


In [None]:
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

In [None]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and dog:",word2vec_cbow.similarity('film', 'dog'))
print("-"*30)

In [None]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# # load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

### SkipGram

In [None]:
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

In [None]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(words)
print("-"*30)

#Acess vector for one word
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:", word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and dog:",word2vec_skipgram.similarity('film', 'dog'))
print("-"*30)

**Skipgram took more time to train than BOW, any guess??**