#Continuous Bag of Words & skip-gram

In [None]:
!pip install -U gensim



We will learn how to build word2vec model using gensim.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# Data processing
import pandas as pd
import re

# Modeling
from gensim.models import Word2Vec
from gensim.models import Phrases
from gensim.models.phrases import Phraser

stopWords = stopwords.words('english')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#Load the Data

Load the dataset. The dataset used in this section is available in the data folder as text.zip.

In [None]:
data = pd.read_csv('/content/text.csv')


Let us see what we got in our data:



In [None]:
data.head()


Unnamed: 0,room kind clean strong smell dogs. generally average ok overnight stay youre fussy. would consider staying price right. breakfast free better nothing.
0,stayed crown plaza april april . staff friendl...
1,booked hotel hotwire lowest price could find. ...
2,stayed husband sons way alaska cruise. loved h...
3,girlfriends stayed celebrate th birthdays. pla...
4,rooms. one nice clearly updated recently other...


#Preprocess and prepare the dataset



Define a function for preprocessing the data:



In [None]:
def pre_process(text):

    #convert to lowercase
    text = str(text).lower()

    #remove all special characters and keep only alpha numeric characters and spaces
    text = re.sub(r'[^A-Za-z0-9\s.]',r'',text)

    #remove new lines
    text = re.sub(r'\n',r' ',text)

    # remove stop words
    text = " ".join([word for word in text.split() if word not in stopWords])

    return text

Now ,we will process our data:

In [None]:
data.iloc[:, 0] = data.iloc[:, 0].map(lambda x: pre_process(x))


In [None]:
data.iloc[:, 0].head()


0    stayed crown plaza april april . staff friendl...
1    booked hotel hotwire lowest price could find. ...
2    stayed husband sons way alaska cruise. loved h...
3    girlfriends stayed celebrate th birthdays. pla...
4    rooms. one nice clearly updated recently other...
Name: room kind clean strong smell dogs. generally average ok overnight stay youre fussy. would consider staying price right. breakfast free better nothing., dtype: object

#Genism library requires input in the from of list of lists. i.e,text = [ [word1, word2, word3], [word1, word2, word3] ].We know that each row in our data contains a set of sentences

In [None]:
data.iloc[1, 0].split('.')[:5]


['booked hotel hotwire lowest price could find',
 ' got front desk manager gave us smoking room',
 ' argued little baby would booked room known smoking',
 ' manager would hear anything told hotwire books cheapest rooms available',
 ' get go unhappy']

We currently have the data stored as a list. However, we require it to be structured as a list of lists. To achieve this, we will first split the data by periods ('.') and then further divide each segment by spaces (' '). This process will result in our data being organized into a list of lists, where each inner list corresponds to a segment of the original data delimited by periods and further subdivided by spaces.


In [None]:
corpus = []
for line in data.iloc[:, 0]:
    sentences = line.split('.')
    for sentence in sentences:
        words = sentence.split()
        if words:  # Check if the list of words is not empty
            corpus.append(words)


In [None]:
corpus[:2]


[['stayed', 'crown', 'plaza', 'april', 'april'],
 ['staff', 'friendly', 'attentive']]

The problem we have is our corpus contains only unigrams and it will not give us results when we give bigram as an input, for an example say 'san francisco'.To enhance our text data preprocessing, we utilize Gensim's Phrases functions. These functions identify frequently co-occurring words and connect them with an underscore. For instance, 'san francisco' is transformed into 'san_francisco'. We specify the min_count parameter as 25, instructing the model to disregard words and bigrams that appear less frequently than this threshold. This helps filter out less significant terms, improving the quality of our data representation.

In [None]:
phrases = Phrases(sentences=corpus,min_count=25,threshold=50)
bigram = Phraser(phrases)

In [None]:
for index,sentence in enumerate(corpus):
    corpus[index] = bigram[sentence]

In [None]:
corpus[111]


['planning',
 'go',
 'planet',
 'hollywood',
 'negotiate',
 'travel',
 'within',
 'main',
 'section',
 'restaurant',
 'shown',
 'table',
 'otherwise',
 'may',
 'end',
 'like',
 'us',
 'top',
 'floor',
 'overlooking',
 'action',
 'it']

In [None]:
corpus[9]


['appeared', 'carpets', 'vacummed', 'every', 'day']

# Build the Model


We define some of the important hyperparameters that the model needs.

*Size represents the size of the vector i.e dimensions of the vector to represent a word. The size can be chosen according to our data size. If our data is very small then we can set our size to a small value, but if we have significantly large dataset then we can set our vector size to 300. In our case, we set our size to 100

*Window size represents the distance that should be considered between the target word and its neighboring word. Words exceeding the window size from the target word will not be considered for learning. Typically, a small window size is preferred.

*Min count represents the minimum frequency of words. i.e if the particular word's occurrence is less than a min_count then we can simply ignore that word.

*workers specify the number of worker threads we need to train the model

*sg=1 implies we use skip-gram method for training  
*sg=0 implies we use CBOW for training

In [None]:
size = 100
window_size = 2
epochs = 120
min_count = 2
workers = 4
sg = 1

In [None]:
model = Word2Vec(corpus, sg=1, window=window_size, vector_size=size, min_count=min_count, workers=workers, epochs=epochs)


To save and load the model, we can simply use save and load functions respectivley.

In [None]:
import os

# Create the directory if it doesn't exist
os.makedirs("model/", exist_ok=True)

# Save the Word2Vec model
model.wv.save("model/word2vec.model")


In [None]:
from gensim.models import KeyedVectors

# Load the Word2Vec model
model = KeyedVectors.load('model/word2vec.model')


#Evaluate the Embeddings

After training, we assess the model's performance and understanding of word meanings. Using Gensim's most_similar function, we obtain a list of top similar words related to a given input word. For example, providing "san_diego" yields closely related city names.

In [None]:
model.most_similar('san_diego')


[('la', 0.5867746472358704),
 ('san_antonio', 0.5835387110710144),
 ('anatole', 0.5419203042984009),
 ('boston', 0.5245999097824097),
 ('sea_world', 0.5182870626449585),
 ('studios', 0.517884373664856),
 ('universities', 0.5173631310462952),
 ('oceanside', 0.5145856738090515),
 ('chicago', 0.5138354301452637),
 ('jacksonville', 0.5136340856552124)]

We can also apply arithmetic operations on our vector to check how accurate our vectors are, For instance, woman + king - man = queen:

In [None]:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)


[('queensize', 0.5518624782562256)]