## Training Word2Vec on amazon reviews dataset

- I have trained the Word2Vec model on amazon reviews dataset containing 3M sentences.
- I used Kaggle notebook to train as they provide 30GB RAM.
- Please use kaggle notebook to train the Word2Vec model.

- After training Word2Vec, I fine-tuned it on our `clustering.xlsx` dataset.
- After training Word2Vec and fine-tuning save the model and download to the local system to use it further.
- Download the Word2Vec model to this path: `../dataset/word2vec/clustering_word2vec.model`

#### How to train?
- Go to this link: [Click to go!](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)
- And create a notebook OR upload this notebook and run it.
- To fine-tune on `clustering.xlsx`. please upload the `clustering.xlsx` data to the Kaggle drive (space).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/amazon-reviews/amazon_review_polarity_csv.tgz
/kaggle/input/amazon-reviews/train.csv
/kaggle/input/amazon-reviews/test.csv


In [2]:
import pandas as pd
dataset = pd.read_csv("/kaggle/input/amazon-reviews/train.csv",names=["polarity","summary","text"], header=None)
dataset = dataset[["text"]]
dataset.head()

Unnamed: 0,text
0,This sound track was beautiful! It paints the ...
1,I'm reading a lot of reviews saying that this ...
2,This soundtrack is my favorite music of all ti...
3,I truly like this soundtrack and I enjoy video...
4,"If you've played the game, you know how divine..."


#### pre-process dataset

In [3]:
import gensim

In [4]:
from time import time
t0 = time()
dataset["text"] = dataset["text"].apply(gensim.utils.simple_preprocess)
print(f"time taken to preprocess text: {time() - t0}")

time taken to preprocess text: 355.1740491390228


In [5]:
dataset.head()

Unnamed: 0,text
0,"[this, sound, track, was, beautiful, it, paint..."
1,"[reading, lot, of, reviews, saying, that, this..."
2,"[this, soundtrack, is, my, favorite, music, of..."
3,"[truly, like, this, soundtrack, and, enjoy, vi..."
4,"[if, you, ve, played, the, game, you, know, ho..."


#### fine-tune Word2Vec on current dataset

In [6]:
model = gensim.models.Word2Vec(
    window = 10,
    vector_size = 300,
    min_count = 1,
    epochs=10,
)
t0 = time()
model.build_vocab(dataset["text"])
print(f"time taken to build vocab: {time()-t0}")

time taken to build vocab: 84.88864254951477


In [7]:
len(model.wv)

823220

#### run the word2vec model

In [8]:
model.corpus_count, model.epochs

(3600000, 10)

In [9]:
t0 = time()
model.train(dataset["text"], total_examples=model.corpus_count, epochs=model.epochs)
print(f"time taken to train the Word2Vec model: {time()-t0}")

time taken to train the Word2Vec model: 3974.2294974327087


In [10]:
# from gensim.models import Word2Vec
# sentences = dataset["text"]

# model = Word2Vec(min_count=1)
# model.build_vocab(sentences)  # prepare the model vocabulary
# model.train(sentences, total_examples=model.corpus_count, epochs=1)

In [11]:
model.save("./review_word2vec.model")

#### Saving model in binary format

In [53]:
model.wv.save_word2vec_format("model.bin", binary=True)

In [18]:
model.wv.most_similar("islam")

[('islamic', 0.8006618022918701),
 ('christianity', 0.7927968502044678),
 ('muslims', 0.7710422277450562),
 ('muslim', 0.7656044960021973),
 ('judaism', 0.7236652374267578),
 ('catholicism', 0.7124875783920288),
 ('religion', 0.6957213878631592),
 ('mormonism', 0.6921664476394653),
 ('fundamentalism', 0.684215247631073),
 ('religions', 0.6679003238677979)]

In [19]:
model.wv.distance("good","best")

0.6559979915618896

In [20]:
model.wv.distance("king","man")

0.6876093447208405

In [23]:
model.wv.distance("king","women")

1.0023832756560296

In [26]:
model.wv.distance("man", "chair")

0.995308366138488

In [27]:
model.wv.distance("table","chair")

0.54062619805336

In [39]:
len(model.wv)

823220

## Fine-tuning trained Word2Vec on our own dataset

#### 1. Craete a Word2Vecc model
With the same vector size as pretrained model

In [41]:
new_model = gensim.models.Word2Vec(
    vector_size=300,
    min_count=1,
    epochs=20,
    window=10,
)

#### 2. Build the vocabulary for the new corpus

##### load the dataset

In [47]:
sentences = pd.read_excel(open("/kaggle/input/sentence-clustering/clustering.xlsx", "rb"))
sentences = sentences[["Text"]]
sentences.head()

Unnamed: 0,Text
0,"Moeller's student-run newspaper, The Crusader,..."
1,"In 2008, The Crusader won First Place, the sec..."
2,The Squire is a student literary journal that ...
3,Paul Keels - play-by-play announcer for Ohio S...
4,Joe Uecker - Ohio State Senator (R-66) .


##### preprocess the dataset

In [48]:
t0 = time()
sentences["Text"] = sentences["Text"].apply(gensim.utils.simple_preprocess)
print(f"time taken to preprocess the dataset: {time()-t0}")

time taken to preprocess the dataset: 1.4718668460845947


In [59]:
sentences.head()

Unnamed: 0,Text
0,"[moeller, student, run, newspaper, the, crusad..."
1,"[in, the, crusader, won, first, place, the, se..."
2,"[the, squire, is, student, literary, journal, ..."
3,"[paul, keels, play, by, play, announcer, for, ..."
4,"[joe, uecker, ohio, state, senator]"


In [49]:
t0 = time()
new_model.build_vocab(sentences["Text"])
print(f"time taken to build vocab of the dataset: {time()-t0}")

time taken to build vocab of the dataset: 1.77947998046875


#### 3. Create a vector of ones
that determine the mutability of the pretrained vectors. In the previous Gensim versions, this used to be a single lockf argument to the intersect_word2vec_format function. Using a vector of ones ensures that all the words in the vocabulary are updated during fine-tuning

In [50]:
import numpy as np
new_model.wv.vectors_lockf = np.ones(len(new_model.wv))

#### 4. Perform a vocabulary intersection using
intersect_word2vec_format function to initialize the new embeddings with the pretrained embeddings for the words that are in the pretraining vocabulary. I am quoting from the official Gensim documentation as follows intersect_word2vec_format

>Merge in an input-hidden weight matrix loaded from the original C word2vec-tool format, where it intersects with the current vocabulary.

>No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.

In [54]:
new_model.wv.intersect_word2vec_format("./model.bin", binary=True)

#### 5. Train the new model

In [56]:
t0 = time()
new_model.train(sentences["Text"], total_examples=new_model.corpus_count, epochs=new_model.epochs)
print(f"Time taken to fine-tune clustering sentence corpus: {time()-t0}")

Time taken to fine-tune clustering sentence corpus: 26.70268726348877


In [64]:
new_model.wv.most_similar("paul")

[('simon', 0.5700941681861877),
 ('pauls', 0.5690290927886963),
 ('john', 0.5652545690536499),
 ('adrian', 0.5583145618438721),
 ('brian', 0.5408787727355957),
 ('lester', 0.5309159755706787),
 ('patrick', 0.5303022265434265),
 ('richard', 0.5258485674858093),
 ('peter', 0.5240234732627869),
 ('joseph', 0.5191812515258789)]

In [60]:
model.wv.most_similar("crusader")

[('caped', 0.6418017148971558),
 ('tyrant', 0.612069845199585),
 ('warlord', 0.5712436437606812),
 ('stronghold', 0.5492900609970093),
 ('mercenary', 0.5314666628837585),
 ('warriors', 0.5216918587684631),
 ('swordsman', 0.5211747288703918),
 ('tyrannical', 0.5182375311851501),
 ('warrior', 0.5182107090950012),
 ('barbarians', 0.5135986804962158)]

In [61]:
new_model.wv.most_similar("crusader")

[('tyrant', 0.612069845199585),
 ('warlord', 0.5712436437606812),
 ('stronghold', 0.5492900609970093),
 ('mercenary', 0.5314666628837585),
 ('warriors', 0.5216918587684631),
 ('swordsman', 0.5211747288703918),
 ('warrior', 0.5182107090950012),
 ('barbarians', 0.5135986804962158),
 ('conqueror', 0.509844958782196),
 ('rebellion', 0.5082306861877441)]

In [62]:
new_model.save("./clustering_word2vec.model")

In [1]:
# save in binary format
new_model.wv.save_word2vec_format("clustering_model.bin", binary=True)

NameError: name 'new_model' is not defined

## Training Doc2Vec model on amazon reviews and then fine-tune it on the clustering setences dataset