# What is Word2Vec Model

Word2Vec is a popular algorithm used for natural language processing (NLP) tasks, specifically for learning word embeddings, which are dense numerical representations of words. It was developed by Tomas Mikolov and his colleagues at Google in 2013.

Word2Vec is based on the idea that words with similar meanings tend to appear in similar contexts. The algorithm takes a large corpus of text as input and learns to represent words in a continuous vector space, where words with similar meanings are close. The resulting word vectors capture semantic relationships between terms, such as analogies and similarities.

Two main architectures are used in Word2Vec: Continuous Bag-of-Words (CBOW) and Skip-gram. In CBOW, the algorithm predicts the target word based on its surrounding context words. In contrast, Skip-gram predicts the context words given a target word. Both architectures use a shallow neural network with a single hidden layer to learn the word embeddings.

Once trained, the word vectors can be used for various NLP tasks. For example, they can be used to compute the similarity between words or find semantically related words. The word vectors can also be used as input features for downstream machine learning models in tasks like text classification, sentiment analysis, machine translation, and more.

Word2Vec has been widely adopted and has significantly impacted the field of NLP. Its ability to capture semantic relationships between words in an unsupervised manner has made it a valuable tool for many language-related applications.

# Training Word2Vec model
Trained on personal over 270k english only articles (dataset size 1.22GB)
The articles data was scrapped off of multiple englush news and blog posts, for reasearch personal purposes only.

you can use wikipedia dataset of 20GB from this link
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2


### Installing libraries needed
before we can train the model we need to install the libraries needed like (nltk, pandas, gensim)

In [8]:
%pip install numpy pandas nltk gensim

Collecting numpy
  Downloading numpy-1.25.0-cp39-cp39-macosx_11_0_arm64.whl (14.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas
  Downloading pandas-2.0.2-cp39-cp39-macosx_11_0_arm64.whl (10.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hCollecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting gensim
  Downloading gensim-4.3.1-cp39-cp39-macosx_11_0_arm64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting click (from nltk)
  Using cached click-8

### importing the libraries
after installing our libraries now we can implement word2vec by first importing all the necessery libraries

In [9]:
import pandas as pd
import gensim
from nltk.tokenize import word_tokenize
import re
import string
from gensim.models import Word2Vec

### Preprocessing the text
Before we load the text to our model to train we will need to preprocess it by turning all characters to lowercase, delete any numerical characters, special characters, and any white space at the start or end.
and generate the tokens (separating words and turning it to list of words instead of lines and paragraphs)

In [10]:
def preprocess_text(a_string):
    a_string = a_string.lower() # make all characters lower case
    a_string = re.sub(r'\d+', '', a_string) # delete any numbers in the text
    a_string = a_string.translate(str.maketrans('', '', string.punctuation)) # remove [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
    a_string = a_string.strip() # delete whitespace
    return word_tokenize(a_string)

Defining the path to our text dataset that we prepared by scrapping online sources

In [11]:
path_to_file = "articles.txt"

generate a list of lines from the txt file

In [12]:
vec = []

with open(path_to_file) as f:
    lines = f.read().splitlines()

as we can see we have over 7 million lines

In [13]:
len(lines)

7152556

In [15]:
#previewing first 10 lines to check on the text
lines[:10]

['Nancy Pelosi receives a standing ovation from audience at Clive Davis’ party',
 "Nancy Pelosi recently attended Clive Davis' annual pre-Grammys party, where she received a standing ovation.",
 'The event took place on Saturday, February 9, 2019, at the Beverly Hilton Hotel, in Beverly Hills, and attracted several big names from the music industry.',
 'Speaker of the House Nancy Pelosi | Photo: Getty Images',
 'RECOGNIZED BY THE OTHER GUESTS',
 "After hearing Davis' words, everyone in the audience stood up to honor Pelosi, who also rose from her chair to express her gratitude towards the other party guests.",
 'A VERY EXCLUSIVE GATHERING',
 "Davis' pre-Grammys party is a much-anticipated event by those who are lucky enough to be invited and the guest list is usually filled with the names of famous performers such as Pharrell, Charlie Puth, Dionne Warwick and Beck.",
 "But one doesn't need to be a recording artist to get invited. Just like Pelosi, figures such as Calvin Klein, Caitlyn 

In [16]:
#run the preprocesser that we definded to remove all unwanted characters and white space and generate the tokens
for line in lines:
    vec.append(preprocess_text(line))

In [17]:
#previewing first 2 lines tokens
vec[:2]

[['nancy',
  'pelosi',
  'receives',
  'a',
  'standing',
  'ovation',
  'from',
  'audience',
  'at',
  'clive',
  'davis',
  '’',
  'party'],
 ['nancy',
  'pelosi',
  'recently',
  'attended',
  'clive',
  'davis',
  'annual',
  'pregrammys',
  'party',
  'where',
  'she',
  'received',
  'a',
  'standing',
  'ovation']]

### Defining the model

The Word2Vec model require few parameters, the first is our text data tokenized,
min_count: ignore any token that has a frequency lower than this count
window: the distence between the predicted word and neighboring words in the text
workers: number of parallel processes to run for this model

for our case the parameters we went for are, min count of 1, and a window of 20 neighbering words, and 4 workers, and by default the method it will use is Continuous Bag-of-Words (CBOW)

In [18]:
model = Word2Vec(vec, min_count=1, window=20, workers=4)

In [20]:
# saving the model for later use
model.save("./word2vec.model")

In [22]:
# after saving the model it can be loaded with this line
model = Word2Vec.load("./word2vec.model")

# testing the model

We will run some tests to see if the model could get similar words to some examples and also do some arithmatic on the words to see if it works too

In [87]:
vec = model.wv["king"]-model.wv["man"]+model.wv["woman"]
model.wv.most_similar(vec, topn=5)

[('queen', 0.7477008700370789),
 ('king', 0.6856210827827454),
 ('princess', 0.6503725051879883),
 ('crown', 0.646379292011261),
 ('monarch', 0.6389058828353882)]

In [88]:
vec = model.wv["her"]-model.wv["woman"]+model.wv["man"]
model.wv.most_similar(vec, topn=5)

[('his', 0.8368909955024719),
 ('her', 0.7781476378440857),
 ('him', 0.7210463285446167),
 ('he', 0.6017439961433411),
 ('nonito', 0.5735311508178711)]

In [89]:
model.wv.most_similar("woman", topn=5)

[('lady', 0.7298709750175476),
 ('girl', 0.7199440002441406),
 ('man', 0.7185112237930298),
 ('womans', 0.6461623311042786),
 ('person', 0.6370121836662292)]

In [90]:
model.wv.most_similar("messi", topn=5)

[('aleksandr', 0.8017778992652893),
 ('mariya', 0.7633758187294006),
 ('ronaldo', 0.6185779571533203),
 ('golfers', 0.5623908638954163),
 ('cristiano', 0.5520051121711731)]

In [91]:
model.wv.most_similar("cat", topn=5)

[('feline', 0.8932287693023682),
 ('kitty', 0.8868649005889893),
 ('kitten', 0.860683023929596),
 ('dog', 0.8262585997581482),
 ('puppy', 0.8212113976478577)]

In [92]:
model.wv.most_similar("dog", topn=5)

[('pup', 0.9232812523841858),
 ('puppy', 0.9159420728683472),
 ('pooch', 0.8976006507873535),
 ('cat', 0.8262584805488586),
 ('canine', 0.8040332198143005)]

In [93]:
model.wv.most_similar(["father", "her"], topn=5)

[('mother', 0.8382915258407593),
 ('mom', 0.7471005916595459),
 ('she', 0.7199749946594238),
 ('daughter', 0.7157238721847534),
 ('grandmother', 0.7026622891426086)]