# WORD2VEC

The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Each vector has some semantic meaning to it. Created with shallow 2 layered NN that reconstruct the context of words. Helps in developing context for each word using embeddings. 

Developed in either of the two model archs:

1. CBOW - Continuous Bag of Words - model predicts current word from surrounding words. ( No order of context, faster, distant also better) 
2. Skip Gram - model predicts surrounding windows from current word. (context is order, slower, closer ones more important)

Hyper Parameters involved:

1. Training algorithm - hierarchical softmax and/or negative sampling. hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.

2. Sub Sampling - High-frequency words often provide little information. Words with a frequency above a certain threshold may be subsampled to increase training speed

3. Dimensionality - After a point of increased embedding size, no point. Usually 100 to 1000 is the size.

4. Context Window - number of Surrounding words - 10 for skip gram, 5 for CBOW 

Exercise is to train own word2vec model and play with pretrained model

In [1]:
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import re

In [2]:
paragraph = """WORD2VEC
The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Each vector has some semantic meaning to it. Created with shallow 2 layered NN that reconstruct the context of words. Helps in developing context for each word using embeddings.

Developed in either of the two model archs:

CBOW - Continuous Bag of Words - model predicts current word from surrounding words. ( No order of context, faster, distant also better)
Skip Gram - model predicts surrounding windows from current word. (context is order, slower, closer ones more important)
Hyper Parameters involved:

Training algorithm - hierarchical softmax and/or negative sampling. hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.

Sub Sampling - High-frequency words often provide little information. Words with a frequency above a certain threshold may be subsampled to increase training speed

Dimensionality - After a point of increased embedding size, no point. Usually 100 to 1000 is the size.

Context Window - number of Surrounding words - 10 for skip gram, 5 for CBOW"""

In [3]:
#preprocess the data using regex
sentences = nltk.sent_tokenize(paragraph)
processed_sentences = []
for sentence in sentences:
    print("\nSentence before processing : ", sentence)
    sentence = re.sub('[^a-zA-Z0-9]', ' ',sentence)
    sentence = re.sub('\s+', ' ', sentence)
    sentence = sentence.lower()
    words = nltk.word_tokenize(sentence)
    
    processed_sentence = [word for word in words if word not in stopwords.words('english')]
    
    processed_sentences.append(processed_sentence)
        
    print("\nSentence after processing : ", processed_sentence)
    


Sentence before processing :  WORD2VEC
The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.

Sentence after processing :  ['word2vec', 'word2vec', 'algorithm', 'uses', 'neural', 'network', 'model', 'learn', 'word', 'associations', 'large', 'corpus', 'text']

Sentence before processing :  Each vector has some semantic meaning to it.

Sentence after processing :  ['vector', 'semantic', 'meaning']

Sentence before processing :  Created with shallow 2 layered NN that reconstruct the context of words.

Sentence after processing :  ['created', 'shallow', '2', 'layered', 'nn', 'reconstruct', 'context', 'words']

Sentence before processing :  Helps in developing context for each word using embeddings.

Sentence after processing :  ['helps', 'developing', 'context', 'word', 'using', 'embeddings']

Sentence before processing :  Developed in either of the two model archs:

CBOW - Continuous Bag of Words - model predicts current word from surr

In [5]:
model = Word2Vec(processed_sentences, min_count = 1)

vocab = model.wv.vocab
for key, value in vocab.items():
    print(key, " : ", value)

word2vec  :  Vocab(count:2, index:7, sample_int:1361428237)
algorithm  :  Vocab(count:2, index:8, sample_int:1361428237)
uses  :  Vocab(count:1, index:23, sample_int:2086370027)
neural  :  Vocab(count:1, index:24, sample_int:2086370027)
network  :  Vocab(count:1, index:25, sample_int:2086370027)
model  :  Vocab(count:4, index:2, sample_int:905746060)
learn  :  Vocab(count:1, index:26, sample_int:2086370027)
word  :  Vocab(count:4, index:3, sample_int:905746060)
associations  :  Vocab(count:1, index:27, sample_int:2086370027)
large  :  Vocab(count:1, index:28, sample_int:2086370027)
corpus  :  Vocab(count:1, index:29, sample_int:2086370027)
text  :  Vocab(count:1, index:30, sample_int:2086370027)
vector  :  Vocab(count:1, index:31, sample_int:2086370027)
semantic  :  Vocab(count:1, index:32, sample_int:2086370027)
meaning  :  Vocab(count:1, index:33, sample_int:2086370027)
created  :  Vocab(count:1, index:34, sample_int:2086370027)
shallow  :  Vocab(count:1, index:35, sample_int:2086370

In [6]:
vector = model.wv['skip']

similar = model.wv.most_similar('skip')
similar

[('5', 0.26387906074523926),
 ('developing', 0.2299186885356903),
 ('important', 0.19774191081523895),
 ('size', 0.19406504929065704),
 ('dimensionality', 0.18159811198711395),
 ('word2vec', 0.181463822722435),
 ('closer', 0.16070972383022308),
 ('network', 0.1579560488462448),
 ('meaning', 0.1513058990240097),
 ('using', 0.14122387766838074)]

In [7]:
#pretrained model from gensim repository ( lsiting all avaialable models)
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [8]:
glove_wiki = gensim.downloader.load('glove-wiki-gigaword-300')

glove_wiki.most_similar('wikipedia')

[('encyclopedia', 0.5071935653686523),
 ('wikimedia', 0.5039559602737427),
 ('wiki', 0.49234116077423096),
 ('facebook', 0.46857360005378723),
 ('blog', 0.4539109170436859),
 ('conservapedia', 0.4523037075996399),
 ('youtube', 0.45151132345199585),
 ('britannica', 0.44909369945526123),
 ('websites', 0.44877538084983826),
 ('blogs', 0.4350595772266388)]