## Word embeddings


Word embedding is any of a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

## Word2vec 


### Statistical Language Model
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability. to the whole sequence. The language model provides context to distinguish between words and phrases that sound similar.


### CBOW
Given context, setup a neural net to predict next word

### SkipGram 
Given a word in the text sequence, setup a neural net to predict the sequence.





<img src="images/word2vec.jpg"> 


Context size is fixed (hyperparameter) and the input to the neural net is the k words of the context in 1-hot representation.

So if we have 1000 words in the vocab and context 4, the input in CBOW will be 4 stacked 1000 dimensional 1-hot vectors (one vector for each word in the context) and the target vector a 1000 dimensional target vector of the probabilities of the next word.

The hidden layer would be of dimensionality D (again hyperparameter). The byproduct of the training then are two matrices, the input-hidden layer weights and the hidden-output layer weights.

Both matrices have dimensions $Dx|V|$, where D the hidden layer dimension and $|V|$ the size of the vocabulary.


Similarily, the configuration of the skip-gram architecture is similar:

Input vector 1-hot encoding of the input word, output vectors k softmax vectors of dimension $|V|$. 



In [32]:
import gensim 
import numpy as np
import logging
from gensim.models import Word2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [20]:
class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""
    
    def __init__(self, lines):
        self.lines = lines
            
    def __iter__(self):
        for line in self.lines:
            
            text = line.lower()
            
            yield gensim.utils.simple_preprocess(text)

In [21]:
data = open("./data/amazon_reviews.txt").readlines()

In [22]:
sentences = MyCorpus(data)
for s in sentences:
    print(s)
    break

['bought', 'this', 'album', 'because', 'loved', 'the', 'title', 'song', 'it', 'such', 'great', 'song', 'how', 'bad', 'can', 'the', 'rest', 'of', 'the', 'album', 'be', 'right', 'well', 'the', 'rest', 'of', 'the', 'songs', 'are', 'just', 'filler', 'and', 'are', 'worth', 'the', 'money', 'paid', 'for', 'this', 'it', 'either', 'shameless', 'bubblegum', 'or', 'depressing', 'tripe', 'kenny', 'chesney', 'is', 'popular', 'artist', 'and', 'as', 'result', 'he', 'is', 'in', 'the', 'cookie', 'cutter', 'category', 'of', 'the', 'nashville', 'music', 'scene', 'he', 'gotta', 'pump', 'out', 'the', 'albums', 'so', 'the', 'record', 'company', 'can', 'keep', 'lining', 'their', 'pockets', 'while', 'the', 'suckers', 'out', 'there', 'keep', 'buying', 'this', 'garbage', 'to', 'perpetuate', 'more', 'garbage', 'coming', 'out', 'of', 'that', 'town', 'll', 'get', 'down', 'off', 'my', 'soapbox', 'now', 'but', 'country', 'music', 'really', 'needs', 'to', 'get', 'back', 'to', 'it', 'roots', 'and', 'stop', 'this', 'po

In [23]:
model = Word2Vec( min_count=5, workers=5, size=200) 
model.build_vocab(sentences)

2020-12-03 01:06:51,120 : INFO : collecting all words and their counts
2020-12-03 01:06:51,122 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-12-03 01:06:53,253 : INFO : PROGRESS: at sentence #10000, processed 1242775 words, keeping 41603 word types
2020-12-03 01:06:53,661 : INFO : collected 45128 word types from a corpus of 1480752 raw words and 11914 sentences
2020-12-03 01:06:53,662 : INFO : Loading a fresh vocabulary
2020-12-03 01:06:53,711 : INFO : effective_min_count=5 retains 13926 unique words (30% of original 45128, drops 31202)
2020-12-03 01:06:53,712 : INFO : effective_min_count=5 leaves 1429573 word corpus (96% of original 1480752, drops 51179)
2020-12-03 01:06:53,762 : INFO : deleting the raw counts dictionary of 45128 items
2020-12-03 01:06:53,764 : INFO : sample=0.001 downsamples 50 most-common words
2020-12-03 01:06:53,765 : INFO : downsampling leaves estimated 1091047 word corpus (76.3% of prior 1429573)
2020-12-03 01:06:53,803 : INFO :

In [24]:

sentences = MyCorpus(data)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

2020-12-03 01:07:31,503 : INFO : training model with 5 workers on 13926 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-12-03 01:07:32,553 : INFO : EPOCH 1 - PROGRESS: at 25.37% examples, 272751 words/s, in_qsize 9, out_qsize 0
2020-12-03 01:07:33,597 : INFO : EPOCH 1 - PROGRESS: at 59.64% examples, 310007 words/s, in_qsize 9, out_qsize 0
2020-12-03 01:07:34,637 : INFO : EPOCH 1 - PROGRESS: at 92.82% examples, 323266 words/s, in_qsize 10, out_qsize 0
2020-12-03 01:07:34,717 : INFO : worker thread finished; awaiting finish of 4 more threads
2020-12-03 01:07:34,718 : INFO : worker thread finished; awaiting finish of 3 more threads
2020-12-03 01:07:34,720 : INFO : worker thread finished; awaiting finish of 2 more threads
2020-12-03 01:07:34,723 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-12-03 01:07:34,729 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-12-03 01:07:34,730 : INFO : EPOCH - 1 : training o

(5455417, 7403760)

In [25]:
model.wv.most_similar("awful")

2020-12-03 01:08:11,966 : INFO : precomputing L2-norms of word weight vectors


[('ridiculous', 0.8187541961669922),
 ('scary', 0.7904298901557922),
 ('terrible', 0.785262405872345),
 ('stupid', 0.7838867902755737),
 ('lame', 0.7792518138885498),
 ('plainly', 0.772826075553894),
 ('joke', 0.7724997997283936),
 ('plain', 0.771803081035614),
 ('kinda', 0.7712831497192383),
 ('absolutely', 0.7675280570983887)]

In [26]:
model.wv.most_similar("camera")

[('lens', 0.891923189163208),
 ('bag', 0.8825212717056274),
 ('unit', 0.8497478365898132),
 ('camcorder', 0.8489190340042114),
 ('battery', 0.8283607959747314),
 ('tripod', 0.8274000883102417),
 ('case', 0.8206779956817627),
 ('canon', 0.8135760426521301),
 ('charger', 0.8048727512359619),
 ('razor', 0.7936983108520508)]

In [64]:
model.wv.most_similar("book")

[('novel', 0.7962727546691895),
 ('author', 0.7322311997413635),
 ('movie', 0.7139641642570496),
 ('story', 0.6948456764221191),
 ('books', 0.6827583312988281),
 ('writing', 0.6768943071365356),
 ('read', 0.6709310412406921),
 ('film', 0.65472412109375),
 ('bible', 0.6295198202133179),
 ('review', 0.6107279062271118)]

In [65]:
model.wv.most_similar("movie")

[('film', 0.9205185770988464),
 ('story', 0.8234798312187195),
 ('novel', 0.8226872682571411),
 ('plot', 0.7506477236747742),
 ('show', 0.7407904863357544),
 ('ending', 0.7253567576408386),
 ('book', 0.7139641046524048),
 ('album', 0.708283543586731),
 ('song', 0.7051331996917725),
 ('guy', 0.6938921809196472)]

In [39]:
from sklearn.cluster import KMeans

In [89]:
kmeans = KMeans(10)

kmeans.fit(model.wv.vectors)

KMeans(n_clusters=10)

In [90]:
labels = kmeans.labels_

In [91]:
vocab = list(model.wv.vocab.keys())

In [92]:
#model.wv.vocab

In [94]:
for i in range(50):
    ret = np.where( labels==i)[0]
    
    words = [vocab[ret[j]] for j in range( min(5, len(ret)) )]
    print( " ".join(words))
    
    print("\n")

quiet upfront capture determine meeting


chesney popular in while ll


album money result category gotta


bought such how bad shameless


loved great rest be are


kenny keep town my brilliant


title it of just filler


out mainstream cd many they


can well either artist as


this because the song right


























































































































