## Word embeddings


Word embedding is any of a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

## Word2vec 


### Statistical Language Model
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability. to the whole sequence. The language model provides context to distinguish between words and phrases that sound similar.


### CBOW
Given context, setup a neural net to predict next word

### SkipGram 
Given a word in the text sequence, setup a neural net to predict the sequence.





<img src="images/word2vec.jpg"> 


Context size is fixed (hyperparameter) and the input to the neural net is the k words of the context in 1-hot representation.

So if we have 1000 words in the vocab and context 4, the input in CBOW will be 4 stacked 1000 dimensional 1-hot vectors (one vector for each word in the context) and the target vector a 1000 dimensional target vector of the probabilities of the next word.

The hidden layer would be of dimensionality D (again hyperparameter). The byproduct of the training then are two matrices, the input-hidden layer weights and the hidden-output layer weights.

Both matrices have dimensions $Dx|V|$, where D the hidden layer dimension and $|V|$ the size of the vocabulary.


Similarily, the configuration of the skip-gram architecture is similar:

Input vector 1-hot encoding of the input word, output vectors k softmax vectors of dimension $|V|$. 



## word2vec breakthrough

- Very fast to train 
    - (async sgd)
    - (negative sampling)
- trained on larger corpora 
    - better embedding quality 



## fastText 

Fasttext is essentially very similar to word2vec. The main differentiation is that it produces embeddings on subword level (ngrams) and then combines the word representation as the sum of its components.

Because of this property it is really useful in datasets with typos/misplaced or missing characters, as it will produce very similar embeddings. 




## Problems with embeddings

- No contextual semantic sensitivity : bank vs bank 
- inherited bias from existing corpora (so we need to be very careful)


In [1]:
import gensim 
import numpy as np
from gensim.models import Word2Vec

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""
    
    def __init__(self, lines):
        self.lines = lines
            
    def __iter__(self):
        for line in self.lines:
            
            text = line.lower()
            
            yield gensim.utils.simple_preprocess(text)

In [3]:
data = open("./data/amazon_reviews.txt").readlines()

In [4]:
sentences = MyCorpus(data)
for s in sentences:
    print(s)
    break

['bought', 'this', 'album', 'because', 'loved', 'the', 'title', 'song', 'it', 'such', 'great', 'song', 'how', 'bad', 'can', 'the', 'rest', 'of', 'the', 'album', 'be', 'right', 'well', 'the', 'rest', 'of', 'the', 'songs', 'are', 'just', 'filler', 'and', 'are', 'worth', 'the', 'money', 'paid', 'for', 'this', 'it', 'either', 'shameless', 'bubblegum', 'or', 'depressing', 'tripe', 'kenny', 'chesney', 'is', 'popular', 'artist', 'and', 'as', 'result', 'he', 'is', 'in', 'the', 'cookie', 'cutter', 'category', 'of', 'the', 'nashville', 'music', 'scene', 'he', 'gotta', 'pump', 'out', 'the', 'albums', 'so', 'the', 'record', 'company', 'can', 'keep', 'lining', 'their', 'pockets', 'while', 'the', 'suckers', 'out', 'there', 'keep', 'buying', 'this', 'garbage', 'to', 'perpetuate', 'more', 'garbage', 'coming', 'out', 'of', 'that', 'town', 'll', 'get', 'down', 'off', 'my', 'soapbox', 'now', 'but', 'country', 'music', 'really', 'needs', 'to', 'get', 'back', 'to', 'it', 'roots', 'and', 'stop', 'this', 'po

In [5]:
model = Word2Vec( min_count=5, workers=5, size=200) 
model.build_vocab(sentences)

2021-05-28 13:44:38,798 : INFO : collecting all words and their counts
2021-05-28 13:44:38,799 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-05-28 13:44:41,004 : INFO : PROGRESS: at sentence #10000, processed 1242775 words, keeping 41603 word types
2021-05-28 13:44:41,419 : INFO : collected 45128 word types from a corpus of 1480752 raw words and 11914 sentences
2021-05-28 13:44:41,420 : INFO : Loading a fresh vocabulary
2021-05-28 13:44:41,495 : INFO : effective_min_count=5 retains 13926 unique words (30% of original 45128, drops 31202)
2021-05-28 13:44:41,496 : INFO : effective_min_count=5 leaves 1429573 word corpus (96% of original 1480752, drops 51179)
2021-05-28 13:44:41,549 : INFO : deleting the raw counts dictionary of 45128 items
2021-05-28 13:44:41,551 : INFO : sample=0.001 downsamples 50 most-common words
2021-05-28 13:44:41,552 : INFO : downsampling leaves estimated 1091047 word corpus (76.3% of prior 1429573)
2021-05-28 13:44:41,586 : INFO :

In [6]:

sentences = MyCorpus(data)
model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

2021-05-28 13:44:45,067 : INFO : training model with 5 workers on 13926 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2021-05-28 13:44:46,087 : INFO : EPOCH 1 - PROGRESS: at 29.82% examples, 322884 words/s, in_qsize 0, out_qsize 0
2021-05-28 13:44:47,132 : INFO : EPOCH 1 - PROGRESS: at 57.23% examples, 299487 words/s, in_qsize 9, out_qsize 0
2021-05-28 13:44:48,167 : INFO : EPOCH 1 - PROGRESS: at 90.01% examples, 317020 words/s, in_qsize 9, out_qsize 0
2021-05-28 13:44:48,259 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-28 13:44:48,262 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-28 13:44:48,266 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-28 13:44:48,269 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-28 13:44:48,274 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-28 13:44:48,274 : INFO : EPOCH - 1 : training on

(5454643, 7403760)

In [10]:
len(model.wv.vocab)

13926

In [11]:
v1 = model.wv["awful"]
v2 = model.wv["ridiculous"]


In [12]:
v1.dot(v2) / np.sqrt(v1.dot(v1) * v2.dot(v2))
#[('ridiculous', 0.8427592515945435)

0.8366808

In [13]:
sentence = "this is an awful product"

ret = []
for word in sentence.split():
    v = model.wv[word]
    ret.append(v)
    
ret = np.array(ret)

In [14]:
r1 = np.median( ret, axis = 0)#.shape
r2 = np.mean( ret, axis = 0)#.shape
r3 = np.max( ret, axis = 0)#.shape
r4 = np.min( ret, axis = 0)#.shape

r1.reshape(-1,1).shape

final = np.concatenate( (r1.reshape(-1,1), r2.reshape(-1,1), r3.reshape(-1,1), r4.reshape(-1,1)))
final.shape

(800, 1)

In [15]:
model.wv.most_similar("awful")

2021-05-28 13:45:23,344 : INFO : precomputing L2-norms of word weight vectors


[('ridiculous', 0.8366807699203491),
 ('terrible', 0.8213960528373718),
 ('okay', 0.8132612705230713),
 ('scary', 0.7956511974334717),
 ('kinda', 0.794615626335144),
 ('sad', 0.7933586239814758),
 ('joke', 0.7898640632629395),
 ('dissappointing', 0.7880542278289795),
 ('weird', 0.7867347598075867),
 ('subpar', 0.7781393527984619)]

In [16]:
model.wv.most_similar("camera")

[('lens', 0.8993487358093262),
 ('bag', 0.8668299317359924),
 ('camcorder', 0.8383852243423462),
 ('unit', 0.8286342620849609),
 ('battery', 0.8257143497467041),
 ('case', 0.8062664270401001),
 ('canon', 0.8035275936126709),
 ('tripod', 0.7975265979766846),
 ('charger', 0.7946109175682068),
 ('scale', 0.7940206527709961)]

In [17]:
model.wv.most_similar("book")

[('novel', 0.7867717742919922),
 ('author', 0.7369700074195862),
 ('story', 0.7038761377334595),
 ('books', 0.6955527663230896),
 ('movie', 0.6905809640884399),
 ('read', 0.6675878167152405),
 ('writing', 0.6507339477539062),
 ('review', 0.6300108432769775),
 ('film', 0.626803994178772),
 ('language', 0.5969916582107544)]

In [18]:
model.wv.most_similar("movie")

[('film', 0.9149080514907837),
 ('story', 0.8254029154777527),
 ('novel', 0.7964406609535217),
 ('show', 0.7392091751098633),
 ('ending', 0.7327175140380859),
 ('album', 0.7141045928001404),
 ('acting', 0.6943376064300537),
 ('book', 0.6905809640884399),
 ('stuff', 0.6872981190681458),
 ('song', 0.6854572296142578)]

In [105]:
from sklearn.cluster import KMeans

In [106]:
kmeans = KMeans(10)

kmeans.fit(model.wv.vectors)

KMeans(n_clusters=10)

In [90]:
labels = kmeans.labels_

In [91]:
vocab = list(model.wv.vocab.keys())

In [92]:
#model.wv.vocab

In [94]:
for i in range(50):
    ret = np.where( labels==i)[0]
    
    words = [vocab[ret[j]] for j in range( min(5, len(ret)) )]
    print( " ".join(words))
    
    print("\n")

quiet upfront capture determine meeting


chesney popular in while ll


album money result category gotta


bought such how bad shameless


loved great rest be are


kenny keep town my brilliant


title it of just filler


out mainstream cd many they


can well either artist as


this because the song right


























































































































