# Word Embeddings via Word2Vec

Instead of using pre-trained models, if you really needed to, and had a large corpus, you could train your own embeddings in attempt to provide domain context.

# General Process

- Define our corpus == universe assumption (we have captured the vocabulary well enough in our corpus)
- Spacy to tokenize the text
- List of lists -> gensim:word2vec (a each list has the tokens)
- Gensim Word2Vec
- Tell spacy about our vectors

> We will use a smaller corpus to highlight tradeoffs of training our own



In [None]:
# imports
import numpy as np
import pandas as pd

import spacy
import gensim

In [None]:
# initalize spacy -- we are going to bring our own model
#                    so small is fine (and explicit)

MODEL = "en_core_web_sm"
spacy.cli.download(MODEL)

### ok, we have that model, lets build our own
nlp = spacy.load(MODEL)

In [None]:
# get the data == universe

SQL = "SELECT * from `datasets.airline-intents`"

intents = pd.read_gbq(SQL, "questrom")




In [None]:
# get a sample to see what we have
intents.sample(5)

In [None]:
intents.shape

## Step 1: spacy tokenize text

In [None]:
# tokenize the corpus

def tokenize(text):
  doc = nlp(text)
  return [token.text for token in doc]

In [None]:
# apply it to the text column
intents['tokens'] = intents.text.apply(tokenize)

In [None]:
# another sample
intents.sample(5)

## Step 2 : Gensim Word2Vec

In [None]:
# gensim
from gensim.models import Word2Vec

In [None]:
# extract tokens as a list of lists

docs = intents.tokens.to_list()

In [None]:
docs[:5]

In [None]:
# fit the model 
# 50 feature vectors, a context window of 3, and skipgram model
# https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

model = Word2Vec(docs, size=50, window=3, sg=1)

In [None]:
# what do we have
type(model)

In [None]:
# we have some basics

model.corpus_count

In [None]:
intents.shape

In [None]:
# what is the size of our vocab?

len(model.wv.vocab)

In [None]:
# get vocab vectors

model.wv.get_vector('the')
model.wv.get_vector('flight')
model.wv.get_vector('boston')
model.wv.get_vector('help')


## it fails hard on a lookup

In [None]:
# we can also compare

In [None]:
# we can look at the most similar vectors for a token

model.wv.most_similar("boston")
model.wv.most_similar("flight")
# model.wv.most_similar("the")

# Step 3: Save and load into spacy

In [None]:
# ok, lets save this out to a text file
model.wv.save_word2vec_format("word2vec.txt")

In [None]:
# we are going to compres the file

! gzip word2vec.txt

In [None]:
! ls -l

In [None]:
# inform spacy of a new model, 
# https://spacy.io/api/cli#init-vectors
# we are on spacy 2

! python -m spacy init-model en brock-model --vectors-loc word2vec.txt.gz

In [None]:
# rename this to whatever you have above
nlp = spacy.load("brock-model")

In [None]:
# lets check the vectors are being used
# check boston

nlp("boston").vector

In [None]:
# compare that we are using the same
# model.wv.get_vector("boston")

# compare spacy and gensim are the same
nlp("boston").vector == model.wv.get_vector("boston")

In [None]:
# all the other bits still apply
test = nlp("This is the example please btibert@bu.edu")

In [None]:
# lets confirm

for token in test:
  print(token.text, token.lemma_, token.like_email, token.is_oov)

In [None]:
## what does above tell us about building our own?
## what might we need to "improve" our model/vectors?


## Where to go from here

- fastText vectors [use txt not bin]: https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md#models
