# Embeddings Notebook
This notebook is used to train a custom Word2Vec model using our corpus along with any amount of wikipedia articles. Each model is stored in a class that can be pickled and re-loaded later. The pickled class allows us to store model version and description, so that we can easily keep track of which model produces the best synonym results.

In [1]:
from Embeddings import EmbeddingsObject # embeddings class
import dill # package to pickle an object with all its dependency files
import warnings
warnings.filterwarnings('ignore')

In [2]:
# instantiate embeddings class with version and description
embeddingsObj = EmbeddingsObject(version = "1", description = "main corpus")

In [3]:
# set the corpus, if 'clean_sentence.csv' not in directory then first run 'build_corpus()
#embeddingsObj.build_corpus() # must have full_dataframe.csv in above directory to run
embeddingsObj.set_corpus()

Set main corpus as training data for embeddings


In [4]:
# first parameter is search term, second is number of articles with that term to add to training data
#embeddingsObj.scrape_wiki('armament', 50)

In [5]:
# corpus description is a list with all training data descriptions
embeddingsObj.corpus_description

['Main DoD Corpus']

In [6]:
'''
    embeddings method takes
    -embedding dimensions (100 default)
    -window (5 default)
    -min_count (1 default)
    -workers (3 default)
    -training type (0 for CBOW, 1 for Skip-Gram)
'''
embeddingsObj.train_embeddings(size = 100, window = 5, min_count = 1, workers = 3, training_type = 0)

Finished Training Custom Embeddings in 2 Minutes


In [7]:
embeddingsObj.find_synonyms(synonym_threshold = 0.9, save_synonyms = True)

num of words: 174838
Through 500 out of 174838 words
Through 1000 out of 174838 words
Through 1500 out of 174838 words
Through 2000 out of 174838 words
Through 2500 out of 174838 words
Through 3000 out of 174838 words
Through 3500 out of 174838 words
Through 4000 out of 174838 words
Through 4500 out of 174838 words
Through 5000 out of 174838 words
Through 5500 out of 174838 words
Through 6000 out of 174838 words
Through 6500 out of 174838 words
Through 7000 out of 174838 words
Through 7500 out of 174838 words
Through 8000 out of 174838 words
Through 8500 out of 174838 words
Through 9000 out of 174838 words
Through 9500 out of 174838 words
Through 10000 out of 174838 words
Through 10500 out of 174838 words
Through 11000 out of 174838 words
Through 11500 out of 174838 words
Through 12000 out of 174838 words
Through 12500 out of 174838 words
Through 13000 out of 174838 words
Through 13500 out of 174838 words
Through 14000 out of 174838 words
Through 14500 out of 174838 words
Through 15000

Through 120500 out of 174838 words
Through 121000 out of 174838 words
Through 121500 out of 174838 words
Through 122000 out of 174838 words
Through 122500 out of 174838 words
Through 123000 out of 174838 words
Through 123500 out of 174838 words
Through 124000 out of 174838 words
Through 124500 out of 174838 words
Through 125000 out of 174838 words
Through 125500 out of 174838 words
Through 126000 out of 174838 words
Through 126500 out of 174838 words
Through 127000 out of 174838 words
Through 127500 out of 174838 words
Through 128000 out of 174838 words
Through 128500 out of 174838 words
Through 129000 out of 174838 words
Through 129500 out of 174838 words
Through 130000 out of 174838 words
Through 130500 out of 174838 words
Through 131000 out of 174838 words
Through 131500 out of 174838 words
Through 132000 out of 174838 words
Through 132500 out of 174838 words
Through 133000 out of 174838 words
Through 133500 out of 174838 words
Through 134000 out of 174838 words
Through 134500 out o

In [9]:
# save embedding object as pickle
dill.dump(embeddingsObj, open("pickled_models/{}.pkl".format(embeddingsObj.description.replace(" ", "_")), "wb"))
print("Saved model object as {}.pkl".format(embeddingsObj.description.replace(" ", "_")))

Saved model object as main_corpus.pkl


In [10]:
read_in_model = dill.load(open("pickled_models/main_corpus.pkl", "rb"))