# Word2Vec
Runs Word2Vec on all text files within the same file<br>
Combine all .txt files into one corpus<br>
Specify the dimensions of the model<br>
Creates a word embedding output in both .txt and .bin format<br>
Credits to Siraj Raval, edited to make more applicable

### Imports

In [1]:
from __future__ import absolute_import, division, print_function
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re
import nltk
import gensim.models.word2vec as w2v
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [4]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Error loading punkt: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>
[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


False

### Read names of .txt files

In [5]:
text_filenames = sorted(glob.glob('./*.txt'))

In [6]:
print("Found texts:")
text_filenames

Found texts:


['./A-Bayesian-Optimization-Approach-to-Compute-the-Nash-Equilibria-of-Potential-Games-using-Bandit-Feedback.txt',
 './A-Disaster-Response-System-Based-on-Human-Agent-Collectives.txt',
 './A-Framework-for-Assessing-the-Performance-of-Pulsar-Search-Pipelines.txt',
 './A-General-Framework-for-Fair-Regression.txt',
 './A-Machine-Learning-Approach-to-Risk-Minimisation-in-Electricity-Markets.txt',
 './A-Machine-Learning-Approach-to-the-Prediction-of-Tidal-Currents.txt',
 './A-Novel-Approach-to-Forecasting-Financial-Volatility-with-Gaussian-Process-Envelopes.txt',
 './A-Probabilistic-Approach-to-Nonparametric-Local-Volatility.txt',
 './A-Proposed-Risk-Modeling-Shift-from-the-Approach-of-SDE-towards-MLC-1.txt',
 './A-Simulation-of-the-Insurance-Industry-The-Problem-of-Risk-Model-Homogeneity.txt',
 './A-signature-based-machine-learning-model-for-bipolar-disorder-and-borderline-personality-disorder.txt',
 './Abate2018_Chapter_ExperimentalBiologicalProtocol.txt',
 './Abate2018_Chapter_ModellingS

### combine .txt files into a large corpus

In [7]:
corpus_raw = u""
for text_filename in text_filenames:
    print("Reading '{0}'...".format(text_filename))
    with codecs.open(text_filename, "r", "utf-8") as text_file:
        corpus_raw += text_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading './A-Bayesian-Optimization-Approach-to-Compute-the-Nash-Equilibria-of-Potential-Games-using-Bandit-Feedback.txt'...
Corpus is now 43939 characters long

Reading './A-Disaster-Response-System-Based-on-Human-Agent-Collectives.txt'...
Corpus is now 196949 characters long

Reading './A-Framework-for-Assessing-the-Performance-of-Pulsar-Search-Pipelines.txt'...
Corpus is now 283053 characters long

Reading './A-General-Framework-for-Fair-Regression.txt'...
Corpus is now 329803 characters long

Reading './A-Machine-Learning-Approach-to-Risk-Minimisation-in-Electricity-Markets.txt'...
Corpus is now 381392 characters long

Reading './A-Machine-Learning-Approach-to-the-Prediction-of-Tidal-Currents.txt'...
Corpus is now 416454 characters long

Reading './A-Novel-Approach-to-Forecasting-Financial-Volatility-with-Gaussian-Process-Envelopes.txt'...
Corpus is now 455924 characters long

Reading './A-Probabilistic-Approach-to-Nonparametric-Local-Volatility.txt'...
Corpus is now 555150 characte

Corpus is now 9757779 characters long

Reading './Semi-supervised-Learning-with-Deep-Generative-Models.txt'...
Corpus is now 9774289 characters long

Reading './Sequential-Sampling-of-Gaussian-Latent-Variable-Models.txt'...
Corpus is now 9814211 characters long

Reading './Social-Bridges-in-Urban-Purchase-Behaviour.txt'...
Corpus is now 9920684 characters long

Reading './Solving-Strong-Substitutes-Product-Mix-Auctions.txt'...
Corpus is now 10045720 characters long

Reading './Speculative-Trading-of-Electricity-Contracts-in-Interconnected-Locations.txt'...
Corpus is now 10102296 characters long

Reading './Spoofing-and-Price-Manipulation-in-Order-Driven-Markets.txt'...
Corpus is now 10175712 characters long

Reading './Stochastic-Volatility-for-Utility-Maximizers-A-Martingale-Approach.txt'...
Corpus is now 10268598 characters long

Reading './Stochy-automatic-verification-and-synthesis-of-stochastic-processes.txt'...
Corpus is now 10279310 characters long

Reading './String-and-Membran

### Preprocessing of corpus (tokenizing and removing stopwords)

In [8]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [9]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [10]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [11]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [12]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

Applications of game theory include a wide range of economic phenomena such as
auctions [17], oligopolies, social network formation [15], behavioral economics and political economics; just to name a few.
['Applications', 'of', 'game', 'theory', 'include', 'a', 'wide', 'range', 'of', 'economic', 'phenomena', 'such', 'as', 'auctions', 'oligopolies', 'social', 'network', 'formation', 'behavioral', 'economics', 'and', 'political', 'economics', 'just', 'to', 'name', 'a', 'few']


In [13]:
token_count = sum([len(sentence) for sentence in sentences])
print("The text corpus contains {0:,} tokens".format(token_count))

The text corpus contains 1,929,328 tokens


### Assigns the parameters for creating model

In [14]:
# Dimensions of word vectors.
num_features = 100
# Minimum word count threshold.
min_word_count = 3

# Number of threads to run in parallel.
num_workers = multiprocessing.cpu_count()

# Context window length.
context_size = 7

# Downsample setting for frequent words.
#0 - 1e-5 is good for this
downsampling = 1e-3

# Seed for the RNG, to make the results reproducible.
#random number generator
#deterministic, good for debugging
seed = 1

In [15]:
model = w2v.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=num_features,
    min_count=min_word_count,
    window=context_size,
    sample=downsampling
)

In [16]:
model.build_vocab(sentences)

2020-06-23 16:32:22,212 : INFO : collecting all words and their counts
2020-06-23 16:32:22,213 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-06-23 16:32:22,261 : INFO : PROGRESS: at sentence #10000, processed 183934 words, keeping 15182 word types
2020-06-23 16:32:22,316 : INFO : PROGRESS: at sentence #20000, processed 371982 words, keeping 22858 word types
2020-06-23 16:32:22,368 : INFO : PROGRESS: at sentence #30000, processed 549081 words, keeping 29158 word types
2020-06-23 16:32:22,417 : INFO : PROGRESS: at sentence #40000, processed 736285 words, keeping 33577 word types
2020-06-23 16:32:22,466 : INFO : PROGRESS: at sentence #50000, processed 920779 words, keeping 38366 word types
2020-06-23 16:32:22,522 : INFO : PROGRESS: at sentence #60000, processed 1108806 words, keeping 42633 word types
2020-06-23 16:32:22,581 : INFO : PROGRESS: at sentence #70000, processed 1292044 words, keeping 47437 word types
2020-06-23 16:32:22,643 : INFO : PROGRESS: a

In [17]:
print("Word2Vec vocabulary length:", len(model.wv.vocab))

Word2Vec vocabulary length: 23740


Begins training here, time depends on dimensions but under 1000 would be relatively quick

In [18]:
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

  """Entry point for launching an IPython kernel.
2020-06-23 16:32:28,847 : INFO : training model with 8 workers on 23740 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=7
2020-06-23 16:32:29,874 : INFO : EPOCH 1 - PROGRESS: at 18.09% examples, 273795 words/s, in_qsize 14, out_qsize 1
2020-06-23 16:32:30,880 : INFO : EPOCH 1 - PROGRESS: at 39.95% examples, 298067 words/s, in_qsize 13, out_qsize 2
2020-06-23 16:32:31,902 : INFO : EPOCH 1 - PROGRESS: at 65.22% examples, 325217 words/s, in_qsize 16, out_qsize 0
2020-06-23 16:32:32,952 : INFO : EPOCH 1 - PROGRESS: at 89.76% examples, 332409 words/s, in_qsize 15, out_qsize 0
2020-06-23 16:32:33,258 : INFO : worker thread finished; awaiting finish of 7 more threads
2020-06-23 16:32:33,279 : INFO : worker thread finished; awaiting finish of 6 more threads
2020-06-23 16:32:33,301 : INFO : worker thread finished; awaiting finish of 5 more threads
2020-06-23 16:32:33,304 : INFO : worker thread finished; awaiting finis

(7599230, 9646640)

In [19]:
# not needed, but simple test for the model
model.wv.most_similar('man')

2020-06-23 16:32:51,629 : INFO : precomputing L2-norms of word weight vectors


[('jan', 0.7993326187133789),
 ('babak', 0.7935115694999695),
 ('oxford', 0.7895066738128662),
 ('students', 0.7825140357017517),
 ('email', 0.7712308764457703),
 ('Corresponding', 0.7555925846099854),
 ('obloj', 0.7547800540924072),
 ('maths', 0.7536329030990601),
 ('ECGI', 0.7519477009773254),
 ('mahdavidamghani', 0.7492924332618713)]

In [20]:
model.wv.vocab

{'v': <gensim.models.keyedvectors.Vocab at 0x7efc99a838d0>,
 'cs': <gensim.models.keyedvectors.Vocab at 0x7efc99a83e10>,
 'GT': <gensim.models.keyedvectors.Vocab at 0x7efc99a83cd0>,
 'Nov': <gensim.models.keyedvectors.Vocab at 0x7efc99a832d0>,
 'ar': <gensim.models.keyedvectors.Vocab at 0x7efc99a83e50>,
 'X': <gensim.models.keyedvectors.Vocab at 0x7efc99a83d10>,
 'V': <gensim.models.keyedvectors.Vocab at 0x7efc99a83b50>,
 'A': <gensim.models.keyedvectors.Vocab at 0x7efc99a83e90>,
 'Bayesian': <gensim.models.keyedvectors.Vocab at 0x7efc99a83d50>,
 'optimization': <gensim.models.keyedvectors.Vocab at 0x7efc99a83fd0>,
 'approach': <gensim.models.keyedvectors.Vocab at 0x7efc99a83f90>,
 'to': <gensim.models.keyedvectors.Vocab at 0x7efc99a92050>,
 'compute': <gensim.models.keyedvectors.Vocab at 0x7efc99a92090>,
 'the': <gensim.models.keyedvectors.Vocab at 0x7efc99a92110>,
 'Nash': <gensim.models.keyedvectors.Vocab at 0x7efc99a92150>,
 'equilibria': <gensim.models.keyedvectors.Vocab at 0x7efc

saves model as both .bin and .txt files

In [21]:
model.wv.save_word2vec_format('model'+ str(num_features) + '.bin', binary=True)
model.wv.save_word2vec_format('model'+ str(num_features) + '.txt', binary=False)

2020-06-23 16:32:51,812 : INFO : storing 23740x100 projection weights into model100.bin
2020-06-23 16:32:51,954 : INFO : storing 23740x100 projection weights into model100.txt
