## Calculate and save embeddings

Gensim requires text input to be in sentence format, ie a list containing multiple sentences, where each sentence is a list of tokenized words in that sentence. We can create that fairly easily with a simple function and nltk tokenization. Word2Vec does not require removal of stopwords, since the most frequently occuring words are downsampled anyway.

In [2]:
import nltk
import re
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [3]:
# dont need to remove stop words with full w2v - very frequent words get downsampled anyway
def text_to_sentence(file_path):
    with open(file_path, 'r') as file:
        clean = file.read().lower()
  
    print('Processing '+file_path+'...')
    sentences = nltk.sent_tokenize(clean)
    print('Sentences tokenized. Tokenizing words...')
    sentences = [re.sub('[^a-z]',' ', sentence) for sentence in sentences] 
    sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    print('done\n')

    return sentences

In [4]:
female_text = text_to_sentence('../Data/Female_Authors.txt')
male_text = text_to_sentence('../Data/Male_Authors.txt')
movie_text = text_to_sentence('../Data/Movie_Scripts.txt')

Processing ../Data/Female_Authors.txt...
Sentences tokenized. Tokenizing words...
done

Processing ../Data/Male_Authors.txt...
Sentences tokenized. Tokenizing words...
done

Processing ../Data/Movie_Scripts.txt...
Sentences tokenized. Tokenizing words...
done



In [5]:
# Lyrics dataset requires special processing. The raw data was unpunctuated and doesn't follow a normal
# sentence structure. Instead, we will artificially create sentences assuming a standard sentence length of
# 6 words, as an estimate of the length of a normal line of lyrics

with open('./Data/Combined_lyrics_final.txt', 'r') as file:
    clean = file.read().lower()
words = nltk.word_tokenize(clean)
words = [re.sub('[^a-z]',' ', word) for word in words]
lyrics_text = [words[x:x+6] for x in range(0, len(words), 6)]

In [8]:
# Build and save models - by default the Word2Vec runs 5 epochs. We can do more by subsequently
# running model.train, or we can just change the default

female_model = gensim.models.Word2Vec(female_text, size=100, window=5, min_count=2, workers=8)
male_model = gensim.models.Word2Vec(male_text, size=100, window=5, min_count=2, workers=8)
movie_model = gensim.models.Word2Vec(movie_text, size=100, window=5, min_count=2, workers=8)
lyrics_model = gensim.models.Word2Vec(lyrics_text, size=100, window=5, min_count=2, workers=8)



2019-05-28 18:12:33,657 : INFO : collecting all words and their counts
2019-05-28 18:12:33,659 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-28 18:12:33,727 : INFO : PROGRESS: at sentence #10000, processed 205157 words, keeping 8134 word types
2019-05-28 18:12:33,784 : INFO : PROGRESS: at sentence #20000, processed 405666 words, keeping 10477 word types
2019-05-28 18:12:33,851 : INFO : PROGRESS: at sentence #30000, processed 640469 words, keeping 13419 word types
2019-05-28 18:12:33,900 : INFO : PROGRESS: at sentence #40000, processed 804204 words, keeping 18522 word types
2019-05-28 18:12:33,959 : INFO : PROGRESS: at sentence #50000, processed 994638 words, keeping 20714 word types
2019-05-28 18:12:34,017 : INFO : PROGRESS: at sentence #60000, processed 1185192 words, keeping 23756 word types
2019-05-28 18:12:34,071 : INFO : PROGRESS: at sentence #70000, processed 1371462 words, keeping 26490 word types
2019-05-28 18:12:34,126 : INFO : PROGRESS: at

2019-05-28 18:12:44,253 : INFO : collecting all words and their counts
2019-05-28 18:12:44,253 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-28 18:12:44,307 : INFO : PROGRESS: at sentence #10000, processed 174860 words, keeping 11141 word types
2019-05-28 18:12:44,366 : INFO : PROGRESS: at sentence #20000, processed 367988 words, keeping 15332 word types
2019-05-28 18:12:44,419 : INFO : PROGRESS: at sentence #30000, processed 544829 words, keeping 18554 word types
2019-05-28 18:12:44,471 : INFO : PROGRESS: at sentence #40000, processed 721609 words, keeping 20118 word types
2019-05-28 18:12:44,527 : INFO : PROGRESS: at sentence #50000, processed 910107 words, keeping 21665 word types
2019-05-28 18:12:44,582 : INFO : PROGRESS: at sentence #60000, processed 1094806 words, keeping 23128 word types
2019-05-28 18:12:44,636 : INFO : PROGRESS: at sentence #70000, processed 1269560 words, keeping 24622 word types
2019-05-28 18:12:44,688 : INFO : PROGRESS: a

2019-05-28 18:12:56,050 : INFO : EPOCH 5 - PROGRESS: at 49.62% examples, 1642047 words/s, in_qsize 13, out_qsize 2
2019-05-28 18:12:57,052 : INFO : EPOCH 5 - PROGRESS: at 94.72% examples, 1615818 words/s, in_qsize 15, out_qsize 0
2019-05-28 18:12:57,108 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-05-28 18:12:57,113 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-05-28 18:12:57,119 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-05-28 18:12:57,126 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-05-28 18:12:57,129 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-05-28 18:12:57,135 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-28 18:12:57,136 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-28 18:12:57,137 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-28 18:12:57,138 : INFO : EPOCH - 5

2019-05-28 18:12:59,048 : INFO : PROGRESS: at sentence #610000, processed 4859454 words, keeping 55097 word types
2019-05-28 18:12:59,082 : INFO : PROGRESS: at sentence #620000, processed 4940068 words, keeping 55369 word types
2019-05-28 18:12:59,116 : INFO : PROGRESS: at sentence #630000, processed 5030982 words, keeping 55668 word types
2019-05-28 18:12:59,144 : INFO : PROGRESS: at sentence #640000, processed 5099035 words, keeping 55957 word types
2019-05-28 18:12:59,179 : INFO : PROGRESS: at sentence #650000, processed 5193045 words, keeping 56191 word types
2019-05-28 18:12:59,208 : INFO : PROGRESS: at sentence #660000, processed 5265438 words, keeping 56341 word types
2019-05-28 18:12:59,241 : INFO : PROGRESS: at sentence #670000, processed 5347950 words, keeping 56987 word types
2019-05-28 18:12:59,274 : INFO : PROGRESS: at sentence #680000, processed 5429685 words, keeping 57363 word types
2019-05-28 18:12:59,309 : INFO : PROGRESS: at sentence #690000, processed 5519481 words,

2019-05-28 18:13:01,298 : INFO : min_count=2 retains 52523 unique words (69% of original 75099, drops 22576)
2019-05-28 18:13:01,299 : INFO : min_count=2 leaves 10418150 word corpus (99% of original 10440726, drops 22576)
2019-05-28 18:13:01,446 : INFO : deleting the raw counts dictionary of 75099 items
2019-05-28 18:13:01,449 : INFO : sample=0.001 downsamples 56 most-common words
2019-05-28 18:13:01,450 : INFO : downsampling leaves estimated 7987390 word corpus (76.7% of prior 10418150)
2019-05-28 18:13:01,593 : INFO : estimated required memory for 52523 words and 100 dimensions: 68279900 bytes
2019-05-28 18:13:01,594 : INFO : resetting layer weights
2019-05-28 18:13:02,105 : INFO : training model with 8 workers on 52523 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2019-05-28 18:13:03,128 : INFO : EPOCH 1 - PROGRESS: at 17.35% examples, 1368205 words/s, in_qsize 15, out_qsize 0
2019-05-28 18:13:04,131 : INFO : EPOCH 1 - PROGRESS: at 36.32% examples, 14

2019-05-28 18:13:29,391 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-28 18:13:29,420 : INFO : PROGRESS: at sentence #10000, processed 60000 words, keeping 4447 word types
2019-05-28 18:13:29,440 : INFO : PROGRESS: at sentence #20000, processed 120000 words, keeping 6870 word types
2019-05-28 18:13:29,461 : INFO : PROGRESS: at sentence #30000, processed 180000 words, keeping 9126 word types
2019-05-28 18:13:29,481 : INFO : PROGRESS: at sentence #40000, processed 240000 words, keeping 11075 word types
2019-05-28 18:13:29,502 : INFO : PROGRESS: at sentence #50000, processed 300000 words, keeping 12703 word types
2019-05-28 18:13:29,522 : INFO : PROGRESS: at sentence #60000, processed 360000 words, keeping 14195 word types
2019-05-28 18:13:29,543 : INFO : PROGRESS: at sentence #70000, processed 420000 words, keeping 15920 word types
2019-05-28 18:13:29,565 : INFO : PROGRESS: at sentence #80000, processed 480000 words, keeping 17563 word types
2019-05-2

2019-05-28 18:13:34,514 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-05-28 18:13:34,517 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-05-28 18:13:34,518 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-05-28 18:13:34,518 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-05-28 18:13:34,519 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-05-28 18:13:34,521 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-28 18:13:34,524 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-28 18:13:34,526 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-28 18:13:34,526 : INFO : EPOCH - 3 : training on 2734842 raw words (1851294 effective words) took 1.3s, 1475503 effective words/s
2019-05-28 18:13:35,544 : INFO : EPOCH 4 - PROGRESS: at 76.39% examples, 1450851 words/s, in_qsize 15, out_qsize 0
2019-05-28 18:13:35

In [10]:
female_model.save('../Data/female_model.model')
male_model.save('../Data/male_model.model')
movie_model.save('../Data/movie_model.model')
lyrics_model.save('../Data/lyrics_model.model')

2019-05-28 18:15:16,301 : INFO : saving Word2Vec object under ../Data/female_model.model, separately None
2019-05-28 18:15:16,302 : INFO : not storing attribute vectors_norm
2019-05-28 18:15:16,303 : INFO : not storing attribute cum_table
2019-05-28 18:15:16,642 : INFO : saved ../Data/female_model.model
2019-05-28 18:15:16,643 : INFO : saving Word2Vec object under ../Data/male_model.model, separately None
2019-05-28 18:15:16,644 : INFO : not storing attribute vectors_norm
2019-05-28 18:15:16,644 : INFO : not storing attribute cum_table
2019-05-28 18:15:17,311 : INFO : saved ../Data/male_model.model
2019-05-28 18:15:17,312 : INFO : saving Word2Vec object under ../Data/movie_model.model, separately None
2019-05-28 18:15:17,313 : INFO : not storing attribute vectors_norm
2019-05-28 18:15:17,314 : INFO : not storing attribute cum_table
2019-05-28 18:15:18,173 : INFO : saved ../Data/movie_model.model
2019-05-28 18:15:18,174 : INFO : saving Word2Vec object under ../Data/lyrics_model.model, s

Now that models are saved, we can load them later using ```model = Word2Vec.load('../Data/female_model.model')```  
or, if we only need the vectors, ```vectors = KeyedVectors.load('../Data/female_model.wv', mmap='r')```

In [31]:
# Also save the vectors only (to improve load time and memory usage if further training is not required)

female_model.wv.save('../Data/female_model.wv')
male_model.wv.save('../Data/male_model.wv')
movie_model.wv.save('../Data/movie_model.wv')
lyrics_model.wv.save('../Data/lyrics_model.wv')



2019-05-28 18:49:07,899 : INFO : saving Word2VecKeyedVectors object under ../Data/female_model.wv, separately None
2019-05-28 18:49:07,901 : INFO : not storing attribute vectors_norm
2019-05-28 18:49:08,180 : INFO : saved ../Data/female_model.wv
2019-05-28 18:49:08,181 : INFO : saving Word2VecKeyedVectors object under ../Data/male_model.wv, separately None
2019-05-28 18:49:08,182 : INFO : not storing attribute vectors_norm
2019-05-28 18:49:08,639 : INFO : saved ../Data/male_model.wv
2019-05-28 18:49:08,640 : INFO : saving Word2VecKeyedVectors object under ../Data/movie_model.wv, separately None
2019-05-28 18:49:08,641 : INFO : not storing attribute vectors_norm
2019-05-28 18:49:09,359 : INFO : saved ../Data/movie_model.wv
2019-05-28 18:49:09,360 : INFO : saving Word2VecKeyedVectors object under ../Data/lyrics_model.wv, separately None
2019-05-28 18:49:09,360 : INFO : not storing attribute vectors_norm
2019-05-28 18:49:09,600 : INFO : saved ../Data/lyrics_model.wv
