## Training doc2vec model using Colaboratory

Google's Colaboratory allows to process code like Jupyter Notebooks but on the cloud, so they do the heavy lifting. 
In this notebook we will train a doc2vec model using Amazon Reviews, to then use the vectors in the Shared Perspectives framework.

We might have to install gensim, the doc2vec implementation on out notebook first, this can be done directly here!

In [0]:
!pip install gensim

Then we do our imports, mostly for gensim and natural language tools.

In [0]:
import os
import collections
import random
from random import shuffle
import gensim
from gensim import models
from gensim.models.doc2vec import Doc2Vec,LabeledSentence
import nltk
from nltk.tokenize import word_tokenize
import json
import time
import gzip
import pickle

Now we can upload the json file with the reviews, then convert it into json lines.

In [57]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

User uploaded file "Video_Games_5.json" with length 319473515 bytes


In [0]:
data_string = uploaded['Video_Games_5.json']

data = []
for line in data_string.splitlines():
  data.append(json.loads(line))

The gensim model requires that reviews are in the format of words and tags. Words are separated tokens (every word in the review), and for tags we use the id of the product with the id of the reviewer.
We use reviews with more than 25 words, since Amazon requires 20 words, a lot of the ones that are barely long enough are not very useful for our purposes, for example:
"Great product a a a a a"...until 25 words
"Came in time! f g h i j k "...you get the idea

It may be necessary to download the nltk tokenizer first.

In [0]:
nltk.download()

In [133]:
sentences = []
start_time = time.time()

for l in range(len(data)):
  if len(word_tokenize(data[l]['reviewText'])) > 25:
    sentence = models.doc2vec.TaggedDocument(
        words = word_tokenize(data[l]['reviewText'].lower()), 
        tags = [data[l]['reviewerID']+'|'+data[l]['asin']])
    sentences.append(sentence)
    
print(time.time()-start_time)

758.03997016


It was 231780 including all sentences, but excluding the ones with less than 25 words, we have:

In [134]:
len(sentences)

217181

Then we initialize the model, with a size of 100 dimensions, a window of 10 words and a minimum count of the 5 appearances for every word. For more info on these and other hyper-parameter visit the doc2vec Github website.

We also assert that there is the same amount of sentences and vectors in the model.

In [0]:
model = models.Doc2Vec(sentences, size=100, window=10, min_count=5)

In [0]:
assert len(model.docvecs) == len(sentences), "there are overlapping section titles! {0} docvecs and {1} documents".format(len(model.docvecs), len(documents))

In [137]:
start_time = time.time()
model.train(sentences, total_examples=model.corpus_count, epochs=5)
print(time.time()-start_time)

502.897390127


In [0]:
model.save("videogames.doc2vec")

In [0]:
files.download('videogames.doc2vec.docvecs.doctag_syn0.npy')

Done! We can save and download.

We can also create a dictionary with the text of the reviews, and find some similar ones.

In [0]:
reviews ={}

for l in range(len(data)):
  id = str(data[l]['reviewerID']+'|'+data[l]['asin'])
  reviews[id] = data[l]['reviewText']

In [143]:
sims = model.docvecs.most_similar(positive=[model.docvecs[random.choice(reviews.keys())]], topn=10)

print('Top similar reviews to: ', reviews[sims[0][0]]+'\n')
for x in range(len(sims)):
    print('ID: ', sims[x][0], ' Review: ', reviews[sims[x][0]]+'\n')

('Top similar reviews to: ', u'We ordered Spectrobes and Spectrobes beyond the Portals........... My grandson perfered &#34;Beyond the Portals&#34;.......and would recommend this one to anyone 11 years old or older.S. George\n')
('ID: ', u'A2TMG30R94E2BT|B000GABOTA', ' Review: ', u'We ordered Spectrobes and Spectrobes beyond the Portals........... My grandson perfered &#34;Beyond the Portals&#34;.......and would recommend this one to anyone 11 years old or older.S. George\n')
('ID: ', u'A2IKKPLUX8U7FW|B000UW21A0', ' Review: ', u'Having recently purchased my first gaming console (a PS3), this was just the game I was looking for - an electronic "RAIDERS OF THE LOST ARK" type of adventure, with action, puzzles, and exotic environments to explore (jungle, temple, etc.).  At around $20 (or less used), well worth the price.\n')
('ID: ', u'A12CBUR8QGQ5UP|B001C3N0OW', ' Review: ', u"just started playing this one. it's another addictive hide and seek type puzzle game. not very hard, but some of