<h1>Implicit Feature Extraction</h1>

This notebook contains the process of extraction for implicit features, using doc2vec by Gensim.
(check here for more info: https://radimrehurek.com/gensim/models/doc2vec.html)

The goal is to obtain the Perceptual Tuple from every review for the Experience Items available.


<h2>Doc2vec</h2>

In this section, we go through the process of training and testing a doc2vec model from Amazon User Reviews. The goal is to create high-dimensional vectors with latent features of the reviews. Doc2vec is a neural network that creates document (review) embeddings in a vector space. This is a high dimensional space, for example 100 dimensions. These vectors will later be used to try and cluster the reviews by the features the users care about and how they write about them.

In [189]:
from gensim import models
from nltk.tokenize import word_tokenize
import json
import multiprocessing
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim.models.doc2vec
import numpy as np
from contextlib import contextmanager
import sys
import gzip
from collections import defaultdict
import random
from scipy import spatial

<H2> Pre-processing </H2>

The data was obtained as a json file of reviews from the UCSD website, http://jmcauley.ucsd.edu/data/amazon/

In [66]:
full_reviews = []
with gzip.open('data/reviews_Movies_and_TV_5.json.gz') as f:
    for line in f:
        full_reviews.append(json.loads(line))
print('Total reviews in dataset: ', len(full_reviews))

items = defaultdict(int)
 
for review in full_reviews:
    asin = review['asin']
    items[asin] += 1

print('Total movies and TV items in dataset: ', len(items))

Total reviews in dataset:  1697533
Total movies and TV items in dataset:  50052


So we have 1.6 million reviews for around 50K movies and TV items from Amazon, and they look like this:

In [5]:
full_reviews[0]

{'asin': '0005019281',
 'helpful': [0, 0],
 'overall': 4.0,
 'reviewText': 'This is a charming version of the classic Dicken\'s tale.  Henry Winkler makes a good showing as the "Scrooge" character.  Even though you know what will happen this version has enough of a change to make it better that average.  If you love A Christmas Carol in any version, then you will love this.',
 'reviewTime': '02 26, 2008',
 'reviewerID': 'ADZPIG9QOCDG5',
 'reviewerName': 'Alice L. Larson "alice-loves-books"',
 'summary': 'good version of a classic',
 'unixReviewTime': 1203984000}

In [68]:
movies_200 = []
for key in items:
    if items[key] > 100 and items[key] < 300:
        if len(movies_200) < 200:
            movies_200.append(key)

In this example we use only the first 200 movies that have between 100 and 300 reviews, for easier processing.

In [75]:
reviews = []
for review in full_reviews:
    if review['asin'] in movies_200:
        reviews.append(review)
        
print('Now we have', len(reviews), 'reviews, from', len(movies_200), 'items')

Now we have 31217 reviews, from 200 items


Then the reviews have to be transformed into a 'TaggedDocument' format consisting of unicode separate words (which form the "documents" of doc2vec) and a tag. In this case, the tag is created by concatenating the reviewer id with the product id from Amazon, such as 'A3UF8X1S0ZZ8KR|B000WUVZCK'. This tag has no effect for our training purpose. The text is tokenized and we only take in account reviews with more than 25 words, since Amazon requires 20 words, a lot of the ones that are barely long enough are not very useful for our purposes, for example: "Great product a a a a a"...until 25 words "Came in time! f g h i j k "...you get the idea.

Here we also create a dictionary with the labels as keys and text as value, for easier reading and qualitative analysis later.

It may be necessary to download the nltk tokenizer first, execute the download command and follow the instructions:

In [None]:
nltk.download()

In [129]:
sentences = []
review_text = {}

for review in reviews:
    review_id = (review['reviewerID']+'|'+review['asin'])
    sentence = models.doc2vec.LabeledSentence(
    words = word_tokenize(review['reviewText'].lower()),
    tags = [review_id])
    if len(sentence[0]) > 25:
        sentences.append(sentence)
        review_text[review_id] = review['reviewText']
        
print('Total sentences introduced: ', len(sentences), 'for', len(review_text), 'reviews')

Total sentences introduced:  28944 for 28944 reviews


This is how a Labeled Sentence looks like:

In [130]:
sentences[0]

LabeledSentence(words=['this', 'is', 'a', 'charming', 'version', 'of', 'the', 'classic', 'dicken', "'s", 'tale', '.', 'henry', 'winkler', 'makes', 'a', 'good', 'showing', 'as', 'the', '``', 'scrooge', "''", 'character', '.', 'even', 'though', 'you', 'know', 'what', 'will', 'happen', 'this', 'version', 'has', 'enough', 'of', 'a', 'change', 'to', 'make', 'it', 'better', 'that', 'average', '.', 'if', 'you', 'love', 'a', 'christmas', 'carol', 'in', 'any', 'version', ',', 'then', 'you', 'will', 'love', 'this', '.'], tags=['ADZPIG9QOCDG5|0005019281'])

Note: If models are already available, skip to model loading. If not, continue here

<H2> Training </H2>

Then we build the model with parameters tuned by trial and error, initially based on the inforamtion established in the paper of the experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), and an example by gensim on IMDB [doc2vec & IMDB](http://localhost:8888/notebooks/GitHub/gensim/docs/notebooks/doc2vec-IMDB.ipynb):

* `size` of 100-dimensional vectors, as the 400d vectors of the paper don't seem to offer much benefit on this task
* The `window` is kept at 10 since it showed good performance with documents of similar size
* Similarly, frequent word subsampling (restricting the amount of times that words can appear) seems to decrease sentiment-prediction accuracy, so it's left out.
* `dm=0` means 'skip-gram' (PV-DBOW) mode, a distributed bag of words implementation, proven to be significantly faster and as accurate as the Distributed Memory (DM) mode.
* A `min_count=5` saves quite a bit of model memory, discarding words that only appear five times or less, since they are not useful for our Shared Perspective concepts.
* More `workers` allow faster processing when possibe
* `alpha` is the initial learning rate, and will decrease linearly to min_alpha. In this case we keep it fixed to avoid decay.

In [131]:
assert gensim.models.doc2vec.FAST_VERSION > -1, "this will be painfully slow otherwise"

model = models.Doc2Vec(sentences, dm=0, size=100, window=10, min_count=5, workers=multiprocessing.cpu_count())

print(str(model))

Doc2Vec(dbow,d100,n5,mc5,s0.001,t8)


In order to be sure there are no overlapping tags, here we check whether the amount of created vectors us the same as the amount of documents put into each model.

In [132]:
assert len(model.docvecs) == len(sentences), "there are overlapping section titles! {0} docvecs and {1} documents".format(len(model.docvecs), len(sentences))

Here we train the dbow model for a set number of iterations, for example 10. For more detail of hyper-parameters visit the following website: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/doc2vec.py

In [133]:
model.train(sentences, total_examples=model.corpus_count, epochs=10)

41081398

Then we save (or load) the model for future use.

In [134]:
model.save("movies_dbow10epoch.doc2vec")

#model = Doc2Vec.load("movies_dbow10epoch.doc2vec")

<H2> Analysis </H2>

A.K.A. Playing with the model

First of all, let's see the vectors obtained from the model, or the *Perceptual Tuples.*

In [208]:
example = random.choice(list(review_text.keys()))

In [209]:
model.docvecs[example]

array([ 0.31564757, -0.25758761,  0.32343212,  0.11199896, -0.36150137,
       -0.30959255, -0.11408874, -0.23506793,  0.15536129,  0.00984454,
       -0.34564993, -0.64718872,  0.00950273,  0.18111324, -0.27767479,
       -0.07515359,  0.32019427, -0.32216209, -0.224976  ,  0.23496059,
        0.75499338,  0.13461852, -0.21235451,  0.4820191 , -0.10561399,
       -0.11321183,  0.40781179, -0.46380854, -0.15351984, -0.07887659,
       -0.4131723 ,  0.81180853,  0.00272181, -0.13923369, -0.08895218,
        0.82202286, -0.08987688,  0.20316355, -0.18511321,  0.11422019,
        0.49718273,  0.5551607 ,  0.06297428, -0.15935303, -0.8181302 ,
        0.15749055,  0.24855907,  0.27744496, -0.39455798, -0.72817761,
       -0.05193769, -0.10428967,  0.1791677 ,  0.39620438,  0.33176005,
        0.26038209, -0.08656421,  0.4592309 , -0.28824422, -0.07417981,
       -0.04835309, -0.09215254, -0.16211525,  0.50178498,  0.28218919,
        0.15133481,  1.07339537,  0.45725924,  0.00947233,  0.16

This Perceptual Tuple consists of the values for 100 *Perceptual Features*. They are implicit, so we cannot understand or intepret their meaning.

It would be awesome if after training we could check the values for reviews that mention great action and find what PFs they have in common. Or reviews that talk about great humor and compare their tuples to the ones saying that the movie was boring, and find if there are attributes high for the first and low for the second.

Sadly, if the third value is -0.25758761 or 0.32343212, it makes no sense to us humans.

Anyway, we can start by testing the similarity of a document with itself, as a naive sanity check. First by using the similarity measure implemented by Gensim (should be 1), second by computing the cosine distance between the vectors using scipy (should be 0).

In [214]:
print(model.docvecs.similarity(d1='ADZPIG9QOCDG5|0005019281', d2='ADZPIG9QOCDG5|0005019281'))

print(round(spatial.distance.cosine(model.docvecs['ADZPIG9QOCDG5|0005019281'], model.docvecs['ADZPIG9QOCDG5|0005019281']), 2))

1.0
0.0


Here we can input a given ID and obtain the most similar sentences.

In [162]:
sims = model.docvecs.most_similar(positive=[model.docvecs[example]], topn=4)

print('Top  similar reviews to:', review_text[example])
for review in sims:
    print('ID: ', review[0], ' Review: ', review_text[review[0]]+'\n')

Top  similar reviews to: I was dressed up at my school as freddy krueger yesterday (for my halloween dance) and then i watched this movie. Theis has to be one of the best of the Nightmare. My favorite part is wehn the Freddipilliar appears and eats the little girl. Its awesome. WATCH THIS AWESOME SH**.
ID:  A39W3263A9HCMN|0780630866  Review:  I was dressed up at my school as freddy krueger yesterday (for my halloween dance) and then i watched this movie. Theis has to be one of the best of the Nightmare. My favorite part is wehn the Freddipilliar appears and eats the little girl. Its awesome. WATCH THIS AWESOME SH**.

ID:  A3DEO4BBK4TQ1Q|0780619412  Review:  THIS IS ON MY TOP TEN FAVORITE MOVIES OF ALL TIME. THE FIST TIME I SAW IT I DIDN'T SLEEP FOR A WEEK. FREDDY WITHOUT HUMOR IS SCARY. DEFINITLY THE BEST NIGHTMARE FILM EVER!

ID:  A2XU709F7V64T|0780619412  Review:  ok,i aint a big fan of freddy's movies,but id say this is the best one and scariest,i liked part 3 alot too,check this mo

We can also find only the most similar reviews for the same item.

In [175]:
def same_product_similars(review_id, number):    
    sims = model.docvecs.most_similar(positive=[review_id], topn=100)
    print('Top similar reviews to: ', review_text[review_id]+'\n')
    i = 0
    for sim in sims:
        s = sim[0]
        asin = s[(s.index('|'))+1:]
        if i < number:
            if asin == review_id[(review_id.index('|'))+1:]:
                print('ID: ', sim[0], ' Review: ', review_text[sim[0]]+'\n')
                i = i + 1

In [176]:
same_product_similars('A39W3263A9HCMN|0780630866', 5)

Top similar reviews to:  I was dressed up at my school as freddy krueger yesterday (for my halloween dance) and then i watched this movie. Theis has to be one of the best of the Nightmare. My favorite part is wehn the Freddipilliar appears and eats the little girl. Its awesome. WATCH THIS AWESOME SH**.

ID:  A3VHYPCUXD7VHT|0780630866  Review:  Freddy Krueger is my favorite horror icon of all time! He is fun, and scary, but not too over the top when it comes to gore. My personal favorites are Nightmare 1, Nightmare 3: Dream Warriors, Wes Craven's New Nightmare, & Freddy vs. Jason. But, anything with Freddy in it is AWESOME!!! Highly recommended!

ID:  A2940X5L71GK3U|0780630866  Review:  I like scary movies (Horror, Thriller, Suspense). This may be one of the best horror movies ever. This was cool and it was one of four scary movies  that actually scared me. The other ones where Thinner, The Shinning, and  The Sixth Sense. Out of all the Halloween movies, Friday the 13ths, and  Nightmare

Gensim also allows to infer the vector of a given sentence (separated by words like the TaggedDocuments above) and test how similar it is to others. This is important to get vectors from new sentences that were not in the training set for the model. In this case we try it with a sentence that *is* in the dataset, to see which ones are the most similar. The first sentence *should* be itself.

Higher numbers of steps or iterations in the inference process will achieve a better similarity score to itself, while reducing it to the others.

In [195]:
inferred_docvec = model.infer_vector(word_tokenize(review_text[example]), steps=5000, alpha = 0.01)
sims = model.docvecs.most_similar(positive=[inferred_docvec], topn=5)

print('Top  similar reviews to:', review_text[example]+'\n')
for review in sims:
    print('ID: ', review[0], ' Review: ', review_text[review[0]]+'\n')

Top  similar reviews to: I was dressed up at my school as freddy krueger yesterday (for my halloween dance) and then i watched this movie. Theis has to be one of the best of the Nightmare. My favorite part is wehn the Freddipilliar appears and eats the little girl. Its awesome. WATCH THIS AWESOME SH**.

ID:  A39W3263A9HCMN|0780630866  Review:  I was dressed up at my school as freddy krueger yesterday (for my halloween dance) and then i watched this movie. Theis has to be one of the best of the Nightmare. My favorite part is wehn the Freddipilliar appears and eats the little girl. Its awesome. WATCH THIS AWESOME SH**.

ID:  A1M82OE9TB0RQ0|0780630874  Review:  Ok. This is like, the best of the Freddy flicks. Dream Child used to be the best to me, then I started watchin this one a lot more. The F/X are great! The cockroach scene is kool. And the song while the main titles are rolling is pretty kool, too. And Lisa Zane is the best female survivor I've seen in the more major horror flicks (

Now we can evaluate the cosine distance between the vector and its own word-based inference.
We can experiment with some values for the number of steps or alpha.

In [206]:
n_steps = [100, 1000, 10000]
alphas = [0.1, 0.01, 0.001]

for trial in n_steps:
    for alpha_value in alphas:
        inferred_docvec = model.infer_vector(word_tokenize(review_text[example].lower()), steps = trial, alpha = alpha_value) 
        print(trial, alpha_value, round(spatial.distance.cosine(inferred_docvec, model.docvecs[example]), 2))

100 0.1 0.1
100 0.01 0.07
100 0.001 0.14
1000 0.1 0.05
1000 0.01 0.02
1000 0.001 0.02
10000 0.1 0.09
10000 0.01 0.09
10000 0.001 0.09


It seems that 1000 steps with alpha = 0.01 yield the most similar vectors, so we could use this in future parts of the process, if necessary.

Now we have a trained doc2vec model on movie reviews, and we can evaluate similarity of the review texts, find the most similar ones, etc.

The next step is to group similar perceptual tuples into Shared Perspectives.