# Sense2Vec with Star Wars Reviews

Here I'll be running Sense2Vec on the Star Wars Reviews that I have collected. This notebook is based loosely on the article 'Sense2vec with spaCy and Gensim' found at https://explosion.ai/blog/sense2vec-with-spacy

In [1]:
import os, re
import numpy as np
import pandas as pd

import spacy
import nltk

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
%matplotlib inline

## Load the data

In [3]:
imdb_data = pd.read_csv("../data/imdb_data.csv")
rt_data_1 = pd.read_csv("../rotten_tomatoes/reviews.csv")
rt_data_2 = pd.read_csv("../rotten_tomatoes/reviews.csv")

rt_data = pd.concat([rt_data_1, rt_data_2]).drop_duplicates()

imdb_data['star_rating'] = imdb_data.star_rating / 2
imdb_data['source'] = 'imdb'
rt_data['source'] = 'rotten tomatoes'

data_cols = ['date', 'name', 'user_link', 'source', 'review', 'downvotes', 'upvotes', 'star_rating']
review_data = pd.concat([imdb_data, rt_data])[data_cols]

## Using Spacy

In [4]:
import spacy

In [5]:
nlp = spacy.load("en")

In [60]:
def transform_texts(texts, nlp):
    # Load the annotation models
    # Stream texts through the models. We accumulate a buffer and release
    # the GIL around the parser, for efficient multi-threading.
    skip_count = 0
    for i, doc in enumerate(nlp.pipe(texts, n_threads = 4)):
        print("\rProcessing %d out of %-30d" % (i, len(texts)), end = "")
        
        # Iterate over base NPs, e.g. "all their good ideas"
        for np in list(doc.noun_chunks):
            # Only keep adjectives and nouns, e.g. "good ideas"
            while np.end - np.start > 1 and np[0].dep_ not in ('amod', 'compound'):
                np = np[1:]
            if np.end - np.start > 1:
                # Merge the tokens, e.g. good_ideas
                np.merge(tag = np.root.tag_, lemma = np.text, ent_type = np.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(tag = ent.root.tag_, lemma = ent.text, ent_type = ent.label_)
        token_strings = []
        for token in doc:
            text = token.text.replace(' ', '_')
            tag = token.ent_type_ or token.pos_
            token_strings.append('%s|%s' % (text, tag))
        yield ' '.join(token_strings)

In [61]:
nlp_reviews = list(transform_texts(review_data.review, nlp))

Processing 5 out of 4368                          

IndexError: Error calculating span: Can't find end

In [68]:
help(np.merge)

Help on built-in function merge:

merge(...) method of spacy.tokens.span.Span instance
    Retokenize the document, such that the span is merged into a single
    token.
    
    **attributes: Attributes to assign to the merged token. By default,
        attributes are inherited from the syntactic root token of the span.
    RETURNS (Token): The newly merged token.



In [132]:
print(doc)

With thunderous laughter, and disappointment.I have no idea how this made it through scripting, story boards, and early screenings. With no one once standing up and saying "What the hell is this crap? This isn't Star Wars! This comes off as a Comedy Central special, and no Star Wars fan will ever accept this as an acceptable story"I'm sure some youngster, that has never seen another Star Wars movie, is probably going to watch this, and think it's the best movie they have ever seen. But they will grow up, and go back and watch the real story from Episode 1 through to Episode 7, then get back to this one and ask themselves "What Happened".Not the worst movie of all time, but in my opinion, not worthy of existing in the Star Wars universe. This should just be the end of a 2 part arc. Wow... just wow. JJ has a lot of work to do if he's going to try to somehow put all the tooth paste back in the tube after this mess.


In [131]:
for ent in doc.ents:
    print(ent)

Comedy Central
story"I'm
Star Wars
Episode 1
Episode 7
Star Wars
2
JJ
