## How to enhance a comment to encourage trending?

1. Get a list of top 10 words in the trending bucket. 
2. User enters a comment.
3. Comment is scored.
4. Determine if any of the top 10 words are missing, and score by adding the words to the comment.
5. Output the comment score, and what would be the comment if 'x' word is added.

Could use word2vec to find similar words through cosine similarity.

Process:
1. User enters comment
2. Tokenize comment
3. Find the most similar word compared to the corpus of trending products. 
4. Generate additional comments with swapping out one word.
5. Comments are transformed (count vector -> lda -> log)
6. Comments are predicted (SMOTE -> XGB) 

In [4]:
import spacy

import numpy as np
import pandas as pd
import dill

import re
from nltk import SnowballStemmer
from nltk.corpus import stopwords

stemmer = SnowballStemmer('english')

import matplotlib.pyplot as plt
%matplotlib inline

%config InlineBackend.figure_format = 'svg'

MODELING_PATH = '../data/modeling/'
PATH = '../data/amazon_reviews_us_Toys_v1_00.tsv'

In [5]:
# save progress
def save(obj, obj_name):
    f = MODELING_PATH + obj_name
    dill.dump(obj, open(f, 'wb'))

def load(obj_name):
    f = MODELING_PATH + obj_name
    return dill.load(open(f, 'rb'))

In [26]:
from AmazonReviews import AmazonReviews

ar = AmazonReviews()
ar.load_data(PATH)
ar.calc_trend_score()
ar.create_observations()

Read from pickle...


In [3]:
# enter a comment
comment = 'This toy is amazing! So much worth the bucks!!'

In [9]:
def token_comment(comment):
    comment_token = re.compile('\\b[a-z][a-z]+\\b').findall(comment)
    return [w for w in comment_token if w not in set(stopwords.words())]

In [13]:
comment_tokenized = token_comment(comment)

In [6]:
nlp = spacy.load('en_core_web_lg')

In [7]:
print (nlp.vocab[u'dog'].similarity(nlp.vocab[u'dachshund']))

0.62467307


In [8]:
def most_similar(word):

    by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:10]]

In [21]:
nlp.vocab

filtered_words = [w for w in nlp.vocab if w.prob >= -15]
filtered_words = set(filtered_words)

In [25]:
filtered_words

{<spacy.lexeme.Lexeme at 0x1019c1048>,
 <spacy.lexeme.Lexeme at 0x1018a7510>,
 <spacy.lexeme.Lexeme at 0x1019fbb40>,
 <spacy.lexeme.Lexeme at 0x1a3c78e288>,
 <spacy.lexeme.Lexeme at 0x101b33ea0>,
 <spacy.lexeme.Lexeme at 0x1019de750>,
 <spacy.lexeme.Lexeme at 0x1019c1090>,
 <spacy.lexeme.Lexeme at 0x1019fbd80>,
 <spacy.lexeme.Lexeme at 0x101bfb558>,
 <spacy.lexeme.Lexeme at 0x101bfb510>,
 <spacy.lexeme.Lexeme at 0x1a3c7ac288>,
 <spacy.lexeme.Lexeme at 0x101b65e10>,
 <spacy.lexeme.Lexeme at 0x101bfb5e8>,
 <spacy.lexeme.Lexeme at 0x101bc39d8>,
 <spacy.lexeme.Lexeme at 0x1a818f7fc0>,
 <spacy.lexeme.Lexeme at 0x1a818f8090>,
 <spacy.lexeme.Lexeme at 0x1018a7558>,
 <spacy.lexeme.Lexeme at 0x1a34371f30>,
 <spacy.lexeme.Lexeme at 0x101b33ee8>,
 <spacy.lexeme.Lexeme at 0x1a813c4bd0>,
 <spacy.lexeme.Lexeme at 0x1019de678>,
 <spacy.lexeme.Lexeme at 0x1a81915ab0>,
 <spacy.lexeme.Lexeme at 0x101972e58>,
 <spacy.lexeme.Lexeme at 0x101a8be10>,
 <spacy.lexeme.Lexeme at 0x101a8bea0>,
 <spacy.lexeme.Lex

In [19]:
def get_related(word):
    # replace word.vocab with the set of words in the trending review corpus
    filtered_words = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
#     similarity = sorted(filtered_words, key=lambda w: word.similarity(w), reverse=True)
#     return similarity[:10]

get_related(nlp.vocab[u'plane'])
# print( [w.lower_ for w in get_related(nlp.vocab[u'plane'])])

In [31]:
## need to get the corpus of all reviews which have trended
review_corpus = ' '.join(ar.obs[ar.obs.trend == 1].review_body)

In [33]:
trending_words = nlp(review_corpus)

KeyboardInterrupt: 

'toy amazing much worth bucks'

In [24]:
for token1 in d:
    for token2 in d:
        if token1 != token2:
            print(token1.text, token2.text, token1.similarity(token2))

toy amazing 0.3016522
toy much 0.25184816
toy worth 0.21597828
toy bucks 0.24880211
amazing toy 0.3016522
amazing much 0.552381
amazing worth 0.44661003
amazing bucks 0.28742218
much toy 0.25184816
much amazing 0.552381
much worth 0.6091755
much bucks 0.42316318
worth toy 0.21597828
worth amazing 0.44661003
worth much 0.6091755
worth bucks 0.6375981
bucks toy 0.24880211
bucks amazing 0.28742218
bucks much 0.42316318
bucks worth 0.6375981
