## How to enhance a comment to encourage trending?

1. Get a list of top 10 words in the trending bucket. 
2. User enters a comment.
3. Comment is scored.
4. Determine if any of the top 10 words are missing, and score by adding the words to the comment.
5. Output the comment score, and what would be the comment if 'x' word is added.

Could use word2vec to find similar words through cosine similarity.

Process:
1. User enters comment
2. Tokenize comment
3. Find the most similar word compared to the corpus of trending products. 
4. Generate additional comments with swapping out one word.
5. Comments are transformed (count vector -> lda -> log)
6. Comments are predicted (SMOTE -> XGB) 

In [1]:
import spacy

import numpy as np
import pandas as pd
import dill

import re
from nltk import SnowballStemmer
from nltk.corpus import stopwords

stemmer = SnowballStemmer('english')

import matplotlib.pyplot as plt
%matplotlib inline

%config InlineBackend.figure_format = 'svg'

MODELING_PATH = '../data/modeling/'
PATH = '../data/amazon_reviews_us_Toys_v1_00.tsv'

In [2]:
# save progress
def save(obj, obj_name):
    f = MODELING_PATH + obj_name
    dill.dump(obj, open(f, 'wb'))

def load(obj_name):
    f = MODELING_PATH + obj_name
    return dill.load(open(f, 'rb'))

In [4]:
from AmazonReviews import AmazonReviews

ar = AmazonReviews()
ar.load_data(PATH)
ar.calc_trend_score()
ar.create_observations()

Read from pickle...


In [33]:
# enter a comment
comment = 'This toy is amazing! So much worth the bucks!!'

In [34]:
def token_comment(comment):
    tkpat = re.compile('\\b[a-z][a-z]+\\b')
    comment_token = tkpat.findall(comment)
    return [w for w in comment_token if w not in set(stopwords.words())]

In [35]:
comment_tokenized = token_comment(comment)

In [5]:
nlp = spacy.load('en_core_web_lg')

In [7]:
print (nlp.vocab[u'dog'].similarity(nlp.vocab[u'dachshund']))

0.62467307


In [19]:
def get_related(word):
    # replace word.vocab with the set of words in the trending review corpus
    filtered_words = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
#     similarity = sorted(filtered_words, key=lambda w: word.similarity(w), reverse=True)
#     return similarity[:10]

get_related(nlp.vocab[u'plane'])
# print( [w.lower_ for w in get_related(nlp.vocab[u'plane'])])

In [None]:
## need to get the corpus of all reviews which have trended
review_corpus = ' '.join(ar.obs[ar.obs.trend == 1].review_body)

In [12]:
review_corpus = token_comment(review_corpus) # takes a long time
review_corpus[:10]

['grandchild',
 'loved',
 'book',
 'used',
 'kept',
 'busy',
 'creating',
 'problem',
 'colored',
 'pencils']

In [14]:
review_corpus = set(review_corpus)
save(review_corpus, 'review_corpus.pkl')

In [30]:
review_vocab = [nlp.vocab[w] for w in review_corpus] # critical

In [15]:
len(review_corpus)

6516

In [90]:
def most_similar(word, top=10):
#     filtered_words = [w for w in review_vocab if w.is_lower == word.is_lower]
#     similarity_scores = [word.similarity(w) for w in review_vocab]
#     words = [w.orth_ for w in review_vocab]
    by_similarity = sorted(review_vocab, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:top]]
#     return pd.DataFrame(data={'word':words, 'score':similar_scores})

In [89]:
# most_similar(nlp.vocab[u'plane'])
# comment_tokenized

comment_list = [comment]
for t in comment_tokenized:
    sim_words = most_similar(nlp.vocab[t])
    for s in sim_words:
        new_comment = comment.replace(t, s)
        if new_comment != comment:
            comment_list.append(comment.replace(t, s))
comment_list

['This toy is amazing! So much worth the bucks!!',
 'This toys is amazing! So much worth the bucks!!',
 'This doll is amazing! So much worth the bucks!!',
 'This dolls is amazing! So much worth the bucks!!',
 'This teddy is amazing! So much worth the bucks!!',
 'This plush is amazing! So much worth the bucks!!',
 'This bunny is amazing! So much worth the bucks!!',
 'This stuffed is amazing! So much worth the bucks!!',
 'This playset is amazing! So much worth the bucks!!',
 'This lego is amazing! So much worth the bucks!!',
 'This toy is incredible! So much worth the bucks!!',
 'This toy is awesome! So much worth the bucks!!',
 'This toy is fantastic! So much worth the bucks!!',
 'This toy is wonderful! So much worth the bucks!!',
 'This toy is great! So much worth the bucks!!',
 'This toy is fabulous! So much worth the bucks!!',
 'This toy is phenomenal! So much worth the bucks!!',
 'This toy is beautiful! So much worth the bucks!!',
 'This toy is gorgeous! So much worth the bucks!!',


## Predict on the new comments

In [92]:
doc_transformer = load('doc_5t_transformer.pkl')
classifier_model = load('final_model_smote_5t.pkl')

In [93]:
comments_transformed = doc_transformer.transform(comment_list)

  sorted(inconsistent))


In [102]:
comment_probs = classifier_model.predict_proba(comments_transformed)[:,1]

In [103]:
for c, p in zip(comment_list, comment_probs):
    print(p, c)

0.07082949 This toy is amazing! So much worth the bucks!!
0.07082949 This toys is amazing! So much worth the bucks!!
0.07369653 This doll is amazing! So much worth the bucks!!
0.07369653 This dolls is amazing! So much worth the bucks!!
0.07082949 This teddy is amazing! So much worth the bucks!!
0.07082949 This plush is amazing! So much worth the bucks!!
0.14121 This bunny is amazing! So much worth the bucks!!
0.07082949 This stuffed is amazing! So much worth the bucks!!
0.12800424 This playset is amazing! So much worth the bucks!!
0.12800424 This lego is amazing! So much worth the bucks!!
0.07082949 This toy is incredible! So much worth the bucks!!
0.07082949 This toy is awesome! So much worth the bucks!!
0.07082949 This toy is fantastic! So much worth the bucks!!
0.07082949 This toy is wonderful! So much worth the bucks!!
0.25630432 This toy is great! So much worth the bucks!!
0.07082949 This toy is fabulous! So much worth the bucks!!
0.21020529 This toy is phenomenal! So much worth t

In [104]:
comment_list[comment_probs.argmax()] # pretty fucking cool

'This toy is great! So much worth the bucks!!'

In [105]:
from sklearn.pipeline import Pipeline

In [108]:
model_pipe = Pipeline(
    [
        ('step1', doc_transformer),
        ('step2', classifier_model)
    ]
) # need to train on whole model

In [107]:
model_pipe.predict_proba(comment_list)[:,1]



array([0.07082949, 0.07082949, 0.07369653, 0.07369653, 0.07082949,
       0.07082949, 0.14121   , 0.07082949, 0.12800424, 0.12800424,
       0.07082949, 0.07082949, 0.07082949, 0.07082949, 0.25630432,
       0.07082949, 0.21020529, 0.07082949, 0.07082949, 0.07082949,
       0.07082949, 0.07082949, 0.07082949, 0.07082949, 0.07082949,
       0.14021012, 0.07231435, 0.07082949, 0.14292574, 0.07082949,
       0.07082949, 0.13507941, 0.07082949, 0.07082949, 0.07082949,
       0.07082949, 0.21020529, 0.07082949, 0.07082949, 0.07082949,
       0.07082949, 0.07082949, 0.18541679, 0.12800424, 0.07082949,
       0.07082949], dtype=float32)