# Information

## Amazon data
* [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)
* product info: GET amazon.com/dp/B00006HAXW
* [More data!](http://jmcauley.ucsd.edu/data/amazon/)

## Action Plan

Small Data sample:
1. Explore Data
2. Collaborative Filtering
3. NLP
4. Sentiment Analysis
5. Seq2Seq summarizer

Large Data samples:
* Implement above pipeline

## 1. Explore Data

In [1]:
import pandas as pd
import numpy as np
import json
import gzip
from data.parser import getDF

In [2]:
data = getDF('./data/reviews_Books_5.json.gz')

In [3]:
data.summary

0                                                 Wonderful!
1                                               close to god
2                            Must Read for Life Afficianados
3          Timeless for every good and bad time in your l...
4                                              A Modern Rumi
5                             This book will bring you peace
6                                                 Graet Work
7                                                Such Beauty
8                                                The Prophet
9                                           A Modern Classic
10           Perhaps the greatest book that I have ever read
11                   Great classic that everyone should read
12                                                   Amazing
13                            Everyone should have this book
14                     A book everyone &#34;should&#34; read
15                           phenomenal piece of literature!
16         textured pape

In [4]:
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A10000012B7CGYKOMPQ4L,000100039X,Adam,"[0, 0]",Spiritually and mentally inspiring! A book tha...,5.0,Wonderful!,1355616000,"12 16, 2012"
1,A2S166WSCFIFP5,000100039X,"adead_poet@hotmail.com ""adead_poet@hotmail.com""","[0, 2]",This is one my must have books. It is a master...,5.0,close to god,1071100800,"12 11, 2003"
2,A1BM81XB4QHOA3,000100039X,"Ahoro Blethends ""Seriously""","[0, 0]",This book provides a reflection that you can a...,5.0,Must Read for Life Afficianados,1390003200,"01 18, 2014"
3,A1MOSTXNIO5MPJ,000100039X,Alan Krug,"[0, 0]",I first read THE PROPHET in college back in th...,5.0,Timeless for every good and bad time in your l...,1317081600,"09 27, 2011"
4,A2XQ5LZHTD4AFT,000100039X,Alaturka,"[7, 9]",A timeless classic. It is a very demanding an...,5.0,A Modern Rumi,1033948800,"10 7, 2002"


In [5]:
data.columns

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime'],
      dtype='object')

In [6]:
from collections import Counter
prod_counts = Counter(data['asin'])
print(prod_counts.most_common(10))
print('Length: {}'.format(len(prod_counts)))

[('030758836X', 7440), ('0439023483', 6717), ('0375831002', 4864), ('038536315X', 4604), ('0439023513', 4440), ('0316055433', 4305), ('0385537859', 4284), ('0007444117', 3821), ('147674355X', 3725), ('0399159347', 3655)]
Length: 367982


In [7]:
user_counts = Counter(data['reviewerID'])
print(user_counts.most_common(10))
print('Length: {}'.format(len(user_counts)))

[('AFVQZQ8PW0L', 23222), ('A14OJS0VWMOSWO', 16090), ('A2F6N60Z96CAJI', 5891), ('A320TMDV6KCFU', 4212), ('AHUT55E980RDR', 3091), ('A13QTZ8CIMHHG4', 2949), ('A1K1JW1C5CUSUZ', 2910), ('A328S9RN3U5M68', 2795), ('A2TX179XAT5GRP', 2529), ('A21NVBFIEQWDSG', 2526)]
Length: 603668


In [78]:
# We'll create a smaller data object with 100 products and users
data[:100].to_json('./data/reviews_books_100.json', orient='records')

## 2. Collaborative Filtering

### Basic filtering based on user overall score

In [8]:
ratings = data[['asin', 'reviewerID', 'overall']]

In [9]:
ratings.head()

Unnamed: 0,asin,reviewerID,overall
0,000100039X,A10000012B7CGYKOMPQ4L,5.0
1,000100039X,A2S166WSCFIFP5,5.0
2,000100039X,A1BM81XB4QHOA3,5.0
3,000100039X,A1MOSTXNIO5MPJ,5.0
4,000100039X,A2XQ5LZHTD4AFT,5.0


In [10]:
users = ratings['reviewerID'].unique()
products = ratings['asin'].unique()

In [11]:
userid2idx = {o:i for i,o in enumerate(users)}
productid2idx = {o:i for i,o in enumerate(products)}
idx2usedid = {i:o for i,o in enumerate(users)}
idx2productid = {i:o for i,o in enumerate(products)}

We update the user and product ids to be continous integers, which we want when using embeddings

In [12]:
ratings['userId'] = ratings['reviewerID'].apply(lambda x: userid2idx[x])
ratings['productId'] = ratings['asin'].apply(lambda x: productid2idx[x])
ratings = ratings.drop('reviewerID',1)
ratings = ratings.drop('asin',1)
ratings.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,overall,userId,productId
0,5.0,0,0
1,5.0,1,0
2,5.0,2,0
3,5.0,3,0
4,5.0,4,0


In [13]:
def stats(column):
    return print('Column: {}, Min {}, Max {}, Non-unique {}'.format(column,
        ratings[column].min(), ratings[column].max(), ratings[column].nunique()))

In [14]:
stats('userId')
stats('productId')

Column: userId, Min 0, Max 603667, Non-unique 603668
Column: productId, Min 0, Max 367981, Non-unique 367982


### Prepare dataset

#### Hyperparameters

In [15]:
# number of latent factors
n_factors = 50
# learning rate
learning_rate = 0.001
# batch size
batch_size = 1028
# number of epochs
epochs = 20

In [16]:
np.random.seed = 42

Randomly split into training (80%) and validation (20%) set

In [17]:
msk = np.random.rand(len(ratings)) < 0.8
train = ratings[msk]
val = ratings[~msk]

print('Training samples {} ({}), Validation samples {} ({})'.format(
    len(train), len(train)/len(ratings), len(val), len(val)/len(ratings)))

Training samples 7120433 (0.8002247910523227), Validation samples 1777608 (0.19977520894767736)


### First Model: Dot product
The most basic approach is a dot product of a product embedding and a user embedding and add their respective biases.

In [18]:
from keras.layers import Input, Embedding, dot, Flatten, merge
from keras.regularizers import l2
from keras.models import Model
from keras.optimizers import Adam

Using TensorFlow backend.


In [19]:
n_users = ratings.userId.nunique()
n_products = ratings.productId.nunique()

In [20]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)
    return inp, emb

In [21]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [23]:
user_bias = create_bias(user_in, n_users)
prod_bias = create_bias(prod_in, n_products)

In [24]:
x = merge([user_emb, prod_emb], mode='dot')
x = Flatten()(x)
x = merge([x, user_bias], mode='sum')
x = merge([x, prod_bias], mode='sum')

model = Model([user_in, prod_in], x)
model.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [25]:
model.fit([train.userId, train.productId], train.overall, batch_size=batch_size,
          epochs=epochs, validation_data=([val.userId, val.productId], val.overall))

Train on 7120433 samples, validate on 1777608 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fa9a6982cf8>

In [26]:
model.save_weights('models/dot-books.h5')
with open('models/dot-books.json', 'w') as f:
    f.write(model.to_json())
f.close()

### Analyze Results

In [27]:
model.load_weights('models/dot-books.h5')

We'll restrict to the top 1000 products

In [28]:
g = ratings.groupby('productId')['overall'].count()
top_prods = g.sort_values(ascending=False)[:1000]
top_prods = np.array(top_prods.index)

#### A look at the product bias term. 

In [29]:
get_prod_bias = Model(prod_in, prod_bias)
product_bias = get_prod_bias.predict(top_prods)
prod_scores = [(b[0], i) for i,b in zip(top_prods, product_bias)]

#####  Top and bottom scores (products)

In [30]:
from operator import itemgetter

In [31]:
# Bottom
prod_scores = [(b, idx2productid[i]) for b,i in prod_scores]
sorted(prod_scores, key=itemgetter(0))[:15]

[(0.54248488, '1892112000'),
 (1.0967101, '0345803485'),
 (1.1235007, 'B0093MU7QS'),
 (1.1773204, '0307741907'),
 (1.247512, '0517580519'),
 (1.4369099, '0099451956'),
 (1.6222056, '0091883768'),
 (1.6249024, '0307275558'),
 (1.6812198, '030727828X'),
 (1.6813415, '031218087X'),
 (1.6893973, '0425269205'),
 (1.7422723, '0099297701'),
 (1.743048, '0307950654'),
 (1.7631874, '0316228532'),
 (1.7724576, '074356619X')]

In [32]:
# Top
sorted(prod_scores, key=itemgetter(0), reverse=True)[:15]

[(3.4578693, '0356502465'),
 (3.4181018, '0765326361'),
 (3.3473585, '0143124544'),
 (3.3472676, '023076889X'),
 (3.3235109, '0451464397'),
 (3.3175764, '0765326353'),
 (3.2667656, '0370332288'),
 (3.266572, '0061950726'),
 (3.2662535, '0345543971'),
 (3.2642975, '0141345713'),
 (3.2632172, '0007386648'),
 (3.2542779, '1480563900'),
 (3.2361443, '0743527992'),
 (3.231287, '0425252868'),
 (3.2306137, '045141912X')]

#### A look at the embeddings

In [33]:
get_prod_emb = Model(prod_in, prod_emb)
product_emb = np.squeeze(get_prod_emb.predict([top_prods]))
product_emb.shape

(1000, 50)

It's hard to visualize 50 (n_factors) embeddings, so we use PCA to simplify them down to just 3 vectors

In [34]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
prod_pca = pca.fit(product_emb.T).components_

1st component

In [35]:
fac0 = prod_pca[0]

In [36]:
prod_comp = [(f, i) for f,i in zip(fac0, top_prods)]
prod_comp = [(b, idx2productid[i]) for b,i in prod_comp]

In [37]:
sorted(prod_comp, key=itemgetter(0), reverse=True)[:10]

[(0.99015409699348511, '0061537934'),
 (0.03440007034824677, '0316024961'),
 (0.02484767574517957, '1611099285'),
 (0.014505928376358567, '0307265439'),
 (0.0099678324471696878, '1250029880'),
 (0.0083678790251605299, '1416585834'),
 (0.0082992179314063802, '1453860959'),
 (0.0082703339368619382, '1451533969'),
 (0.0069236240100475624, '0425266516'),
 (0.0068798024786250345, '1477805028')]

In [38]:
sorted(prod_comp, key=itemgetter(0))[:10]

[(-0.1184655559955928, '0800733428'),
 (-0.025952932679041748, '0805090037'),
 (-0.024492873444534536, '1476712980'),
 (-0.01938769406627926, '0307934055'),
 (-0.015343848318090323, '0441018645'),
 (-0.013855591102053463, '0007172826'),
 (-0.013195580380871369, '045141411X'),
 (-0.010289132005519074, '1455578363'),
 (-0.0084978342120548486, '1480536466'),
 (-0.0076389130060686211, '0385660065')]

### Second Model: Simple Neural Net

In [39]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [40]:
from keras.layers import Dropout, Dense

In [41]:
x = merge([user_emb, prod_emb], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)

nn = Model([user_in, prod_in], x)
nn.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)


In [42]:
nn.fit([train.userId, train.productId], train.overall, batch_size=batch_size,
       epochs=epochs, validation_data=([val.userId, val.productId], val.overall))

Train on 7120433 samples, validate on 1777608 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fa9ea819c50>

In [43]:
nn.save_weights('models/nn-books.h5')
with open('models/nn-books.json', 'w') as f:
    f.write(nn.to_json())
f.close()

### Test models

In [44]:
test_ratings = val[:10]
print(test_ratings)

    overall  userId  productId
1       5.0       1          0
6       5.0       6          0
9       5.0       9          0
10      5.0      10          0
16      5.0      16          0
18      5.0      18          0
19      5.0      19          0
24      5.0      24          0
27      5.0      27          0
28      5.0      28          0


In [45]:
users = test_ratings['userId'].values
prods = test_ratings['productId'].values

In [46]:
test_ratings['preds'] = nn.predict([users, prods])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [47]:
test_ratings

Unnamed: 0,overall,userId,productId,preds
1,5.0,1,0,4.46137
6,5.0,6,0,4.780422
9,5.0,9,0,4.484874
10,5.0,10,0,4.374266
16,5.0,16,0,4.430554
18,5.0,18,0,4.385927
19,5.0,19,0,4.123089
24,5.0,24,0,4.460001
27,5.0,27,0,4.466776
28,5.0,28,0,4.474891


## 3. NLP

In [50]:
import spacy
nlp = spacy.load('en')

In [52]:
reviews = data[['asin', 'reviewerID', 'reviewText', 'summary']]

In [56]:
reviews.head()

Unnamed: 0,asin,reviewerID,reviewText,summary
0,000100039X,A10000012B7CGYKOMPQ4L,Spiritually and mentally inspiring! A book tha...,Wonderful!
1,000100039X,A2S166WSCFIFP5,This is one my must have books. It is a master...,close to god
2,000100039X,A1BM81XB4QHOA3,This book provides a reflection that you can a...,Must Read for Life Afficianados
3,000100039X,A1MOSTXNIO5MPJ,I first read THE PROPHET in college back in th...,Timeless for every good and bad time in your l...
4,000100039X,A2XQ5LZHTD4AFT,A timeless classic. It is a very demanding an...,A Modern Rumi


In [57]:
reviews_text = ''.join(reviews['reviewText'])

In [58]:
len(reviews_text)

7286830869

In [64]:
doc = nlp.make_doc_doc(reviews_text[:100000])

In [65]:
for proc in nlp.pipeline:
    proc(doc)

In [75]:
for word in doc[:100]:
    print(word.text,':', word.lemma_, word.shape_, word.pos_, word.tag_, word.dep_)

Spiritually : spiritually Xxxxx ADV RB ROOT
and : and xxx CCONJ CC cc
mentally : mentally xxxx ADV RB advmod
inspiring : inspiring xxxx ADJ JJ conj
! : ! ! PUNCT . punct
A : a X DET DT det
book : book xxxx NOUN NN nsubj
that : that xxxx ADJ WDT nsubj
allows : allow xxxx VERB VBZ relcl
you : -PRON- xxx PRON PRP nsubj
to : to xx PART TO aux
question : question xxxx VERB VB ccomp
your : -PRON- xxxx ADJ PRP$ poss
morals : moral xxxx NOUN NNS dobj
and : and xxx CCONJ CC cc
will : will xxxx VERB MD aux
help : help xxxx VERB VB ROOT
you : -PRON- xxx PRON PRP nsubj
discover : discover xxxx VERB VB xcomp
who : who xxx NOUN WP dobj
you : -PRON- xxx PRON PRP nsubj
really : really xxxx ADV RB advmod
are!This : are!this xxx!Xxxx DET DT nsubj
is : be xx VERB VBZ ccomp
one : one xxx NUM CD dobj
my : -PRON- xx ADJ PRP$ nsubj
must : must xxxx VERB MD aux
have : have xxxx VERB VB relcl
books : book xxxx NOUN NNS dobj
. : . . PUNCT . punct
It : -PRON- Xx PRON PRP nsubj
is : be xx VERB VBZ ROOT
a : a x DE