# Information

## Amazon data
* [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)
* product info: GET amazon.com/dp/B00006HAXW
* [More data!](http://jmcauley.ucsd.edu/data/amazon/)

## Action Plan

Small Data sample:
1. Explore Data
2. Collaborative Filtering
3. Sentiment Analysis
4. Seq2Seq summarizer
5. Web interface

Large Data samples:
* Implement above pipeline

## Data Discovery

In [1]:
import pandas as pd
import numpy as np

In [90]:
data = pd.read_csv('data/Reviews.csv', index_col='Id')
books_data = pd.read_csv('data/ratings_Books.csv', header=None)

In [91]:
books_data.head()

Unnamed: 0,0,1,2,3
0,AH2L9G3DQHHAJ,116,4.0,1019865600
1,A2IIIDRK3PRRZY,116,1.0,1395619200
2,A1TADCM7YWPQ8M,868,4.0,1031702400
3,AWGH7V0BDOJKB,13714,4.0,1383177600
4,A3UTQPQPM4TQO0,13714,5.0,1374883200


In [4]:
data.columns

Index(['ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [5]:
data.describe()

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0
mean,1.743817,2.22881,4.183199,1296257000.0
std,7.636513,8.28974,1.310436,48043310.0
min,0.0,0.0,1.0,939340800.0
25%,0.0,0.0,4.0,1271290000.0
50%,0.0,1.0,5.0,1311120000.0
75%,2.0,2.0,5.0,1332720000.0
max,866.0,923.0,5.0,1351210000.0


In [6]:
from collections import Counter

In [7]:
product_counts = Counter(data['ProductId'])
print(product_counts.most_common(10))
print('Length: {}'.format(len(product_counts)))

[('B007JFMH8M', 913), ('B002QWP89S', 632), ('B0026RQTGE', 632), ('B002QWHJOU', 632), ('B002QWP8H0', 632), ('B003B3OOPA', 623), ('B001EO5Q64', 567), ('B0013NUGDE', 564), ('B007M83302', 564), ('B000VK8AVK', 564)]
Length: 74258


In [8]:
user_counts = Counter(data['UserId'])
print(user_counts.most_common(10))
print('Length: {}'.format(len(user_counts)))

[('A3OXHLG6DIBRW8', 448), ('A1YUL9PCJR3JTY', 421), ('AY12DBB0U420B', 389), ('A281NPSIMI1C2R', 365), ('A1Z54EM24Y40LL', 256), ('A1TMAVN4CEM8U8', 204), ('A2MUGFV2TDQ47K', 201), ('A3TVZM3ZIXG8YW', 199), ('A3PJZ8TU8FDQ1K', 178), ('AQQLWCMRNDFGI', 176)]
Length: 256059


## 1. Collaborative Filtering

### Basic filtering based on user score

In [9]:
ratings = data[['ProductId', 'UserId', 'Score']]

In [10]:
ratings.head()

Unnamed: 0_level_0,ProductId,UserId,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,B001E4KFG0,A3SGXH7AUHU8GW,5
2,B00813GRG4,A1D87F6ZCVE5NK,1
3,B000LQOCH0,ABXLMWJIXXAIN,4
4,B000UA0QIQ,A395BORC6FGVXV,2
5,B006K2ZZ7K,A1UQRSCLF8GW1T,5


In [11]:
users = ratings['UserId'].unique()
products = ratings['ProductId'].unique()

In [74]:
userid2idx = {o:i for i,o in enumerate(users)}
productid2idx = {o:i for i,o in enumerate(products)}
idx2usedid = {i:o for i,o in enumerate(users)}
idx2productid = {i:o for i,o in enumerate(products)}

We update the user and product ids to be continous integers, which we want when using embeddings

In [13]:
ratings['userId'] = ratings['UserId'].apply(lambda x: userid2idx[x])
ratings['productId'] = ratings['ProductId'].apply(lambda x: productid2idx[x])
ratings = ratings.drop('UserId',1)
ratings = ratings.drop('ProductId',1)
ratings.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,Score,userId,productId
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,0,0
2,1,1,1
3,4,2,2
4,2,3,3
5,5,4,4


In [14]:
def stats(column):
    return print('Column: {}, Min {}, Max {}, Non-unique {}'.format(column,
        ratings[column].min(), ratings[column].max(), ratings[column].nunique()))

In [15]:
stats('userId')
stats('productId')

Column: userId, Min 0, Max 256058, Non-unique 256059
Column: productId, Min 0, Max 74257, Non-unique 74258


### Prepare dataset

#### Hyperparameters

In [53]:
# number of latent factors
n_factors = 50
# learning rate
learning_rate = 0.001
# batch size
batch_size = 256
# number of epochs
epochs = 20

In [17]:
np.random.seed = 42

Randomly split into training (80%) and validation (20%) set

In [18]:
msk = np.random.rand(len(ratings)) < 0.8
train = ratings[msk]
val = ratings[~msk]

print('Training samples {} ({}), Validation samples {} ({})'.format(
    len(train), len(train)/len(ratings), len(val), len(val)/len(ratings)))

Training samples 455010 (0.800434160019984), Validation samples 113444 (0.19956583998001598)


### First Model: Dot product
The most basic approach is a dot product of a product embedding and a user embedding and add their respective biases.

In [37]:
from keras.layers import Input, Embedding, dot, Flatten, merge
from keras.regularizers import l2
from keras.models import Model
from keras.optimizers import Adam

In [38]:
n_users = ratings.userId.nunique()
n_products = ratings.productId.nunique()

In [39]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)
    return inp, emb

In [40]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [42]:
user_bias = create_bias(user_in, n_users)
prod_bias = create_bias(prod_in, n_products)

In [51]:
x = merge([user_emb, prod_emb], mode='dot')
x = Flatten()(x)
x = merge([x, user_bias], mode='sum')
x = merge([x, prod_bias], mode='sum')

model = Model([user_in, prod_in], x)
model.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [52]:
model.fit([train.userId, train.productId], train.Score, batch_size=batch_size,
          epochs=epochs, validation_data=([val.userId, val.productId], val.Score))

Train on 455010 samples, validate on 113444 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f77043a37f0>

In [154]:
model.save_weights('models/dot.h5')
with open('models/dot.json', 'w') as f:
    f.write(model.to_json())
f.close()

### Analyze Results

In [56]:
model.load_weights('models/dot.h5')

We'll restrict to the top 1000 products

In [57]:
g = ratings.groupby('productId')['Score'].count()
top_prods = g.sort_values(ascending=False)[:1000]
top_prods = np.array(top_prods.index)

#### A look at the product bias term. 

In [59]:
get_prod_bias = Model(prod_in, prod_bias)
product_bias = get_prod_bias.predict(top_prods)
prod_scores = [(b[0], i) for i,b in zip(top_prods, product_bias)]

#####  Top and bottom scores (products)

In [63]:
from operator import itemgetter

In [76]:
# Bottom
prod_scores = [(b, idx2productid[i]) for b,i in prod_scores]
sorted(prod_scores, key=itemgetter(0))[:15]

[(1.3939834, 'B006N3I69A'),
 (1.4603701, 'B003JA5KBW'),
 (1.6009259, 'B007RTR9DS'),
 (1.6133969, 'B000X1Q1G8'),
 (1.7591424, 'B007RTR9G0'),
 (1.8170928, 'B0041NYV8E'),
 (2.0138457, 'B006BXV176'),
 (2.0272725, 'B005A1LJ04'),
 (2.0381904, 'B006MONQMC'),
 (2.0412643, 'B003YBLF2E'),
 (2.047174, 'B008O3G25W'),
 (2.0507903, 'B005GBIXZM'),
 (2.054209, 'B005GYULZY'),
 (2.0549076, 'B005CUU25G'),
 (2.0764017, 'B008O3G2K2')]

In [77]:
# Top
sorted(prod_scores, key=itemgetter(0), reverse=True)[:15]

[(4.1031499, 'B007R900WA'),
 (4.0837893, 'B000O5DI1E'),
 (4.0695276, 'B003KRHDMI'),
 (4.0091701, 'B000NMJWZO'),
 (3.9898019, 'B003QDRJXY'),
 (3.985074, 'B005BRHVD6'),
 (3.9831145, 'B000FBMFDO'),
 (3.9805987, 'B000FKQD42'),
 (3.9801629, 'B002VLZ8D0'),
 (3.9729517, 'B006H34CUS'),
 (3.9669712, 'B000ET4SM8'),
 (3.9663024, 'B003B3OOPA'),
 (3.9621248, 'B007JFMH8M'),
 (3.9590971, 'B000CPZSC8'),
 (3.9533169, 'B001E5DXEU')]

#### A look at the embeddings

In [68]:
get_prod_emb = Model(prod_in, prod_emb)
product_emb = np.squeeze(get_prod_emb.predict([top_prods]))
product_emb.shape

(1000, 50)

It's hard to visualize 50 (n_factors) embeddings, so we use PCA to simplify them down to just 3 vectors

In [69]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
prod_pca = pca.fit(product_emb.T).components_

1st component

In [70]:
fac0 = prod_pca[0]

In [81]:
prod_comp = [(f, i) for f,i in zip(fac0, top_prods)]
prod_comp = [(b, idx2productid[i]) for b,i in prod_comp]

In [83]:
sorted(prod_comp, key=itemgetter(0), reverse=True)[:10]

[(0.60545428443152505, 'B008FHUKE6'),
 (0.41063519914343832, 'B001VJ0B0I'),
 (0.34750848945200241, 'B007JT7AIA'),
 (0.31414815584528288, 'B006BXUYN8'),
 (0.064989991838264163, 'B004ZIER34'),
 (0.056133800991978462, 'B008EG58V8'),
 (0.047680678935849041, 'B0045XE32E'),
 (0.046080593948543941, 'B003VXHGPK'),
 (0.03824869631739318, 'B004JRO1S2'),
 (0.027721114763581352, 'B006Q7YG2O')]

In [84]:
sorted(prod_comp, key=itemgetter(0))[:10]

[(-0.36294352238737115, 'B004E4CCSQ'),
 (-0.1774650037264647, 'B004MO6NI8'),
 (-0.11689568065383477, 'B003JA5KBW'),
 (-0.10961930728285804, 'B002IEZJMA'),
 (-0.091150292897382107, 'B003G52BN0'),
 (-0.089676985723096861, 'B0026KPDG8'),
 (-0.074996153014948386, 'B001LG945O'),
 (-0.073189260165791803, 'B005VOOM2W'),
 (-0.06134599095948437, 'B004YV80O4'),
 (-0.050700649066207658, 'B0041NYV8E')]

### Second Model: Simple Neural Net

In [95]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [96]:
from keras.layers import Dropout, Dense

In [99]:
x = merge([user_emb, prod_emb], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)

nn = Model([user_in, prod_in], x)
nn.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)


In [100]:
nn.fit([train.userId, train.productId], train.Score, batch_size=batch_size,
       epochs=epochs, validation_data=([val.userId, val.productId], val.Score))

Train on 455010 samples, validate on 113444 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f7568579e10>

In [155]:
nn.save_weights('models/nn.h5')
with open('models/nn.json', 'w') as f:
    f.write(nn.to_json())
f.close()

### Test models

In [132]:
test_ratings = val[:10]
print(test_ratings)

    Score  userId  productId
Id                          
8       5       7          4
13      1      12          8
28      4      27          9
31      5      29         12
36      4      34         13
47      5      45         13
53      4      51         14
60      5      58         16
75      2      73         23
86      5      84         28


In [149]:
users = test_ratings['userId'].values
prods = test_ratings['productId'].values

In [151]:
test_ratings['preds'] = nn.predict([users, prods])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [152]:
test_ratings

Unnamed: 0_level_0,Score,userId,productId,preds
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,5,7,4,4.478443
13,1,12,8,2.467163
28,4,27,9,4.170745
31,5,29,12,4.34627
36,4,34,13,4.19799
47,5,45,13,4.746248
53,4,51,14,4.097061
60,5,58,16,3.751348
75,2,73,23,4.182163
86,5,84,28,4.57179


## 2. Sentiment Analysis