# Information

## Amazon data
* [Kaggle](https://www.kaggle.com/snap/amazon-fine-food-reviews)
* product info: GET amazon.com/dp/B00006HAXW
* [More data!](http://jmcauley.ucsd.edu/data/amazon/)

## Action Plan

Small Data sample:
1. Explore Data
2. Collaborative Filtering
3. Sentiment Analysis
4. Seq2Seq summarizer
5. Web interface

Large Data samples:
* Implement above pipeline

## Data Discovery

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('data/Reviews.csv', index_col='Id')

In [3]:
data.head()

Unnamed: 0_level_0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [4]:
data.columns

Index(['ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [5]:
data.describe()

Unnamed: 0,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0
mean,1.743817,2.22881,4.183199,1296257000.0
std,7.636513,8.28974,1.310436,48043310.0
min,0.0,0.0,1.0,939340800.0
25%,0.0,0.0,4.0,1271290000.0
50%,0.0,1.0,5.0,1311120000.0
75%,2.0,2.0,5.0,1332720000.0
max,866.0,923.0,5.0,1351210000.0


In [6]:
from collections import Counter

In [7]:
product_counts = Counter(data['ProductId'])
print(product_counts.most_common(10))
print('Length: {}'.format(len(product_counts)))

[('B007JFMH8M', 913), ('B002QWP89S', 632), ('B0026RQTGE', 632), ('B002QWHJOU', 632), ('B002QWP8H0', 632), ('B003B3OOPA', 623), ('B001EO5Q64', 567), ('B0013NUGDE', 564), ('B007M83302', 564), ('B000VK8AVK', 564)]
Length: 74258


In [8]:
user_counts = Counter(data['UserId'])
print(user_counts.most_common(10))
print('Length: {}'.format(len(user_counts)))

[('A3OXHLG6DIBRW8', 448), ('A1YUL9PCJR3JTY', 421), ('AY12DBB0U420B', 389), ('A281NPSIMI1C2R', 365), ('A1Z54EM24Y40LL', 256), ('A1TMAVN4CEM8U8', 204), ('A2MUGFV2TDQ47K', 201), ('A3TVZM3ZIXG8YW', 199), ('A3PJZ8TU8FDQ1K', 178), ('AQQLWCMRNDFGI', 176)]
Length: 256059


## 1. Collaborative Filtering

### Basic filtering based on user score

In [9]:
ratings = data[['ProductId', 'UserId', 'Score']]

In [10]:
ratings.head()

Unnamed: 0_level_0,ProductId,UserId,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,B001E4KFG0,A3SGXH7AUHU8GW,5
2,B00813GRG4,A1D87F6ZCVE5NK,1
3,B000LQOCH0,ABXLMWJIXXAIN,4
4,B000UA0QIQ,A395BORC6FGVXV,2
5,B006K2ZZ7K,A1UQRSCLF8GW1T,5


In [11]:
users = ratings['UserId'].unique()
products = ratings['ProductId'].unique()

In [12]:
userid2idx = {o:i for i,o in enumerate(users)}
productid2idx = {o:i for i,o in enumerate(products)}
idx2usedid = {i:o for i,o in enumerate(users)}
idx2productid = {i:o for i,o in enumerate(products)}

We update the user and product ids to be continous integers, which we want when using embeddings

In [13]:
ratings['userId'] = ratings['UserId'].apply(lambda x: userid2idx[x])
ratings['productId'] = ratings['ProductId'].apply(lambda x: productid2idx[x])
ratings = ratings.drop('UserId',1)
ratings = ratings.drop('ProductId',1)
ratings.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,Score,userId,productId
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5,0,0
2,1,1,1
3,4,2,2
4,2,3,3
5,5,4,4


In [14]:
def stats(column):
    return print('Column: {}, Min {}, Max {}, Non-unique {}'.format(column,
        ratings[column].min(), ratings[column].max(), ratings[column].nunique()))

In [15]:
stats('userId')
stats('productId')

Column: userId, Min 0, Max 256058, Non-unique 256059
Column: productId, Min 0, Max 74257, Non-unique 74258


### Prepare dataset

#### Hyperparameters

In [16]:
# number of latent factors
n_factors = 50
# learning rate
learning_rate = 0.001
# batch size
batch_size = 256
# number of epochs
epochs = 20

In [17]:
np.random.seed = 42

Randomly split into training (80%) and validation (20%) set

In [18]:
msk = np.random.rand(len(ratings)) < 0.8
train = ratings[msk]
val = ratings[~msk]

print('Training samples {} ({}), Validation samples {} ({})'.format(
    len(train), len(train)/len(ratings), len(val), len(val)/len(ratings)))

Training samples 454278 (0.7991464568812956), Validation samples 114176 (0.2008535431187044)


### First Model: Dot product
The most basic approach is a dot product of a product embedding and a user embedding and add their respective biases.

In [19]:
from keras.layers import Input, Embedding, dot, Flatten, merge
from keras.regularizers import l2
from keras.models import Model
from keras.optimizers import Adam

Using TensorFlow backend.


In [20]:
n_users = ratings.userId.nunique()
n_products = ratings.productId.nunique()

In [21]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)
    return inp, emb

In [22]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [23]:
def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

In [24]:
user_bias = create_bias(user_in, n_users)
prod_bias = create_bias(prod_in, n_products)

In [25]:
x = merge([user_emb, prod_emb], mode='dot')
x = Flatten()(x)
x = merge([x, user_bias], mode='sum')
x = merge([x, prod_bias], mode='sum')

model = Model([user_in, prod_in], x)
model.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [26]:
model.fit([train.userId, train.productId], train.Score, batch_size=batch_size,
          epochs=epochs, validation_data=([val.userId, val.productId], val.Score))

Train on 454278 samples, validate on 114176 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5c48577748>

In [27]:
model.save_weights('models/dot.h5')
with open('models/dot.json', 'w') as f:
    f.write(model.to_json())
f.close()

### Analyze Results

In [28]:
model.load_weights('models/dot.h5')

We'll restrict to the top 1000 products

In [29]:
g = ratings.groupby('productId')['Score'].count()
top_prods = g.sort_values(ascending=False)[:1000]
top_prods = np.array(top_prods.index)

#### A look at the product bias term. 

In [30]:
get_prod_bias = Model(prod_in, prod_bias)
product_bias = get_prod_bias.predict(top_prods)
prod_scores = [(b[0], i) for i,b in zip(top_prods, product_bias)]

#####  Top and bottom scores (products)

In [31]:
from operator import itemgetter

In [32]:
# Bottom
prod_scores = [(b, idx2productid[i]) for b,i in prod_scores]
sorted(prod_scores, key=itemgetter(0))[:15]

[(1.3422005, 'B006N3I69A'),
 (1.5084307, 'B000X1Q1G8'),
 (1.7365083, 'B003JA5KBW'),
 (1.7572726, 'B007RTR9DS'),
 (1.8644536, 'B007RTR9G0'),
 (1.906934, 'B0041NYV8E'),
 (2.0875323, 'B006BXV176'),
 (2.1724749, 'B006MONQMC'),
 (2.1793909, 'B003YBLF2E'),
 (2.2242985, 'B005CUU25G'),
 (2.2535112, 'B008O3G2GG'),
 (2.2827082, 'B005GYULZY'),
 (2.2875962, 'B004E4HUMY'),
 (2.2976708, 'B005GBIXZM'),
 (2.3044803, 'B004U49QU2')]

In [33]:
# Top
sorted(prod_scores, key=itemgetter(0), reverse=True)[:15]

[(4.3596163, 'B007R900WA'),
 (4.3412876, 'B000O5DI1E'),
 (4.298192, 'B003KRHDMI'),
 (4.2583861, 'B000CPZSC8'),
 (4.2534466, 'B000NMJWZO'),
 (4.231967, 'B001E5DXEU'),
 (4.2232647, 'B005BRHVD6'),
 (4.2176037, 'B003QDRJXY'),
 (4.2129188, 'B003B3OOPA'),
 (4.1953998, 'B000ET4SM8'),
 (4.1953926, 'B007JFMH8M'),
 (4.1851988, 'B000ED9L9E'),
 (4.1765814, 'B001EQ5JLE'),
 (4.166079, 'B000DZDJ0K'),
 (4.1578732, 'B000E1HVR0')]

#### A look at the embeddings

In [34]:
get_prod_emb = Model(prod_in, prod_emb)
product_emb = np.squeeze(get_prod_emb.predict([top_prods]))
product_emb.shape

(1000, 50)

It's hard to visualize 50 (n_factors) embeddings, so we use PCA to simplify them down to just 3 vectors

In [35]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
prod_pca = pca.fit(product_emb.T).components_

1st component

In [36]:
fac0 = prod_pca[0]

In [37]:
prod_comp = [(f, i) for f,i in zip(fac0, top_prods)]
prod_comp = [(b, idx2productid[i]) for b,i in prod_comp]

In [38]:
sorted(prod_comp, key=itemgetter(0), reverse=True)[:10]

[(0.81157336539207381, 'B001OCKIBY'),
 (0.12140957400438074, 'B004YV80OE'),
 (0.099166830761096975, 'B005A1LJ04'),
 (0.096789230912546667, 'B001EQ55ZO'),
 (0.074578554655234247, 'B005VOOL00'),
 (0.062280536609135295, 'B00503DOWS'),
 (0.05402654935495272, 'B005HG9ESG'),
 (0.033766231225774622, 'B001PMDYV4'),
 (0.027205268094518981, 'B005ZBZLT4'),
 (0.021924805409853341, 'B005VOONGM')]

In [39]:
sorted(prod_comp, key=itemgetter(0))[:10]

[(-0.27075668851049217, 'B003BJZMSM'),
 (-0.26855674959117848, 'B0061IULW2'),
 (-0.24111692145616787, 'B005VOOLXM'),
 (-0.23805374247042008, 'B0039ZOZ86'),
 (-0.092244599300887867, 'B0041NYV8E'),
 (-0.080101059646441458, 'B001BOQ3SW'),
 (-0.065068065654438997, 'B005VOOKS8'),
 (-0.057201845010192674, 'B001BM01BE'),
 (-0.033232452504461968, 'B001181NBA'),
 (-0.032807736732446101, 'B004T3QMD8')]

### Second Model: Simple Neural Net

In [40]:
user_in, user_emb = embedding_input('user_in', n_users, n_factors, 1e-4)
prod_in, prod_emb = embedding_input('prod_in', n_products, n_factors, 1e-4)

  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
from keras.layers import Dropout, Dense

In [42]:
x = merge([user_emb, prod_emb], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)

nn = Model([user_in, prod_in], x)
nn.compile(Adam(learning_rate), loss='mse', metrics=['accuracy'])

  """Entry point for launching an IPython kernel.
  name=name)


In [43]:
nn.fit([train.userId, train.productId], train.Score, batch_size=batch_size,
       epochs=epochs, validation_data=([val.userId, val.productId], val.Score))

Train on 454278 samples, validate on 114176 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5ba4e2de80>

In [44]:
nn.save_weights('models/nn.h5')
with open('models/nn.json', 'w') as f:
    f.write(nn.to_json())
f.close()

### Test models

In [45]:
test_ratings = val[:10]
print(test_ratings)

    Score  userId  productId
Id                          
3       4       2          2
8       5       7          4
10      5       9          6
16      5      15          9
17      2      16          9
18      5      17          9
22      5      21          9
30      5      10         11
33      4      31         13
35      5      33         13


In [46]:
users = test_ratings['userId'].values
prods = test_ratings['productId'].values

In [47]:
test_ratings['preds'] = nn.predict([users, prods])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [48]:
test_ratings

Unnamed: 0_level_0,Score,userId,productId,preds
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,4,2,2,4.534613
8,5,7,4,4.450095
10,5,9,6,4.910298
16,5,15,9,4.48889
17,2,16,9,3.086109
18,5,17,9,4.48889
22,5,21,9,4.83528
30,5,10,11,4.936732
33,4,31,13,4.237014
35,5,33,13,4.334737


## 2. Sentiment Analysis