# Product Rating Prediction Using LSTMs

### 1. Preparing the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_json('data.json',lines=True)
data = data[['reviewText','overall']]
print(data.head())

                                          reviewText  overall
0  They look good and stick good! I just don't li...        4
1  These stickers work like the review says they ...        5
2  These are awesome and make my phone look so st...        5
3  Item arrived in great time and was in perfect ...        4
4  awesome! stays on, and looks great. can be use...        5


In [3]:
vocab = pd.read_csv('vocab.txt',sep=" ", header=None)
vocab = vocab[0].values
vocab = np.append(vocab,"UNKNOWN")
vocab = np.append(vocab,"ENDPAD")
print(vocab)

['a' 'abandon' 'ability' ... 'zone' 'UNKNOWN' 'ENDPAD']


In [12]:
word_map = {}
for index,value in enumerate(vocab):
    word_map[value] = index
n_words = vocab.shape[0]
print(n_words)

3002


In [20]:
#Format x_data
x_data = []

import nltk

def get_matrix_ids(s):
    id_matrix = []
    w = nltk.word_tokenize(s)
    w = [i.lower() for i in w if i.isalpha()]
    
    for i in w:
        if i in vocab:
            id_matrix.append(word_map[i])
        else :
            id_matrix.append(word_map["UNKNOWN"]) #Unknown token
    return id_matrix
for index,row in data.iterrows():
    x_data.append(get_matrix_ids(row['reviewText']))

In [24]:
from keras.preprocessing.sequence import pad_sequences
x_data = pad_sequences(maxlen=65, sequences=x_data, padding="post", value=n_words - 1)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [36]:
y_data = pd.get_dummies(data['overall']).values

In [38]:
print(x_data[1])
print(y_data[1])

[2696 3000 2968 1538 2685 2254 3000 2697  801 2697 2538 1815 1202  118
 2697 2533 1815 2685 1937 2697 3000 2597 3000  118 3000  376 2379 2688
 2957 1715 2431 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001
 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001 3001
 3001 3001 3001 3001 3001 3001 3001 3001 3001]
[0 0 0 0 1]


Our data is ready. For every training point, we hace x as the sequence matrix, and y as its corresponsing one-hot encoded vector for the rating.

### 2. Building the model

In [42]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM

model = Sequential()
model.add(Embedding(input_dim=n_words, output_dim=100, input_length=65))
model.add(LSTM(units=100, recurrent_dropout=0.1))
model.add(Dense(5,activation='softmax'))

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [43]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 65, 100)           300200    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 505       
Total params: 381,105
Trainable params: 381,105
Non-trainable params: 0
_________________________________________________________________
None


### 3. Training the model

In [44]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2)

In [45]:
model.fit(x_train, y_train, batch_size=30, epochs=10, validation_split=0.1, verbose=2)

Train on 139995 samples, validate on 15556 samples
Epoch 1/10
 - 603s - loss: 1.0800 - acc: 0.5859 - val_loss: 0.9389 - val_acc: 0.6266
Epoch 2/10
 - 580s - loss: 0.9287 - acc: 0.6280 - val_loss: 0.9070 - val_acc: 0.6386
Epoch 3/10
 - 580s - loss: 0.8953 - acc: 0.6409 - val_loss: 0.8934 - val_acc: 0.6376
Epoch 4/10
 - 590s - loss: 0.8713 - acc: 0.6508 - val_loss: 0.8868 - val_acc: 0.6455
Epoch 5/10
 - 625s - loss: 0.8502 - acc: 0.6588 - val_loss: 0.8907 - val_acc: 0.6441
Epoch 6/10
 - 641s - loss: 0.8312 - acc: 0.6680 - val_loss: 0.8911 - val_acc: 0.6454
Epoch 7/10
 - 601s - loss: 0.8126 - acc: 0.6745 - val_loss: 0.8937 - val_acc: 0.6419
Epoch 8/10
 - 581s - loss: 0.7934 - acc: 0.6829 - val_loss: 0.9062 - val_acc: 0.6470
Epoch 9/10
 - 580s - loss: 0.7749 - acc: 0.6906 - val_loss: 0.9117 - val_acc: 0.6385
Epoch 10/10
 - 586s - loss: 0.7573 - acc: 0.6981 - val_loss: 0.9273 - val_acc: 0.6374


<keras.callbacks.History at 0x2090f466cc0>

In [46]:
score,acc = model.evaluate(x_test,y_test, verbose = 2, batch_size = 15)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.94
acc: 0.63


In [64]:
def rate_sentence(s):
    w = get_matrix_ids(s)
    w = pad_sequences(maxlen=65, sequences=[w], padding="post", value=n_words - 1)
    output = model.predict([w])[0]
    output = np.argmax(output)
    print("Predicted rating : " + str(output))

In [65]:
rate_sentence("I love this product so much. I totally reccomend this")

Predicted rating : 4
