# Introduction
Hello people, welcome to this kernel. In this kernel I am going to classify food reviews using RNNs. I will explain everything. Before starting, let's take a look at the content of this kernel.

# Notebook Content
1. Importing Libraries and The Data
1. Natural Language Processing
1. Training RNN Model
1. Evaluating Model
1. Conclusion

# Importing Libraries and The Data
In this section I am going to import libraries that I will use. In this kernel I will use Keras as deep learning library. And I will use GRU as RNN model. But you can try LSTM as well.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers import GRU,Dense,Embedding
from tensorflow.python.keras.layers import CuDNNGRU

import re 
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/amazon-fine-food-reviews/database.sqlite
/kaggle/input/amazon-fine-food-reviews/hashes.txt
/kaggle/input/amazon-fine-food-reviews/Reviews.csv


In [2]:
data = pd.read_csv('/kaggle/input/amazon-fine-food-reviews/Reviews.csv')
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


* There are 9 features in this dataset, but at least for this kernel we will just use score and review text.        

In [3]:
x = data["Text"]
y = data["Score"]


* Now, I will one hot encode our scores. Probably you know what is one hot encoding but even so I want to explain.

Assume that you have a data with 3 different label classes. Your data's first 5 column is like that:

                         1 
                         0
                         2
                         1
                         0

If you encode your data it will look like this:

                    0 1 0
                    1 0 0 
                    0 0 1
                    0 1 0
                    1 0 0
                    

In [4]:
y = pd.get_dummies(y)
y.head()

Unnamed: 0,1,2,3,4,5
0,0,0,0,0,1
1,1,0,0,0,0
2,0,0,0,1,0
3,0,1,0,0,0
4,0,0,0,0,1


* Our y data is ready to use, now we need to process our x.

# Natural Language Processing

In this section I am going to process our x data. I will follow these steps:

1. Cleaning the text
1. Lowering the text
1. Tokenizing
1. Padding


### Cleaning the text

In this section I am going to remove everything that is unrelevant. I will use re module for this.

In [5]:
pattern = "[^a-zA-Z0123456789]"

x = [re.sub(pattern," ",text) for text in x]

x[:2]

['I have bought several of the Vitality canned dog food products and have found them all to be of good quality  The product looks more like a stew than a processed meat and it smells better  My Labrador is finicky and she appreciates this product better than  most ',
 'Product arrived labeled as Jumbo Salted Peanuts   the peanuts were actually small sized unsalted  Not sure if this was an error or if the vendor intended to represent the product as  Jumbo  ']

* Now let's lower everything.

In [6]:
x = [text.lower() for text in x]
x[:2]

['i have bought several of the vitality canned dog food products and have found them all to be of good quality  the product looks more like a stew than a processed meat and it smells better  my labrador is finicky and she appreciates this product better than  most ',
 'product arrived labeled as jumbo salted peanuts   the peanuts were actually small sized unsalted  not sure if this was an error or if the vendor intended to represent the product as  jumbo  ']

### Tokenizing
Now I am going to convert words into indexes. In order to do it I will use Tokenizer.

In [7]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(x)

In [8]:
x_tokens = tokenizer.texts_to_sequences(x)
print(x_tokens[0])

[2, 18, 126, 324, 7, 1, 4776, 521, 102, 53, 207, 3, 18, 118, 30, 44, 6, 32, 7, 31, 184, 1, 40, 629, 50, 27, 4, 2621, 59, 4, 1179, 447, 3, 5, 619, 100, 13, 8, 1770, 3, 85, 9, 40, 100, 59, 142]


In [9]:
print(x_tokens[1])

[40, 375, 2195, 25, 1948, 1079, 1, 1079, 82, 256, 195, 1050, 3585, 19, 212, 39, 9, 21, 72, 3175, 33, 39, 1, 1568, 2206, 6, 1, 40, 25]


* As you can see each review has a different size, but in deep learning we determine only one shape. So we have to pad them.

In [10]:
max_len = max([len(text) for text in x_tokens])

mean_len = int(np.mean([len(text) for text in x_tokens]))
print("Maximum length of a text is {}".format(max_len))

print("Mean of length of the texts is {}".format(mean_len))

Maximum length of a text is 3116
Mean of length of the texts is 79


In [11]:
x_tokens_pad = pad_sequences(x_tokens,maxlen=mean_len)
x_tokens_pad.shape

(568454, 79)

In [12]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x_tokens_pad,y,test_size=0.2,random_state=1)

x_train,x_val,y_train,y_val = train_test_split(x_train,y_train,test_size=0.1,random_state=1)

del x_tokens_pad
del x_tokens

print("Shape of x_train is {}".format(x_train.shape))
print("Shape of x_val is {}".format(x_val.shape))
print("Shape of x_test is {}".format(x_test.shape))
print("Shape of y_train is {}".format(y_train.shape))
print("Shape of y_val is {}".format(y_val.shape))
print("Shape of y_test is {}".format(y_test.shape))

Shape of x_train is (409286, 79)
Shape of x_val is (45477, 79)
Shape of x_test is (113691, 79)
Shape of y_train is (409286, 5)
Shape of y_val is (45477, 5)
Shape of y_test is (113691, 5)


# Deep Learning

In [13]:
NODE_SIZE = 256
NUM_CLASSES = 5

VOCAB_SIZE = 5000
VECTOR_SIZE = 100
TOKEN_SIZE = max_len

model = Sequential()

model.add(Embedding(input_dim=5000,
                   output_dim = VECTOR_SIZE,
                   input_length = TOKEN_SIZE
                  ))

model.add(CuDNNGRU(NODE_SIZE,return_sequences=True))

model.add(CuDNNGRU(NODE_SIZE,return_sequences=True))

model.add(CuDNNGRU(NODE_SIZE,return_sequences=False))

model.add(Dense(NUM_CLASSES,activation="softmax"))

model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])

In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 3116, 100)         500000    
_________________________________________________________________
cu_dnngru (CuDNNGRU)         (None, 3116, 256)         274944    
_________________________________________________________________
cu_dnngru_1 (CuDNNGRU)       (None, 3116, 256)         394752    
_________________________________________________________________
cu_dnngru_2 (CuDNNGRU)       (None, 256)               394752    
_________________________________________________________________
dense (Dense)                (None, 5)                 1285      
Total params: 1,565,733
Trainable params: 1,565,733
Non-trainable params: 0
_________________________________________________________________


In [15]:
model.fit(x_train,y_train,validation_data=(x_val,y_val),epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7feb223ba810>

In [16]:
model.evaluate(x_test,y_test)



[0.5851007103919983, 0.7866234183311462]