<a href="https://colab.research.google.com/github/jawaluke/DL-NLP-natural-language-processing-/blob/master/Tensorflow_sentiment_analysis_amazon_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# enabling the gpu
# install dependencies

import tensorflow as tf
from tensorflow import keras
import os
import tensorflow_datasets as tfds

# **AMAZON reviews to classify positive / negative reviews**

#**Using Tensorflow and Keras for the NLP classification**

#**First enabling the GPU and check our gpu property by the below code**

In [None]:
!nvidia-smi

In [None]:
# checking the version

print(tf.__version__)

#**Loading the dataset from the Tensorflow datasets**

In [None]:
# loading the datasets

datasets, info = tfds.load("amazon_us_reviews/Mobile_Electronics_v1_00", with_info= True)
 
# train datasets

train_data = datasets["train"]



info - the information about the dataset (dataset exploring)

train - full dataset without split

if you run the below code
 you will see the dataset info and it contain every detail of the amazon's 
product

but, we need only reviews text and rating

In [None]:
# info of the datasets

info


 we are gonna take the review and star rating see the customer reviews


In [None]:
# we are gonna take the review and star rating see the customer reviews

print(train_data)


In [None]:
# length of the datasets

len(list(train_data))


In [None]:
# setting the batch size and buffer size

BATCH_SIZE = 128
BUFFER_SIZE = 30000


#**Shuffle the datasets**

we don't need it to be ordered

In [None]:
# getting the data shuffle

train_data = train_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration= False)

 in tensorflow you need to iterate things to display ,for example


In [None]:
# in tensorflow you need to iterate things to show for example

for reviews in train_data.take(2):
  print(reviews)
# all were in tensor shape

Extracting only the reviews text

In [None]:
for reviews in train_data.take(4):
  # printing the reviews of people by iterating the tensorflow just like a json format

  print(reviews["data"]["review_body"])

To remove tensor shape use numpy()

Basically we are doing here is. showing the review texts and coorespondent rating

ratings like in range(1-5) but we want it to be 0 and 1

by making threshold as a 3 

0 - negative reviews  ( < 3 )

1 - positive reviews  ( > 3)

---



In [None]:
# now making the rating with some threshold like positive and negative ( 0 and 1 )

for reviews in train_data.take(5):
  print(reviews["data"].get("review_body").numpy())
  
  print(reviews["data"].get("star_rating"))

  # here we are making the condition like 0 and 1

  print(tf.where(reviews["data"].get("star_rating")>3,1,0).numpy())

  print("\n\n")


# **Data Preprocessing**

going to tokenize the words in this review

In [None]:
# now tokenizing

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences



In [None]:
# considering the vocab_size

vocab_size = 73738

creating a sentences and label list and appending the reviews texts and rating

sentences ----> reviews text

labels -------> ratings( 0 and 1)

In [None]:
def sep_text_reviews(train_data):
  sentences = []
  labels = []
  for reviews in train_data:
    sentences.append(reviews["data"].get("review_body").numpy().decode("utf-8"))
    labels.append(tf.where(reviews["data"].get("star_rating")>3,1,0).numpy())
  return sentences, labels

# call the function

sentences, labels = sep_text_reviews(train_data)


In [None]:
len(labels)

In [None]:
len(sentences)

now all the reviews and ratings are stored in a list

In [None]:
for i in range(0,3):
  print(sentences[i])
  print(labels[i])
  print("\n\n")

#**Train and Test split**

from the length of datasset were 104975

splitting into , 

training ---> 85000

testing ----> 104975 - 85000

In [None]:
import numpy as np

training_size = 85000

training_data =  np.array(sentences[0:training_size])
training_labels = np.array(labels[0:training_size])

testing_data = np.array(sentences[training_size:])
testing_labels = np.array(labels[training_size:])

training_data.shape , training_labels.shape, testing_data.shape, testing_labels.shape

Tokenizing the reviews into tokenize word

In [None]:
tokenizer = Tokenizer(num_words= 73738 , oov_token= "<OOV>")
tokenizer.fit_on_texts(training_data)

word_index = tokenizer.word_index


From the below code,

as you see the reviews text words are numbered 

In [None]:
word_index.items()

based on the sentences we are using the numbered reviews words

In [None]:
training_sequences = tokenizer.texts_to_sequences(training_data)

In [None]:

training_sequences[1]

we need a maximum length of the reviews overall

In [None]:
max_sequence_len = max([len(x) for x in training_sequences])
max_sequence_len

padding is convert our sentences with the maximum length of reviews 

eg:
 
 our reviews maximum length is 2943

 so, converting every sentence with that maximum length by adding 0 in front

In [None]:
training_padded = pad_sequences( training_sequences, maxlen= max_sequence_len, truncating = "post")

In [None]:
training_padded[1]

Similar way with the testing sentences

In [None]:
testing_sequences = tokenizer.texts_to_sequences(testing_data)
testing_padded = pad_sequences(testing_sequences, maxlen= max_sequence_len)

#**Define our model**

- we are using bidirectional LSTM for our model [link for the Documentation](https://keras.io/api/layers/recurrent_layers/lstm)

In [None]:
# model define


model = keras.Sequential([
                          
keras.layers.Embedding(vocab_size, 16, input_length= max_sequence_len),
 keras.layers.Bidirectional(keras.layers.LSTM(32)),
keras.layers.Dense(24, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')])

model.compile(
    loss="binary_crossentropy",
    optimizer = "adam",
    metrics = ["accuracy"]
)
model.summary()

#**Training our model**

- if it start training, 

- so do you have any plan for next 30 minutes just go for it and came back later

In [31]:
model.fit(training_padded, training_labels, epochs = 12, batch_size = 128, validation_data= (testing_padded, testing_labels))

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<tensorflow.python.keras.callbacks.History at 0x7f5871cdf748>

our model is overfitting the data but our validation accuracy increasing  yeah that's good sign

if you have more time just increase the epochs values ( 15, 20, 50).....


#**Save the model**

In [32]:
model.save("amazon_reviews.h5")

#**Loading our model**

well i try to give another notebook for loading our saved model

because to reduce the training time 

(if you want to train it just go for it and try to increase the validation accuracy )



In [33]:
model_path = "/content/amazon_reviews.h5"

In [34]:
from keras.models import load_model

In [35]:
model = load_model(model_path)

In [36]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 2917, 16)          1179808   
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________
dense (Dense)                (None, 24)                1560      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 1,193,937
Trainable params: 1,193,937
Non-trainable params: 0
_________________________________________________________________


#**Predicting the reviews**

for prediction , 

- first thing we need a tokenizer to convert the output text into vector(numbered)

- we need to padding it and then going for prediction

In [37]:
def prediction(model, text):
  sequences = tokenizer.texts_to_sequences([text])
  padded = pad_sequences(sequences, maxlen = max_sequence_len, truncating = "post")
  reviews = model.predict(padded)

  if reviews[0]>0.5:
    print("it is a positive percent : ",str(reviews[0][0]*100)[:3])
  else:
    print("it is a negative percent : ",str(reviews[0][0]*100)[:3])





now calling the function , with output text


In [49]:

text = "i don't know why i buy this"
prediction(model, text)
print(text)

it is a negative percent :  11.
i don't know why i buy this


wow, negative reviews and lets make it complicated output text


In [51]:
text = " waste product very much disappointing"
prediction(model, text)
print(text)

it is a negative percent :  0.0
 waste product very much disappointing


okay let see the positive reviews

In [59]:
text = "recently i got my laptop and it perform super cool"
prediction(model, text)
print(text)

it is a positive percent :  99.
recently i got my laptop and it perform super cool
