<a href="https://colab.research.google.com/github/mintusf/Sentiment-analysis/blob/master/Sentiment_analysis_model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To download dataset and glove embeddings, you need an account on kaggle.
Download kaggle.json from My account -> API -> Create New API Token

In order to run notebook without crashing, I recommend setting RAM to max. 25GB.

**Downloading data**

In [1]:
from google.colab import files

Choose the kaggle.json file that you have downloaded

In [4]:
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"filipmintus","key":"6f72d4148a17a65140b68eee58af8fb2"}'}

File configuration in order to use in google colab.

In [5]:
!ls -lha kaggle.json
!mkdir -p ~/.kaggle 
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

-rw-r--r-- 1 root root 67 Dec 11 11:29 kaggle.json


Download and extract the dataset - Amazon reviews.

The dataset includes 3600000 reviews of Amazon products in the training dataset and 400000 reviews in the testing dataset.

In [8]:
!kaggle datasets download bittlingmayer/amazonreviews
!unzip amazonreviews.zip

Downloading amazonreviews.zip to /content
 98% 481M/493M [00:04<00:00, 110MB/s] 
100% 493M/493M [00:04<00:00, 111MB/s]


Extracting examples and labels

1. Opening Bz2 file
2. Decoding byte type
3. Extracting label from a string

In [10]:
import bz2
with bz2.open("test.ft.txt.bz2", "rb") as f:
  # Decompress data from file
  content = f.read()
text = content.decode()
text = text.split('__label__')[1:]
Y_test = [a[0] for a in text]
X_test = [a[2:] for a in text]
print(len(X_test)==len(Y_test))

True
2 Useful for remodels: I recently remodeled my house and these came in extremely useful for boring through studs for electrical and plumbing... as well as prepping doors for hardware.

Useful for remodels: I recently remodeled my house and these came in extremely useful for boring through studs for electrical and plumbing... as well as prepping doors for hardware.

2


In [11]:
import bz2
with bz2.open("train.ft.txt.bz2", "rb") as f:
  # Decompress data from file
  content = f.read()
text = content.decode()
text = text.split('__label__')[1:]
Y_train = [a[0] for a in text]
X_train = [a[2:] for a in text]
print(len(X_train)==len(Y_train))

True


Downloading 100-dimensional GloVe embedding with 6 billions tokens. 

In [14]:
!kaggle datasets download terenceliu4444/glove6b100dtxt
!unzip glove6b100dtxt

Downloading glove6b100dtxt.zip to /content
 90% 118M/131M [00:01<00:00, 70.9MB/s]
100% 131M/131M [00:01<00:00, 97.3MB/s]


A function for extracting look-up dictionaries between words, indexes and embeddings.

In [0]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)       
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [0]:
import numpy as np
words_to_index, index_to_words, word_to_vec_map = read_glove_vecs('glove.6B.100d.txt')

**Data preprocessing**

Function for splitting each sentence from input vector into list containing embedding index corresponding to each word of sentence. Outputted list will be an input for the Embedding layer of Keras.

In [0]:
def split_sentence(sentences, words_to_index,max_words):
  m = len(sentences)
  new_sentences = []
  for i, sentence  in enumerate(sentences):
    words = sentence.lower().split()
    new_sentence = []
    for j,word in enumerate(words):
      if word[-1] in [',', '.', ':','!','?' ,' ', ',','''\'''']:
        word = word[:-1]
      if word in words_to_index.keys():
        new_sentence.append(words_to_index[word])
      else: new_sentence.append(0)
    new_sentences.append(new_sentence)
  return new_sentences

Splitting training dataset 

In [0]:
X_train_indices = split_sentence(X_train, words_to_index, 150)

In [0]:
X_test_indices = split_sentence(X_test, words_to_index, 150)

Next step is to pad sentences into the same length (150).
Too long sentences are cutted at the beginning.
Too short sentences are filled with 0 in the beginning.

In [0]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train_padded = pad_sequences(X_train_indices,maxlen = 150, padding='pre', truncating='pre')
X_test_padded = pad_sequences(X_test_indices,maxlen = 150, padding='pre', truncating='pre')

Correcting labels, so that 1 corresponds to positive sentiment and 0 to negative.

In [0]:
Y_train = np.array(Y_train, dtype = np.int32)
Y_train-=1
Y_test_true = np.array(Y_test, dtype = np.int32)
Y_test_true-=1

**Model building**

In [0]:
from tensorflow.keras.layers import Embedding

A function creating an embedding layer with the pre-trained gloVe embeddings.
Weights of the Embedding layer are set to be trainable.

In [0]:
def build_embedding_layer(word_to_vec_map, words_to_index):
  voc_size = len(words_to_index) + 1
  emb_dim = word_to_vec_map['rice'].shape[0]
  emb_matrix = np.zeros((voc_size,emb_dim))
  for word, index in words_to_index.items():
    emb_matrix[index,:] = word_to_vec_map[word]
  embedding_layer = Embedding(voc_size, emb_dim, trainable = True)
  embedding_layer.build((None,))
  embedding_layer.set_weights([emb_matrix])
  return embedding_layer

In [0]:
first_layer = build_embedding_layer(word_to_vec_map, words_to_index)

A function building a NLP model for sentiment analysis.

In [0]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation

def build_sentiment_model(input_shape, word_to_vec_map,words_to_index):
  sentence_input = Input((input_shape),dtype = 'int32')
  
  embedding_layer = build_embedding_layer(word_to_vec_map,words_to_index)
  X = embedding_layer(sentence_input)
  X = LSTM(256,return_sequences = True)(X)
  X = Dropout(0.1)(X)
  X = LSTM(256,return_sequences = False)(X)
  X = Dropout(0.1)(X)
  X = Dense(1)(X)
  X = Activation('sigmoid')(X)

  model = Model (sentence_input, X)
  return model

In [0]:
sentiment_model = build_sentiment_model((150,),word_to_vec_map, words_to_index)


In [50]:
sentiment_model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 150)]             0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 150, 100)          40000100  
_________________________________________________________________
lstm_6 (LSTM)                (None, 150, 256)          365568    
_________________________________________________________________
dropout_6 (Dropout)          (None, 150, 256)          0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_7 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257 

In [0]:
from tensorflow.keras.optimizers import Adam
sentiment_model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate = 0.0001), metrics=['accuracy'])

The training dataset consists of 3600000 examples. In order to make training faster, random 200000 examples are chosen for this process.

In [0]:
idx = np.random.randint(0,3600000,200000)
X_train_padded_part = X_train_padded[idx]
Y_train_padded_part = Y_train[idx]

Training the model with batch size 128 for 2 epochs.

In [52]:
import tensorflow as tf
with tf.device('/device:GPU:0'):
  sentiment_model.fit(X_train_padded_part,Y_train_padded_part,batch_size = 128, epochs = 2, shuffle = True)

Train on 200000 samples
Epoch 1/2
Epoch 2/2


Training the model for additional 4 epochs.

In [58]:
import tensorflow as tf
with tf.device('/device:GPU:0'):
  sentiment_model.fit(X_train_padded_part,Y_train_padded_part,batch_size = 128, epochs = 4, shuffle = True)

Train on 200000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


The model reaches 92.5% on training dataset.

Evaluating the model with 100000 randomly chosen examples from test dataset.

In [98]:
idx = np.random.randint(0,400000,100000)
sentiment_model.evaluate(X_test_padded[idx], Y_test_true[idx])



[0.22167939500808717, 0.91177]

Let's try our own sentences.

In [0]:
def predict_sentiment(sentence):
  X_test = split_sentence(sentence, words_to_index, 150)
  X_test_padded = pad_sequences(X_test,maxlen = 150)
  prediction = sentiment_model.predict(X_test_padded)
  if prediction > 0.5: print ('Positive!')
  else: print('Negative!')
  print(prediction)

In [112]:
predict_sentiment(['Such an amazing quality!'])

Positive!
[[0.7540016]]


In [111]:
predict_sentiment(['Great purchase! After one year still looks like new'])

Positive!
[[0.91678673]]


In [110]:
predict_sentiment(['It is so bad that I would not recommend it even to my worst enemy'])

Negative!
[[0.0311656]]


In [109]:
predict_sentiment(['Broken after second use.'])

Negative!
[[0.10456979]]


Run next cell if you want to save weights on you Google Drive.

To do it, you need to click the shown link, choose your google account and allow Google Cloud SDK to access it. The password will be shown which should be inputed in blank area.

In [119]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

sentiment_model.save_weights('SentimentModel_weights.h5')
weights_file = drive.CreateFile({'title' : 'SentimentModel_weights.h5'})
weights_file.SetContentFile('SentimentModel_weights.h5')
weights_file.Upload()
drive.CreateFile({'id': weights_file.get('id')})
print(f"Weights saved, id: {weights_file.get('''id''')}")

Weights saved, id: 1DDr_sZZRuVwXqyhsOGJ_-l6gXPThEvfh
