<a href="https://colab.research.google.com/github/hyfoo-bot/project/blob/main/data%20analyticsText_analytics_and_attention_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will make our first foray into text analytics, and build a model for classifying text sentiment. Machine learning will always prefer numbers and matrices, so you will see some new methods for converting text data to numerical format.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from tensorflow.keras import layers

Download a public dataset from UC Irvine that contains sentences collated from Amazon (shopping), IMDB (movie reviews) and Yelp (restaurant reviews). The sentences come with labels where 0 is negative and 1 is positive.

In [None]:
!wget https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip

--2024-11-11 13:51:42--  https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘sentiment+labelled+sentences.zip’

sentiment+labelled+     [ <=>                ]  82.21K   456KB/s    in 0.2s    

2024-11-11 13:51:42 (456 KB/s) - ‘sentiment+labelled+sentences.zip’ saved [84188]



In [None]:
import zipfile

local_zip = 'sentiment+labelled+sentences.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('')
zip_ref.close()

Pandas is a great package for virtually any kind of dataset loading and manipulation. You won't go wrong starting with this. Here we take a quick peek at each set of reviews, then mash it all together into a single dataset.

In [None]:
yelp = pd.read_csv('sentiment labelled sentences/yelp_labelled.txt', sep='\t', header=None)
imdb = pd.read_csv('sentiment labelled sentences/imdb_labelled.txt', sep='\t', header=None)
amazon = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', header=None)
print(yelp.head(),yelp.shape)
print(imdb.head(),imdb.shape)
print(amazon.head(), amazon.shape)


                                                   0  1
0                           Wow... Loved this place.  1
1                                 Crust is not good.  0
2          Not tasty and the texture was just nasty.  0
3  Stopped by during the late May bank holiday of...  1
4  The selection on the menu was great and so wer...  1 (1000, 2)
                                                   0  1
0  A very, very, very slow-moving, aimless movie ...  0
1  Not sure who was more lost - the flat characte...  0
2  Attempting artiness with black & white and cle...  0
3       Very little music or anything to speak of.    0
4  The best scene in the movie was when Gerardo i...  1 (748, 2)
                                                   0  1
0  So there is no way for me to plug it in here i...  0
1                        Good case, Excellent value.  1
2                             Great for the jawbone.  1
3  Tied to charger for conversations lasting more...  0
4                            

In [None]:
data = pd.concat([yelp, imdb, amazon])
print(data.head(),data.shape)

                                                   0  1
0                           Wow... Loved this place.  1
1                                 Crust is not good.  0
2          Not tasty and the texture was just nasty.  0
3  Stopped by during the late May bank holiday of...  1
4  The selection on the menu was great and so wer...  1 (2748, 2)


As usual, we need to split the data. Here we only need to split into train and test. Later we will see that the model.fit function already provides a validation split.

In [None]:
# YOUR CODE HERE
# Write 1 line of code using train_test_split to split the data into 75% training and 25% testing. Hint: check out how we did this in the Trees exercise.
train_x, test_x, train_y, test_y = train_test_split(data[0].values, data[1].values, test_size=0.25)
# YOUR CODE ENDS

Keras provides an interesting tool called Tokenizer for pre-processing text data. In machine learning, words are known as tokens. The fit_on_texts method looks at all the text you give it and produces a list of unique words, known as a 'vocabulary', with each word assigned an index number. Then, the texts_to_sequences method converts the sentences to number sequences based on the vocabulary.

There is a need to 'pad' the sequences because the sentences have varying lengths. So we set a max sequence length of 32 and pad the shorter sentences with zeroes.

In [None]:
vocab_size = 5000
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(train_x)
train_seq = tokenizer.texts_to_sequences(train_x)
train_pad = pad_sequences(train_seq, padding='post', maxlen=32)
test_seq = tokenizer.texts_to_sequences(test_x)
test_pad = pad_sequences(test_seq, padding='post', maxlen=32)
print(train_pad)
print(train_pad.shape)
print(train_y.shape)

[[  59   92  138 ...    0    0    0]
 [  46   10  180 ...    0    0    0]
 [  38    3  608 ...    0    0    0]
 ...
 [4554 4555  903 ...    0    0    0]
 [ 220  842   36 ...    0    0    0]
 [4556 4557  120 ...    0    0    0]]
(2061, 32)
(2061,)


Here we define a simple model. The model starts with an Embedding layer that converts the unique indexes of each word in our sentence sequences into vectors of a high-dimensional space. This is followed by a simple LSTM layer of just 8 units, and ending with a single-neuron that gives a probablistic output from 0 to 1.

In [None]:
embedding_dim = 32 # embedding vectors of length 32
model = keras.Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim))
# YOUR CODE HERE - add 1 LSTM layer with 8 units
model.add(layers.LSTM(8))
# YOUR CODE ENDS
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


When using model.fit, the flag validation_split allows the code to split off a fraction of the data for validation. 'Accuracy' refers to training accuracy and should quickly hit >90%, but that doesn't represent your model's performance on unseen data. Be careful to look at val_accuracy when evaluating the model.

In [None]:
# YOUR CODE HERE
# write 1 line of code using model.fit to train the model. Use the following settings: batch_size=8, epochs=5, verbose=1, validation_split=0.2
model.fit(train_pad,train_y,batch_size=8,epochs=5,verbose=1,validation_split=0.2)
# YOUR CODE ENDS

Epoch 1/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.4875 - loss: 0.6940 - val_accuracy: 0.6029 - val_loss: 0.6623
Epoch 2/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.7795 - loss: 0.5479 - val_accuracy: 0.6877 - val_loss: 0.6765
Epoch 3/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - accuracy: 0.8625 - loss: 0.3878 - val_accuracy: 0.7264 - val_loss: 0.6559
Epoch 4/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.9074 - loss: 0.2899 - val_accuracy: 0.7191 - val_loss: 0.7253
Epoch 5/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 17ms/step - accuracy: 0.9433 - loss: 0.2322 - val_accuracy: 0.7288 - val_loss: 0.6702


<keras.src.callbacks.history.History at 0x78885a0151e0>

The model looks passable (val_accuracy around 75%) and will probably do even better with longer training and some tweaking of hyperparameters. But here we will experiment with Attention, a powerful mechanism for weighting the relative importance of different parts of a sentence.

We will implement the simplest form of Attention that was first proposed by Bahdanau (2014) for improving LSTM performance. Rather than use only the final output (hidden state) of the LSTM, we will learn a weighting of all the hidden states emitted from all the LSTM units, such the layer outputs a weighted sum of a linear combination of the states. This sounds simple enough but already requires a bit more work to write our own layer. Most of the work is done for you.

In [None]:
from tensorflow.keras.layers import Layer
from tensorflow.keras import backend as K

class Attention(Layer):

    def __init__(self):
        super(Attention,self).__init__()

    def build(self, input_shape):
        # set up the layer to have a set of linear weights that match the length of the input
        self.W=self.add_weight(name="att_weight", shape=(input_shape[-1],1),
                               initializer="normal")
        self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
                               initializer="zeros")

        super(Attention,self).build(input_shape)

    def call(self, x):
        # YOUR CODE HERE
        # 1 line of code that performs a dot product of the input x with the weights and adds the bias term
        # the dot product of u and v is implemented as K.dot(u,v)
        dot_product = K.dot(x,self.W)+self.b
        # YOUR CODE ENDS
        e = K.tanh(dot_product)
        a = K.softmax(e, axis=1)
        output = x*a

        return K.sum(output, axis=1)

We define a new model, and here we add our new Attention layer. Notice that the LSTM now is set to return sequences, which is needed for the attention calculation.

In [None]:
# model initialization
model = keras.Sequential()
# embedding layer
model.add(layers.Embedding(vocab_size, embedding_dim))
#LSTM(64),
model.add(layers.LSTM(8,return_sequences=True))

# YOUR CODE HERE - add our homemade layer
model.add(Attention())
# YOUR CODE ENDS

model.add(layers.Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [None]:
# YOUR CODE HERE - same code you wrote above to train the model
model.fit(train_pad,train_y,batch_size=8,epochs=5,verbose=1,validation_split=0.2)
# YOUR CODE ENDS

Epoch 1/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 0.5382 - loss: 0.6902 - val_accuracy: 0.6731 - val_loss: 0.6147
Epoch 2/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.8285 - loss: 0.4598 - val_accuracy: 0.7700 - val_loss: 0.4944
Epoch 3/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 15ms/step - accuracy: 0.9296 - loss: 0.2548 - val_accuracy: 0.7821 - val_loss: 0.5244
Epoch 4/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9374 - loss: 0.2145 - val_accuracy: 0.7676 - val_loss: 0.6089
Epoch 5/5
[1m206/206[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.9656 - loss: 0.1463 - val_accuracy: 0.7530 - val_loss: 0.6745


<keras.src.callbacks.history.History at 0x788859a55570>

There few percentage points improvement up to around 80% validation accuracy.