We will use "IMDB movie review sentiment classification dataset"

Dataset Description: https://keras.io/api/datasets/imdb/

This is a dataset of 25,000 movie reviews from IMDB, tagged by sentiment (positive/negative). The reviews have been preprocessed and each review is coded as a list of (whole) word indexes. For convenience, words are indexed by their overall frequency in the dataset, so that, for example, the integer "3" encodes the 3rd most frequent word in the data.

In [3]:
#!pip install keras
#!pip install tensorflow

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
import numpy
import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, Dropout
from tensorflow.python.keras.layers.embeddings import Embedding
from tensorflow.python.keras.layers.convolutional import Conv1D
from tensorflow.python.keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from tensorflow.python.keras.layers import Flatten
from keras.preprocessing.text import one_hot
numpy.random.seed(7)

In [5]:
db=imdb.load_data()

In [73]:
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [7]:
len(X_train)

25000

In [8]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [9]:
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [10]:
X_train.shape

(25000, 500)

we will use the embedding layer which defines the first hidden layer of the network. it must specify 3 arguments:

input_dim: the size of the vocabulary in the text

output_dim: this is the size of the vector space in which each word will be immersed

input_legth: this is the size of the sequence, for example if your documents contain 100 words each then it is 100

In [11]:
# creating tyhe model 
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=6, batch_size=64)

2023-06-18 11:51:42.075881: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


Epoch 1/6


2023-06-18 11:51:44.297655: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32

	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]

2023-06-18 11:51:44.299489: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32

	 [[{{node gradients/split_grad/concat/split/split_dim}}]]

2023-06-18 11:51:44.301115: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You



2023-06-18 11:53:36.598663: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32

	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]

2023-06-18 11:53:36.600551: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32

	 [[{{node gradients/split_grad/concat/split/split_dim}}]]

2023-06-18 11:53:36.602172: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You


Epoch 2/6


Epoch 3/6


Epoch 4/6


Epoch 5/6


Epoch 6/6



<keras.callbacks.History at 0x7fe0b077eaf0>

In [12]:
#evaluation
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.99%


## a simple example of the embedding layer

In [6]:
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
#docs = numpy.arange(len(docs))

In [7]:
labels = numpy.array([1,1,1,1,1,0,0,0,0,0])

In [8]:
vocab_size = 50

In [9]:
encoded_docs = [one_hot(d, vocab_size) for d in docs]

In [10]:
encoded_docs

[[40, 34],
 [42, 32],
 [28, 49],
 [33, 32],
 [30],
 [34],
 [45, 49],
 [47, 42],
 [45, 32],
 [21, 16, 34, 32]]

In [11]:
max_length = 4
padded_docs = sequence.pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[40 34  0  0]
 [42 32  0  0]
 [28 49  0  0]
 [33 32  0  0]
 [30  0  0  0]
 [34  0  0  0]
 [45 49  0  0]
 [47 42  0  0]
 [45 32  0  0]
 [21 16 34 32]]


We are now ready to define our Embedding layer as part of our model.

The embedding has a vocabulary of 50 and an entry length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. It is important to note that the output of the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten it (the flatten layer) into a 32-element vector to pass it to the Dense output layer. 

In [12]:
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model


In [16]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 module_wrapper (ModuleWrapp  (None, 4, 8)             400       
 er)                                                             
                                                                 
 module_wrapper_1 (ModuleWra  (None, 32)               0         
 pper)                                                           
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [14]:
model.fit(padded_docs, labels, epochs=50, verbose=0)

<keras.callbacks.History at 0x79c2782cdc00>

In [15]:
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 89.999998


## To Do: 

1. Try the same thing on Google reviews dataset ( the file is given in the lab directory)
2. try to change the embedding representation using Glove and Skipgram 

In [18]:
import pandas as pd
import numpy
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

google_reviews = pd.read_csv('/kaggle/input/reviews/reviews.csv')

In [19]:
df = google_reviews[['content', 'score']]

In [20]:
# Calculate the maximum sequence length
max_sequence_length = max(len(content.split()) for content in df["content"])

# Tokenize the content
num_words=500
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(df["content"])
sequences = tokenizer.texts_to_sequences(df["content"])

# Pad sequences to the same length
content_embeddings = sequence.pad_sequences(sequences, maxlen=max_sequence_length)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(content_embeddings, df['score'], test_size=0.2, random_state=42)


In [56]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense

embedding_vecor_length = 32
sequence_length = X_train.shape[1]  # Assuming X_train is a numpy array

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_vecor_length, input_length=sequence_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_11 (Embedding)    (None, 391, 32)           16000     
                                                                 
 conv1d_1 (Conv1D)           (None, 391, 32)           3104      
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 195, 32)          0         
 1D)                                                             
                                                                 
 lstm_2 (LSTM)               (None, 128)               82432     
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 101,665
Trainable params: 101,665
Non-trainable params: 0
________________________________________________

In [57]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test Loss:', loss)
print('Test Accuracy:', accuracy)

Test Loss: 0.7243567109107971
Test Accuracy: 0.02190476283431053


In [31]:
# Load the GloVe embeddings
glove_embeddings = {}  # Dictionary to store the embeddings
with open('/kaggle/input/glove-global-vectors-for-word-representation/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = numpy.array(values[1:], dtype='float32')
        glove_embeddings[word] = vector


In [37]:
# Create an embedding matrix
embedding_matrix = numpy.zeros((num_words, 100))
for word, index in tokenizer.word_index.items():
    if index < num_words:
        embedding_vector = glove_embeddings.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector


In [53]:
embedding_matrix.shape

(500, 100)

In [54]:
# Create the model
model = Sequential()

# Add the Embedding layer
vocab_size = embedding_matrix.shape[0]
embedding_dim = embedding_matrix.shape[1]
model.add(Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False))

# Add other layers to the model
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, None, 100)         50000     
                                                                 
 lstm_1 (LSTM)               (None, 128)               117248    
                                                                 
 dense_2 (Dense)             (None, 1)                 129       
                                                                 
Total params: 167,377
Trainable params: 117,377
Non-trainable params: 50,000
_________________________________________________________________
None


In [55]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test Loss:', loss)
print('Test Accuracy:', accuracy)

Test Loss: 0.9715513586997986
Test Accuracy: 0.030793650075793266
