We will use "IMDB movie review sentiment classification dataset"

Dataset Description: https://keras.io/api/datasets/imdb/

This is a dataset of 25,000 movie reviews from IMDB, tagged by sentiment (positive/negative). The reviews have been preprocessed and each review is coded as a list of (whole) word indexes. For convenience, words are indexed by their overall frequency in the dataset, so that, for example, the integer "3" encodes the 3rd most frequent word in the data.

In [2]:
!pip install keras

Collecting keras
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
     ---------------------------------------- 1.7/1.7 MB 7.7 MB/s eta 0:00:00
Installing collected packages: keras
Successfully installed keras-2.11.0



[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.11.0-cp37-cp37m-win_amd64.whl (1.9 kB)
Collecting tensorflow-intel==2.11.0
  Downloading tensorflow_intel-2.11.0-cp37-cp37m-win_amd64.whl (266.3 MB)
     -------------------------------------- 266.3/266.3 MB 3.1 MB/s eta 0:00:00
Collecting libclang>=13.0.0
  Downloading libclang-16.0.0-py2.py3-none-win_amd64.whl (24.4 MB)
     --------------------------------------- 24.4/24.4 MB 10.7 MB/s eta 0:00:00
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.31.0-cp37-cp37m-win_amd64.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 13.5 MB/s eta 0:00:00
Collecting protobuf<3.20,>=3.9.2
  Downloading protobuf-3.19.6-cp37-cp37m-win_amd64.whl (896 kB)
     ------------------------------------- 896.6/896.6 kB 14.1 MB/s eta 0:00:00
Collecting google-pasta>=0.1.1
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflo


[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
import numpy
import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM, Dropout
from tensorflow.python.keras.layers.embeddings import Embedding
from tensorflow.python.keras.layers.convolutional import Conv1D
from tensorflow.python.keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.preprocessing import sequence
from tensorflow.python.keras.preprocessing.text import one_hot
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.layers import Flatten
# fix random seed for reproducibility;pl
numpy.random.seed(7)

ModuleNotFoundError: No module named 'tensorflow.preprocessing'

In [7]:
db=imdb.load_data()

In [8]:
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [9]:
len(X_train)

25000

In [10]:
y_train

array([1, 0, 0, ..., 0, 1, 0], dtype=int64)

In [11]:
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [13]:
X_train.shape

(25000, 500)

we will use the embedding layer which defines the first hidden layer of the network. it must specify 3 arguments:

input_dim: the size of the vocabulary in the text

output_dim: this is the size of the vector space in which each word will be immersed

input_legth: this is the size of the sequence, for example if your documents contain 100 words each then it is 100

In [21]:
# creating tyhe model 
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=6, batch_size=64)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6

<tensorflow.python.keras.callbacks.History at 0x1ad71b91e80>

In [23]:

# evaluation
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 88.10%


## a simple example of the embedding layer

In [10]:
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

In [11]:
labels = [1,1,1,1,1,0,0,0,0,0]

In [12]:
vocab_size = 50

In [13]:
encoded_docs = [one_hot(d, vocab_size) for d in docs]

NameError: name 'one_hot' is not defined

In [10]:
print(encoded_docs)

[[12, 43], [7, 40], [28, 36], [1, 40], [5], [20], [3, 36], [16, 7], [3, 40], [37, 26, 43, 40]]


In [22]:
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[12 43  0  0]
 [ 7 40  0  0]
 [28 36  0  0]
 [ 1 40  0  0]
 [ 5  0  0  0]
 [20  0  0  0]
 [ 3 36  0  0]
 [16  7  0  0]
 [ 3 40  0  0]
 [37 26 43 40]]


We are now ready to define our Embedding layer as part of our model.

The embedding has a vocabulary of 50 and an entry length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. It is important to note that the output of the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten it (the flatten layer) into a 32-element vector to pass it to the Dense output layer. 

In [26]:
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [27]:
model.fit(padded_docs, labels, epochs=50, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x19d3b571dd8>

In [28]:
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


## To Do: 

1. Try the same thing on Google reviews dataset ( the file is given in the lab directory)
2. try to change the embedding representation using Glove and Skipgram 