<a href="https://colab.research.google.com/github/Mi1kDev/CST3133_CW/blob/main/CST3133_CourseWork_B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing and Deep Learning Coursework Part 2

## 1.2.1 Text Dataset Selection and Preprocessing


### Selecting, reviewing, and adjusting the dataset

In [43]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [44]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [45]:
nlp_dataset = pd.read_csv("/content/drive/MyDrive/CST3133_CW/datasets/IMDB Dataset.csv")
nlp_dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [46]:
nlp_dataset.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [47]:
nlp_dataset.shape

(50000, 2)

In [48]:
nlp_dataset.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [49]:
nlp_dataset.isnull().sum().sum()

0

Checking for duplicates and removing them

In [50]:
nlp_dataset.duplicated().sum()

418

In [51]:
nlp_dataset.drop_duplicates(inplace=True)
nlp_dataset.shape

(49582, 2)

### Preprocessing the reviews


Importing and downloading all the necessary libraries to tokenise the reviews

In [52]:
# Regular Expressions Library to Clean the data
import re
# Natural Language Toolkit Library to Preprocess the data
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Downhload the necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

Function to determine the part-of-speech(POS) tag for each word.

In [53]:
def get_wordnet_pos(word):
  tag = nltk.pos_tag([word])[0][1][0].upper()
  # post_tag access = [Tupple][POS Tag][First Letter POS Tag]
  tag_dict = {
      "J": wordnet.ADJ, # Adjectives
      "N": wordnet.NOUN, # Nouns
      "V": wordnet.VERB, # Verbs
      "R": wordnet.ADV # Adverb
      }
  return tag_dict.get(tag, wordnet.NOUN)

Cleaning the dataset
*   Turning each word to lower case
*   Removing HTML tags
*   Tokenising the words
*   Removing Stopwords
*   Applying lemmatization







In [54]:
def preprocess_text(review):
  review = review.lower()
  review = re.sub(r'<[^>]+>', '', review)
  review = re.sub(r'[^a-zA-Z0-9]', ' ', review)
  tokens = word_tokenize(review)
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

  return " " .join(tokens)

Extracting the tokens of the review and target labels as binary:

In [56]:
nlp_tokenised_reviews = []
nlp_sentiment_labels = []

for index, row in nlp_dataset.iterrows():
  nlp_tokenised_reviews.append(preprocess_text(row['review']))
  nlp_sentiment_labels.append( 1 if row['sentiment'] == 'positive' else 0)

print(nlp_tokenised_reviews[:5])
print(nlp_sentiment_labels[:5])

['one reviewer mention watch 1 oz episode hooked right exactly happen first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid show pull punch regard drug sex violence hardcore classic use word call oz nickname give oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home many aryan muslim gangsta latino christian italian irish scuffle death stare dodgy dealing shady agreement never far away would say main appeal show due fact go show dare forget pretty picture paint mainstream audience forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watch developed taste oz get accustom high level graphic violence violence injustice crooked guard sell nickel inmate kill order get away well mannered middle class inmate turn prison bitch due lack street skill prison experience watch oz may become comforta

Importing Google's Word2Vec as Pre-trained Embedding Model for feature representation.

In [57]:
import gensim.downloader as api
word2vec_model = api.load('word2vec-google-news-300')

Function to generate numerical representation (embedding) for a string of tokens from a review

In [58]:
def get_sentence_embedding(sentence, model, vector_size = 300):
  tokens = sentence.split()
  # Storing the numerical vectors of the tokens that are valid
  token_vectors = []
  valid_tokens = [token for token in tokens if token in model.key_to_index]
  if not valid_tokens:
    # Returning a zero vector if no valid tokens are found
    return np.zeros(vector_size)
  for token in valid_tokens:
    token_vectors.append(model[token])
  # Returning the embeddings
  return np.mean(token_vectors, axis=0)


Extracting the embeddings for each review

In [59]:
nlp_embeddings = []
for review in nlp_tokenised_reviews:
  sentence_embedding = get_sentence_embedding(review, word2vec_model)
  nlp_embeddings.append(sentence_embedding)
#  Converting the embeddings and sentiment binary labels to NumPy array for the ML Model
nlp_embeddings = np.array(nlp_embeddings)
nlp_sentiment_labels = np.array(nlp_sentiment_labels)
print(nlp_embeddings.shape)
print(nlp_sentiment_labels.shape)

(49582, 300)
(49582,)


## 1.2.2 Deep Learning Model Implementation


Design and train a neural network, e.g., RNN, LSTM for a text-based task, e.g., sentiment analysis.

Clearly explain the model architecture, e.g., embedding layers, hidden layers, activation functions, and
hyperparameter tuning

In [60]:
''' Design and Train Neural Network using Reccurent Neural Network (RNN) '''
# importing the necessary libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Reshape
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

### Reshaping the data to be easy accessable for RNN inputs ###

In [61]:
# retriving maximum sequence length
max_sequence_length = max(len(review.split()) for review in nlp_tokenised_reviews)
print(max_sequence_length) # so far the length is 1429

# applying pad sequences to ensure that the length is uniformed
padded_reviews = pad_sequences([[word2vec_model.key_to_index.get(word, 0)
   for word in review.split()] for review in nlp_tokenised_reviews],
   maxlen=max_sequence_length,
   padding = 'post',
   truncating = 'post')
print(padded_reviews.shape)

1429
(49582, 1429)


### Developing RNN Model ###

In [62]:
model = Sequential()
# adding the input shape for the embenddings
model.add(Reshape((nlp_embeddings.shape[1], 1), input_shape=(nlp_embeddings.shape[1])))
model.add(SimpleRNN(units=128))

# Binary classification output layer presented
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

  super().__init__(**kwargs)


ValueError: Cannot convert '300' to a shape.

### Train the Module ###


In [42]:
X_train, X_test, y_train, y_test = train_test_split(nlp_embeddings, nlp_sentiment_labels, test_size=0.2, random_state=42)

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/10


ValueError: Exception encountered when calling Sequential.call().

[1mCannot take the length of shape with unknown rank.[0m

Arguments received by Sequential.call():
  • inputs=tf.Tensor(shape=<unknown>, dtype=float32)
  • training=True
  • mask=None

## 1.2.3 Evaluation and Insights


Use evaluation metrics, e.g., accuracy, precision, recall, loss curves.

Provide visualizations, e.g., learning curves, confusion matrices, to explain findings, where possible.

Highlight strengths, limitations and areas for improvement.