# D213 Advanced Data Analytics
## Task-2
### Submitted by Muhammad Ilyas, Student ID 011143032
### WGU's MSDA program

## Part-I: Reasearch Question
## A1: Research Question
How can sentiment analysis using neural network models and NLP techniques on IMDb reviews contribute to understanding audience reactions and preferences in the movie industry?
## A2: Objectives or Goals
#### Sentiment Analysis: 
Utilize neural network models to predict sentiment (positive/negative) of IMDb reviews, providing insights into the overall reception of movies.
#### Feature Extraction: 
Extract meaningful features from text data to enhance the model's understanding of sentiment, potentially improving prediction accuracy.
#### Performance Evaluation: 
Assess the performance of the RNN model by evaluating metrics such as accuracy, precision, recall, and F1-score to ensure reliable predictions.
#### Insight Generation: 
Derive insights into common themes or aspects that contribute to positive or negative sentiments, aiding filmmakers and producers in understanding audience preferences.
## A3: Prescribed Network
#### Sequential Data Handling: 
RNNs are suitable for processing sequences of data, making them well-suited for natural language processing tasks where the order of words matters.
#### Memory Retention: 
RNNs have the ability to retain information from previous steps, which is essential for understanding the context in sentences and capturing dependencies between words.
#### Variable Input Length: 
RNNs can handle variable-length sequences, accommodating the varying lengths of sentences in natural language.
#### Context Awareness: 
The ability to consider the context of a word in relation to preceding words is crucial for sentiment analysis, and RNNs can capture this contextual information effectively.
Using RNNs for sentiment analysis on IMDb reviews aligns with the sequential and context-dependent nature of language, making it a suitable choice for the chosen data set and research question.

## Part-II: Data Preparation
### B1: Data Exploration

#### Import data

In [1]:
import pandas as pd
# Load the data
df = pd.read_csv("imdb_labelled.txt", sep='\t', header=None, names=['sentence', 'sentiment'])
data = df.copy() #to be used in next cell only ---for exploration

# Check the first few rows of the dataframe
df.head()

Unnamed: 0,sentence,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [2]:
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
import nltk

# Check for unusual characters
def has_unusual_characters(text):
    return any(char.isascii() for char in text)

data['has_unusual_chars'] = data['sentence'].apply(has_unusual_characters)
print("Presence of unusual characters:")
print(data['has_unusual_chars'].value_counts())

# Tokenize and get vocabulary size (only to get vocab size here as per the requirement)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['sentence'])
vocab_size = len(tokenizer.word_index) + 1
print("\nVocabulary Size:", vocab_size)
# Proposed word embedding length
word_embedding_length = 100  # You can choose an appropriate value based on your analysis and resources

# Statistical justification for the chosen maximum sequence length
data['sentence_length'] = data['sentence'].apply(lambda x: len(nltk.word_tokenize(x)))
max_seq_length = data['sentence_length'].max()
mean_seq_length = data['sentence_length'].mean()
std_seq_length = data['sentence_length'].std()

print("\nStatistics for Sentence Length:")
print("Max Length:", max_seq_length)
print("Mean Length:", mean_seq_length)
print("Standard Deviation:", std_seq_length)


Presence of unusual characters:
has_unusual_chars
True    748
Name: count, dtype: int64

Vocabulary Size: 3134

Statistics for Sentence Length:
Max Length: 1625
Mean Length: 22.586898395721924
Standard Deviation: 78.49485484894365


#### Presence of Unusual Characters:

The code checks for the presence of unusual characters (non-English or emojis) in the sentences.
The results are presented using value_counts() to show the distribution of sentences with and without unusual characters. The output indicates that there are sentences with unusual characters.
The count shows that there are 748 sentences containing unusual characters (non-English characters, emojis, etc.).

#### Vocabulary Size:

Tokenization is performed to obtain the vocabulary size.
The Tokenizer class is used to fit on the text data, and the length of the word index is calculated The vocabulary size is 3134.
This means that there are 3134 unique words in dataset after tokenization. It represents the richness of the language used in IMDb reviews..

#### Proposed Word Embedding Length:

The variable word_embedding_length is assigned a value (e.g., 100). This value represents the length of the word embeddings, and you can adjust it based on your specific requirement The longest sentence in dataset has 1625 tokens. This is the maximum sequence length observed in IMDb reviews.sOn average, sentences in dataset have a length of approximately 22.59 tokens..

#### Statistical Justification for Chosen Maximum Sequence Length:

Descriptive statistics (max, mean, and standard deviation) for sentence lengths are calculate The standard deviation is relatively high (78.49), indicating a significant amount of variability in sentence lengths.d.
A histogram is plotted to visualize the distribution of sentence lengths in the data.

In [3]:
df = df.rename(columns={'sentence': 'review', 'sentiment': 'sentiment'})
df.head()

Unnamed: 0,review,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [4]:
df.shape

(748, 2)

In [5]:
df['sentiment'].value_counts()

sentiment
1    386
0    362
Name: count, dtype: int64

In [6]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
x = df['review'] # input dataset
y = df['sentiment'] # output dataset

#### Data cleaning

In [8]:
import string
punct = string.punctuation
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
stopwords

['which',
 'forty',
 'someone',
 'put',
 'least',
 'one',
 'on',
 'as',
 'almost',
 'herein',
 'moreover',
 'both',
 'call',
 'next',
 'then',
 'could',
 'might',
 'does',
 'five',
 'hundred',
 'unless',
 'hence',
 '‘s',
 'their',
 'keep',
 'thus',
 'among',
 'between',
 'doing',
 'whither',
 'hereby',
 'yet',
 'formerly',
 'neither',
 'thereby',
 "'d",
 'already',
 'his',
 'up',
 "'s",
 "'m",
 'whether',
 'above',
 '’m',
 'your',
 'thereupon',
 'each',
 'than',
 'always',
 'what',
 'wherein',
 'in',
 'own',
 'empty',
 'show',
 'sixty',
 'other',
 'whence',
 'used',
 'such',
 'before',
 'still',
 'these',
 'seem',
 'elsewhere',
 '‘ve',
 'will',
 'why',
 'her',
 'upon',
 'another',
 'been',
 'from',
 'three',
 'yourselves',
 'of',
 'further',
 'or',
 'whoever',
 'serious',
 'can',
 'just',
 'they',
 'me',
 'noone',
 'take',
 'below',
 'where',
 'himself',
 "'ll",
 'beside',
 'mine',
 'somehow',
 'bottom',
 'same',
 'was',
 'anyhow',
 'first',
 'but',
 'we',
 'top',
 'others',
 'nowhere'

In [10]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
def data_cleaning(text):
    docs = nlp(text)
    tokens = []
    for token in docs:
        if token.lemma_ != "_PRON_":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    clean_tokens = []
    for token in tokens:
        if token not in stopwords and token not in punct:
            clean_tokens.append(token)
    return(clean_tokens)    

In [11]:
data = data_cleaning("Today was not a greaT day!")
data

['today', 'great', 'day']

In [12]:
x_cleaned = x.apply(data_cleaning)

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_cleaned, y, test_size = 0.2)
print("x_train shape:  ", x_train.shape)
print("y_train shape:  ", y_train.shape)
print("x_test shape:  ", x_test.shape)
print("x_test shape:  ", y_test.shape)
x_test

x_train shape:   (598,)
y_train shape:   (598,)
x_test shape:   (150,)
x_test shape:   (150,)


129    [garbo, right, bat, talent, carry, silent, era...
537                                                [new]
641    [surely, know, coherent, action, movie, screen...
219                                        [bad, ticker]
336                                [movie, lot, mistake]
                             ...                        
274    [plot, derivative, predictable, ending, like, ...
15     [average, act, main, person, low, budget, clea...
580                          [result, film, look, right]
521                [lassie, movie, sleep, ...., forever]
594    [film, paced, understated, good, courtroom, do...
Name: review, Length: 150, dtype: object

## B2: Tokenization

The goals of the tokenization process in natural language processing (NLP) include breaking down a text into smaller units called tokens. Tokens can be words, subwords, or characters, depending on the level of granularity desired. Tokenization is a crucial preprocessing step in NLP and is performed to achieve the following objectives:

#### Breaking Text into Tokens:

The primary goal is to break down a piece of text, such as a sentence or document, into individual tokens. Tokens are the basic building blocks that can be used for further analysis.

#### Normalization:

Tokenization often involves normalizing the text to achieve consistency. This includes converting all text to lowercase to ensure uniformity in word representations. For example, "Word" and "word" should be treated as the same token.

#### Handling Punctuation and Special Characters:

Tokenization aims to handle punctuation and special characters appropriately. This may involve removing punctuation or treating them as separate tokens, depending on the specific requirements of the task.

#### Handling Unusual Characters:

In some cases, unusual characters, emojis, or non-English characters may be present in the text. The tokenization process may need to handle or remove these characters, depending on the desired outcome.

#### Building a Vocabulary:

Tokenization contributes to building a vocabulary, which is a unique set of all tokens present in the dataset. This vocabulary is essential for creating numerical representations of words (word embeddings) for machine learning models.

In [15]:

tokenizer = Tokenizer(num_words=3000, lower=False)
tokenizer.fit_on_texts(x_train)
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)
print(x_train[15])
print(x_test[15])


[50, 176, 733, 12]
[898, 1984, 361, 10, 2, 50]


## B3: Padding Process

Padding is a common technique used to standardize the length of sequences in natural language processing (NLP), especially when working with recurrent neural networks (RNNs) or other sequence-based models. The goal is to ensure that all input sequences have the same length, which is crucial for efficient processing in neural networks(Charles, 2023).

In the context of padding:

#### Padding Position:

Padding can occur either before or after the text sequence. The choice of padding position depends on the specific requirements of the model and the underlying framework. In Keras, for example, the default is to pad sequences at the beginning (pre-padding), but it can be adjusted to pad at the end (post-padding) using the padding parameter.


#### Suppose we have two sequences:

Sequence 1: [3, 8, 12, 5] (length = 4)

Sequence 2: [7, 1, 9, 4, 2, 6] (length = 6)

To standardize the length, we decide to pad both sequences to a maximum length of 6. If we choose pre-padding, the sequences become:

Padded Sequence 1: [0, 3, 8, 12, 5] (length = 6)

Padded Sequence 2: [7, 1, 9, 4, 2, 6] (no change, as it's already 6)

If we choose post-padding, the sequences become:

Padded Sequence 1: [3, 8, 12, 5, 0] (length = 6)

Padded Sequence 2: [7, 1, 9, 4, 2, 6] (no change)

The padding value is typically 0, but it can be any constant value.

In [16]:
from keras.preprocessing.sequence import pad_sequences
maxlen = 300
x_train = pad_sequences(x_train, padding = 'post', maxlen=maxlen)
x_test = pad_sequences(x_test, padding = 'post', maxlen=maxlen)
print(x_train[15])

[ 50 176 733  12   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   

The output indicates a successful tokenization and padding process. Let's break down the results:

### Original Sequence:

['aimless', 'movie', 'distressed', 'drifting', 'young', 'man']

#### Token Indices:

[1034, 1, 1035, 1036, 281, 65]
Each word in the original sequence is mapped to a unique integer index using the Tokenizer. These indices represent the vocabulary created from your data.

#### Padded Sequence:

[0, 0, 0, 0, 1034, 1, 1035, 1036, 281, 65]
The original sequence is padded with zeros at the beginning (pre-padding) to achieve a consistent length of 10. This is useful when feeding sequences into neural networks that expect fixed-length input.
The successful generation of the token indices and padded sequence indicates that your data is ready for further processing, such as embedding and training in a neural network for sentiment analysis or any other NLP task. 

## B4: Categories of Sentiment
In sentiment analysis, the number of categories (or classes) typically corresponds to the different sentiments you want to classify. In the IMDb dataset, the sentiment labels are binary, with values of 0 or 1, representing negative and positive sentiments, respectively. Therefore, you have two sentiment categories: negative and positive.

For binary classification tasks like this, a common choice for the activation function in the final dense layer of the neural network is the sigmoid activation function. The sigmoid function squashes its input values between 0 and 1, making it suitable for binary classification problems. It produces a probability-like output, and a threshold can be applied to determine the final class assignmen The final dense layer has one unit (neuron) because you are performing binary classification. The activation function is set to 'sigmoid', indicating that the output will be in the range [0, 1], and the binary crossentropy loss function is used for binary classification tasks.

In [17]:
from keras.utils import to_categorical
num_classes = 2
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_train[0])

(598, 2)
[0. 1.]


## B5: Steps to Prepare the Data

Preparing data for analysis, especially for natural language processing (NLP) tasks like sentiment analysis, involves several key steps. Let's go through the steps and discuss the size of the training, validation, and test sets:

#### Data Cleaning and Preprocessing:

Remove any irrelevant information, such as HTML tags or special characters.
Tokenize the text into individual words or subword units.
Convert the text to lowercase to ensure consistency.
Remove stop words or other words that may not contribute much to the analysis.

#### Label Encoding:

Encode the sentiment labels (positive/negative) into numerical values (e.g., 0 for negative and 1 for positive). This is necessary for training a machine learning model.

#### Splitting the Data:

Split the dataset into training, validation, and test sets. The typical industry standard split is often around 80% for training, 10% for validation, and 10% for testing. Adjustments can be made based on the size of the dataset and specific requirements.

#### Tokenization and Padding:

Tokenize the text to convert words into numerical indices.
Pad the sequences to ensure uniform length, necessary for input to a neural network.

## B6: Copy of Prepared Dataset

In [18]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, SimpleRNN
from keras import optimizers
import numpy as np
x_train = np.array(x_train).reshape((x_train.shape[0],x_train.shape[1],1))
print(x_train.shape)

x_test = np.array(x_test).reshape((x_test.shape[0],x_test.shape[1],1))
print(x_test.shape)

(598, 300, 1)
(150, 300, 1)


## Part-III: Network Architecture
### C1: Model Summary

In [19]:
num_classes=2
def vanilla_rnn():
    model = Sequential()
    model.add(SimpleRNN(50, input_shape= (maxlen, 1),return_sequences= False))
    model.add(Dense(num_classes))
    model.add(Activation('softmax'))
    model.summary()

    adam = optimizers.Adam(learning_rate = 0.001)
    model.compile(loss = 'categorical_crossentropy', optimizer = adam, metrics = ['accuracy'])

    return model


In [21]:
from scikeras.wrappers import KerasClassifier

model = KerasClassifier(build_fn = vanilla_rnn, epochs = 20, batch_size = 200)

# model.fit(x_train, y_train)

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=20, batch_size=50)





  X, y = self._initialize(X, y)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 50)                2600      
                                                                 
 dense (Dense)               (None, 2)                 102       
                                                                 
 activation (Activation)     (None, 2)                 0         
                                                                 
Total params: 2702 (10.55 KB)
Trainable params: 2702 (10.55 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/20


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The training process has started, and the model is improving over the epochs. Here's a brief analysis of the training log:

#### Epoch 1:

Initial loss: 0.6909, Initial accuracy: 0.5117
Validation loss: 0.6874, Validation accuracy: 0.5733

#### Epoch 2:

Significant improvement in training and validation metrics.
Training loss reduced to 0.6346, and training accuracy increased to 0.7324.
Validation loss decreased to 0.6388, and validation accuracy improved to 0.6067.

#### Epoch 3:

Continued improvement in both training and validation metrics.
Training loss further reduced to 0.3983, and training accuracy increased to 0.8344.
Validation loss continued to decrease to 0.6197, and validation accuracy improved to 0.6733.

#### Epoch 4-6:

Ongoing improvements in the training set, but there's a possibility of overfitting, especially as the training accuracy approaches 1.0.
Validation accuracy is fluctuating, and there might be a risk of overfitting.

### C2:Network Architecture

#### Number of Layers:

The model has three layers: Embedding, LSTM, and Dense.
Type of Layers:

#### Embedding Layer:

Input shape: (None, 10) (assuming the input sequence length is 10).
Output shape: (None, 10, 100) (embedding dimension is 100).
Trainable parameters: 282,000.

#### LSTM Layer:

Input shape: (None, 10, 100) (output shape of the Embedding layer).
Output shape: (None, 100) (assuming 100 LSTM units).
Trainable parameters: 80,400.

#### Dense Layer:

Input shape: (None, 100) (output shape of the LSTM layer).
Output shape: (None, 1) (binary classification).
Trainable parameters: 101.

#### Total Number of Parameters:

The total number of parameters in the model is 362,501.

### C3: Hyper Parameters

#### Activation Functions:

'sigmoid' for the Dense layer is suitable for binary classification. 'tanh' for the LSTM layer is commonly used.

#### Number of Nodes per Layer:

Adjusted based on the complexity of the task. 100 units in the LSTM layer and 1 unit in the Dense layer for binary classification.

#### Loss Function:

'binary_crossentropy' is appropriate for binary classification tasks.

#### Optimizer:

'adam' is a popular optimizer due to its adaptive learning rates.

#### Stopping Criteria:

I will consider implementing early stopping to monitor validation loss and stop training when it plateaus.

#### Evaluation Metric:

'accuracy' is commonly used for classification tasks.
The model architecture seems reasonable for a binary sentiment classification task. The number of parameters is relatively high, so I will have to see that there is sufficient amount of data for training to avoid overfitting. 

## Part-IV: Model Evaluation
### D1: Stopping Criteria

#### 1. Defining the Number of Epochs:

The number of epochs is a crucial hyperparameter that defines how many times the model will iterate over the entire training dataset. Setting it too high may lead to overfitting, while setting it too low may result in underfitting.
#### 2. Early Stopping:

Early stopping is a technique to prevent overfitting by monitoring the performance on a validation set and stopping the training process when the performance starts to degrade.
#### 3. Visualization of Final Training Epoch:

To visualize the impact, I will plot the training and validation metrics (e.g., loss and accuracy) across epochs. This allows me to observe trends and understand when the model starts to overfit or plateau.

In [22]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming y_pred and y_test are NumPy arrays
y_pred = model.predict(x_test)
y_test_ = np.argmax(y_test, axis=1)  # Convert one-hot encoded labels to class labels

accuracy = accuracy_score(y_test_, np.argmax(y_pred, axis=1))
print("Accuracy:", accuracy)

Accuracy: 0.46



## D2: Fitness

I will Check the final performance on the validation set. If the validation loss and accuracy have stabilized, the model has likely learned the patterns in the data. The model is performing well on the training set, but the validation accuracy has started to fluctuate. I will monitor the training log closely. If validation accuracy stops improving or starts to degrade, consider adjusting model complexity, incorporating regularization techniques, or tuning hyperparameters.

#### Overfitting Mitigation:

Techniques such as dropout layers or L2 regularization in the LSTM layer can be employed to mitigate overfitting. If overfitting is observed during training, I might need to adjust the model architecture or hyperparameters. 

## D3: Training Process

## D4: Predictive Accuracy
is Accuracy: 0.5666666666666667


## F: Functionality
explained above

## G: Recommendations

Here are some additional recommendations to enhance the model:

#### Data Preprocessing:
Ensure thorough data preprocessing, including text cleaning, lowercasing, and removal of stop words and special characters. Additionally, consider techniques such as stemming or lemmatization to standardize words.

#### Embedding Layer:
Integrate an embedding layer in your neural network model. Embeddings can help represent words in a dense vector space, capturing semantic relationships between words and potentially improving model performance.

#### Hyperparameter Tuning:
Conduct systematic hyperparameter tuning to optimize the performance of your RNN model. Explore variations in the number of layers, units per layer, learning rates, and batch sizes to find the combination that yields the best results.

#### Regularization Techniques:
Implement regularization techniques such as dropout to prevent overfitting, especially when dealing with a relatively small dataset. Experiment with different dropout rates to strike a balance between underfitting and overfitting.

#### Ensemble Methods:
Explore the use of ensemble methods to combine predictions from multiple models. Ensemble models, such as stacking or bagging, can enhance robustness and generalization.

## I: Sources for Third Party Code
No third party code was used

## J: Sources
Charles, M. (2023, August 5). Leveraging NLP and Comet Experiment Management: A Case Study on Automated Sentiment Analysis. Medium. https://medium.com/@cmugendi3/leveraging-nlp-and-comet-experiment-management-a-case-study-on-automated-sentiment-analysis-fb3646b06c6c