# __Text Classification Using RNN__

Let's see how to classify the text using RNN (Recurrent Neural Network).

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sowmya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Steps to be followed:

1. Import the libraries
2. Define the hyperparameter
3. Preprocess the data and print the lengths of the labels and article lists
4. Split the data into training and validation sets
5. Initialize a tokenizer and fit it to the training articles
6. Convert the training articles into sequences using the tokenizer
7. Pad the sequence
8. Print the length of validation sequences and the shape of validation padded
9. Train the model
10. Compile the model
11. Plot the graph

### Step 1: Import the libraries
- Import the required libraries.

In [4]:
import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
import matplotlib.pyplot as plt


### Step 2: Define the hyperparameter
- Set the value of __vocab_size__ to __5000__, representing the size of the vocabulary.
- Set the value of __embedding_dim__ to __64__, specifying the dimensionality of the word embeddings.
- Set the value of __max_length__ to __200__, indicating the maximum length of input sequences.
- Set the value of __padding_type__ to __post__, specifying that padding should be added at the end of sequences.
- Set the value of __trunc_type__ to __post__, indicating that truncation should be applied at the end of sequences.
- Set the value of __oov_tok__ to __OOV__, representing the token to be used for out-of-vocabulary words.
- Set the value of __training_portion__ to __0.8__, representing the proportion of data to be used for training.

In [6]:
vocab_size = 5000
embedding_dim = 64
max_length = 200
padding_type = 'post'
trunc_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

### Step 3: Preprocess the data and print the lengths of the labels and articles lists

- Define two empty lists, articles, and labels to store the articles and labels, respectively.
- Read the contents of the **bbc-text.csv** file using csv.reader and iterate through each row.
- Extract the labels from the first column of each row and append it to the labels list.
- Process the article from the second column by removing stopwords and replacing consecutive spaces with a single space, and then append it to the article list.
- Print the lengths of the labels and article lists to display the number of labels and processed articles, respectively.

In [8]:
articles = []
labels = []
#"D:\SimpliLearn_2024_2025\Deep_Learning_with_Keras_and_TensorFlow_ILT_Material\0.2_LVC\Datasets\Lesson_11_Recurrent_Neural_Networks\bbc-text.csv"
with open("bbc-text_1742742988916.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)  # skip header
    for row in reader:
        label, article = row
        article = ' '.join([word for word in article.split() if word not in STOPWORDS])
        articles.append(article)
        labels.append(label)
print(len(labels))
print(len(articles))

2225
2225


__Observations:__
- There are only **2,225** articles in the data.
- Then, we split into a training set and a validation set, according to the parameter we set earlier, 80% for training, and 20% for validation.

### Step 4: Split the data into training and validation sets
- Calculate the **train_size** by multiplying the length of the article list with __training_portion__ and converting it to an integer.
- Create **train_articles** by slicing the article list from index **0** to **train_size.**
- Create **train_labels** by slicing the labels list from index **0** to **train_size.**
- Create validation_articles by slicing the articles list from **train_size** onward.
- Create **validation_labels** by slicing the labels list from **train_size** onward.
- Print the **train_size** to display the calculated value.

- The lengths of **train_articles**, **train_labels**, **validation_articles**, and **validation_labels** represent the number of items in each list.

In [13]:
print(set(labels))

{'sport', 'tech', 'business', 'politics', 'entertainment'}


__Observation:__

- The output is a set containing the unique labels: 'business', 'tech', 'entertainment', 'politics', and 'sport'.

In [16]:
from sklearn.preprocessing import LabelEncoder

# Convert labels to integers using LabelEncoder which automatically makes them 0-indexed
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

In [18]:
# Check unique labels and adjust the model's last Dense layer
num_classes = len(np.unique(labels_encoded))
print("Number of unique labels:", num_classes)


Number of unique labels: 5


In [20]:
# Split the data
train_size = int(len(articles) * training_portion)
train_articles = articles[:train_size]
training_label_seq = labels_encoded[:train_size]

validation_articles = articles[train_size:]
validation_label_seq = labels_encoded[train_size:]


print(train_size)
print(len(train_articles))
print(len(training_label_seq))
print(len(validation_articles))
print(len(validation_label_seq))

1780
1780
1780
445
445


__Observations:__
- The value of **train_size** is calculated based on the proportion of training data.
- The lengths of **train_articles**, **training_label_seq**, **validation_articles**, and **validation_label_seq** represent the number of items in each list.

### Step 5: Initialize a tokenizer and fit it to the training articles

- Initialize a **Tokenizer** object named tokenizer with the specified parameters: **num_words** representing the vocabulary size and **oov_token** representing the out-of-vocabulary token.
- Fit the tokenizer on the training articles **(train_articles)** using the **fit_on_texts** method.

  `fit_on_texts`: Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it a sample sentence like

    `The cat sat on the mat.`

  It will create a dictionary, where every word gets a unique integer value. 0 is reserved for padding.
      word_index["the"] = 1; word_index["cat"] = 2.

- This step updates the tokenizer's internal word index based on the words in the training articles.
- Assign the word index obtained from the tokenizer to the variable **word_index.**
- Extract the first 10 items from the word_index dictionary
- Print the resulting dictionary.

In [24]:
# Initialize and fit tokenizer on training data only
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_articles)
word_index = tokenizer.word_index

In [26]:
len(word_index)

27270

In [28]:
dict(list(word_index.items())[0:10])

{'<OOV>': 1,
 'said': 2,
 'mr': 3,
 'would': 4,
 'year': 5,
 'also': 6,
 'people': 7,
 'new': 8,
 'us': 9,
 'one': 10}

__Observations:__
- The code prints a dictionary containing the first 10 items from the word_index dictionary.
- These items represent a subset of the word-to-index mappings generated by the tokenizer.

### Step 6: Convert the training articles into sequences using the tokenizer
- Convert the training articles **(train_articles)** into sequences of numbers using the `texts_to_sequences` method of the tokenizer object and assign the result to `train_sequences` variable
    
- Print the sequence representation of the 11th training article (index 10) by accessing **train_sequences[10].**

In [32]:
train_sequences  = tokenizer.texts_to_sequences(train_articles)

# train_sequences is a list of lists
print(train_sequences[10])

[2430, 1, 225, 4991, 22, 641, 586, 225, 4991, 1, 1, 1660, 1, 1, 2430, 22, 564, 1, 1, 140, 278, 1, 140, 278, 796, 822, 661, 2305, 1, 1144, 1691, 1, 1718, 4992, 1, 1, 1, 1, 1, 4733, 1, 1, 122, 4510, 1, 2, 2873, 1503, 352, 4734, 1, 52, 341, 1, 352, 2170, 3958, 41, 22, 3792, 1, 1, 1, 1, 542, 1, 1, 1, 834, 631, 2365, 347, 4735, 1, 365, 22, 1, 787, 2366, 1, 4298, 138, 10, 1, 3663, 681, 3531, 1, 22, 1, 414, 822, 661, 1, 90, 13, 633, 1, 225, 4991, 1, 598, 1, 1691, 1021, 1, 4993, 807, 1861, 117, 1, 1, 1, 2973, 22, 1, 99, 278, 1, 1604, 4994, 542, 492, 1, 1441, 4736, 778, 1319, 1, 1858, 10, 33, 641, 319, 1, 62, 478, 564, 301, 1504, 22, 479, 1, 1, 1663, 1, 797, 1, 3065, 1, 1363, 6, 1, 2430, 564, 22, 2970, 4730, 1, 1, 1, 1, 1, 850, 39, 1822, 674, 297, 26, 979, 1, 882, 22, 361, 22, 13, 301, 1504, 1341, 374, 20, 63, 883, 1096, 4299, 247]


In [34]:
train_articles[10]

'berlin cheers anti-nazi film german movie anti-nazi resistance heroine drawn loud applause berlin film festival. sophie scholl - final days portrays final days member white rose movement. scholl 21 arrested beheaded brother hans 1943 distributing leaflets condemning abhorrent tyranny adolf hitler. director marc rothemund said: feeling responsibility keep legacy scholls going. must somehow keep ideas alive added. film drew transcripts gestapo interrogations scholl trial preserved archive communist east germany secret police. discovery inspiration behind film rothemund worked closely surviving relatives including one scholl sisters ensure historical accuracy film. scholl members white rose resistance group first started distributing anti-nazi leaflets summer 1942. arrested dropped leaflets munich university calling day reckoning adolf hitler regime. film focuses six days scholl arrest intense trial saw scholl initially deny charges ended defiant appearance. one three german films vying 

In [36]:
word_index['resistance']

7153

In [38]:
len(train_sequences)

1780

__Observation:__
- The code prints the sequence representation of the 11th training article (index 10) in the **train_sequences** list.
- The output is a list of integers, where each integer represents the index of a word in the tokenizer's word index vocabulary that corresponds to a word in the article.

### Step 7: Pad the Sequence
- Pad the sequences in **train_sequences** using the **pad_sequences** function. It is done so that every sequence has the same length.
- Set the maximum length of the padded sequences to **max_length.**
- Specify the padding type as **padding_type** and the truncation type as **trunc_type.**
- Assign the padded sequences to the variable **train_padded.**

In [42]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [44]:
print(train_padded[10])

[2430    1  225 4991   22  641  586  225 4991    1    1 1660    1    1
 2430   22  564    1    1  140  278    1  140  278  796  822  661 2305
    1 1144 1691    1 1718 4992    1    1    1    1    1 4733    1    1
  122 4510    1    2 2873 1503  352 4734    1   52  341    1  352 2170
 3958   41   22 3792    1    1    1    1  542    1    1    1  834  631
 2365  347 4735    1  365   22    1  787 2366    1 4298  138   10    1
 3663  681 3531    1   22    1  414  822  661    1   90   13  633    1
  225 4991    1  598    1 1691 1021    1 4993  807 1861  117    1    1
    1 2973   22    1   99  278    1 1604 4994  542  492    1 1441 4736
  778 1319    1 1858   10   33  641  319    1   62  478  564  301 1504
   22  479    1    1 1663    1  797    1 3065    1 1363    6    1 2430
  564   22 2970 4730    1    1    1    1    1  850   39 1822  674  297
   26  979    1  882   22  361   22   13  301 1504 1341  374   20   63
  883 1096 4299  247    0    0    0    0    0    0    0    0    0    0
    0 

__Observation:__
- The code prints the padded sequence representation of the 11th training article.
- The output is a list of integers representing the word indices of the corresponding words in the article, after applying padding to ensure a consistent length (max_length) for all sequences.

### Step 8: Print the length of validation sequences and the shape of validation padded
- Convert the validation articles into sequences using the tokenizer and pad the sequences to a maximum length. Assign the result to **validation_padded.**
- Print the length of **validation_sequences** and the shape of **validation_padded.**
- Create a tokenizer for the labels and fit it on the labels list.
- Convert the training and validation labels into sequences using the label tokenizer and store the results in **training_label_seq** and **validation_label_seq** as NumPy arrays.

In [48]:
validation_sequences = tokenizer.texts_to_sequences(validation_articles)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(len(validation_sequences))
print(validation_padded.shape)


445
(445, 200)


__Observations:__
- The length of **validation_sequences**, indicating the number of sequences in the validation set.
- The shape of **validation_padded**, representing the dimensions of the padded validation sequences.

In [51]:
# Confirm that labels are correctly zero-indexed
print("Unique training labels:", np.unique(training_label_seq))
print("Unique validation labels:", np.unique(validation_label_seq))

Unique training labels: [0 1 2 3 4]
Unique validation labels: [0 1 2 3 4]


### Step 9: Train the model
- Create a sequential model using **tf.keras.Sequential().**
- Add an embedding layer to the model with the specified vocabulary size **(vocab_size)** and embedding dimension **(embedding_dim).**
- Add a bidirectional SimpleRNN layer to the model with the same embedding dimension.
- Add a dense layer to the model with the embedding dimension as the number of units and **relu** activation function.
- Add a dense layer with `num_classes` which represents the number of unique classes/labels and the **softmax** activation function to the model.
- Print a summary of the model's architecture using **model.summary().**

In [70]:
model_dropout = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(embedding_dim)),
    tf.keras.layers.Dropout(0.5),  # Add dropout
    tf.keras.layers.Dense(embedding_dim, activation='relu'),
    tf.keras.layers.Dropout(0.5),  # Add another dropout layer
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

model_dropout.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
history = model_dropout.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)


Epoch 1/10
56/56 - 7s - 118ms/step - accuracy: 0.2084 - loss: 1.6326 - val_accuracy: 0.2652 - val_loss: 1.5553
Epoch 2/10
56/56 - 3s - 61ms/step - accuracy: 0.3247 - loss: 1.5014 - val_accuracy: 0.3596 - val_loss: 1.4989
Epoch 3/10
56/56 - 4s - 66ms/step - accuracy: 0.4236 - loss: 1.3961 - val_accuracy: 0.3416 - val_loss: 1.4577
Epoch 4/10
56/56 - 4s - 65ms/step - accuracy: 0.4904 - loss: 1.1857 - val_accuracy: 0.5101 - val_loss: 1.0970
Epoch 5/10
56/56 - 4s - 63ms/step - accuracy: 0.6545 - loss: 0.8628 - val_accuracy: 0.5685 - val_loss: 1.0514
Epoch 6/10
56/56 - 3s - 61ms/step - accuracy: 0.7747 - loss: 0.6004 - val_accuracy: 0.7236 - val_loss: 0.7250
Epoch 7/10
56/56 - 4s - 63ms/step - accuracy: 0.7938 - loss: 0.5404 - val_accuracy: 0.6404 - val_loss: 0.8550
Epoch 8/10
56/56 - 4s - 68ms/step - accuracy: 0.9028 - loss: 0.3140 - val_accuracy: 0.7663 - val_loss: 0.6402
Epoch 9/10
56/56 - 4s - 63ms/step - accuracy: 0.9579 - loss: 0.1549 - val_accuracy: 0.8067 - val_loss: 0.5338
Epoch 10/

In [None]:
model_dropout.save('my_model.keras')

In [71]:
# Evaluate the model on the validation set
val_loss, val_accuracy = model_dropout.evaluate(validation_padded, validation_label_seq, verbose=2)
print(f"Validation loss: {val_loss}")
print(f"Validation accuracy: {val_accuracy}")

14/14 - 0s - 12ms/step - accuracy: 0.8404 - loss: 0.4957
Validation loss: 0.4957028031349182
Validation accuracy: 0.8404494524002075


In [72]:
# Evaluate the model on the training set
train_loss, train_accuracy = model_dropout.evaluate(train_padded, training_label_seq, verbose=2)
print(f"Training loss: {train_loss}")
print(f"Training accuracy: {train_accuracy}")

56/56 - 1s - 11ms/step - accuracy: 0.9972 - loss: 0.0111
Training loss: 0.01105344295501709
Training accuracy: 0.9971910119056702


__Observation__

Here, you can see that the validation accuracy has increased to ~77%

Now lets explore one more step to address overfitting called **Early Stopping**

**Explanation of Early Stopping Parameters:**

`monitor`: This is the metric to be monitored, usually 'val_loss' or 'val_accuracy'.

`patience`: The number of epochs to continue training without improvement in the monitored metric. After this number, training stops.

`restore_best_weights`: When set to True, the model weights are reverted to those that achieved the best value of the monitored metric. This is useful because even after the condition for patience is met, the best model may have occurred in an earlier epoch.