# Complex Document Classification
For a more complex document classification task, we can consider using a larger dataset and a neural network model, such as a Convolutional Neural Network (CNN) or a Long Short-Term Memory (LSTM) network. These models are well-suited for text classification tasks, especially when dealing with large datasets.


**for simple document classification ---> using TF-IDF & Naive Bayes Classifier**


**for complex document classification ---> using Long Short-Term Memory (LSTM)**

One such dataset is the "Jigsaw Multilingual Toxic Comment Classification" dataset available on Kaggle. It contains comments from various online platforms and is labeled for toxicity. However, for this example, I will use a simpler dataset due to accessibility and ease of use in an example. Let's use the "IMDb Movie Reviews" dataset, which is large and commonly used for text classification tasks. It contains movie reviews, labeled as positive or negative.




In [1]:
# Setup and Data Loading
# Import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.datasets import imdb

# Set parameters for the dataset and model
max_features = 20000  # number of words to consider as features
maxlen = 80  # cut texts after this number of words
batch_size = 32

# Load the IMDb dataset
print("Loading data...")
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

# Pad sequences for consistency
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)


Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)


In [2]:
# Building the LSTM Model

# LSTM model architecture
print('Building model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


Building model...




In [3]:
# Training the Model
# Train the model
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))


Train...
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7aade21c71f0>

In [4]:
# Evaluating the Model
# Evaluate the model
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)


Test score: 1.0700082778930664
Test accuracy: 0.813319981098175


Here are some common types of visualizations for model evaluation:

- Training and Validation Loss over Epochs:

This line chart shows the training and validation loss over each training epoch. It helps you assess whether your model is overfitting or underfitting.

- Training and Validation Accuracy over Epochs:

Similar to the loss chart, this line chart displays the training and validation accuracy over each epoch. It helps you understand how well your model is performing on the training and validation data.

- Confusion Matrix:

A confusion matrix is a table that helps you visualize the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives. You can create a heatmap to display this information.

- ROC Curve and AUC:

If your problem is binary classification, you can create a Receiver Operating Characteristic (ROC) curve and calculate the Area Under the Curve (AUC). This curve helps you evaluate the trade-off between true positive rate and false positive rate at different thresholds.

- Precision-Recall Curve:

This curve is useful for imbalanced datasets and shows the trade-off between precision and recall at different thresholds. It helps you choose an appropriate threshold for your classification task.

- Histogram of Predictions:

A histogram can be used to visualize the distribution of predicted probabilities or scores for each class. It can help you understand how confident your model is in its predictions.

- Box Plot of Prediction Scores:

A box plot can help you visualize the spread of prediction scores for each class, providing insights into the model's uncertainty.