# CNN for Text Classification

This notebook demonstrates how to build a Convolutional Neural Network (CNN) for text classification tasks.

## Table of Contents
1. Introduction to CNN for Text
2. Data Loading and Preprocessing
3. Text Vectorization
4. Building the CNN Model
5. Training and Evaluation
6. Predictions and Analysis

## Learning Objectives
- Understand how CNNs can be applied to text data
- Learn about text preprocessing and embedding
- Build and train a CNN model for text classification
- Evaluate model performance and make predictions

## 1. Introduction to CNN for Text Classification

### Why CNNs for Text?
While CNNs are primarily known for image processing, they work remarkably well for text classification:
- **Local feature detection**: CNNs can identify n-gram patterns in text
- **Parameter efficiency**: Fewer parameters than RNNs for similar tasks
- **Parallel processing**: Faster training compared to sequential models
- **Translation invariance**: Can detect features regardless of position in text

### How it works:
1. Text is converted to word embeddings (vectors)
2. 1D convolution filters slide over the text sequence
3. Filters detect local patterns (like n-grams)
4. Pooling layers extract the most important features
5. Dense layers classify based on extracted features

## 2. Setup and Imports

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# TensorFlow and Keras
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (
    Dense, Dropout, Embedding, Conv1D, GlobalMaxPooling1D,
    MaxPooling1D, Flatten, Input, Concatenate
)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.utils import plot_model

# For text processing
import re
import nltk
from nltk.corpus import stopwords

# Download NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Display versions
print(f"TensorFlow version: {tf.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 3. Data Loading and Exploration

For this tutorial, we'll use the IMDB movie reviews dataset for sentiment analysis (binary classification).

In [None]:
# Load IMDB dataset from Keras
from tensorflow.keras.datasets import imdb

# Parameters
MAX_FEATURES = 10000  # Number of words to consider as features
MAX_LEN = 500  # Maximum length of sequences

print("Loading IMDB dataset...")
(X_train_raw, y_train), (X_test_raw, y_test) = imdb.load_data(num_words=MAX_FEATURES)

print(f"\nTraining samples: {len(X_train_raw)}")
print(f"Testing samples: {len(X_test_raw)}")
print(f"\nClass distribution in training set:")
print(f"Positive: {sum(y_train)} ({sum(y_train)/len(y_train)*100:.2f}%)")
print(f"Negative: {len(y_train) - sum(y_train)} ({(len(y_train) - sum(y_train))/len(y_train)*100:.2f}%)")

In [None]:
# Explore sequence lengths
train_lengths = [len(x) for x in X_train_raw]
test_lengths = [len(x) for x in X_test_raw]

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(train_lengths, bins=50, alpha=0.7, color='blue')
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.title('Distribution of Training Sequence Lengths')
plt.axvline(x=MAX_LEN, color='red', linestyle='--', label=f'Max Length: {MAX_LEN}')
plt.legend()

plt.subplot(1, 2, 2)
plt.boxplot([train_lengths, test_lengths], labels=['Train', 'Test'])
plt.ylabel('Sequence Length')
plt.title('Sequence Length Distribution')
plt.axhline(y=MAX_LEN, color='red', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

print(f"\nSequence length statistics (Training):")
print(f"Mean: {np.mean(train_lengths):.2f}")
print(f"Median: {np.median(train_lengths):.2f}")
print(f"Max: {np.max(train_lengths)}")
print(f"Min: {np.min(train_lengths)}")

## 4. Data Preprocessing

Pad sequences to ensure uniform input size for the CNN.

In [None]:
# Pad sequences
print("Padding sequences...")
X_train = pad_sequences(X_train_raw, maxlen=MAX_LEN, padding='post', truncating='post')
X_test = pad_sequences(X_test_raw, maxlen=MAX_LEN, padding='post', truncating='post')

print(f"\nTraining data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

# Create validation split
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print(f"\nAfter validation split:")
print(f"Training samples: {len(X_train)}")
print(f"Validation samples: {len(X_val)}")
print(f"Test samples: {len(X_test)}")

## 5. Building the CNN Model

### Architecture Overview:
1. **Embedding Layer**: Converts word indices to dense vectors
2. **Convolutional Layers**: Extract local features using different filter sizes
3. **Pooling Layers**: Reduce dimensionality and extract most important features
4. **Dense Layers**: Classification layers
5. **Dropout**: Regularization to prevent overfitting

In [None]:
# Model hyperparameters
EMBEDDING_DIM = 128
NUM_FILTERS = 128
FILTER_SIZES = [3, 4, 5]  # Different n-gram sizes
DROPOUT_RATE = 0.5

def build_simple_cnn(vocab_size, embedding_dim, max_length, num_filters, filter_size, dropout_rate):
    """
    Build a simple CNN model for text classification.
    """
    model = Sequential([
        # Embedding layer
        Embedding(input_dim=vocab_size, 
                 output_dim=embedding_dim, 
                 input_length=max_length,
                 name='embedding'),
        
        # Dropout for embedding
        Dropout(dropout_rate, name='embedding_dropout'),
        
        # Convolutional layer
        Conv1D(filters=num_filters, 
               kernel_size=filter_size,
               activation='relu',
               name='conv1d'),
        
        # Global max pooling
        GlobalMaxPooling1D(name='global_max_pooling'),
        
        # Dense layer
        Dense(128, activation='relu', name='dense1'),
        Dropout(dropout_rate, name='dropout1'),
        
        # Output layer
        Dense(1, activation='sigmoid', name='output')
    ])
    
    return model

# Build the model
model = build_simple_cnn(
    vocab_size=MAX_FEATURES,
    embedding_dim=EMBEDDING_DIM,
    max_length=MAX_LEN,
    num_filters=NUM_FILTERS,
    filter_size=3,
    dropout_rate=DROPOUT_RATE
)

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model architecture
print("Simple CNN Model Architecture:")
print("="*50)
model.summary()

### Multi-Filter CNN (Kim 2014 Architecture)

This architecture uses multiple filter sizes to capture different n-gram patterns simultaneously.

In [None]:
def build_multi_filter_cnn(vocab_size, embedding_dim, max_length, num_filters, filter_sizes, dropout_rate):
    """
    Build a multi-filter CNN model (Kim 2014 architecture).
    Uses multiple filter sizes to capture different n-gram patterns.
    """
    # Input layer
    inputs = Input(shape=(max_length,), name='input')
    
    # Embedding layer
    embedding = Embedding(input_dim=vocab_size,
                         output_dim=embedding_dim,
                         input_length=max_length,
                         name='embedding')(inputs)
    
    embedding = Dropout(dropout_rate, name='embedding_dropout')(embedding)
    
    # Create multiple convolutional branches
    conv_blocks = []
    for i, filter_size in enumerate(filter_sizes):
        conv = Conv1D(filters=num_filters,
                     kernel_size=filter_size,
                     activation='relu',
                     name=f'conv_{filter_size}gram')(embedding)
        conv = GlobalMaxPooling1D(name=f'maxpool_{filter_size}gram')(conv)
        conv_blocks.append(conv)
    
    # Concatenate all features
    if len(conv_blocks) > 1:
        merged = Concatenate(name='concatenate')(conv_blocks)
    else:
        merged = conv_blocks[0]
    
    # Dense layers
    dense = Dense(128, activation='relu', name='dense1')(merged)
    dense = Dropout(dropout_rate, name='dropout1')(dense)
    
    # Output layer
    outputs = Dense(1, activation='sigmoid', name='output')(dense)
    
    # Create model
    model = Model(inputs=inputs, outputs=outputs, name='MultiFilterCNN')
    
    return model

# Build multi-filter model
multi_model = build_multi_filter_cnn(
    vocab_size=MAX_FEATURES,
    embedding_dim=EMBEDDING_DIM,
    max_length=MAX_LEN,
    num_filters=NUM_FILTERS,
    filter_sizes=FILTER_SIZES,
    dropout_rate=DROPOUT_RATE
)

# Compile the model
multi_model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display model architecture
print("\nMulti-Filter CNN Model Architecture:")
print("="*50)
multi_model.summary()

## 6. Training the Model

We'll use callbacks for early stopping and model checkpointing.

In [None]:
# Training parameters
BATCH_SIZE = 128
EPOCHS = 20

# Callbacks
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True,
    verbose=1
)

model_checkpoint = ModelCheckpoint(
    'best_model.keras',
    monitor='val_accuracy',
    save_best_only=True,
    verbose=1
)

print("Training Multi-Filter CNN Model...")
print("="*50)

# Train the model
history = multi_model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping, model_checkpoint],
    verbose=1
)

## 7. Training History Visualization

In [None]:
def plot_training_history(history):
    """
    Plot training and validation accuracy and loss.
    """
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Accuracy plot
    axes[0].plot(history.history['accuracy'], label='Training Accuracy', marker='o')
    axes[0].plot(history.history['val_accuracy'], label='Validation Accuracy', marker='s')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_title('Model Accuracy Over Epochs')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Loss plot
    axes[1].plot(history.history['loss'], label='Training Loss', marker='o')
    axes[1].plot(history.history['val_loss'], label='Validation Loss', marker='s')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Loss')
    axes[1].set_title('Model Loss Over Epochs')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print best metrics
    best_epoch = np.argmax(history.history['val_accuracy'])
    print(f"\nBest Validation Accuracy: {history.history['val_accuracy'][best_epoch]:.4f} at Epoch {best_epoch + 1}")
    print(f"Training Accuracy at Best Epoch: {history.history['accuracy'][best_epoch]:.4f}")
    print(f"Validation Loss at Best Epoch: {history.history['val_loss'][best_epoch]:.4f}")

plot_training_history(history)

## 8. Model Evaluation

In [None]:
# Evaluate on test set
print("Evaluating model on test set...")
test_loss, test_accuracy = multi_model.evaluate(X_test, y_test, verbose=0)

print(f"\nTest Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Generate predictions
y_pred_proba = multi_model.predict(X_test, verbose=0)
y_pred = (y_pred_proba > 0.5).astype(int).flatten()

# Classification report
print("\nClassification Report:")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Calculate additional metrics
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\nPrecision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

## 9. Prediction Distribution Analysis

In [None]:
# Analyze prediction confidence
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(y_pred_proba[y_test == 0], bins=50, alpha=0.7, label='Actual Negative', color='red')
plt.hist(y_pred_proba[y_test == 1], bins=50, alpha=0.7, label='Actual Positive', color='green')
plt.xlabel('Predicted Probability')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Probabilities')
plt.axvline(x=0.5, color='black', linestyle='--', label='Decision Threshold')
plt.legend()

plt.subplot(1, 2, 2)
confidence = np.abs(y_pred_proba.flatten() - 0.5)
plt.hist(confidence, bins=50, alpha=0.7, color='blue')
plt.xlabel('Prediction Confidence (|prob - 0.5|)')
plt.ylabel('Frequency')
plt.title('Model Confidence Distribution')

plt.tight_layout()
plt.show()

print(f"\nAverage prediction confidence: {np.mean(confidence):.4f}")
print(f"Predictions with confidence > 0.4: {np.sum(confidence > 0.4) / len(confidence) * 100:.2f}%")

## 10. Making Predictions on Custom Text

Let's create a function to preprocess and predict sentiment on custom text.

In [None]:
# Get word index from IMDB dataset
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

def preprocess_text(text, word_index, max_len):
    """
    Preprocess custom text for prediction.
    """
    # Convert to lowercase and split
    words = text.lower().split()
    
    # Convert words to indices (add 3 to account for special tokens in IMDB dataset)
    sequence = [word_index.get(word, 2) + 3 for word in words]  # 2 is the index for unknown words
    
    # Filter out words not in vocabulary
    sequence = [idx for idx in sequence if idx < MAX_FEATURES]
    
    # Pad sequence
    padded = pad_sequences([sequence], maxlen=max_len, padding='post', truncating='post')
    
    return padded

def predict_sentiment(text, model, word_index, max_len):
    """
    Predict sentiment of custom text.
    """
    # Preprocess text
    processed = preprocess_text(text, word_index, max_len)
    
    # Make prediction
    prediction = model.predict(processed, verbose=0)[0][0]
    
    # Determine sentiment
    sentiment = "Positive" if prediction > 0.5 else "Negative"
    confidence = prediction if prediction > 0.5 else 1 - prediction
    
    return sentiment, confidence, prediction

# Test with custom reviews
test_reviews = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout.",
    "Terrible film. Waste of time and money. The worst movie I've ever seen.",
    "It was okay, nothing special. Some parts were good but overall mediocre.",
    "Brilliant masterpiece! A must-watch for everyone. Outstanding performances all around.",
    "Boring and predictable. I fell asleep halfway through."
]

print("Sentiment Predictions for Custom Reviews:")
print("="*80)

for i, review in enumerate(test_reviews, 1):
    sentiment, confidence, raw_pred = predict_sentiment(review, multi_model, word_index, MAX_LEN)
    print(f"\nReview {i}:")
    print(f"Text: {review[:80]}..." if len(review) > 80 else f"Text: {review}")
    print(f"Predicted Sentiment: {sentiment}")
    print(f"Confidence: {confidence:.4f} ({confidence*100:.2f}%)")
    print(f"Raw Prediction Score: {raw_pred:.4f}")

## 11. Model Interpretation and Analysis

In [None]:
# Analyze some misclassified examples
misclassified_idx = np.where(y_pred != y_test)[0]

print(f"Total misclassified samples: {len(misclassified_idx)}")
print(f"Misclassification rate: {len(misclassified_idx) / len(y_test) * 100:.2f}%")

# Function to decode review
def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review if i > 0])

# Show some misclassified examples
print("\nExamples of Misclassified Reviews:")
print("="*80)

for i in range(min(5, len(misclassified_idx))):
    idx = misclassified_idx[i]
    decoded = decode_review(X_test[idx])
    
    print(f"\nExample {i+1}:")
    print(f"Review: {decoded[:200]}..." if len(decoded) > 200 else f"Review: {decoded}")
    print(f"True Label: {'Positive' if y_test[idx] == 1 else 'Negative'}")
    print(f"Predicted: {'Positive' if y_pred[idx] == 1 else 'Negative'}")
    print(f"Prediction Probability: {y_pred_proba[idx][0]:.4f}")

## 12. Visualizing Filter Activations (Optional Advanced)

In [None]:
# Create a model that outputs intermediate layer activations
from tensorflow.keras.models import Model

# Get the embedding and first conv layer outputs
layer_outputs = [layer.output for layer in multi_model.layers[1:5]]  # embedding + conv layers
activation_model = Model(inputs=multi_model.input, outputs=layer_outputs)

# Get activations for a sample
sample_idx = 0
sample = X_test[sample_idx:sample_idx+1]
activations = activation_model.predict(sample, verbose=0)

print("Activation shapes:")
for i, activation in enumerate(activations):
    print(f"Layer {i+1}: {activation.shape}")

# Visualize embedding
embedding_output = activations[0][0]  # Shape: (max_len, embedding_dim)

plt.figure(figsize=(15, 5))
plt.imshow(embedding_output.T, aspect='auto', cmap='viridis')
plt.colorbar()
plt.xlabel('Sequence Position')
plt.ylabel('Embedding Dimension')
plt.title('Word Embedding Visualization')
plt.show()

## 13. Model Comparison: Simple vs Multi-Filter CNN

Let's train the simple CNN and compare performance.

In [None]:
# Train simple CNN for comparison
print("Training Simple CNN for comparison...")

simple_history = model.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    callbacks=[EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)],
    verbose=1
)

# Evaluate simple model
simple_test_loss, simple_test_accuracy = model.evaluate(X_test, y_test, verbose=0)

print("\n" + "="*50)
print("MODEL COMPARISON")
print("="*50)
print(f"\nSimple CNN:")
print(f"  Test Accuracy: {simple_test_accuracy:.4f}")
print(f"  Test Loss: {simple_test_loss:.4f}")

print(f"\nMulti-Filter CNN:")
print(f"  Test Accuracy: {test_accuracy:.4f}")
print(f"  Test Loss: {test_loss:.4f}")

print(f"\nImprovement: {(test_accuracy - simple_test_accuracy)*100:.2f}% accuracy gain")

## 14. Save and Load Model

In [None]:
# Save the model
model_path = 'cnn_text_classifier.keras'
multi_model.save(model_path)
print(f"Model saved to {model_path}")

# Load the model (demonstration)
loaded_model = tf.keras.models.load_model(model_path)
print("Model loaded successfully!")

# Verify loaded model works
loaded_test_loss, loaded_test_accuracy = loaded_model.evaluate(X_test, y_test, verbose=0)
print(f"\nLoaded model test accuracy: {loaded_test_accuracy:.4f}")

## 15. Key Takeaways and Next Steps

### What We Learned:
1. **CNN Architecture for Text**: 
   - Embedding layer converts words to dense vectors
   - Conv1D layers extract local n-gram features
   - Pooling layers select most important features
   - Multiple filter sizes capture different patterns

2. **Training Best Practices**:
   - Use early stopping to prevent overfitting
   - Monitor validation metrics
   - Use dropout for regularization
   - Save best model checkpoints

3. **Performance Analysis**:
   - Evaluate on held-out test set
   - Analyze confusion matrix
   - Check prediction confidence distribution
   - Examine misclassified examples

### Possible Improvements:
1. **Pre-trained Embeddings**: Use GloVe or Word2Vec
2. **Deeper Architecture**: Add more convolutional layers
3. **Hyperparameter Tuning**: Optimize filter sizes, number of filters, dropout rates
4. **Data Augmentation**: Back-translation, synonym replacement
5. **Ensemble Methods**: Combine multiple models
6. **Attention Mechanisms**: Add attention layers to focus on important words

### Resources:
- Original Paper: "Convolutional Neural Networks for Sentence Classification" (Kim, 2014)
- TensorFlow Documentation: https://www.tensorflow.org/tutorials/text/text_classification_rnn
- Understanding CNNs for NLP: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

## Exercises for Practice:

1. **Experiment with hyperparameters**: Try different embedding dimensions, filter sizes, and number of filters
2. **Multi-class classification**: Apply this to a dataset with more than 2 classes
3. **Custom dataset**: Use your own text data for classification
4. **Pre-trained embeddings**: Integrate GloVe or Word2Vec embeddings
5. **Regularization techniques**: Try L2 regularization, different dropout rates
6. **Model interpretation**: Implement attention visualization or feature importance analysis