# 09 — CNN (TextCNN)

A **Convolutional Neural Network** for text classification using Word2Vec embeddings.

Unlike the FFN (which averaged word vectors), this model operates on **sequences** of word vectors,
using Conv1D filters at multiple window sizes (3, 4, 5) to capture n-gram patterns.

In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, classification_report
import os
# Corpus Configuration
CORPUS_NAME = 'raw_corpus' # Options: 'pre-filtered-corpus', 'raw_corpus'
PROCESSED_DATA_DIR = f'../data/processed/{CORPUS_NAME}'
MODELS_DIR_BASE = f'../models/{CORPUS_NAME}'




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/Users/lhbelfanti/.pyenv/versions/3.12.12/lib/python3.12/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/lhbelfanti/.pyenv/versions/3.12.12/lib/python3.12/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/lhbelfanti/.pyenv/versions/3.12.12/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/lhbelfanti/.pyenv/versions/3.12.12/lib/python3.12/site-packages/traitlets/con

In [2]:
%load_ext watermark
%watermark -v -n -m -p numpy,pandas,torch,gensim,sklearn

Python implementation: CPython
Python version       : 3.12.12
IPython version      : 9.10.0

numpy  : 2.4.2
pandas : 3.0.0
torch  : 2.2.2
gensim : 4.4.0
sklearn: 1.8.0

Compiler    : Clang 17.0.0 (clang-1700.6.3.2)
OS          : Darwin
Release     : 25.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit



## 1. Data Preparation

We need **sequences** of word indices (not averaged vectors) so that Conv1D can detect local patterns.

In [3]:
MAX_LEN = 50  # Pad/truncate all documents to 50 tokens
EMBED_DIM = 100
LABEL_MAP = {'NEGATIVE': 0, 'POSITIVE': 1}

def build_embedding_matrix(w2v_model):
    """Create an embedding matrix from Word2Vec. Index 0 is reserved for padding."""
    vocab = w2v_model.wv.key_to_index
    matrix = np.zeros((len(vocab) + 1, EMBED_DIM))  # +1 for padding at index 0
    word2idx = {'<PAD>': 0}
    for word, idx in vocab.items():
        word2idx[word] = idx + 1  # Shift by 1 (0 = padding)
        matrix[idx + 1] = w2v_model.wv[word]
    return matrix, word2idx

def texts_to_sequences(texts, word2idx, max_len):
    """Convert texts to padded integer sequences."""
    sequences = []
    for text in texts:
        tokens = str(text).split()
        seq = [word2idx.get(w, 0) for w in tokens[:max_len]]
        # Pad
        seq += [0] * (max_len - len(seq))
        sequences.append(seq)
    return np.array(sequences)

## 2. Model Definition

TextCNN: Embedding → Multiple Conv1D filters (sizes 3, 4, 5) → MaxPool → Concatenate → Dense → Sigmoid

In [4]:
class TextCNN(nn.Module):
    def __init__(self, embedding_matrix, num_filters=100, filter_sizes=(3, 4, 5), dropout=0.3):
        super().__init__()
        vocab_size, embed_dim = embedding_matrix.shape
        
        # Pre-trained embedding layer (frozen)
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.embedding.weight = nn.Parameter(torch.FloatTensor(embedding_matrix))
        self.embedding.weight.requires_grad = False  # Freeze embeddings
        
        # Conv1D filters at different window sizes
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, num_filters, kernel_size=fs)
            for fs in filter_sizes
        ])
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(num_filters * len(filter_sizes), 1)
    
    def forward(self, x):
        # x: (batch, seq_len) → (batch, seq_len, embed_dim)
        x = self.embedding(x)
        # Conv1d expects (batch, channels, seq_len)
        x = x.permute(0, 2, 1)
        
        # Apply each conv filter + max-over-time pooling
        conv_outs = []
        for conv in self.convs:
            c = F.relu(conv(x))           # (batch, num_filters, new_len)
            c = F.max_pool1d(c, c.size(2)) # (batch, num_filters, 1)
            c = c.squeeze(2)               # (batch, num_filters)
            conv_outs.append(c)
        
        # Concatenate all filter outputs
        out = torch.cat(conv_outs, dim=1)  # (batch, num_filters * len(filter_sizes))
        out = self.dropout(out)
        out = torch.sigmoid(self.fc(out)).squeeze(1)
        return out

## 3. Training Function

In [5]:
def train_cnn(variation_name, data_dir, w2v_path, output_dir, epochs=20, lr=1e-3, batch_size=32):
    print(f"\n{'='*20} TextCNN: {variation_name} {'='*20}")
    
    # Load Word2Vec
    w2v = Word2Vec.load(w2v_path)
    embed_matrix, word2idx = build_embedding_matrix(w2v)
    print(f"Embedding matrix: {embed_matrix.shape}")
    
    # Load data
    train_df = pd.read_csv(f'{data_dir}/train.csv').fillna('')
    test_df  = pd.read_csv(f'{data_dir}/test.csv').fillna('')
    
    X_train = texts_to_sequences(train_df['text_clean'], word2idx, MAX_LEN)
    X_test  = texts_to_sequences(test_df['text_clean'], word2idx, MAX_LEN)
    y_train = train_df['label'].map(LABEL_MAP).values.astype(np.float32)
    y_test  = test_df['label'].map(LABEL_MAP).values.astype(np.float32)
    
    # DataLoaders
    train_ds = TensorDataset(torch.LongTensor(X_train), torch.FloatTensor(y_train))
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    
    # Model
    model = TextCNN(embed_matrix)
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(
        filter(lambda p: p.requires_grad, model.parameters()), lr=lr
    )
    
    # Train
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        if (epoch + 1) % 5 == 0:
            print(f"  Epoch {epoch+1}/{epochs} — Loss: {total_loss/len(train_loader):.4f}")
    
    # Evaluate
    model.eval()
    with torch.no_grad():
        preds = model(torch.LongTensor(X_test))
    y_pred = np.array((preds >= 0.5).int().tolist())
    
    acc = accuracy_score(y_test, y_pred)
    print(f"\nTextCNN ({variation_name}) Accuracy: {acc:.4f}")
    print(classification_report(y_test.astype(int), y_pred))
    
    # Save
    os.makedirs(output_dir, exist_ok=True)
    torch.save(model.state_dict(), f'{output_dir}/model.pt')
    print(f"Model saved to {output_dir}/model.pt")
    
    return acc

## 4. Run All Pipelines

In [6]:
acc_standard = train_cnn("Standard", f"{PROCESSED_DATA_DIR}/standard", f"{MODELS_DIR_BASE}/word2vec/standard/word2vec.model", f"{MODELS_DIR_BASE}/cnn/standard")

acc_irony = train_cnn("Irony", f"{PROCESSED_DATA_DIR}/irony", f"{MODELS_DIR_BASE}/word2vec/irony/word2vec.model", f"{MODELS_DIR_BASE}/cnn/irony")

acc_obfuscated = train_cnn("Obfuscated", f"{PROCESSED_DATA_DIR}/obfuscated", f"{MODELS_DIR_BASE}/word2vec/obfuscated/word2vec.model", f"{MODELS_DIR_BASE}/cnn/obfuscated")



Embedding matrix: (2332, 100)


  Epoch 5/20 — Loss: 0.3578


  Epoch 10/20 — Loss: 0.2090


  Epoch 15/20 — Loss: 0.1121


  Epoch 20/20 — Loss: 0.0803

TextCNN (Standard) Accuracy: 0.8172
              precision    recall  f1-score   support

           0       0.79      0.86      0.82       232
           1       0.85      0.78      0.81       233

    accuracy                           0.82       465
   macro avg       0.82      0.82      0.82       465
weighted avg       0.82      0.82      0.82       465

Model saved to ../models/raw_corpus/cnn/standard/model.pt



Embedding matrix: (2323, 100)


  Epoch 5/20 — Loss: 0.3526


  Epoch 10/20 — Loss: 0.1951


  Epoch 15/20 — Loss: 0.1111


  Epoch 20/20 — Loss: 0.0782

TextCNN (Irony) Accuracy: 0.8323
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       232
           1       0.87      0.78      0.82       233

    accuracy                           0.83       465
   macro avg       0.84      0.83      0.83       465
weighted avg       0.84      0.83      0.83       465

Model saved to ../models/raw_corpus/cnn/irony/model.pt



Embedding matrix: (2303, 100)


  Epoch 5/20 — Loss: 0.3424


  Epoch 10/20 — Loss: 0.1973


  Epoch 15/20 — Loss: 0.1160


  Epoch 20/20 — Loss: 0.0663

TextCNN (Obfuscated) Accuracy: 0.8280
              precision    recall  f1-score   support

           0       0.81      0.85      0.83       232
           1       0.85      0.80      0.82       233

    accuracy                           0.83       465
   macro avg       0.83      0.83      0.83       465
weighted avg       0.83      0.83      0.83       465

Model saved to ../models/raw_corpus/cnn/obfuscated/model.pt


## 5. Comparison

In [7]:
print("\n=== Final Comparison ===")
print(f"Standard: {acc_standard:.4f}")
print(f"Irony:    {acc_irony:.4f}")
print(f"Obfuscated: {acc_obfuscated:.4f}")
diff = acc_irony - acc_standard
print(f"Impact of Irony features: {diff:+.4f}")


=== Final Comparison ===
Standard: 0.8172
Irony:    0.8323
Obfuscated: 0.8280
Impact of Irony features: +0.0151
