# CNN Text Classification with Word2Vec
Dieses Notebook zeigt:
- Laden eines Word2Vec‑Modells (GoogleNews)
- Umwandlung eines Dokuments in eine Embedding‑Matrix
- Klassifikation mittels einfachem 1D‑CNN


## Installation

In [1]:
!pip install gensim torch --quiet

## Word2Vec laden

In [2]:
import gensim.downloader as api
# GoogleNews-Embeddings (300d)
w2v = api.load('word2vec-google-news-300')
w2v



<gensim.models.keyedvectors.KeyedVectors at 0x1388f1a90>

## Beispieldaten
Kleine Demo‑Texte mit binären Labels.

In [3]:
texts = [
    'this movie was fantastic and I loved every minute',
    'absolutely terrible film boring and predictable',
    'the food was great and service was excellent',
    'the service was slow and the food was cold'
]
labels = [1,0,1,0]

## Dokumente in Embedding-Matrizen umwandeln
Jedes Dokument wird zu einer Matrix der Größe (seq_len, emb_dim).

In [4]:
import numpy as np

def embed_document(text, max_len=20):
    tokens = text.lower().split()
    vecs = []
    for tok in tokens[:max_len]:
        if tok in w2v:
            vecs.append(w2v[tok])
        else:
            vecs.append(np.zeros(300))
    while len(vecs) < max_len:
        vecs.append(np.zeros(300))
    return np.array(vecs)

X = np.stack([embed_document(t) for t in texts])
y = np.array(labels)
X.shape

(4, 20, 300)

## CNN-Modell definieren

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

class TextCNN(nn.Module):
    def __init__(self, emb_dim=300, seq_len=20, num_classes=2):
        super().__init__()
        # 1D-CNN über die Sequenz
        self.conv = nn.Conv1d(in_channels=emb_dim, out_channels=100, kernel_size=3)
        self.relu = nn.ReLU()
        # Globales Max-Pooling
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.fc = nn.Linear(100, num_classes)

    def forward(self, x):
        # x: (batch, seq_len, emb_dim)
        x = x.permute(0,2,1)  # -> (batch, emb_dim, seq_len)
        x = self.conv(x)
        x = self.relu(x)
        x = self.pool(x).squeeze(-1)
        return self.fc(x)

## Training vorbereiten

In [6]:
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)
model = TextCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

## Training (kurz)

In [7]:
for epoch in range(20):
    optimizer.zero_grad()
    outputs = model(X_tensor)
    loss = criterion(outputs, y_tensor)
    loss.backward()
    optimizer.step()
    if epoch % 5 == 0:
        print(f'Epoch {epoch} Loss {loss.item():.4f}')

Epoch 0 Loss 0.7051
Epoch 5 Loss 0.3587
Epoch 10 Loss 0.1695
Epoch 15 Loss 0.0725


## Beispielvorhersage

In [8]:
test_text = 'the movie was great but the ending was boring'
test_emb = torch.tensor(embed_document(test_text), dtype=torch.float32).unsqueeze(0)
pred = model(test_emb)
print('Predicted class:', pred.argmax(dim=1).item())

Predicted class: 1
