# News Text Classification Task

> LSTM Approach

> **Introduction:** This is the Question 3 of APOAI 2025 Mock Competition, and it is also the third question of the NOAI 2024 (China).

## I. Question Overview

A dataset for news text classification is provided, which is stored in a .csv file and contains two variables:

- `text`: The content of the news text.
- `category`: The category of the news text.

The training set is stored in `train_news.csv`, with a total of 1,000 samples. The testing set is stored in `test_news.csv`, with a total of 200 samples. During the competition, the test set samples without labels will be provided.

## II. Data Set

1. Address of the training set: `train_news.csv`, [Training Set](https://bohrium.dp.tech/competitions/2223242868?tab=datasets);
2. Test set (without labels): `test_news_nolabel.csv`, which contestants cannot access or download;
3. Test set (with labels): `test_news_label.csv`, which contestants cannot access or download.

## III. Task

Please use PyTorch to design and train a natural language processing model to achieve the news text classification task, that is, input the sentences of the news and output the news categories.

The specific requirements are as follows:

1. The total training time and testing time using the CPU should not exceed 10 minutes. The connection time and queuing time are not counted into the total time.
2. Tip: It is recommended to use Word Embedding + LSTM.

## IV. Submission

Please submit the `submission.ipynb` file, which contains the entire process of training the model. In `submission.ipynb`, store the prediction results of the test set in `submission.csv`. The naming and storage method of the label should be consistent with that of `train_news.csv`.

You can refer to the submission format in `baseline.ipynb`. The address of `baseline.ipynb`: [Question 3 of APOAI Mock Competition_baseline](https://bohrium.dp.tech/notebooks/84584239178).

## V. Scoring

1. When the training and testing are completed within the specified time, the scoring criterion is the average value of the F1-Scores of all categories. Please look up the meaning of F1-Score on the Internet by yourself.
2. If the F1-Scores of all categories are not calculated, a score of 0 will be given.
3. If the total time for training and testing exceeds the time limit, a score of 0 will be given.

In [1]:
import os
import random
import pandas as pd
import numpy as np
import nltk
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from pathlib import Path
from collections import Counter

TEST_PATH = Path(os.environ.get("DATA_PATH") or "")  # For grader
TRAIN_PATH = Path("")
if TEST_PATH != TRAIN_PATH:
    TRAIN_PATH = Path("/bohr/train-t05i/v2")

nltk.data.path.append(TRAIN_PATH / "punkt")

seed = 42
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [2]:
df_train = pd.read_csv(TRAIN_PATH / "train_news.csv")
df_train

Unnamed: 0,text,category
0,Campbell rescues Arsenal\n\nSol Campbell prove...,sport
1,Algeria hit by further gas riots\n\nAlgeria su...,business
2,Senior Fannie Mae bosses resign\n\nThe two mos...,business
3,Russia gets investment blessing\n\nSoaring oil...,business
4,Injury sidelines Philippoussis\n\nMark Philipp...,sport
...,...,...
995,Roxy Music on Isle of Wight bill\n\nRoxy Music...,entertainment
996,Qwest may spark MCI bidding war\n\nUS phone co...,business
997,Newcastle 2-1 Bolton\n\nKieron Dyer smashed ho...,sport
998,Format wars could 'confuse users'\n\nTechnolog...,tech


## Preprocess text

In [3]:
def preprocess(texts, vocab=None, *, max_len=500, vocab_size=25000):
    # Tokenize
    text_tokens = [word_tokenize(text.lower()) for text in texts]

    # Build vocabulary
    vocab_provided = vocab is not None
    if not vocab_provided:
        common_words = Counter([token for text in text_tokens for token in text]).most_common(vocab_size - 2)
        vocab = {word: idx + 2 for idx, (word, _) in enumerate(common_words)}
        vocab["<UNK>"] = 1
        vocab["<PAD>"] = 0

    # Tokens to token IDs
    text_token_ids = []
    for text in text_tokens:
        encoded = [vocab.get(word, vocab["<UNK>"]) for word in text]
        # Truncate if more, pad if less
        encoded += [vocab["<PAD>"]] * (max_len - len(encoded))
        encoded = encoded[:max_len]
        text_token_ids.append(encoded)

    if vocab_provided:
        return text_token_ids
    return text_token_ids, vocab

In [4]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(df_train["category"])
label_encoder.classes_

array(['business', 'entertainment', 'sport', 'tech'], dtype=object)

In [5]:
X_train, vocab = preprocess(df_train["text"])
len(vocab)

23371

In [6]:
class TextDataset(Dataset):
    def __init__(self, texts, labels=None):
        self.texts = torch.tensor(texts, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.long) if labels is not None else None

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx] if self.labels is not None else self.texts[idx]

In [7]:
ds_train = TextDataset(X_train, y_train)
dl_train = DataLoader(ds_train, batch_size=64, shuffle=True)

## Define model and train

In [8]:
class MyModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, padding_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(2 * hidden_dim, num_classes)
    
    def forward(self, x):
        embedded = self.embedding(x)
        _, (hidden, cell) = self.lstm(embedded)
        last_hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        logits = self.fc(last_hidden)
        return logits

In [9]:
def train(model, device, optimizer, criterion, dataloader, num_epochs):
    train_losses = []
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0
        
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            
            outputs = model(inputs).squeeze()
            loss = criterion(outputs, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item() * inputs.size(0)

        running_loss /= len(dataloader.dataset)
        train_losses.append(running_loss)
        accuracy = eval(model, device, dataloader)
        print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {running_loss}, Accuracy: {accuracy}")
    return train_losses

In [10]:
@torch.no_grad()
def predict(model, device, dataloader):
    model.eval()
    all_preds = []
    for inputs, _ in dataloader:
        inputs = inputs.to(device)
        
        outputs = model(inputs).squeeze()
        preds = torch.argmax(outputs, axis=1)
        
        all_preds.append(preds)
    return torch.hstack(all_preds)

In [11]:
def eval(model, device, dataloader):
    correct = 0
    total = 0
    for inputs, labels in dataloader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        preds = predict(model, device, [(inputs, None)])
        
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    return correct / total

In [12]:
embedding_dim = 128
hidden_dim = 64

device = "cuda" if torch.cuda.is_available() else "cpu"
model = MyModel(len(vocab), embedding_dim, hidden_dim, len(label_encoder.classes_), vocab["<PAD>"]).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [13]:
train(model, device, optimizer, criterion, dl_train, 10);

Epoch [1/10], Loss: 1.3716507558822633, Accuracy: 0.389
Epoch [2/10], Loss: 1.3143046188354492, Accuracy: 0.514
Epoch [3/10], Loss: 1.0938467907905578, Accuracy: 0.674
Epoch [4/10], Loss: 0.7671020851135254, Accuracy: 0.82
Epoch [5/10], Loss: 0.4353484442234039, Accuracy: 0.95
Epoch [6/10], Loss: 0.3092313632965088, Accuracy: 0.976
Epoch [7/10], Loss: 0.17345663273334502, Accuracy: 0.982
Epoch [8/10], Loss: 0.09545973724126816, Accuracy: 0.993
Epoch [9/10], Loss: 0.05204349020123482, Accuracy: 0.997
Epoch [10/10], Loss: 0.02972472044825554, Accuracy: 1.0


## Make predictions (on grader only)

In [14]:
try:
    df_test = pd.read_csv(TEST_PATH / "test_news_nolabel.csv")
except FileNotFoundError:
    df_test = None

In [15]:
if df_test is not None:
    X_test = preprocess(df_test["text"], vocab)

    ds_test = TextDataset(X_test)
    dl_test = DataLoader(ds_test, batch_size=64)

    preds = predict(model, device, dl_test).detach().cpu().numpy()
    df_test["category"] = label_encoder.inverse_transform(preds)

    df_test.to_csv("submission.csv", index=False)

## Score

Leaderboard A:

- F1 - business: 0.7741
- F1 - entertainment: 0.8695
- F1 - sport: 0.8923
- F1 - tech: 0.9310
- Score: 0.8667

Leaderboard B:

- F1 - business: 0.8358
- F1 - entertainment: 0.7741
- F1 - sport: 0.9189
- F1 - tech: 0.7857
- Score: 0.8286