<h1><center>AG News Classification</center></h1>

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity.

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

In [1]:
# Import libraries
import torch
import torch.nn as nn

from src.data.data_loader import create_dataloaders
from src.model.transformer import build_transformer
from src.model.transformer import Transformer
from src.train.training import train_model
from src.utils.utils import get_device
from nltk.tokenize import word_tokenize

  Referenced from: <CFED5F8E-EC3F-36FD-AAA3-2C6C7F8D3DD9> /opt/anaconda3/lib/python3.11/site-packages/torchvision/image.so
  warn(


In [2]:
# Initialize model and training parameters

train_file = 'data/train/train.csv'
val_file = 'data/val/val.csv'

# Size of embedding vector
d_model = 64
# Number of words in a vocabulary
vocab_size = 30000
# Max sequence length for input words/tokens
seq_len = 100
# Dropout rate
dropout = 0.1
# number of encoder blocks
num_layers = 1
# number of attention heads
num_heads = 8
# Number of hidden nodes for feed-forward layer
d_ff = 4*64

# Number of epochs
epochs = 5
# Batch size for training
batch_size = 128
# Number of classes
num_classes = 4

In [3]:
# Get a device to use for training/inference
device = get_device()

# Create training and validation data loaders
train_dataloader, val_dataloader, word_to_id = create_dataloaders(train_file, val_file, batch_size, seq_len,
                                                                   vocab_size)

In [4]:
# Create encoder only transformer model
encoder_only_transformer_model = build_transformer(d_model, vocab_size, seq_len, dropout,
                                                   num_layers, num_heads, d_ff, num_classes).to(device)

print(encoder_only_transformer_model)

Transformer(
  (embed): InputEmbedding(
    (embedding): Embedding(30000, 64)
  )
  (pos): PositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderBlock(
        (self_attention): MultiHeadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (query_linear_layer): Linear(in_features=64, out_features=64, bias=True)
          (key_linear_layer): Linear(in_features=64, out_features=64, bias=True)
          (value_linear_layer): Linear(in_features=64, out_features=64, bias=True)
          (output_linear_layer): Linear(in_features=64, out_features=64, bias=True)
        )
        (feed_forward): FeedForward(
          (linear_1): Linear(in_features=64, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear_2): Linear(in_features=256, out_features=64, bias=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
        (norm): LayerNorm((64,), e

In [5]:
# Create optimizer and loss function
optimizer = torch.optim.Adam(encoder_only_transformer_model.parameters())
loss_fn = nn.CrossEntropyLoss()

# Train model
train_model(epochs, num_classes, encoder_only_transformer_model, train_dataloader, val_dataloader,
            loss_fn, optimizer, device)

Epoch 1, Train Loss 0.32070717215538025, Train Accuracy 0.884453296661377, Test Loss 0.23113450407981873, Test Accuracy 0.9214843511581421
Epoch 2, Train Loss 0.15500973165035248, Train Accuracy 0.9456123113632202, Test Loss 0.23425127565860748, Test Accuracy 0.9262152910232544
Epoch 3, Train Loss 0.10607817769050598, Train Accuracy 0.9610374569892883, Test Loss 0.2572510540485382, Test Accuracy 0.9264323115348816
Epoch 4, Train Loss 0.07226314395666122, Train Accuracy 0.9725063443183899, Test Loss 0.3279348909854889, Test Accuracy 0.917881965637207
Epoch 5, Train Loss 0.05304388329386711, Train Accuracy 0.9799773693084717, Test Loss 0.3840404152870178, Test Accuracy 0.9115017652511597


In [6]:
# Save model
from pathlib import Path

# Create models directory
MODEL_PATH = Path("models")
MODEL_PATH.mkdir(parents=True, exist_ok=True)

# Create model save path
MODEL_NAME = "06_news_classification.pth"
MODEL_SAVE_PATH = MODEL_PATH / MODEL_NAME

In [7]:
# Save the model state dict
print(f"Saving model to: {MODEL_SAVE_PATH}")
torch.save(obj=encoder_only_transformer_model.state_dict(), f=MODEL_SAVE_PATH)

Saving model to: models/06_news_classification.pth


In [8]:
# Create new instance of model and load saved state dict
loaded_model = build_transformer(d_model, vocab_size, seq_len, dropout,
                                num_layers, num_heads, d_ff, num_classes)
loaded_model.load_state_dict(torch.load(MODEL_SAVE_PATH))
loaded_model.to(device)

# Categories
article_categories = {1: 'World', 2:'Sports', 3:'Business', 4:'Sci/Tech'}

UNK = 1

def classify_news(news):
    with torch.inference_mode():
        tokenized_words = word_tokenize(news.lower())[0: seq_len]
        news_tensor = torch.tensor([word_to_id.get(word, UNK) for word in tokenized_words]).to(device)
        
        news_tensor = news_tensor.unsqueeze(dim=0)
        
        encoder_output = loaded_model.encode(news_tensor, None)
        y_logits = loaded_model.project(encoder_output)
        
        y_output = torch.argmax(y_logits, dim=1)
        return (y_output.item()) + 1

In [9]:
article1 = 'The NBA season could end on Thursday night. It’s Game 6 of the NBA Finals, with the Oklahoma City Thunder leading the \
Indiana Pacers 3-2 in the title series. Game 6 is in Indianapolis and Game 7, if necessary, will be Sunday.Shai Gilgeous-Alexander and the \
Thunder are one win away from becoming NBA champions. And Gilgeous-Alexander is on the cusp of a nearly unprecedented season when it comes \
to individual honors.'

print(article_categories.get(classify_news(article1)))

Sports


In [10]:
article2 = 'Global stock markets are experiencing fluctuations due to changing economic indicators, \
central bank policies, and geopolitical developments.'

print(article_categories.get(classify_news(article2)))

Business
