# Article Categorization System

An Article Categorization system in Natural Language Processing (NLP) is a system that automatically classifies or categorizes articles, documents, or textual content into predefined categories or topics. The goal is to assign relevant labels to the text based on its content, enabling easier organization, search, and retrieval of information. This process is also known as text classification or document categorization.

Key components of an Article Categorization system typically include:

**Text Preprocessing**: Cleaning and preparing the text data by removing stop words, punctuation, and other noise.

**Feature Extraction**: Converting the text data into numerical features or vectors, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.

**Machine Learning Model**: Training a machine learning model, such as logistic regression, support vector machines, or neural networks, to learn patterns and relationships between text features and categories.


**Categories/Labels:** Defining a set of categories or labels that the articles can be assigned to. These categories are typically predefined and can cover a wide range of topics, such as politics, sports, technology, health, and more.

**Evaluation Metrics:** Assessing the performance of the model using evaluation metrics like accuracy, precision, recall, F1-score, or confusion matrices.

Applications of Article Categorization in Industry and the Real World:

-- News and Media: Automatic categorization of news articles allows news portals to organize and display content based on topics, making it easier for readers to find articles of interest, and helps in recommending related articles.

-- Content Recommendation: Content platforms, such as blogs, websites, and social media, use categorization to recommend relevant articles or posts to users based on their interests and preferences.

-- Search Engines: Search engines categorize web pages to provide users with more structured and relevant search results. This enhances the user experience and helps users find information quickly.

-- Market Research: Companies can use article categorization to analyze customer reviews, feedback, and social media discussions to gain insights into consumer sentiment and preferences.

-- Customer Support: Automating the categorization of customer support tickets or emails can help companies prioritize and route inquiries to the appropriate departments or support teams.

-- Legal and Compliance: Law firms and regulatory agencies use text categorization to sort and manage large volumes of legal documents, ensuring compliance with regulations and laws.

-- Academic Research: Researchers categorize academic papers and publications to organize and access research in specific domains or fields of study.

-- Content Moderation: Social media platforms use categorization to filter and moderate user-generated content, flagging or removing inappropriate or harmful content.

-- E-commerce: Online retailers categorize product descriptions and reviews to improve the browsing and shopping experience for customers.

-- Sentiment Analysis: Categorizing user reviews, comments, or social media posts into positive, negative, or neutral sentiment categories is a form of article categorization used to gauge public sentiment about products, services, or events.

Overall, Article Categorization systems play a crucial role in information management, content organization, and automation across various industries and real-world applications. They help streamline and enhance the way we access and interact with textual information.

To build a News Article Categorization system using a language model, you can use a pre-trained model for text classification. Below is a complete example code that categorizes news articles into topics like politics, sports, technology, etc., using the Transformers library and a BERT-based text classification model.


Fine-tuning typically involves adjusting the pre-trained model's weights on your specific dataset to make it better suited for your classification task.

Here are the steps to train the BERT model on a custom dataset:

Collect or Create a Labeled Dataset:

- Gather a dataset of news articles or text data relevant to your classification task.
- Label each article or text with one of the categories you want the model to predict (e.g., politics, sports, technology, entertainment, health).

Data Preprocessing:

Tokenize and preprocess your text data using the BERT tokenizer.
Encode the labels as numerical values (e.g., 0 for politics, 1 for sports, etc.).

Split the Dataset:

Split your dataset into training, validation, and test sets. Typically, you use a larger portion for training and smaller portions for validation and testing.

Fine-Tuning:

Load the pre-trained BERT model and modify the classification head to match the number of categories in your dataset.
Fine-tune the model on the training data using techniques like gradient descent. You'll need to define loss functions and optimization strategies.
Monitor the model's performance on the validation set to prevent overfitting.

Evaluation:

Evaluate the model's performance on the test dataset using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score).
Inference:

Once your model is trained and evaluated, you can use it to classify new news articles or text.

In [1]:
# Install required libraries
!pip install transformers



In [2]:
# Training a model like BERT for sequence classification requires a labeled dataset with text examples and their corresponding labels.
#You'll need to collect or create such a dataset and then fine-tune the BERT model on it.
# here there are just 48 text samples corresponding to five categories, and each text sample is assigned a numerical label based on its category
# Replace these samples and labels with your own dataset according to your specific classification task.
# ensuring that you have collected or created a labeled dataset that matches your specific classification task

import torch
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, Dataset
import numpy as np

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=5)  # 5 categories

# Define category labels
category_labels = ['politics', 'sports', 'technology', 'entertainment', 'health',]

# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        return text, label

# Load your custom dataset (replace with your own dataset loading code)
#texts = ["Sample text 1", "Sample text 2", ...]
#labels = [0, 1, ...]  # Numerical labels corresponding to categories

# Load your custom dataset (replace with your own dataset loading code)
texts = [
    "The government announced a new economic policy aimed at boosting the economy.",
    "The local soccer team won the championship after an intense final match.",
    "A new smartphone with advanced features was launched by TechCo.",
    "The upcoming awards show promises to be a star-studded event in the entertainment industry.",
    "A groundbreaking medical discovery could revolutionize healthcare.",
    "Another political news article...",
    "A thrilling sports event took place...",
    "A tech company unveiled a groundbreaking product...",
    "A new movie is creating a buzz in the entertainment world...",
    "Researchers made a significant health-related discovery.",
    "The budget proposal was presented by the government.",
    "A major sports league announced new rules for the upcoming season.",
    "A tech startup secured a multi-million-dollar investment.",
    "A popular actor is set to star in an upcoming blockbuster film.",
    "A breakthrough in medical research may lead to a cure for a serious disease.",
    "The latest political debate focused on key issues facing the nation.",
    "A sports team's victory celebration drew thousands of fans to the streets.",
    "The technology company unveiled its latest product with innovative features.",
    "The entertainment industry is abuzz with rumors of a new celebrity romance.",
    "A new health study suggests the benefits of a balanced diet.",
    "Government leaders discussed economic policies to address inflation.",
    "A thrilling sports match kept fans on the edge of their seats.",
    "A technology conference showcased the latest advancements in the field.",
    "A new movie trailer has generated excitement among moviegoers.",
    "Medical professionals are working on a breakthrough treatment for a rare condition.",
    "Political analysts weigh in on the upcoming election race.",
    "The sports tournament attracted athletes from around the world.",
    "A tech company's stock soared following a successful product launch.",
    "A beloved actor received an award for their contributions to the entertainment industry.",
    "Researchers made a significant discovery in the field of medical science.",
    "Government officials announced a plan to address climate change.",
    "A sports team's coach discussed strategies for the upcoming season.",
    "A tech startup received funding to develop a cutting-edge innovation.",
    "The entertainment industry is buzzing with anticipation for a major film release.",
    "Health experts emphasize the importance of regular exercise.",
    "Political leaders held a summit to address international relations.",
    "A sports event drew record-breaking attendance.",
    "The technology company's CEO outlined their vision for the future.",
    "A new reality show is capturing the attention of television viewers.",
    "Researchers published a groundbreaking study on a medical breakthrough.",
    "Government representatives debated a controversial policy proposal.",
    "A sports championship became a global sensation.",
    "A tech company announced plans for expansion and job creation.",
    "A renowned actor discussed their career in the entertainment industry.",
    "Health professionals shared tips for maintaining a healthy lifestyle.",
]

labels = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
# Numerical labels corresponding to categories (e.g., 0: politics, 1: sports, 2: technology, 3: entertainment, 4: health)



  # we have just 48 example but to increase accuracy ---> Add more text samples as needed
  # Numerical labels corresponding to categories (e.g., 0: politics, 1: sports, 2: technology, 3: entertainment, 4: health)




# Create an instance of your custom dataset
custom_dataset = CustomDataset(texts, labels)

# Split the dataset into training, validation, and test sets (modify as needed)
train_size = int(0.8 * len(custom_dataset))
val_size = int(0.1 * len(custom_dataset))
test_size = len(custom_dataset) - train_size - val_size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(custom_dataset, [train_size, val_size, test_size])

# Define dataloaders for training, validation, and test sets
batch_size = 16
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# Define optimizer and loss function
optimizer = AdamW(model.parameters(), lr=1e-5)
criterion = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 5
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    for batch in train_dataloader:
        texts, labels = batch
        encoded_texts = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512, return_attention_mask=True)
        inputs = {
            'input_ids': encoded_texts['input_ids'],
            'attention_mask': encoded_texts['attention_mask'],
            'labels': labels
        }
        optimizer.zero_grad()
        outputs = model(**inputs)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

# Validation
model.eval()
val_accuracy = 0.0
val_batches = 0
for batch in val_dataloader:
    texts, labels = batch
    encoded_texts = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512, return_attention_mask=True)
    inputs = {
        'input_ids': encoded_texts['input_ids'],
        'attention_mask': encoded_texts['attention_mask'],
        'labels': labels
    }
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = np.argmax(outputs.logits, axis=1)
    val_accuracy += np.sum(predictions == labels.numpy())
    val_batches += len(labels)

if val_batches > 0:
    val_accuracy /= val_batches
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss}, Validation Accuracy: {val_accuracy}")
else:
    print("No validation batches found.")


# Evaluation on the test set
model.eval()
test_accuracy = 0.0
test_batches = 0
for batch in test_dataloader:
    texts, labels = batch
    encoded_texts = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512, return_attention_mask=True)
    inputs = {
        'input_ids': encoded_texts['input_ids'],
        'attention_mask': encoded_texts['attention_mask'],
        'labels': labels
    }
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = np.argmax(outputs.logits, axis=1)
    test_accuracy += np.sum(predictions == labels.numpy())
    test_batches += len(labels)

test_accuracy /= test_batches
print(f"Test Accuracy: {test_accuracy}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 5/5, Loss: 4.470484375953674, Validation Accuracy: 0.0
Test Accuracy: 0.0


In [3]:
def categorize_news_article(news_article):
    # Tokenize the news article
    inputs = tokenizer(news_article, return_tensors="pt", padding=True, truncation=True, max_length=512, return_attention_mask=True)

    # Make a prediction using the fine-tuned BERT model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted category and category probabilities
    predicted_category = category_labels[np.argmax(outputs.logits)]
    category_probabilities = outputs.logits.softmax(dim=1).tolist()[0]

    return predicted_category, category_probabilities

# Example usage for a politic-related news article
news_article = """
In a recent political development, the government announced new policies aimed at economic growth and job creation. The sports world is abuzz with excitement as a major tournament is set to begin. In the technology sector, a new smartphone with advanced features has been launched. Meanwhile, the entertainment industry is gearing up for a star-studded awards show. On the health front, a new study reveals breakthroughs in medical research.
"""

predicted_category, category_probabilities = categorize_news_article(news_article)
print("Predicted Category:", predicted_category)
print("Category Probabilities:", category_probabilities)


Predicted Category: health
Category Probabilities: [0.23524904251098633, 0.1365949511528015, 0.1593373715877533, 0.17738446593284607, 0.2914341986179352]


In [4]:
# Example usage for a health-related news article
health_article = """
A recent medical breakthrough has the potential to transform healthcare. Researchers have discovered a new treatment for a previously incurable disease, offering hope to millions of patients worldwide. Clinical trials have shown promising results, and experts believe this could be a major breakthrough in the field of medicine.
"""

predicted_category, category_probabilities = categorize_news_article(health_article)
print("Predicted Category:", predicted_category)
print("Category Probabilities:", category_probabilities)


Predicted Category: health
Category Probabilities: [0.24294805526733398, 0.147414430975914, 0.16521532833576202, 0.12176720798015594, 0.32265496253967285]


In [5]:
# Example usage for a sports-related news article
sports_article = """
In a thrilling final match, the local soccer team secured a historic victory, winning the championship for the first time in a decade. The match was a nail-biter, with both teams displaying incredible skills and determination. The winning goal came in the last minutes of the game, sending the fans into a frenzy of celebration. The team's captain credited their success to teamwork and dedication.
"""

predicted_category, category_probabilities = categorize_news_article(sports_article)
print("Predicted Category:", predicted_category)
print("Category Probabilities:", category_probabilities)


Predicted Category: sports
Category Probabilities: [0.2290765643119812, 0.22923582792282104, 0.17658132314682007, 0.15603412687778473, 0.20907217264175415]


In [6]:
# Example usage for a technology-related news article
technology_article = """
A tech giant unveiled its latest innovation today, a groundbreaking artificial intelligence system that promises to revolutionize industries. The AI system can analyze massive datasets in real-time, making it invaluable for sectors like autonomous vehicles and healthcare. Experts predict that this innovation will lead to significant advancements in technology.
"""

predicted_category, category_probabilities = categorize_news_article(technology_article)
print("Predicted Category:", predicted_category)
print("Category Probabilities:", category_probabilities)


Predicted Category: health
Category Probabilities: [0.23467382788658142, 0.1599157750606537, 0.17809340357780457, 0.14168372750282288, 0.28563326597213745]


In [7]:
# Example usage for an entertainment-related news article
entertainment_article = """
The entertainment world is buzzing with excitement as preparations for the annual awards show are in full swing. A-list celebrities are expected to grace the red carpet, and fans are eagerly awaiting the announcement of winners in various categories. This event promises to be a star-studded extravaganza.
"""

predicted_category, category_probabilities = categorize_news_article(entertainment_article)
print("Predicted Category:", predicted_category)
print("Category Probabilities:", category_probabilities)


Predicted Category: entertainment
Category Probabilities: [0.22429685294628143, 0.15624935925006866, 0.14983826875686646, 0.25163793563842773, 0.2179776132106781]
