# Text Classification using Newsgroups Dataset and Embedding-based Model 

## Overview

This project demonstrates the use of text embeddings (produced by the Gemini API) to classify newsgroup posts / articles into various categories. The dataset is preprocessed, embeddings are generated for the text data, and a simple neural network classifier is built to predict the categories.

## Description

In this project, the goal is to classify text data from the 20 Newsgroups dataset using embeddings generated by a pre-trained model. The dataset, which consists of 20 different newsgroup topics, is first preprocessed to clean and extract useful features. Afterward, each text entry is converted into an embedding, which is then used as input to a neural network classifier. The classifier is trained on a subset of categories and evaluated on a test set. The project showcases the process of preparing text data for machine learning, generating embeddings, building a classification model, and predicting new, unseen text entries. Additionally, the project also involves fine-tuning model performance using early stopping and monitoring accuracy.

This technique uses the Gemini API's embeddings as input, avoiding the need to train on text input directly, and as a result it is able to perform quite well using relatively few examples compared to training a text model from scratch.

### Steps:

1. **Loading the Dataset**:
   - The 20 Newsgroups dataset is loaded using `fetch_20newsgroups` from `sklearn`, with training and testing subsets.


3. **Preprocessing the Text Data**:
   - Emails are parsed to extract relevant text, including subjects and bodies.
   - Email addresses are removed, and text is truncated to a length of 5000 characters to standardize the input.


4. **Filtering Data by Categories**:
   - A subset of the dataset is sampled, selecting a specified number of samples per class.
   - Categories containing the string "sci" (for science-related topics) are kept for further processing.


5. **Generating Text Embeddings**:
   - The text data is converted into embeddings using a pre-trained model via a custom `embed_fn` function.
   - These embeddings serve as vector representations of the text, suitable for input into a machine learning model.


6. **Building the Classification Model**:
   - A simple neural network model is created using Keras with dense layers.
   - The model is designed to predict the class of the text based on the embeddings.


7. **Training the Model**:
   - The model is trained on the processed text embeddings, with training and validation data split into `x_train`, `y_train`, `x_val`, and `y_val`.
   - Early stopping is applied to avoid overfitting.


8. **Evaluating the Model**:
   - The trained model is evaluated on the test set to determine its accuracy in predicting the correct categories for new text.


9. **Prediction on New Text**:
   - An example text (not from the training set) is classified by the trained model, and the predicted category probabilities are printed.

### **Pre-requisite steps**

1. Install the Gemini AI API
2. Import the necessary libraries
3. Setup the API key and add that in the key in this notebook
4. Load the dataset

**Step 1: Install the API**

In [1]:
%pip install -U -q "google-generativeai>=0.8.3"

Note: you may need to restart the kernel to use updated packages.


**Step 2: Install Libraries**

In [2]:
import google.generativeai as genai

import email
import re

import pandas as pd

**Step 3: Set up API key**

* Generate an API key from: [AI Studio](https://aistudio.google.com/app/apikey) using the [detailed instructions in the docs](https://ai.google.dev/gemini-api/docs/api-key)
* Store the API key in a [Kaggle secret](https://www.kaggle.com/discussions/product-feedback/114053) named `GOOGLE_API_KEY`.
* To make the key available through Kaggle secrets, choose `Secrets` from the `Add-ons` menu and follow the instructions to add your key or enable it for this notebook.

*  *If an error response along the lines of `No user secrets exist for kernel id ...`, is generated, then need to add your API key via `Add-ons`, `Secrets` **and** enable it.

![Screenshot of the checkbox to enable GOOGLE_API_KEY secret](https://storage.googleapis.com/kaggle-media/Images/5gdai_sc_3.png)*

In [3]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

**Step 4: Load the dataset**

The [20 Newsgroups Text Dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and test datasets are based on messages posted before and after a specific date.

In [4]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset="train")
newsgroups_test = fetch_20newsgroups(subset="test")

# View list of class names for dataset
newsgroups_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Here is an example of what a record from the training set looks like.

In [5]:
print(newsgroups_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [6]:
def preprocess_newsgroup_row(data):
    # Extract only the subject and body
    msg = email.message_from_string(data)
    text = f"{msg['Subject']}\n\n{msg.get_payload()}"
    # Strip any remaining email addresses
    text = re.sub(r"[\w\.-]+@[\w\.-]+", "", text)
    # Truncate each entry to 5,000 characters
    text = text[:5000]

    return text


def preprocess_newsgroup_data(newsgroup_dataset):
    # Put data points into dataframe
    df = pd.DataFrame(
        {"Text": newsgroup_dataset.data, "Label": newsgroup_dataset.target}
    )
    # Clean up the text
    df["Text"] = df["Text"].apply(preprocess_newsgroup_row)
    # Match label to target name index
    df["Class Name"] = df["Label"].map(lambda l: newsgroup_dataset.target_names[l])

    return df

In [7]:
# Apply preprocessing function to training and test datasets
df_train = preprocess_newsgroup_data(newsgroups_train)
df_test = preprocess_newsgroup_data(newsgroups_test)

df_train.head()

Unnamed: 0,Text,Label,Class Name
0,WHAT car is this!?\n\n I was wondering if anyo...,7,rec.autos
1,SI Clock Poll - Final Call\n\nA fair number of...,4,comp.sys.mac.hardware
2,"PB questions...\n\nwell folks, my mac plus fin...",4,comp.sys.mac.hardware
3,Re: Weitek P9000 ?\n\nRobert J.C. Kyanko () wr...,1,comp.graphics
4,Re: Shuttle Launch Question\n\nFrom article <>...,14,sci.space


To ensure the dataset is balanced and focuses on specific categories, the sample_data function is used. It performs the following operations:


* Sampling: Selects a specified number of samples (num_samples) from each class in the dataset.
* Filtering by Class: The dataset is filtered to only include categories that contain a specific class name, defined by the classes_to_keep parameter.
* Re-calibrating Label Encoding: After filtering, the label encoding is recalibrated, with each class being assigned a new numeric label.


In [8]:
def sample_data(df, num_samples, classes_to_keep):
    # Sample rows, selecting num_samples of each Label.
    df = (
        df.groupby("Label")[df.columns]
        .apply(lambda x: x.sample(num_samples))
        .reset_index(drop=True)
    )

    df = df[df["Class Name"].str.contains(classes_to_keep)]

    # We have fewer categories now, so re-calibrate the label encoding.
    df["Class Name"] = df["Class Name"].astype("category")
    df["Encoded Label"] = df["Class Name"].cat.codes

    return df

In [9]:
TRAIN_NUM_SAMPLES = 100
TEST_NUM_SAMPLES = 25
CLASSES_TO_KEEP = "sci"  # Class name should contain 'sci' to keep science categories

df_train = sample_data(df_train, TRAIN_NUM_SAMPLES, CLASSES_TO_KEEP)
df_test = sample_data(df_test, TEST_NUM_SAMPLES, CLASSES_TO_KEEP)

In [10]:
df_train.value_counts("Class Name")

Class Name
sci.crypt          100
sci.electronics    100
sci.med            100
sci.space          100
Name: count, dtype: int64

In [11]:
df_test.value_counts("Class Name")

Class Name
sci.crypt          25
sci.electronics    25
sci.med            25
sci.space          25
Name: count, dtype: int64

## Create the embeddings

To perform text classification, we need to convert the text data into numerical representations (embeddings) that the model can process. In this step, we use Google’s Generative AI to generate embeddings for each document.

The following operations are performed:

***Embedding Function (embed_fn):***

The function sends the text data to Google's Generative AI API, requesting embeddings for classification tasks. The embed_fn function is decorated with a retry mechanism to handle potential API timeouts.
The response from the API contains the embedding, which is returned and added to the dataframe.

***Create Embeddings (create_embeddings):***

This function applies the embed_fn function to each text entry in the dataframe, generating embeddings for the entire dataset.

**NOTE**: Embeddings are computed one at a time, so large sample sizes can take a long time!

#### Task types

The `text-embedding-004` model supports a task type parameter that generates embeddings tailored for the specific task.

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITY	| Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION	| Specifies that the embeddings will be used for classification.
CLUSTERING	| Specifies that the embeddings will be used for clustering.
FACT_VERIFICATION | Specifies that the given text will be used for fact verification.

***We are using Classification embedding type for this task.***

In [12]:
from google.api_core import retry
from tqdm.rich import tqdm


tqdm.pandas()

@retry.Retry(timeout=300.0)
def embed_fn(text: str) -> list[float]:
    # You will be performing classification, so set task_type accordingly.
    response = genai.embed_content(
        model="models/text-embedding-004", content=text, task_type="classification"
    )

    return response["embedding"]


def create_embeddings(df):
    df["Embeddings"] = df["Text"].progress_apply(embed_fn)
    return df

In [13]:
df_train = create_embeddings(df_train)
df_test = create_embeddings(df_test)

Output()

  t = cls(total=total, **tqdm_kwargs)


Output()

  t = cls(total=total, **tqdm_kwargs)


In [14]:
df_train.head()

Unnamed: 0,Text,Label,Class Name,Encoded Label,Embeddings
1100,Stray thought (was Re: More technical details\...,11,sci.crypt,0,"[-0.016891459, 0.026258027, -0.035663858, 0.05..."
1101,Re: Licensing of public key implementations\n\...,11,sci.crypt,0,"[0.023607424, 0.025368966, -0.022163793, 0.027..."
1102,Re: text of White House announcement and Q&As ...,11,sci.crypt,0,"[-0.011158817, 0.019066054, -0.05927952, -0.01..."
1103,Re: Off the shelf cheap DES keyseach machine (...,11,sci.crypt,0,"[-0.021023186, 0.01573056, -0.025516696, 0.032..."
1104,Re: freely distributable public key cryptograp...,11,sci.crypt,0,"[0.0020360276, 0.011576817, -0.034181945, 0.02..."


## Build a classification model

In this step, we construct a neural network model using Keras to perform text classification based on the generated embeddings. The model architecture is designed as follows:


* Input Layer: Accepts the embeddings generated for each text entry as input.
* Hidden Layer: A dense layer with ReLU activation that processes the embeddings.
* Output Layer: A dense layer with a softmax activation function, producing the probability distribution over the possible classes.

The build_classification_model function defines this model, and the embedding size is derived from the length of the embeddings (i.e., the number of features in the embeddings).

The model is then compiled using:

* Loss Function: Sparse categorical cross-entropy (suitable for multi-class classification tasks).
* Optimizer: Adam optimizer with a learning rate of 0.001.
* Metrics: Accuracy is tracked during training.

While running the model, Keras will take care of details like shuffling the data points, calculating metrics and other ML boilerplate.

In [15]:
import keras
from keras import layers


def build_classification_model(input_size: int, num_classes: int) -> keras.Model:
    return keras.Sequential(
        [
            layers.Input([input_size], name="embedding_inputs"),
            layers.Dense(input_size, activation="relu", name="hidden"),
            layers.Dense(num_classes, activation="softmax", name="output_probs"),
        ]
    )

In [16]:
# Derive the embedding size from observing the data. The embedding size can also be specified
# with the `output_dimensionality` parameter to `embed_content` if you need to reduce it.
embedding_size = len(df_train["Embeddings"].iloc[0])

classifier = build_classification_model(
    embedding_size, len(df_train["Class Name"].unique())
)
classifier.summary()

classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    metrics=["accuracy"],
)

## Train the model

After building and compiling the classification model, the next step is to train it using the preprocessed data. The training process includes the following steps:

***Preparation of Data:***

* The embeddings and corresponding labels for both training and validation datasets are separated into x_train, y_train and x_val, y_val, respectively.
* The embeddings are stacked into NumPy arrays to serve as input features, while the labels are used as target outputs.

***Early Stopping:***

To prevent overfitting and save training time, an early stopping callback is used. The training process will terminate if the accuracy metric stabilizes for three consecutive epochs.

***Model Training:***

The model is trained using the following parameters:

* Epochs: The training runs for a maximum of 20 epochs.
* Batch Size: 32 samples are processed in each batch.
* Validation Data: The model evaluates its performance on the validation set during training.


In [17]:
import numpy as np


NUM_EPOCHS = 20
BATCH_SIZE = 32

# Split the x and y components of the train and validation subsets.
y_train = df_train["Encoded Label"]
x_train = np.stack(df_train["Embeddings"])
y_val = df_test["Encoded Label"]
x_val = np.stack(df_test["Embeddings"])

# Specify that it's OK to stop early if accuracy stabilises.
early_stop = keras.callbacks.EarlyStopping(monitor="accuracy", patience=3)

# Train the model for the desired number of epochs.
history = classifier.fit(
    x=x_train,
    y=y_train,
    validation_data=(x_val, y_val),
    callbacks=[early_stop],
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

Epoch 1/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - accuracy: 0.3385 - loss: 1.3636 - val_accuracy: 0.2800 - val_loss: 1.3033
Epoch 2/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.4099 - loss: 1.2407 - val_accuracy: 0.7100 - val_loss: 1.1381
Epoch 3/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7998 - loss: 1.0561 - val_accuracy: 0.8300 - val_loss: 0.9791
Epoch 4/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.8541 - loss: 0.8948 - val_accuracy: 0.7400 - val_loss: 0.8712
Epoch 5/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.8786 - loss: 0.6886 - val_accuracy: 0.9000 - val_loss: 0.6768
Epoch 6/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9606 - loss: 0.5400 - val_accuracy: 0.9100 - val_loss: 0.5416
Epoch 7/20
[1m13/13[0m [32m━━━━━━━━━

## Evaluate model performance

After training, the model's performance is evaluated on the validation dataset. The evaluation provides insights into how well the model generalizes to unseen data. The following metrics were used:

* **Accuracy**: The percentage of correctly classified samples out of the total samples in the validation set.
* **Loss**: The categorical cross-entropy loss, which measures the difference between predicted probabilities and true labels.

In [18]:
classifier.evaluate(x=x_val, y=y_val, return_dict=True)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9477 - loss: 0.1784 


{'accuracy': 0.949999988079071, 'loss': 0.17597123980522156}

**The high accuracy indicates that the model performs well in classifying the given text data based on the embeddings.**

## Inference and Prediction

To test the model's ability to classify new data, we provide an example text that avoids any specific terminology. This approach helps determine if the model generalizes well without bias toward domain-specific jargon.

The steps are as follows:

**Input Preparation:**
* A sample text is provided, describing a query about space exploration.
* The text is embedded using the embed_fn function to create an input vector for the model.
  
**Prediction:**

* The embedding is passed to the trained model for classification.
* The model outputs probabilities for each class, indicating the likelihood that the text belongs to each category.
  
**Results:**

* The predicted class probabilities are displayed, showcasing the model's confidence in each category.

In [19]:
# This example avoids any space-specific terminology to see if the model avoids
# biases towards specific jargon.
new_text = """
First-timer looking to get out of here.

Hi, I'm writing about my interest in travelling to the outer limits!

What kind of craft can I buy? What is easiest to access from this 3rd rock?

Let me know how to do that please.
"""
embedded = embed_fn(new_text)

In [20]:
# Remember that the model takes embeddings as input, and the input must be batched,
# so here they are passed as a list to provide a batch of 1.
# inp = np.array([embedded])
# [result] = classifier.predict(inp)

# for idx, category in enumerate(df_test["Class Name"].cat.categories):
#     print(f"{category}: {result[idx] * 100:0.2f}%")

In [22]:
import pandas as pd
from rich.console import Console
from rich.table import Table

# Predict probabilities
inp = np.array([embedded])
[result] = classifier.predict(inp)

# Using Pandas for tabular display
categories = df_test["Class Name"].cat.categories
output_df = pd.DataFrame({
    "Category": categories,
    "Confidence (%)": [f"{prob * 100:.2f}%" for prob in result]
})

# Display as a Pandas DataFrame
# print(output_df)

# Using Rich for styled console output
console = Console()
table = Table(title="Prediction Results")

table.add_column("Category", justify="left")
table.add_column("Confidence (%)", justify="right")

for category, prob in zip(categories, result):
    table.add_row(category, f"{prob * 100:.2f}%")

console.print(table)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step


***Observation**:*

* **The model's prediction demonstrates strong confidence in classifying the input text under the sci.space category.**
* **Minimal probabilities for other categories, such as sci.crypt, sci.electronics, and sci.med, indicate that the model effectively distinguishes between unrelated topics.**

## Conclusion

This project demonstrates the development of a text classification pipeline, from preprocessing raw text data to embedding generation and training a deep learning model. The model achieves high accuracy and can confidently classify new data, showcasing its generalization capabilities.

### Key Takeaways:
* Robustness: The model performs well in categorizing data into relevant classes with minimal bias.
* Versatility: The preprocessing pipeline ensures that the model is adaptable to various textual contexts.
* Accuracy: A validation accuracy of ~93% highlights the effectiveness of the embedding-based approach.

### Further Improvements
While the results are promising, there are opportunities for enhancement:

**Data Augmentation:**

* Introduce more diverse text samples to improve generalization across topics.
* Use techniques like paraphrasing or synonym replacement to increase dataset variety.

**Model Architecture:**

* Experiment with advanced architectures like transformers (e.g., BERT) to handle contextual nuances better.
* Fine-tune hyperparameters such as learning rate and batch size for improved performance.

**Bias and Fairness:**

* Conduct a thorough bias analysis to ensure the model is not disproportionately favoring certain topics.
* Include more balanced samples from underrepresented classes.

**Scalability:**

* Optimize the embedding and classification process to handle larger datasets efficiently.
* Deploy the model using a scalable API for real-time classification.