**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [None]:
# imports for the project

from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:20%]")  # 20% of the training data
ag_news_test = load_dataset("fancyzhx/ag_news", split="test")  # full test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [4]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use mps:0


In [5]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)


Map:   0%|          | 0/24000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [6]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 24000
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 7600
    })
})

In [None]:
X = np.array(ag_news["train"]["embeddings"])
y = np.array(ag_news["train"]["label"])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_train shape: (19200, 768), y_train shape: (19200,)
X_test shape: (7600, 768), y_test shape: (7600,)


In [None]:
lr = LogisticRegression(n_jobs=-1, random_state=42, max_iter=2000)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]  # Try a range of regularization strengths
}

# Using GridSearchCV to find best hyperparamethers in one go
grid = GridSearchCV(
    estimator=lr,
    param_grid=param_grid,
    cv=5, # 5 k-folds
    scoring='accuracy', # scoring for accuracy, as I'm unsure what the purpose of what we're trying to classify
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

best_lr = grid.best_estimator_
print(f"Best C: {grid.best_params_['C']}")

Fitting 5 folds for each of 5 candidates, totalling 25 fits


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best C: 0.1


In [32]:
label_names = ag_news['train'].features['label'].names
print(label_names)

['World', 'Sports', 'Business', 'Sci/Tech']


In [33]:
y_pred_train = best_lr.predict(X_train)
print("Train classification report:")
print(classification_report(y_train, y_pred_train, digits=5, target_names=label_names))

Train classification report:
              precision    recall  f1-score   support

       World    0.90837   0.89892   0.90362      4996
      Sports    0.94900   0.97300   0.96085      4666
    Business    0.86086   0.85047   0.85563      4474
    Sci/Tech    0.88044   0.87836   0.87940      5064

    accuracy                        0.90021     19200
   macro avg    0.89967   0.90019   0.89988     19200
weighted avg    0.89981   0.90021   0.89996     19200



In [34]:
y_pred_val = best_lr.predict(X_val)
print("Validation classification report:")
print(classification_report(y_val, y_pred_val, digits=5, target_names=label_names))

Validation classification report:
              precision    recall  f1-score   support

       World    0.87573   0.86989   0.87280      1199
      Sports    0.94330   0.96471   0.95388      1190
    Business    0.83717   0.83940   0.83828      1127
    Sci/Tech    0.86846   0.85358   0.86096      1284

    accuracy                        0.88187      4800
   macro avg    0.88117   0.88189   0.88148      4800
weighted avg    0.88149   0.88187   0.88163      4800



### Test set

In [35]:
y_pred_test = best_lr.predict(X_test)
print("Test classification report:")
print(classification_report(y_test, y_pred_test, digits=5, target_names=label_names))

Test classification report:
              precision    recall  f1-score   support

       World    0.90049   0.87158   0.88580      1900
      Sports    0.95376   0.95526   0.95451      1900
    Business    0.83904   0.81211   0.82535      1900
    Sci/Tech    0.82070   0.87211   0.84562      1900

    accuracy                        0.87776      7600
   macro avg    0.87850   0.87776   0.87782      7600
weighted avg    0.87850   0.87776   0.87782      7600



### Conclusion
The best model had the hyperparamther C = 0.1, which gave the highest accuracy

This model performed better on all metrics compared to the BoW model. The main reason for this is the use of BERT-based embeddings. BERT takes the meaning of words in context into account and creates better, more useful representations of the text. This helps the classifier make more accurate predictions because it better understands what each piece of text is really saying.