<a href="https://colab.research.google.com/github/nitaph/ma2/blob/main/bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report.

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [2]:
# imports for the project
!pip install datasets

from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegression





In [3]:
# =================================
# Section 1: load the data
# =================================

ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:5%]")  # 5% of the training data
ag_news_test = load_dataset("fancyzhx/ag_news", split="test[:20%]")  # 20% of test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 6000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1520
    })
})

In [4]:
# ===================================
# Section 2: Load ModernBERT pipeline
# ===================================

embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    Device=0                                  # use GPU 0 if available
)

Device set to use cuda:0


In [5]:
# ==========================
# Section 3: encode the data
# ==========================

def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)


Map:   0%|          | 0/1520 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [6]:
# =======================
# Section 4: fit the data
# =======================

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])


print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (6000, 768), y_train shape: (6000,)
X_test shape: (1520, 768), y_test shape: (1520,)


In [7]:
# ==========================
# Section 5: create classifier
# ==========================

lr = LogisticRegression(C=0.1,
        solver='saga',
        penalty='l1',
        class_weight='balanced',
        max_iter=1000,
        random_state=42,
        tol=1e-4)

lr.fit(X_train, y_train)

In [8]:
y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.86      0.84      0.85      1547
           1       0.90      0.95      0.93      1302
           2       0.82      0.82      0.82      1449
           3       0.86      0.85      0.85      1702

    accuracy                           0.86      6000
   macro avg       0.86      0.87      0.86      6000
weighted avg       0.86      0.86      0.86      6000



In [9]:
# ==========================
# Section 6: make predictions
# ==========================

y_pred = lr.predict(X_test)

In [10]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       383
           1       0.93      0.94      0.94       404
           2       0.79      0.80      0.80       339
           3       0.88      0.80      0.84       394

    accuracy                           0.86      1520
   macro avg       0.85      0.85      0.85      1520
weighted avg       0.86      0.86      0.86      1520

