**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part II: BERT

Please see the description of the assignment in the README file (section 2) <br>
**Guide notebook**: [guides/bert_guide.ipynb](guides/bert_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW? Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bert_guide` notebook

* **Optionally**, you can fine-tune a pre-trained BERT model to classify news articles as is done in [guides/bert_guide_finetuning.ipybb](guides/bert_guide_finetuning.ipybb), the same task as in part 1. As this requires more computational resources, this part is optional. If you do decide to complete this part, you will need to use a GPU (e.g., Google Colab) to train the model. (For reference, training on a 2020 Macbook Pro with 16GB RAM and a M1 chip results in an out-of-memory error). Therefore, we suggest that you use Google Colab or another cloud-based service with a GPU. You can easily upload the `bert_guide_finetuning.ipynb` notebook to Google Colab and run it there.

<br>

***

In [1]:
# imports for the project

from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

dataset = load_dataset("fancyzhx/ag_news")


  from .autonotebook import tqdm as notebook_tqdm


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:
from datasets import load_dataset, DatasetDict

TRAIN_SIZE = 1  # percent
TEST_SIZE = 10  # percent

ag_news_train = load_dataset("fancyzhx/ag_news", split=f"train[:{TRAIN_SIZE}%]")
ag_news_test = load_dataset("fancyzhx/ag_news", split=f"test[:{TEST_SIZE}%]")

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})


In [3]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",
    tokenizer="answerdotai/ModernBERT-base",
    task="feature-extraction",
    device=0  # Use GPU 0 if available, or set to -1 for CPU
)

def get_embeddings(data):
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)



Device set to use mps:0
Map: 100%|██████████| 1200/1200 [02:53<00:00,  6.92 examples/s]
Map: 100%|██████████| 760/760 [01:40<00:00,  7.53 examples/s]


In [4]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})

In [5]:

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (1200, 768), y_train shape: (1200,)
X_test shape: (760, 768), y_test shape: (760,)


In [6]:
lr = LogisticRegression(max_iter=1000)

lr.fit(X_train, y_train)

In [7]:
y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       273
           1       1.00      1.00      1.00       182
           2       1.00      0.99      1.00       202
           3       1.00      1.00      1.00       543

    accuracy                           1.00      1200
   macro avg       1.00      1.00      1.00      1200
weighted avg       1.00      1.00      1.00      1200



In [8]:
y_pred = lr.predict(X_test)

In [10]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.76      0.75       197
           1       0.91      0.87      0.89       199
           2       0.66      0.59      0.62       158
           3       0.74      0.82      0.78       206

    accuracy                           0.77       760
   macro avg       0.76      0.76      0.76       760
weighted avg       0.77      0.77      0.77       760



In [None]:
## Comment 
##The BERT-based model uses ModernBERT to extract contextual CLS token embeddings for each input text. 
# These embeddings are dense and encode both syntax and semantics. 
# A Logistic Regression model was then trained on these embeddings. 

# the model shows that the BERT-basaed model is really good performance on the sport category, and less accurate with the other cateogories with the worst one being 66%

# While BERT gives more insight, the BoW model outperformed it in this case likely bc to 
# the small training size 1%, lack of tuning, 
# and BoW's effectiveness on short texts like headlines.


