**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [11]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In this part, you will implement a simple Bag of Words (BoW) model. You will use the `CountVectorizer` (or `TFidfVectorizer`) from `sklearn` to create a BoW representation of a text dataset. You will then use the BoW representation to train a simple classifier for the AG News dataset. Specifically, your task is to:

- Explore the guidelines in the `bow_guide.ipynb` notebook.
- Implement a BoW model to classify news articles in `assignments/bow.ipynb`.
- Train one or more classifiers (e.g., Logistic Regression) on the BoW representation.
- Experiment with different hyperparameters to (possibly) improve performance over the baseline in `bow.ipynb`.
- Briefly reflect on the performance of your system and your choices of hyperparameters for the BoW model and the classifier.
    - Write your analysis as a markdown cell in the bottom of the notebook.


In [12]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [13]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-3, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.05)
test_df = preprocess(test, frac=0.2)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape

((6000, 2), (1520, 2))

In [14]:
# ===============================
# Section 1: spliting the data
# ===============================

(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(4800,) (1200,) (4800,) (1200,)


In [15]:
# ===============================
# Section 2: building the BoW model
# ===============================

# countvectorizer
cv = CountVectorizer()
X_train_vectorized = cv.fit_transform(X_train)
X_train_vectorized.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [17]:
# ===============================
# Section 4: building the BoW model
# ==============================

lr_clf = LogisticRegression( tol= 1e-5, class_weight= "balanced", random_state= 42, max_iter= 120 ) # Note that we can set hyperparameters here

lr_clf.fit(X_train_vectorized, y_train)

In [7]:
# ===============================
# Section 4: building the BoW model
# ==============================

X_val_vectorized = cv.transform(X_val) # note that we use transform here, not fit_transform

y_pred = lr_clf.predict(X_val_vectorized)

In [8]:
# ===============================
# Section 4: building the BoW model
# ==============================

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00      1170
      Sports       1.00      1.00      1.00      1217
    Business       1.00      1.00      1.00      1211
    Sci/Tech       1.00      1.00      1.00      1202

    accuracy                           1.00      4800
   macro avg       1.00      1.00      1.00      4800
weighted avg       1.00      1.00      1.00      4800

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.85      0.83      0.84       330
      Sports       0.80      0.84      0.82       283
    Business       0.92      0.95      0.94       289
    Sci/Tech       0.92      0.88      0.90       298

    accuracy                           0.87      1200
   macro avg       0.87      0.87      0.87      1200
weighted avg       0.87      0.87      0.87      1200



In [9]:

test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.79      0.83      0.81       380
      Sports       0.83      0.82      0.82       380
    Business       0.92      0.92      0.92       380
    Sci/Tech       0.89      0.84      0.87       380

    accuracy                           0.85      1520
   macro avg       0.86      0.85      0.85      1520
weighted avg       0.86      0.85      0.85      1520

