**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [24]:
# imports for the project
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [25]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [26]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 1e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
# del test

train_df.shape, test_df.shape

((1200, 2), (760, 2))

### 2.2. Split the data

The AG News dataset is already split into training and test sets. We will use Scikit-Learn's `train_test_split` to split the training set into a training and validation set. We will use the training set for training our BoW model and the validation set for evaluating the model.

In [27]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(960,) (240,) (960,) (240,)


### 2.3. Build the BoW model

All the word we did in section 1 is done by `CountVectorizer`. We will use the `fit_transform` method to build the BoW model. We will also use the `transform` method to transform the validation set into a BoW representation.

**Note**: We will __not__ fit our `CountVectorizer` on the validation set. We only fit it on the training set. This is important to avoid data leakage.

In [28]:
# countvectorizer
cv = CountVectorizer()
X_train_vectorized = cv.fit_transform(X_train)

Observe the shape of the BoW representation. The number of columns in the BoW representation corresponds to the number of unique words in the training set. This is the size of our vocabulary.

Note that the CountVectorizer has a pretty ruthless tokenizer. It will not only remove all punctuation and lowercase all text by default, as we did, it will also remove all numbers and indeed all tokens that are not pure text. We can change this behavior by passing a custom tokenizer to the CountVectorizer, if we so desire.

In [29]:
X_train_vectorized.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

### 2.4. Create a classifier

Below we create a simple classifier using Scikit-Learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). We train the classifier on the BoW representation of the training set and evaluate it on the BoW representation of the validation set.

In [30]:
lr_clf = LogisticRegression() # Note that we can set hyperparameters here

lr_clf.fit(X_train_vectorized, y_train)

### 2.5. Get predictions

In [31]:
X_val_vectorized = cv.transform(X_val) # note that we use transform here, not fit_transform

y_pred = lr_clf.predict(X_val_vectorized)

### 2.6. Evaluate BoW model

In [32]:

print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))



Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00      1.00       960
weighted avg       1.00      1.00      1.00       960

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.76      0.68      0.72        62
      Sports       0.69      0.60      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.78      0.90      0.83        58

    accuracy                           0.76       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.76      0.75       240



In [33]:
test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.74      0.72      0.73       190
      Sports       0.75      0.72      0.73       190
    Business       0.83      0.88      0.86       190
    Sci/Tech       0.79      0.79      0.79       190

    accuracy                           0.78       760
   macro avg       0.78      0.78      0.78       760
weighted avg       0.78      0.78      0.78       760



### 3. Trying new parameters
Using the same pre-processing steps as above

In [83]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define the model
lr_clf = LogisticRegression()

# Define the parameter grid
param_grid = [
    {'C': [0.01, 0.1, 1, 10], 'penalty': ['l2'], 'solver': ['lbfgs'], 'max_iter': [100, 200, 300]},
    {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear'], 'max_iter': [100, 200, 300]}
]

# Perform Grid Search
grid_search = GridSearchCV(lr_clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_vectorized, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

# Use the best estimator to make predictions
best_lr_clf = grid_search.best_estimator_

# Predictions on the validation set
y_val_pred = best_lr_clf.predict(X_val_vectorized)
print("Performance on the validation set:")
print(classification_report(y_val, y_val_pred, target_names=label_map.values()))

# Predictions on the train set
y_train_pred = best_lr_clf.predict(X_train_vectorized)
print("Performance on the train set:")
print(classification_report(y_train, y_train_pred, target_names=label_map.values()))




Best parameters found:  {'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}
Best cross-validation score:  0.8072916666666666
Performance on the validation set:
              precision    recall  f1-score   support

       World       0.76      0.71      0.73        62
      Sports       0.71      0.58      0.64        60
    Business       0.79      0.87      0.83        60
    Sci/Tech       0.75      0.86      0.80        58

    accuracy                           0.75       240
   macro avg       0.75      0.76      0.75       240
weighted avg       0.75      0.75      0.75       240

Performance on the train set:
              precision    recall  f1-score   support

       World       1.00      1.00      1.00       238
      Sports       1.00      1.00      1.00       240
    Business       1.00      1.00      1.00       240
    Sci/Tech       1.00      1.00      1.00       242

    accuracy                           1.00       960
   macro avg       1.00      1.00 

### 3.1 Test set performance

In [81]:
# Predictions on the test set
test_df_vectorized = cv.transform(test_df["text"])
y_test_pred = best_lr_clf.predict(test_df_vectorized)
print("Performance on the test set:")
print(classification_report(test_df["label"], y_test_pred, target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.73      0.74      0.73       190
      Sports       0.76      0.73      0.74       190
    Business       0.85      0.89      0.87       190
    Sci/Tech       0.82      0.81      0.81       190

    accuracy                           0.79       760
   macro avg       0.79      0.79      0.79       760
weighted avg       0.79      0.79      0.79       760



As we can see from above, using gridsearch to find the best hyperparameters has a slight advantage on the end model performance. The tradeoff is longer training time.