**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part I: Bag-of-Words Model

Please see the description of the assignment in the README file (section 1) <br>
**Guide notebook**: [guides/bow_guide.ipynb](guides/bow_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: Are there any hyperparameters that are particularly important?

* You should follow the steps given in the `bow_guide` notebook

<br>

***

In [1]:
# imports for the project

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS 
import string

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [2]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}

train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

print(train.shape, test.shape)

(120000, 2) (7600, 2)


In [3]:

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac : float = 4e-2, label_map : dict[int, str] = label_map, seed : int = 42) -> pd.DataFrame:
    """ Preprocess the dataset 

    Operations:
    - Map the label to the corresponding category
    - Filter out the labels not in the label_map
    - Sample a fraction of the dataset (stratified by label)

    Args:
    - df (pd.DataFrame): The dataset to preprocess
    - frac (float): The fraction of the dataset to sample in each category
    - label_map (dict): A mapping of the original label to the new label
    - seed (int): The random seed for reproducibility

    Returns:
    - pd.DataFrame: The preprocessed dataset
    """

    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')[["text", "label"]]
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)
       

    )

train_df = preprocess(train, frac=0.1)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del train
del test

train_df.shape, test_df.shape






((12000, 2), (760, 2))

In [4]:
train_df['text'] = train_df['text'].str.lower()
train_df['text'] = train_df['text'].str.replace(f"[{string.punctuation}]", "", regex=True)
train_df['label'] = train_df['label'].str.lower()
test_df ['text'] = test_df ['text'].str.lower()
test_df['text'] = test_df['text'].str.replace(f"[{string.punctuation}]", "", regex=True)
test_df['label'] = test_df['label'].str.lower()

train_df

Unnamed: 0,text,label
0,us house sales fall in july sales of nonnew ho...,business
1,dj to acquire marketwatch dow jones amp co pu...,business
2,dollar hits new low on snow speech the united ...,business
3,yukos executives flee russia all the top execu...,business
4,smithfield quarterly profits beat outlook smit...,business
...,...,...
11995,ivory coast to pull troops back ivory coast fo...,world
11996,a toll on palestinians israeli army encounters...,world
11997,unit killed iraqi civilians marine a former us...,world
11998,thousands registered to vote in 2 statesreport...,world


In [5]:
train_df['tokens'] = train_df['text'].str.split()

train_df['tokens']

0        [us, house, sales, fall, in, july, sales, of, ...
1        [dj, to, acquire, marketwatch, dow, jones, amp...
2        [dollar, hits, new, low, on, snow, speech, the...
3        [yukos, executives, flee, russia, all, the, to...
4        [smithfield, quarterly, profits, beat, outlook...
                               ...                        
11995    [ivory, coast, to, pull, troops, back, ivory, ...
11996    [a, toll, on, palestinians, israeli, army, enc...
11997    [unit, killed, iraqi, civilians, marine, a, fo...
11998    [thousands, registered, to, vote, in, 2, state...
11999    [un, approves, oil, sanction, on, sudan, the, ...
Name: tokens, Length: 12000, dtype: object

### 2. Split the data

In [6]:
(
    
    X_train,
    X_val,
    y_train,
    y_val

) = train_test_split(train_df["text"], train_df["label"], test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(9600,) (2400,) (9600,) (2400,)


### 3. Build the BoW Model

In [7]:
# countvectorizer

cv = CountVectorizer(stop_words= 'english')
X_train_vectorized = cv.fit_transform(X_train)

In [8]:
vocab = cv.vocabulary_
print("Vocabulary:\n", vocab)

Vocabulary:


In [9]:
word_counts = X_train_vectorized.toarray().sum(axis=0)
word_freq = pd.DataFrame(list(vocab.items()), columns=['Word', 'Index'])
word_freq['Count'] = word_freq['Index'].apply(lambda i: word_counts[i])

# Sort and get the top 10 words
top_10_words = word_freq.sort_values(by='Count', ascending=False).head(10)
print("\nTop 10 Words and Their Counts:\n", top_10_words[['Word', 'Count']])


Top 10 Words and Their Counts:
          Word  Count
1         39s   2520
10        new   1721
57       said   1641
42    reuters   1341
338        ap   1314
106     world    638
185    monday    634
718       oil    594
228   tuesday    586
92   thursday    581


In [10]:
X_train_vectorized.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### 4. Create a Classifier

In [48]:

lr_clf = LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=0.8, class_weight=None,solver='liblinear', max_iter=1000) # Note that we can set hyperparameters here

lr_clf.fit(X_train_vectorized, y_train)

### 5. Get Predictions

In [49]:
X_val_vectorized = cv.transform(X_val) # note that we use transform here, not fit_transform

y_pred = lr_clf.predict(X_val_vectorized)

### 6. Evaluate BoW Model

In [50]:
print("Performance on the training set:")
print(classification_report(y_train, lr_clf.predict(X_train_vectorized), target_names=label_map.values()))

print("Performance on the validation set:")
print(classification_report(y_val, y_pred, target_names=label_map.values()))

Performance on the training set:
              precision    recall  f1-score   support

       World       1.00      0.99      1.00      2388
      Sports       0.99      1.00      1.00      2376
    Business       1.00      1.00      1.00      2432
    Sci/Tech       1.00      1.00      1.00      2404

    accuracy                           1.00      9600
   macro avg       1.00      1.00      1.00      9600
weighted avg       1.00      1.00      1.00      9600

Performance on the validation set:
              precision    recall  f1-score   support

       World       0.86      0.83      0.84       612
      Sports       0.84      0.87      0.85       624
    Business       0.93      0.97      0.95       568
    Sci/Tech       0.90      0.87      0.88       596

    accuracy                           0.88      2400
   macro avg       0.88      0.88      0.88      2400
weighted avg       0.88      0.88      0.88      2400



In [51]:
test_df_vectorized = cv.transform(test_df["text"])

print("Performance on the test set:")
print(classification_report(test_df["label"], lr_clf.predict(test_df_vectorized), target_names=label_map.values()))

Performance on the test set:
              precision    recall  f1-score   support

       World       0.84      0.87      0.86       190
      Sports       0.91      0.85      0.88       190
    Business       0.94      0.96      0.95       190
    Sci/Tech       0.90      0.90      0.90       190

    accuracy                           0.90       760
   macro avg       0.90      0.90      0.90       760
weighted avg       0.90      0.90      0.90       760



### 7. Reflections

The most important hyperparameter  being able to run the model is max_iter. Without it the Logistic Regression does not perform. 

Otherwise to optimize the logistic regression the parameters that made a difference where penalty, tolerance, C and the solver.

(The penalty applies regularization to prevent overfitting. The tolerance controls the stopping criteria for the optimization algorithm. 
C is the inverse of the regularization strength and the smaller C is the stronger the regularization, which helps prevent overfitting.
The solver determiens the optimization algorithm and depends on which penalty is chosen.)

When switching the labels from the training set to lower case, the labels of the test data have to be lower case as well otherwise the model sees them as 8 instead of 4 different labels. 