**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [2]:
# imports for the project
from sklearn.metrics import classification_report 
import pandas as pd


### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [14]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [15]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 0.4, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [19]:

# Setup watsonx
from decouple import Config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

import os
from pathlib import Path

# Ensure the working directory is set to the "ma2" folder.
while Path.cwd().name != "ma2" and "ma2" in str(Path.cwd()):
    os.chdir("..")  # Move up one directory
print(f"Working directory set to: {Path.cwd()}")

config = Config('.env')
WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
                url = "https://us-south.ml.cloud.ibm.com",
                api_key = WX_API_KEY
                )

client = APIClient(
                credentials=credentials, 
                project_id="163839a7-17ed-4d45-8690-531423735ab8"
                )

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",
)


SYSTEM_PROMPT = """You task is to classify news stories into one of five categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

from tqdm import tqdm

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)
    
    
print(classification_report(test_df.label, predictions))

Working directory set to: /Users/mikkel/Library/CloudStorage/OneDrive-Personligt/CBS/Cand Merc IT/2. Semester/AI and ML/mas/ma2


100%|██████████| 760/760 [04:34<00:00,  2.77it/s]

              precision    recall  f1-score   support

    Business       0.54      0.91      0.68       190
    Sci/Tech       0.89      0.35      0.50       190
      Sports       0.96      0.91      0.94       190
       World       0.80      0.78      0.79       190

    accuracy                           0.74       760
   macro avg       0.80      0.74      0.73       760
weighted avg       0.80      0.74      0.73       760






In [30]:
# Test with improved prompt
SYSTEM_PROMPT = """You are an expert news classifier. Your task is to classify news stories into one of four categories.

4 CATEGORIES:
{categories}

CATEGORY DESCRIPTIONS:
- World: International news, politics, diplomacy, global events, wars, peace talks, foreign relations
- Sports: Athletic competitions, games, players, teams, tournaments, sports business
- Business: Corporate news, finance, markets, mergers, acquisitions, economic data
- Sci/Tech: Scientific discoveries, technology innovations, research, space exploration, gadgets

TEXT TO CLASSIFY:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else. Remember, there are only four categories.


CATEGORY:
"""

# Reset predictions list for new results
predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)
    
    
print(classification_report(test_df.label, predictions))

100%|██████████| 760/760 [04:06<00:00,  3.09it/s]

              precision    recall  f1-score   support

    Business       0.52      0.93      0.67       190
    Sci/Tech       0.93      0.29      0.45       190
      Sports       0.95      0.86      0.91       190
       World       0.81      0.81      0.81       190

    accuracy                           0.72       760
   macro avg       0.80      0.72      0.71       760
weighted avg       0.80      0.72      0.71       760






In [33]:
# Test with improved prompt
SYSTEM_PROMPT = """You are an expert news classifier. Your task is to classify news stories into one of four categories.

4 CATEGORIES:
{categories}

CATEGORY DESCRIPTIONS:
- World: International news, politics, diplomacy, global events, wars, peace talks, foreign relations
- Sports
- Business: Corporate news, finance, markets, mergers, acquisitions, economic data
- Sci/Tech

TEXT TO CLASSIFY:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else. Remember, there are only four categories.


CATEGORY:
"""

# Reset predictions list for new results
predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)
    
    
print(classification_report(test_df.label, predictions))

100%|██████████| 760/760 [04:34<00:00,  2.77it/s]

              precision    recall  f1-score   support

    Business       0.57      0.89      0.70       190
    Sci/Tech       0.90      0.38      0.54       190
      Sports       0.95      0.93      0.94       190
       World       0.80      0.83      0.82       190

    accuracy                           0.76       760
   macro avg       0.81      0.76      0.75       760
weighted avg       0.81      0.76      0.75       760






In [35]:
# Test with new model
model = ModelInference(
    api_client=client,
    model_id="mistralai/mistral-large",
)

SYSTEM_PROMPT = """You are an expert news classifier. Your task is to classify news stories into one of four categories.

4 CATEGORIES:
{categories}

CATEGORY DESCRIPTIONS:
- World: International news, politics, diplomacy, global events, wars, peace talks, foreign relations
- Sports
- Business: Corporate news, finance, markets, mergers, acquisitions, economic data
- Sci/Tech

TEXT TO CLASSIFY:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else. Remember, there are only four categories.


CATEGORY:
"""

# Reset predictions list for new results
predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)
    
    
print(classification_report(test_df.label, predictions))

100%|██████████| 760/760 [04:39<00:00,  2.72it/s]

              precision    recall  f1-score   support

    Business       0.70      0.93      0.80       190
    Sci/Tech       0.93      0.61      0.73       190
      Sports       0.97      0.98      0.98       190
       World       0.90      0.91      0.90       190

    accuracy                           0.86       760
   macro avg       0.87      0.86      0.85       760
weighted avg       0.87      0.86      0.85       760






By using the best-rated model for classification (mistral large) according to IBM metrics I was able to achieve the highest score of all measured across all metrics.