**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [6]:
# imports for the project

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [50]:

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [51]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [52]:
WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="41afd1e1-f579-4ce0-bf97-2c29ed058f5c"
)

In [53]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

In [72]:
# using for example

categories = test_df["label"].unique()

examples = {}
indices_to_remove = []
for i, row in test_df.iterrows():
    if row["label"] in categories and row["label"] not in examples:
        examples[row["label"]] = row["text"]
        indices_to_remove.append(i)
    if len(examples) == len(categories):
        break

# Print examples
for category, text in examples.items():
    print(f"Category: {category}, Example: {text}")

# Remove selected examples to avoid data leakage
test_df_without_examples = test_df.drop(indices_to_remove).reset_index(drop=True)

# Print remaining dataset size
print("Remaining dataset size:", len(test_df_without_examples))


Category: Business, Example: Ford: Monthly Sales Drop, Company Looks To New Vehicles Cruising along the ever-stretching road of decline. Auto giant Ford Motor (nyse: F - news - people ) reported vehicle sales in October that fell 5 from a year ago.
Category: Sci/Tech, Example: China Closes 1,600 Internet Cafes in Crackdown China shut 1,600 Internet cafes between February and August and imposed \$12.1 million worth of fines for allowing children to play violent or adult-only games and other violations, state media said.
Category: Sports, Example: Agassi Overcomes Verdasco Power  STOCKHOLM (Reuters) - Andre Agassi marched into the  Stockholm Open semifinals Friday, beating Spanish eighth seed  Fernando Verdasco 7-6, 6-2 in his toughest match of the  tournament.
Category: World, Example: Large Explosion Heard in Central Baghdad (Reuters) Reuters - A large blast was heard in central\Baghdad on Thursday, witnesses said.
Remaining dataset size: 756


In [77]:
SYSTEM_PROMPT = """
Classify the following news story into one of the four categories: 
{categories}

Text:  
{text}  

Respond with only the correct category name and nothing else.  

Category:  

"""

In [78]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [03:50<00:00,  3.29it/s]


In [79]:
print(classification_report(test_df.label, predictions))
print(CATEGORIES)

                precision    recall  f1-score   support

      Business       0.52      0.93      0.67       190
        Health       0.00      0.00      0.00         0
          Java       0.00      0.00      0.00         0
 Miscellaneous       0.00      0.00      0.00         0
      Sci/Tech       0.86      0.35      0.50       190
         Space       0.00      0.00      0.00         0
Space Sci/Tech       0.00      0.00      0.00         0
        Sports       0.91      0.91      0.91       190
         World       0.85      0.66      0.74       190

      accuracy                           0.71       760
     macro avg       0.35      0.32      0.31       760
  weighted avg       0.78      0.71      0.70       760

- Business
- Sci/Tech
- Sports
- World


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Originally I tried using zero-shot prompting, which gave similar results as the llm_guide

Then I tried to use few-shot prompting, by using the data set to extract out 3 examples and remove them from the data set. The prompt I used look like the following:

```py
SYSTEM_PROMPT = """
Your task is to classify news stories into one of four categories

CATEGORIES:  
{categories}  

Ford: Monthly Sales Drop, Company Looks To New Vehicles Cruising along the ever-stretching road of decline. Auto giant Ford Motor (nyse: F - news - people ) reported vehicle sales in October that fell 5 from a year ago // Business

United #39;s pension dilemma United Airlines says it likely will end funding for employee pension plans, a move that would be the largest ever default by a US company and could lead to a taxpayer-funded bailout rivaling the savings-and-loan fiasco of the 1980s // Business

Comcast part of group wanting to buy MGM A consortium led by Sony Corp. of America that includes Comcast Corp. has entered into a definitive agreement to acquire Metro-Goldwyn Mayer Inc // Business

{text}:
"""
```

This gave an accuract of 80%, but the LLM created a lot more categories than the 5, so I tried to tweak the examples to use:

text: {text}

category: {category}

However, I couldn't get the LLM to only respond with one of the provided categories, even when prompting it to. So I tried to change the examples, so I'd have an example for each category. 

This didn't help at all. Instead I opted to use zero-shot prompting.

Using an LLM for classification gave good accuracy but struggled with sticking to only the four categories, often creating new ones. BoW with logistic regression worked fine for strict categorization but had lower accuracy. BERT with logistic regression performed the best, offering better accuracy and understanding of context while keeping the categories intact.