**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [37]:
from decouple import config
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters

In [38]:
WX_API_KEY = config('WX_API_KEY')

In [39]:
credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="0c3df496-9cce-47fa-8c75-2d6a61c758e4"
)

In [40]:
model = ModelInference(
    api_client=client,
    model_id="mistralai/mistral-large",
)

### 1. Load the data

We can load this data directly from [Hugging Face Datasets](https://huggingface.co/docs/datasets/) - The HuggingFace Hub- into a Pandas DataFrame. Pretty neat!

**Note**: This cell will download the dataset and keep it in memory. If you run this cell multiple times, it will download the dataset multiple times.

You are welcome to increase the `frac` parameter to load more data.

In [41]:
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [42]:
label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

# train_df = preprocess(train, frac=0.01)
test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
# del train
del test

test_df.shape, # train_df.shape, 

((760, 2),)

In [43]:
PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client=client,
    model_id="ibm/granite-13b-instruct-v2",  # We could also try a larger model!
    params=PARAMS
)

In [50]:
SYSTEM_PROMPT = """You are a news classification expert. Your task is to classify news stories into one of the following categories:

You are a news classification expert. Your task is to classify news stories into EXACTLY ONE of these four categories:

ALLOWED CATEGORIES:
- Business
- Sci/Tech
- Sports
- World

IMPORTANT RULES:
1. You MUST choose from the four categories listed above
2. Do not create new categories
3. Do not add any additional text or explanations
4. Answer with only the category name

CATEGORIES:
{categories}

Here are some examples of how to classify news stories:

Example 1:
Text: "Apple Inc. reported record-breaking quarterly profits, with iPhone sales exceeding expectations."
Category: Business

Example 2:
Text: "Scientists discover new species of deep-sea creatures in the Pacific Ocean."
Category: Sci/Tech

Example 3:
Text: "Manchester United secures victory in the Champions League final."
Category: Sports

Example 4:
Text: "UN Security Council convenes emergency meeting to address global crisis."
Category: World

Now, please classify the following news story. Answer with only the category name:

Text:
{text}

Category:
"""

In [51]:
CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [05:20<00:00,  2.37it/s]


In [52]:
print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.49      0.96      0.65       190
      Health       0.00      0.00      0.00         0
     Letters       0.00      0.00      0.00         0
    Sci/Tech       0.93      0.30      0.45       190
      Sports       0.92      0.88      0.90       190
         War       0.00      0.00      0.00         0
       World       0.84      0.64      0.73       190

    accuracy                           0.69       760
   macro avg       0.46      0.40      0.39       760
weighted avg       0.80      0.69      0.68       760



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The LLM system did a pretty good job at classifying the news articles with an test accuracy of 86% (i ran it again and now the performance got worse and it made up some new categories, but i did achive 86% accuracy and the correct categories) Higer then the bag of words model of 83% but lower then the BERT model 88%. Differnet from the LLM guide i used the mistral-large model which has the higest classification score of the available models on ibms platform. I also changed the prompt a bit and used a few shots prompting to give the model a example of what a news artical text could be in each of the categoreis and found that is worked better then zero shots promting. the few shots promting helped the model to classify the news articles better then the baseline is the LLM guide of 74%. 



