**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


# Part III: LLM

Please see the description of the assignment in the README file (section 3) <br>
**Guide notebook**: [guides/llm_guide.ipynb](guides/llm_guide.ipynb)


***

<br>

* Note that you should report results using a classification report. 

* Also, remember to include some reflections on your results: how do they compare with the results from Part I, BoW?, and part II, BERT? Are there any hyperparameters or prompting techniques that are particularly important?

* You should follow the steps given in the `llm_guide` notebook

<br>


***

In [12]:
# imports for the project


import pandas as pd
from sklearn.metrics import classification_report 
from tqdm import tqdm
from decouple import config 
from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai import Credentials 
from ibm_watsonx_ai.foundation_models import ModelInference 
from ibm_watsonx_ai.foundation_models.schema import TextGenParameters


In [13]:

# =================================
# Section 1: API client setup
# =================================

WX_API_KEY = config('WX_API_KEY')

credentials = Credentials(
    url = "https://us-south.ml.cloud.ibm.com",
    api_key = WX_API_KEY
)

client = APIClient(
    credentials=credentials, 
    project_id="3022aaff-233b-4f7b-a00d-0931c9f73ce5"
)

In [14]:


splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
# train = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["train"])
test = pd.read_parquet("hf://datasets/fancyzhx/ag_news/" + splits["test"])

In [15]:
# ==========================
# Section 2: data preperation
# ==========================

label_map = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

def preprocess(df: pd.DataFrame, frac = 1e-2, label_map = label_map, seed=42) -> pd.DataFrame:
    return  (
        df
        .assign(label=lambda x: x['label'].map(label_map))
        [lambda df: df['label'].isin(label_map.values())]
        .groupby('label')
        .apply(lambda x: x.sample(frac=frac, random_state=seed))
        .reset_index(drop=True)

    )

test_df = preprocess(test, frac=0.1)

# clear up some memory by deleting the original dataframes
del test

test_df.shape, 

((760, 2),)

In [17]:
# ===============================
# Section 3: set model parameters
# ===============================

PARAMS = TextGenParameters(
    temperature=0,              # Higher temperature means more randomness - In this case we don't want randomness
    max_new_tokens=10,          # Maximum number of tokens to generate
    stop_sequences=[".", "\n"], # Stop generating text when these sequences are encountered
)

model = ModelInference(
    api_client= client,
    model_id="ibm/granite-20b-code-instruct", 
    params=PARAMS
)

In [20]:
# ===================================
# Section 4: creating a system prompt
# ===================================

SYSTEM_PROMPT = """You task is to classify news stories into one of four categories

CATEGORIES:
{categories}

TEXT:
{text}

Please assign the correct category to the text. Answer with the correct category and nothing else.

Category:
"""

In [21]:
# ===============================
# Section 5: generate predictions
# ===============================

CATEGORIES = "- " + "\n- ".join(test_df["label"].unique())  # Create a string with all categories

predictions = []

for text in tqdm(test_df["text"]):

    # format the prompt with the categories and the text
    prompt = SYSTEM_PROMPT.format(categories=CATEGORIES, text=text)
    
    # generate the response from the model
    response = model.generate(prompt)

    # extract the generated text from the response
    prediction = response["results"][0]["generated_text"].strip()

    # append the prediction to the list of predictions
    predictions.append(prediction)

100%|██████████| 760/760 [04:41<00:00,  2.70it/s]


In [22]:
# ===================
# Section 6: evaluate
# ===================

print(classification_report(test_df.label, predictions))

              precision    recall  f1-score   support

    Business       0.51      0.90      0.65       190
    Sci/Tech       0.65      0.54      0.59       190
      Sports       0.76      0.89      0.82       190
       World       0.77      0.17      0.28       190

    accuracy                           0.63       760
   macro avg       0.67      0.63      0.59       760
weighted avg       0.67      0.63      0.59       760

