<a href="https://colab.research.google.com/github/joshcova/LLMs-for-social-scientists/blob/main/code/04_open_source_llm_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open source LLMs

We first need to install the relevant packages. Note also the version number.

In [None]:
!pip install -U "transformers==4.40.0" --upgrade

For this text classification example we will use one of the most commonly used LLMs, that is **Meta's Llama model**.

Contrary to other commercial LLMs, Llama is **free**, it provides **information on model weights** which can be useful for advanced fine-tuning and customization. Moreover, users can profit from the fact that Llama models are available in different sizes (e.g. large, medium, tiny...), which makes these models accessible to a wider audience. Finally, Llama can be used for both commercial as well as research purposes.

Nevertheless it is important to note that Meta is **not open on what type of data Meta's foundational models were trained on**, which has led to criticism that the model might not be as open-source as often claimed.  

As there are different Llama models available,  for the purposes of this exercise we will use the following model "meta-llama/Meta-Llama-3-8B-Instruct" (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).  

Although the 8B parameter model is lighter than some of the other Llama models that are available, it is nevertheless quite large.

This is why we will need to connect to a GPU to run this code (e.g. setting it up over a third-party provider if we do not have the computational resources to do so ourselves).

Before starting our analysis, we first need to carry out two  steps:

1. Sign up to Llama and request user access to their models by agreeing to T&C (https://huggingface.co/meta-llama/Meta-Llama-3-8B)
2. Request access tokens via Hugging Face in order to run the models via the Hugging Face interface (https://huggingface.co/docs/hub/en/security-tokens)

In this code, we will test how well our text classification task performs on two different corpora. We will focus on the following aspects:

1. Setting up Hugging Face and Llama
2. The media corpus of UK newspaper headlines
3. The corpus of parliamentary discussions on central bank independence

# Setting up Hugging Face and Llama

Now you can insert your secret Hugging face token ID

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `os_token_gen1` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `os_t

In [None]:
# Load libraries

import transformers
import torch
import pandas as pd

In [None]:
# Initialize model (depending on your computational resources, loading might take some time)

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    task="text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)


Let us test the model by classifying a dummy example.

Note how in this example we also instruct the model to provide us with an explanation as to why it has chosen the category it has chosen.

In [None]:
messages = [
    {"role": "system", "content": "Based on the text, your task is to classify the following newspaper headlines into one of the following 3 categories: 1. Macroeconomics; 2. Law & Crime; 3. Others. Please motivate your answer."},
    {"role": "user", "content": "Interest rates expected to raise in the next month"},
]

full_prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)


In [None]:
outputs = pipeline(
    full_prompt,
    do_sample=False,
    temperature=0,
)
print(outputs[0]["generated_text"][len(full_prompt):])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


I would classify this headline as: 1. Macroeconomics

Motivation: The headline is about interest rates, which is a key aspect of macroeconomics. Macroeconomics is the study of the economy as a whole, including factors such as inflation, unemployment, and interest rates. The expectation of a rate hike is a significant economic event that can have far-reaching impacts on the economy, making it a macroeconomic topic.


Interesting! But do we really need the additional explanation? Let us simplify the prompt.

In [None]:
messages = [
    {"role": "system", "content": " Based on the text, your task is to classify the following newspaper headlines into one of the following 3 categories: 1. Macroeconomics; 2. Law & Crime; 3. Others. Please answer only with the name of the category."},
    {"role": "user", "content": "Interest rates expected to raise in the next month"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)


In [None]:
outputs = pipeline(
    prompt,
    do_sample=False,
    temperature=0,
)
print(outputs[0]["generated_text"][len(prompt):])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


1. Macroeconomics


# Media corpus of UK newspaper headlines



The model seems to work rather well on this first set of dummy examples, let us now scale it up by using our dataset.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/joshcova/LLMs-for-social-scientists/main/data/uk_media_2.csv")

Let us exclude from the original data frame the labelling done by human annotators. The resulting data frame thus only contains an id variable and the text.

We will keep the human labelling, which serves as the gold standard as a separate dataframe (df_topic). This is important as we have to check how well the LLM did.

In [None]:
df = df[["majortopic","text"]]
df = df.rename(columns={"majortopic":"label"})

# Sample from the dataset, so we don't have the run the LLM on the entire corpus

df = df.groupby("label").sample(n=50, random_state=1)

In [None]:
df_label = df[["label"]]

In [None]:
df = df[["text"]]

In [None]:
df.head()

Unnamed: 0,text
4386,"Veil is lifted on Arafat's secret wedding, Yas..."
3385,Nakasone tries to soothe US Japanese are urged...
5374,"Marconi move, Inside."
324,Lord Devlin Asked To Head Press Council
1009,"Smith will hang Africans today, says report"


Just as we did with the example in which we used ChatGPT, we can write a function to interact with the LLM.

In [None]:
def classify_text(text):
    messages = [
        {"role": "system", "content":    """
              Based on the text, your task is to classify the following newspaper headlines
                "into one of the following 3 categories:
                "0. The newspaper headline concerns a topic other than Macroeconomics or Law & Crime.
                "1. The newspaper headline concerns the topic Macroeconomics.
                "2. The newspaper headline concerns the topic Law & Crime. "
                Please answer only with the number assigned to the category.
                """},
        {"role": "user", "content": text},
    ]

    # Generate the classification prompt
    prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Use the pipeline to generate output
    outputs = pipeline(
        prompt,
        do_sample=False,
        temperature=0,  # Deterministic output
    )

    # Extract and clean the generated classification
    generated_text = outputs[0]["generated_text"][len(prompt):].strip()
    # Return the first word as the classification
    return generated_text.split()[0]

In [None]:
# Apply the function to our text

df['category'] = df['text'].apply(classify_text)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for

In [None]:
# Quickly inspect the results

df.head()

Unnamed: 0,text,category
4386,"Veil is lifted on Arafat's secret wedding, Yas...",2
3385,Nakasone tries to soothe US Japanese are urged...,1
5374,"Marconi move, Inside.",0
324,Lord Devlin Asked To Head Press Council,2
1009,"Smith will hang Africans today, says report",2


In [None]:
df["category"].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
2,78
1,46
0,25
I'm,1


In [None]:
# Ensure that all rows which are not classified as 0, 1 or 2 are classified as 0. This gets rid of any eventual errors in the LLM classification.
df["category"] = df["category"].apply(lambda x: x if x in ["1", "2", "0"] else "0")


In [None]:
# Convert the category variable into the same data type

df["category"] = df["category"].astype(int)

In [None]:
# load in Python packages to see how well Llama performed

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, balanced_accuracy_score

metrics = {
    "Metric": ["F1 Score (macro)", "F1 Score (micro)", "Balanced Accuracy"],
    "Value": [
        f1_score(df["category"], df_label["label"], average='macro'),
        f1_score(df["category"], df_label["label"], average='micro'),
        balanced_accuracy_score(df["category"], df_label["label"])
    ]
}

# Convert the dictionary into a DataFrame for nice tabular representation
results_df = pd.DataFrame(metrics)

# Display the results table
results_df

Unnamed: 0,Metric,Value
0,F1 Score (macro),0.608918
1,F1 Score (micro),0.633333
2,Balanced Accuracy,0.635637


In [None]:
# Calculating metrics per class
# Replace the second df with any model of your choice
precision_per_class = precision_score(df["category"], df_label["label"], average=None, labels=[0,1,2])
recall_per_class = recall_score(df["category"], df_label["label"], average=None, labels=[0,1,2])
f1_per_class = f1_score(df["category"], df_label["label"], average=None, labels=[0,1,2])

# Since accuracy is a global metric (not class-specific), we will not recalculate it here.

# Create a DataFrame from the metrics
metrics_per_class_df = pd.DataFrame({
    "Class": [0, 1, 2],
    "Precision": precision_per_class,
    "Recall": recall_per_class,
    "F1 Score": f1_per_class
})

# Display the results table
metrics_per_class_df

Unnamed: 0,Class,Precision,Recall,F1 Score
0,0,0.28,0.538462,0.368421
1,1,0.74,0.804348,0.770833
2,2,0.88,0.564103,0.6875


# Parliamentary speech corpus on central bank independence

In [None]:
df_cbi = pd.read_csv("https://raw.githubusercontent.com/joshcova/LLMs-for-social-scientists/main/data/uk_cbi_sample.csv")

In [None]:
df_results_cbi = df_cbi[["results_number"]]

In [None]:
df_cbi = df_cbi[["id", "sents"]]

In [None]:
df_cbi.rename(columns = {'sents':'text'}, inplace = True)

In [None]:
categories = ["0: anti-independence", "1: pro-independence", "2: unrelated"]

definitions = """
0: The statement expresses opposition for central bank independence for the Bank of England. \\
1: The statement expresses support for central bank independence for the Bank of England. \\
2: The statement does not contain a clear expression in support or opposition to Bank of England central bank independence or is on an unrelated topic (e.g. European central bank).
"""

In [None]:
def classify_text(text):
    messages = [
        {"role": "system", "content":    f"""
         You are a skilled research assistant who will help to classify parliamentary interventions on central bank independence. \\
                    Central bank independence can relate to formal independence (the legal provisions that guarantee the central bank's autonomy, such as its mandate, its organizational structure, and the procedures for appointing its leaders), and actual independence (taking into account factors such as its political and institutional environment, its relationship with the government, and the level of transparency and accountability in its operations). \\
                    Classify the following text into one of the given categories: {categories}\n{definitions} \\
                    Only include the number of the selected category in your response and no further text."
                    """},
        {"role": "user", "content": text},
    ]

    # Generate the classification prompt
    prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Use the pipeline to generate output
    outputs = pipeline(
        prompt,
        do_sample=False,
        temperature=0,  # Deterministic output
    )

    # Extract and clean the generated classification
    generated_text = outputs[0]["generated_text"][len(prompt):].strip()
    # Return the first word as the classification
    return generated_text.split()[0]

In [None]:
df_cbi['category'] = df_cbi['text'].apply(classify_text)


In [None]:
df_cbi["category"].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
2,63
1,52
0,35


In [None]:
df_cbi["category"] = df_cbi["category"].astype(int)

In [None]:
df_results_cbi.head()

In [None]:
# load in Python packages to see how well Llama performed

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, balanced_accuracy_score

metrics = {
    "Metric": ["F1 Score (macro)", "F1 Score (micro)", "Balanced Accuracy"],
    "Value": [
        f1_score(df_cbi["category"], df_results_cbi["results_number"], average='macro'),
        f1_score(df_cbi["category"], df_results_cbi["results_number"], average='micro'),
        balanced_accuracy_score(df_cbi["category"], df_results_cbi["results_number"])
    ]
}

# Convert the dictionary into a DataFrame for nice tabular representation
results_df = pd.DataFrame(metrics)

# Display the results table
results_df

Unnamed: 0,Metric,Value
0,F1 Score (macro),0.550619
1,F1 Score (micro),0.606667
2,Balanced Accuracy,0.567949


In [None]:
# Calculating metrics per class
# Replace the second df with any model of your choice
precision_per_class = precision_score(df_cbi["category"], df_results_cbi["results_number"], average=None, labels=[0,1,2])
recall_per_class = recall_score(df_cbi["category"], df_results_cbi["results_number"], average=None, labels=[0,1,2])
f1_per_class = f1_score(df_cbi["category"], df_results_cbi["results_number"], average=None, labels=[0,1,2])

# Since accuracy is a global metric (not class-specific), we will not recalculate it here.

# Create a DataFrame from the metrics
metrics_per_class_df = pd.DataFrame({
    "Class": [0, 1, 2],
    "Precision": precision_per_class,
    "Recall": recall_per_class,
    "F1 Score": f1_per_class
})

# Display the results table
metrics_per_class_df

Unnamed: 0,Class,Precision,Recall,F1 Score
0,0,0.8,0.228571,0.355556
1,1,0.552941,0.903846,0.686131
2,2,0.654545,0.571429,0.610169
