<a href="https://colab.research.google.com/github/joshcova/LLMs-for-social-scientists/blob/main/code/open_source_llm_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open source LLMs

We first need to install the relevant packages. Note also the version number.

In [None]:
!pip install -U "transformers==4.40.0" --upgrade

Collecting transformers==4.40.0
  Downloading transformers-4.40.0-py3-none-any.whl.metadata (137 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers==4.40.0)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.40.0-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m87.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m99.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizer

For this text classification example we will use one of the most commonly used LLMs, that is **Meta's Llama model**.

Contrary to other commercial LLMs, Llama is **free**, it provides **information on model weights** which can be useful for advanced fine-tuning and customization. Moreover, users can profit from the fact that Llama models are available in different sizes (e.g. large, medium, tiny...), which makes it accessible to a wider audience. Finally, Llama can be used for both commercial as well as research purposes.

Nevertheless it is important to note that Meta is **not open on what type of data Meta's foundational models were trained on**, which has led to criticism that the model might not be as open-source as often claimed.  

As there are different Llama models available,  for the purposes of this exercise we will use the following model "meta-llama/Meta-Llama-3-8B-Instruct" (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).  

Although the 8B parameter model is lighter than some of the other Llama models that are available, it is nevertheless quite large.

This is why we will need to connect to a GPU to run this code (e.g. setting it up over a third-party provider if we do not have the computational resources to do so ourselves).

Before starting our analysis, we first need to carry out two  steps:

1. Sign up to Llama and request user access to their models by agreeing to T&C (https://huggingface.co/meta-llama/Meta-Llama-3-8B)
2. Request access tokens via Hugging Face (https://huggingface.co/docs/hub/en/security-tokens)

In this code, we will test how well our text classification task performs on two different corpora. We will focus on the following aspects:

1. Pre-processing
2. The media corpus of UK newspaper headlines
3. The corpus of parliamentary discussions on central bank independence

# 1. Pre-processing: Hugging Face and Llama

Now you can insert your secret Hugging face token ID

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `os_token_gen1` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `os_t

In [None]:
# Load libraries

import transformers
import torch
import pandas as pd

In [None]:
# Initialize model (depending on your computational resources, loading might take some time)

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    task="text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Let us test the model by classifying a dummy example.

Note how in this example we also instruct the model to provide us with an explanation as to why it has chosen the category it has chosen.

In [None]:
messages = [
    {"role": "system", "content": "Based on the text, your task is to classify the following newspaper headlines into one of the following 3 categories: 1. Macroeconomics; 2. Law & Crime; 3. Others. Please motivate your answer."},
    {"role": "user", "content": "Interest rates expected to raise in the next month"},
]

full_prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)


In [None]:
outputs = pipeline(
    full_prompt,
    do_sample=False,
    temperature=0,
)
print(outputs[0]["generated_text"][len(full_prompt):])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


I would classify this headline as: 1. Macroeconomics

Motivation: The headline is about interest rates, which is a key aspect of macroeconomics. Macroeconomics is the study of the economy as a whole, including factors such as inflation, unemployment, and interest rates. The expectation of a rate hike is a significant economic event that can have far-reaching impacts on the economy, making it a macroeconomic topic.


Interesting! But do we really need the additional explanation? Let us simplify the prompt.

In [None]:
messages = [
    {"role": "system", "content": " Based on the text, your task is to classify the following newspaper headlines into one of the following 3 categories: 1. Macroeconomics; 2. Law & Crime; 3. Others. Please answer only with the name of the category."},
    {"role": "user", "content": "Interest rates expected to raise in the next month"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)


In [None]:
outputs = pipeline(
    prompt,
    do_sample=False,
    temperature=0,
)
print(outputs[0]["generated_text"][len(prompt):])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


1. Macroeconomics


# Media corpus of UK newspaper headlines

> Add blockquote



The model seems to work rather well on this first set of dummy examples, let us now scale it up by using our dataset.

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Media_analysis/uk_media_2.csv")

Let us exclude from the original data frame the labelling done by human annotators. The resulting data frame thus only contains an id variable and the text.

We will keep the human labelling, which serves as the gold standard as a separate dataframe (df_topic).

In [None]:
df_topic = df[["majortopic"]]

In [None]:
df = df[["id", "text"]]

In [None]:
df.head()

Unnamed: 0,id,text
0,1,Peeping Toms
1,2,Police Officer Praised For Car Struggle
2,3,Broadmoor Escape Inquiry Urged
3,4,Police Interview Provost No Action Under Local...
4,5,Molotov's Private Thoughts On Germany Free Ele...


In [None]:
def classify_text(text):
    messages = [
        {"role": "system", "content":    "Based on the text, your task is to classify the following newspaper headlines "
                "into one of the following 3 categories: 1. Macroeconomics; 2. Law & Crime; 0. Others. "
                "Please answer only with the number assigned to the category."},
        {"role": "user", "content": text},
    ]

    # Generate the classification prompt
    prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Use the pipeline to generate output
    outputs = pipeline(
        prompt,
        do_sample=False,
        temperature=0,  # Deterministic output
    )

    # Extract and clean the generated classification
    generated_text = outputs[0]["generated_text"][len(prompt):].strip()
    # Return the first word as the classification
    return generated_text.split()[0]

In [None]:
df['category'] = df['text'].apply(classify_text)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-en

In [None]:
df.head()

Unnamed: 0,id,text,category
0,1,Peeping Toms,2
1,2,Police Officer Praised For Car Struggle,2
2,3,Broadmoor Escape Inquiry Urged,2
3,4,Police Interview Provost No Action Under Local...,2
4,5,Molotov's Private Thoughts On Germany Free Ele...,2


In [None]:
df["category"].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
2,4143
0,1424
1,1164


In [None]:
# Ensure that all rows which are not classified as 0, 1 or 2 are classified as 0. This gets rid of any eventual errors in the LLM classification.
df["category"] = df["category"].apply(lambda x: x if x in ["1", "2", "0"] else "0")


In [None]:
# load in Python packages to see how well Llama performed

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, balanced_accuracy_score, accuracy_score
from sklearn.preprocessing import LabelEncoder

In [None]:
y_pred = df["category"]


In [None]:
y_pred.head()

Unnamed: 0,category
0,2
1,2
2,2
3,2
4,2


In [None]:
label_encoder = LabelEncoder()
category_encoded = label_encoder.fit_transform(y_pred)

In [None]:
y_test = df_topic["majortopic"]

In [None]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, category_encoded))


Confusion Matrix:
[[1233  505 1702]
 [  37  646  101]
 [ 154   13 2340]]


In [None]:
print("\nClassification Report:")
print(classification_report(y_test, category_encoded))


Classification Report:


NameError: name 'classification_report' is not defined

# Parliamentary speech corpus on central bank independence

In [None]:
df_cbi = pd.read_excel("/content/drive/MyDrive/Media_analysis/CBI_UK_sample_labeled.xlsx")

In [None]:
df_results = df_cbi[["results_number"]]

In [None]:
df_cbi = df_cbi[["id", "sents"]]

In [None]:
df_cbi.rename(columns = {'sents':'text'}, inplace = True)

In [None]:
categories = ["0: anti-independence", "1: pro-independence", "2: unrelated"]

definitions = """
0: The statement expresses opposition for central bank independence. \\
1: The statement expresses support for central bank independence. \\
2: The statement does not contain a clear expression in support or opposition to central bank independence.
"""

In [None]:
def classify_text(text):
    messages = [
        {"role": "system", "content":    f"""
         You are a skilled research assistant who will help to classify parliamentary interventions on central bank independence. \\
                    Central bank independence can relate to formal independence (the legal provisions that guarantee the central bank's autonomy, such as its mandate, its organizational structure, and the procedures for appointing its leaders), and actual independence (taking into account factors such as its political and institutional environment, its relationship with the government, and the level of transparency and accountability in its operations). \\
                    Classify the following text into one of the given categories: {categories}\n{definitions} \\
                    Only include the number of the selected category in your response and no further text."
                    """},
        {"role": "user", "content": text},
    ]

    # Generate the classification prompt
    prompt = pipeline.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Use the pipeline to generate output
    outputs = pipeline(
        prompt,
        do_sample=False,
        temperature=0,  # Deterministic output
    )

    # Extract and clean the generated classification
    generated_text = outputs[0]["generated_text"][len(prompt):].strip()
    # Return the first word as the classification
    return generated_text.split()[0]

In [None]:
df_cbi['category'] = df_cbi['text'].apply(classify_text)


In [None]:
y_pred_cbi = df_cbi["category"]

In [None]:
label_encoder = LabelEncoder()
category_encoded_cbi = label_encoder.fit_transform(y_pred_cbi)

In [None]:
y_test_cbi = df_results["results_number"]

In [None]:
print("\nClassification Report:")
print(classification_report(y_test_cbi, category_encoded_cbi))


Classification Report:
              precision    recall  f1-score   support

           0       0.21      0.60      0.31        10
           1       0.89      0.66      0.76        85
           2       0.60      0.64      0.62        55

    accuracy                           0.65       150
   macro avg       0.57      0.63      0.56       150
weighted avg       0.74      0.65      0.68       150

