<a href="https://colab.research.google.com/github/joshcova/LLMs-for-social-scientists/blob/main/code/02_classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT classifiers

In this file we will showcase how researchers can use the BERT (Bidirectional Encoder Representations from Transformers) algorithm to classify texts into a set of predefined categories/labels. Since its release in 2018 by a team of researchers at Google ([Devlin et al. 2018)](https://aclanthology.org/N19-1423.pdf), the BERT language model has been widely used and has been shown to outperfom other unsupervised and supervised machine learning methods in natural language processing. There are also a wide range of other BERT-like language models that are designed for different natural languages (e.g. French, Italian, German).  

Here you can find out more about what other models are available: [Hugging Face - BERT](https://huggingface.co/docs/transformers/en/model_doc/bert)

While you can download this Jupyter notebook into your own local Python programming environment, this Jupyter notebook was written in [**Google Colab**](https://colab.research.google.com/#scrollTo=5fCEDCU_qrC0).

Google Colab is an interactive interface hosted by Google that allows you to 'write and execute Python in your browser' without having to configure anything on your PC. In addition, users can also purchase compute units and thereby rent out server space to run more computationally demanding models.

## Load libraries

In [None]:
pip install AugmentedSocialScientist

In [3]:
import pandas as pd
import numpy as np
from AugmentedSocialScientist.models import Bert


We will use the transformer-based BERT classifiers, which can easily be extracted from the [AugmentedSocialScientist library](https://github.com/rubingshen/AugmentedSocialScientist) developed in Do et al. (2022).

We will conduct this analysis on both of our corpuses: the **media corpus** and the **central bank speech corpus**.

## Media corpus

In [4]:
df_media = pd.read_csv("https://raw.githubusercontent.com/joshcova/LLMs-for-social-scientists/main/data/uk_media_2.csv")

In [5]:
# Let us split the dataset by first testing our classifier on a portion of the dataset (first 3000 rows) and then applying it on the remainder
# of the dataset as a second step

df_media_1 = df_media[1:3000]

In [6]:
df_media_1 = df_media_1.rename(columns = {"majortopic":"label"})

In [7]:
# This is the dataset on which we will test how well our classification strategy did

pred_data = df_media[3001:6730]

In [8]:
# We remove the labels to only keep the texts

pred_data = pred_data[["text"]]

In [9]:
# To train our classifier we need to divide the dataset into a testing and training dataset. This will allow the underlying algorithms that are powering the BERT
# classifier to *learn* the way in which our data is stuctured.

df_train = df_media_1.sample(frac=0.70)
df_test = df_media_1.drop(df_train.index)

In [10]:
# We initialize the Bert classifier.
# Note that the classifier is also available in different languages, but given that our corpus is in English we will keep the defeault option

bert = Bert()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

There are 1 GPU(s) available.
We will use GPU 0: Tesla T4


In [11]:
train_loader = bert.encode(
    df_train.text.values,      #list of texts
    df_train.label.values      #list of labels
    )

  0%|          | 0/2099 [00:00<?, ?it/s]

  0%|          | 0/2099 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [12]:
test_loader = bert.encode(
    df_test.text.values,       #list of texts
    df_test.label.values       #list of labels
    )


  0%|          | 0/900 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [13]:
# In this code snippet we actually go about running our model

scores = bert.run_training(
    train_loader,             #training dataloader
    test_loader,              #test dataloader
    lr=5e-5,                  #learning rate
    n_epochs=3,               #number of epochs
    random_state=1,          #random state (for replicability)
    save_model_as='media_analysis' #name of model to save as
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...
  Batch    40  of     66.    Elapsed: 0:00:10.

  Average training loss: 0.54
  Training took: 0:00:15

Running Validation...

  Average test loss: 0.28
  Validation took: 0:00:01
              precision    recall  f1-score   support

           0       0.92      0.91      0.91       508
           1       0.80      0.93      0.86       108
           2       0.90      0.87      0.88       284

    accuracy                           0.90       900
   macro avg       0.87      0.90      0.88       900
weighted avg       0.90      0.90      0.90       900


Training...
  Batch    40  of     66.    Elapsed: 0:00:09.

  Average training loss: 0.18
  Training took: 0:00:14

Running Validation...

  Average test loss: 0.30
  Validation took: 0:00:01
              precision    recall  f1-score   support

           0       0.91      0.94      0.92       508
           1       0.85      0.92      0.88       108
           2       0.93      0.85      0.89       284

    accuracy   

Now it is time to test the model on an unlabeled dataset

In [14]:
pred_loader = bert.encode(pred_data.text.values) #input a list of unlabeld texts

  0%|          | 0/3729 [00:00<?, ?it/s]

  0%|          | 0/3729 [00:00<?, ?it/s]

In [15]:
pred = bert.predict_with_model(
    pred_loader,
    model_path="/content/models/media_analysis"
    )

  0%|          | 0/117 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [16]:
pred_data['pred_label'] = np.argmax(pred, axis=1)
pred_data['pred_proba'] = np.max(pred, axis=1)

The nice thing about BERT classifiers is that it also allows provides a predicted probability score

In [17]:
pred_data.head()

Unnamed: 0,text,pred_label,pred_proba
3001,The heavy hand of the Spanish police,0,0.810919
3002,Queen's police were slow to react,2,0.996655
3003,Pitfalls of civil nuptials in Greece,0,0.995636
3004,Two jailed after Countryman,2,0.996795
3005,Jobless total sets new record,1,0.991864


In [18]:
validate_df = df_media[3001:6730]

In [19]:
validate_df.head()

Unnamed: 0,id,text,majortopic
3001,3002,The heavy hand of the Spanish police,2
3002,3003,Queen's police were slow to react,2
3003,3004,Pitfalls of civil nuptials in Greece,0
3004,3005,Two jailed after Countryman,2
3005,3006,Jobless total sets new record,1


In [20]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, balanced_accuracy_score

In [21]:
metrics = {
    "Metric": ["F1 Score (macro)", "F1 Score (micro)", "Balanced Accuracy"],
    "Value": [
        f1_score(validate_df["majortopic"], pred_data["pred_label"], average='macro'),
        f1_score(validate_df["majortopic"], pred_data["pred_label"], average='micro'),
        balanced_accuracy_score(validate_df["majortopic"], pred_data["pred_label"])
    ]
}

In [22]:
results_df = pd.DataFrame(metrics)

# Display the results table
results_df

Unnamed: 0,Metric,Value
0,F1 Score (macro),0.894928
1,F1 Score (micro),0.905337
2,Balanced Accuracy,0.905858


In [23]:
# Calculating metrics per class
# Replace the second df with any model of your choice
precision_per_class = precision_score(validate_df["majortopic"], pred_data["pred_label"], average=None, labels=[0,1,2])
recall_per_class = recall_score(validate_df["majortopic"], pred_data["pred_label"], average=None, labels=[0,1,2])
f1_per_class = f1_score(validate_df["majortopic"], pred_data["pred_label"], average=None, labels=[0,1,2])

# Since accuracy is a global metric (not class-specific), we will not recalculate it here.

# Create a DataFrame from the metrics
metrics_per_class_df = pd.DataFrame({
    "Class": [0, 1, 2],
    "Precision": precision_per_class,
    "Recall": recall_per_class,
    "F1 Score": f1_per_class
})

# Display the results table
metrics_per_class_df

Unnamed: 0,Class,Precision,Recall,F1 Score
0,0,0.895414,0.929193,0.911991
1,1,0.819533,0.910377,0.86257
2,2,0.944898,0.878003,0.910223


While the classifier's results are very good for the simpler task of classifying newspaper headlines, we can quickly see how this might not be the case for more complex tasks.

## Central bank corpus

In [None]:
df_cbi = pd.read_csv("https://raw.githubusercontent.com/joshcova/LLMs-for-social-scientists/main/data/uk_cbi_sample.csv")

In [None]:
df_cbi = df_cbi[["sents", "results_number"]]

In [None]:
df_cbi = df_cbi.rename(columns = {"sents":"text", "results_number":"label"})

In [None]:
df_cbi_train = df_cbi.sample(frac=0.70)
df_cbi_test = df_cbi.drop(df_cbi_train.index)

In [None]:
train_loader_cbi = bert.encode(
    df_cbi_train.text.values,      #list of texts
    df_cbi_train.label.values      #list of labels
    )

  0%|          | 0/105 [00:00<?, ?it/s]

  0%|          | 0/105 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
test_loader_cbi = bert.encode(
    df_cbi_test.text.values,      #list of texts
    df_cbi_test.label.values      #list of labels
    )

  0%|          | 0/45 [00:00<?, ?it/s]

  0%|          | 0/45 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
scores_cbi = bert.run_training(
    train_loader_cbi,             #training dataloader
    test_loader_cbi,              #test dataloader
    lr=5e-5,                  #learning rate
    n_epochs=3,               #number of epochs
    random_state=1,          #random state (for replicability)
    save_model_as='cbi_model' #name of model to save as
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.02
  Training took: 0:00:06

Running Validation...

  Average test loss: 0.90
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.62      1.00      0.77        28
           2       0.00      0.00      0.00        12

    accuracy                           0.62        45
   macro avg       0.21      0.33      0.26        45
weighted avg       0.39      0.62      0.48        45


Training...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



  Average training loss: 0.85
  Training took: 0:00:04

Running Validation...

  Average test loss: 0.94
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.63      0.61      0.62        28
           2       0.17      0.25      0.20        12

    accuracy                           0.44        45
   macro avg       0.27      0.29      0.27        45
weighted avg       0.44      0.44      0.44        45


Training...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



  Average training loss: 0.75
  Training took: 0:00:04

Running Validation...

  Average test loss: 0.92
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.59      0.79      0.68        28
           2       0.12      0.08      0.10        12

    accuracy                           0.51        45
   macro avg       0.24      0.29      0.26        45
weighted avg       0.40      0.51      0.45        45


Training complete!


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
df_cbi_val = pd.read_csv("https://raw.githubusercontent.com/joshcova/LLMs-for-social-scientists/main/data/sample_validate_cbi.csv")

In [None]:
df_cbi_val_text = df_cbi_val[["sents"]]

In [None]:
df_cbi_val_text = df_cbi_val_text.rename(columns = {"sents":"text", "results_number":"label"})

In [None]:
pred_loader_cbi = bert.encode(df_cbi_val_text.text.values) #input a list of unlabeld texts

  0%|          | 0/149 [00:00<?, ?it/s]

  0%|          | 0/149 [00:00<?, ?it/s]

In [None]:
pred_cbi = bert.predict_with_model(
    pred_loader_cbi,
    model_path="/content/models/cbi_model"
    )

  0%|          | 0/5 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
df_cbi_val_text['pred_label'] = np.argmax(pred_cbi, axis=1)
df_cbi_val_text['pred_proba'] = np.max(pred_cbi, axis=1)

In [None]:
df_cbi_val_text.head()

Unnamed: 0,text,pred_label,pred_proba
0,"far from being afraid, britain should welcome...",1,0.53635
1,i am not saying that this should happen becaus...,2,0.526135
2,"we think that it is important to strengthen, r...",1,0.654744
3,despite what the economic secretary said earl...,1,0.60178
4,"to clarify, let me point out that the financia...",2,0.511647


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, balanced_accuracy_score

In [None]:
metrics = {
    "Metric": ["F1 Score (macro)", "F1 Score (micro)", "Balanced Accuracy"],
    "Value": [
        f1_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"], average='macro'),
        f1_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"], average='micro'),
        balanced_accuracy_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"])
    ]
}

In [None]:
results_df_cbi = pd.DataFrame(metrics)

# Display the results table
results_df_cbi

Unnamed: 0,Metric,Value
0,F1 Score (macro),0.350947
1,F1 Score (micro),0.516779
2,Balanced Accuracy,0.390034


In [None]:
# Calculating metrics per class
# Replace the second df with any model of your choice
precision_per_class = precision_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"], average=None, labels=[0,1,2])
recall_per_class = recall_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"], average=None, labels=[0,1,2])
f1_per_class = f1_score(df_cbi_val["label_number"], df_cbi_val_text["pred_label"], average=None, labels=[0,1,2])

# Since accuracy is a global metric (not class-specific), we will not recalculate it here.

# Create a DataFrame from the metrics
metrics_per_class_df_cbi = pd.DataFrame({
    "Class": [0, 1, 2],
    "Precision": precision_per_class,
    "Recall": recall_per_class,
    "F1 Score": f1_per_class
})

# Display the results table
metrics_per_class_df_cbi

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,Class,Precision,Recall,F1 Score
0,0,0.0,0.0,0.0
1,1,0.5,0.820896,0.621469
2,2,0.564103,0.349206,0.431373
