<a href="https://colab.research.google.com/github/joshcova/LLMs-for-social-scientists/blob/main/code/classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT classifiers

In [None]:
pip install AugmentedSocialScientist

In [None]:
import pandas as pd
import numpy as np
from AugmentedSocialScientist.models import Bert


We will use the transformer-based BERT classifiers, which can easily be extracted from the AugmentedSocialScientist library developed in Do et al. (2022).

We will conduct this analysis on both of our corpuses: the media corpus and the central bank speech corpus.

In [None]:
df_media = pd.read_csv("/content/drive/MyDrive/Media_analysis/uk_media_2.csv")

In [None]:
# Let us split the dataset by first testing our classifier on a portion of the dataset (first 3000 rows) and then applying it on the remainder
# of the dataset as a second step

df_media_1 = df_media[1:3000]

In [None]:
df_media_1 = df_media_1.rename(columns = {"majortopic":"label"})

In [None]:
pred_data = df_media[3001:6730]

In [None]:
pred_data = pred_data[["text"]]

In [None]:
df_train = df_media_1.sample(frac=0.70)
df_test = df_media_1.drop(df_train.index)

In [None]:
# We initialize the Bert classifier. Note that the classifier is also available in different languages, but given that our corpus is in English we will keep the defeault option

bert = Bert()

In [None]:
train_loader = bert.encode(
    df_train.text.values,      #list of texts
    df_train.label.values      #list of labels
    )

  0%|          | 0/2099 [00:00<?, ?it/s]

  0%|          | 0/2099 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
test_loader = bert.encode(
    df_test.text.values,       #list of texts
    df_test.label.values       #list of labels
    )


  0%|          | 0/900 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
scores = bert.run_training(
    train_loader,             #training dataloader
    test_loader,              #test dataloader
    lr=5e-5,                  #learning rate
    n_epochs=3,               #number of epochs
    random_state=1,          #random state (for replicability)
    save_model_as='media_analysis' #name of model to save as
)

In [None]:
pred_loader = bert.encode(pred_data.text.values) #input a list of unlabeld texts

  0%|          | 0/3729 [00:00<?, ?it/s]

  0%|          | 0/3729 [00:00<?, ?it/s]

In [None]:
pred = bert.predict_with_model(
    pred_loader,
    model_path="/content/models/media_analysis"
    )

  0%|          | 0/117 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
pred_data['pred_label'] = np.argmax(pred, axis=1)
pred_data['pred_proba'] = np.max(pred, axis=1)

The nice thing about BERT classifiers is that it also allows provides a predicted probability score

In [None]:
pred_data.head()

Unnamed: 0,text,pred_label,pred_proba
3001,The heavy hand of the Spanish police,0,0.761444
3002,Queen's police were slow to react,2,0.99784
3003,Pitfalls of civil nuptials in Greece,0,0.997114
3004,Two jailed after Countryman,2,0.997342
3005,Jobless total sets new record,1,0.994617


While the classifier's results are quite good for the simpler task of classifying newspaper headlines, we can quickly see how this might not be the case for more complex tasks.

In [None]:
df_cbi = pd.read_excel("/content/drive/MyDrive/Media_analysis/CBI_UK_sample_labeled.xlsx")

In [None]:
df_cbi = df_cbi[["sents", "results_number"]]

In [None]:
df_cbi = df_cbi.rename(columns = {"sents":"text", "results_number":"label"})

In [None]:
df_cbi_train = df_cbi.sample(frac=0.70)
df_cbi_test = df_cbi.drop(df_cbi_train.index)

In [None]:
train_loader_cbi = bert.encode(
    df_cbi_train.text.values,      #list of texts
    df_cbi_train.label.values      #list of labels
    )

  0%|          | 0/105 [00:00<?, ?it/s]

  0%|          | 0/105 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
test_loader_cbi = bert.encode(
    df_cbi_test.text.values,      #list of texts
    df_cbi_test.label.values      #list of labels
    )

  0%|          | 0/45 [00:00<?, ?it/s]

  0%|          | 0/45 [00:00<?, ?it/s]

label ids: {0: 0, 1: 1, 2: 2}


In [None]:
scores_cbi = bert.run_training(
    train_loader_cbi,             #training dataloader
    test_loader_cbi,              #test dataloader
    lr=5e-5,                  #learning rate
    n_epochs=3,               #number of epochs
    random_state=1,          #random state (for replicability)
    save_model_as='cbi_model' #name of model to save as
)

model.safetensors:  19%|#9        | 83.9M/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training...

  Average training loss: 1.07
  Training took: 0:00:06

Running Validation...

  Average test loss: 0.92
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.56      1.00      0.71        25
           2       0.00      0.00      0.00        18

    accuracy                           0.56        45
   macro avg       0.19      0.33      0.24        45
weighted avg       0.31      0.56      0.40        45


Training...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



  Average training loss: 0.89
  Training took: 0:00:04

Running Validation...

  Average test loss: 0.84
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.56      1.00      0.71        25
           2       0.00      0.00      0.00        18

    accuracy                           0.56        45
   macro avg       0.19      0.33      0.24        45
weighted avg       0.31      0.56      0.40        45


Training...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



  Average training loss: 0.84
  Training took: 0:00:04

Running Validation...

  Average test loss: 0.83
  Validation took: 0:00:00
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.56      1.00      0.71        25
           2       0.00      0.00      0.00        18

    accuracy                           0.56        45
   macro avg       0.19      0.33      0.24        45
weighted avg       0.31      0.56      0.40        45


Training complete!


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
