## Installation instructions

[![colab badge](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mcallaghan/text-as-data/blob/master/Session-11-Spacy-and-Transformers/BERT.ipynb)

To install the required libraries, you will need to do the following in a terminal shell (or prepend a ! to each line and run in colab)

```
pip install transformers
pip install datasets
pip install torch
```


In [1]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer
from torch import tensor
from torch.nn import Sigmoid, Softmax

In [2]:
from transformers import pipeline

pipe = pipeline("sentiment-analysis")

pipe(["This movie was really bad", "I loved watching this movie"])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9998040795326233},
 {'label': 'POSITIVE', 'score': 0.999700665473938}]

In [3]:
pipe = pipeline("text-classification", model="nbroad/ESG-BERT")
texts = [
    "The Hertie School is committed to embedding and mainstreaming diversity, equity and inclusion into all areas of its activities."
]
pipe("The Hertie School is committed to embedding and mainstreaming diversity, equity and inclusion into all areas of its activities.")


[{'label': 'Employee_Engagement_Inclusion_And_Diversity',
  'score': 0.9726636409759521}]

In [4]:
from textwrap import wrap
run_galactica = False
if run_galactica:
    pipe = pipeline("text-generation", model="facebook/galactica-1.3b")
else:
    pipe = pipeline("text-generation")
    
res = pipe("Large language models can be useful. However,")
wrap(res[0]["generated_text"])

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


['Large language models can be useful. However, sometimes this can lead',
 'to other problems.  Note: These are not technical challenges.  If you',
 'have trouble finding a way to do things you use in your language, read',
 'my previous article.']

In [5]:
from transformers import AutoTokenizer, OPTForCausalLM
run_galactica = False
if run_galactica:
    #pipe = pipeline("text-generation", model="facebook/galactica-1.3b")
    #res = pipe("What are the benefits of taking the Text as Data course at Hertie", max_length=200)

    tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-1.3b")
    model = OPTForCausalLM.from_pretrained("facebook/galactica-1.3b", device_map="auto")

    input_text = "Wiki page about the Text as Data Course at the Hertie School of Governance. \n#Introduction"
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

    outputs = model.generate(input_ids, max_length=200)
    print(tokenizer.decode(outputs[0]))

In [6]:
unmasker = pipeline("fill-mask")

unmasker(f"The GOP is going to {unmasker.tokenizer.mask_token} this election")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.4639686048030853,
  'token': 339,
  'token_str': ' win',
  'sequence': 'The GOP is going to win this election'},
 {'score': 0.4603649973869324,
  'token': 2217,
  'token_str': ' lose',
  'sequence': 'The GOP is going to lose this election'},
 {'score': 0.011407200247049332,
  'token': 8052,
  'token_str': ' steal',
  'sequence': 'The GOP is going to steal this election'},
 {'score': 0.008465026505291462,
  'token': 11781,
  'token_str': ' dominate',
  'sequence': 'The GOP is going to dominate this election'},
 {'score': 0.00669073686003685,
  'token': 9808,
  'token_str': ' sweep',
  'sequence': 'The GOP is going to sweep this election'}]

In [7]:
# Let's take our texts and our labels again
texts, y = zip(
    *[
        ("Climate change is impacting human systems", 1),
        ("Climate change is caused by fossil fuels", 0),
        ("Agricultural yields are affected by climate change", 1),
        ("System change not climate change", 0),
        ("higher temperatures are impacting human health", 1),
        ("Forest fires are becoming more frequent due to climate change", 1),
        ("Machine learning can read texts", 0),
        ("AI can help solve climate change!", 0),
        ("We need to save gas this winter", 0),
        ("More frequent droughts are impacting crop yields", 1),
        ("Many communities are affected by rising sea levels", 1),
        ("Global emissions continue to rise", 0),
        ("Ecosystems are increasingly impacted by rising temperatures", 1),
        ("Emissions from fossil fuels need to decline", 0),
        ("Anthropogenic climate change is impacting vulnerable communities", 1),
    ]
)

In [8]:

# To use these with transformers, we are going to need to get them into the right format.
from datasets import Dataset
from transformers import AutoTokenizer

# First we'll put them into a HuggingFace Dataset object
dataset = Dataset.from_dict({"text": texts, "label": y})

# And now we need to tokenize the texts, using the pretrained tokenizer from climatebert
model_name = "climatebert/distilroberta-base-climate-f"
model_name = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)



def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 15
})

In [9]:
# We can wrap this into one function that turns any set of texts (and optional labels)
# into a tokenized huggingface dataset
def datasetify(x, tokenizer, y=None):
    data_dict = {"text": x}
    if y is not None:
        data_dict["label"] = y
    dataset = Dataset.from_dict(data_dict)

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    return dataset.map(tokenize_function, batched=True)

In [10]:
# Now we want to load our model, and instantiate a Trainer class
from transformers import AutoModelForSequenceClassification


# We set num_labels to 2 for binary classification, as we have two classes - positive and negative
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# The trainer class needs to be supplied with a model, and a dataset (and will also accept TrainingArguments and validation data)
trainer = Trainer(model=model, train_dataset=datasetify(texts, tokenizer, y))
# Once this has been instantiated we can apply the train() method
trainer.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=6, training_loss=0.6291860342025757, metrics={'train_runtime': 6.4255, 'train_samples_per_second': 7.003, 'train_steps_per_second': 0.934, 'total_flos': 300624936300.0, 'train_loss': 0.6291860342025757, 'epoch': 3.0})

In [11]:
# To generate predictions, we just need to supply a dataset to the predict method
new_texts = [
    "climate change is impacting terrestrial ecosystems",
    "Machine Learning will solve climate change",
    "Fossil fuels are responsible for rising temperature",
]


pred = trainer.predict(datasetify(new_texts, tokenizer, [1, 0, 0]))
pred

  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3
  Batch size = 8


PredictionOutput(predictions=array([[-0.10175121,  0.40060183],
       [ 0.0580028 ,  0.16400006],
       [-0.18335238,  0.1685566 ]], dtype=float32), label_ids=array([1, 0, 0]), metrics={'test_loss': 0.7017471194267273, 'test_runtime': 0.0716, 'test_samples_per_second': 41.898, 'test_steps_per_second': 13.966})

In [12]:
# However, the model output gives us logits. If these are negative, then the prediction
# is negative, if they are positive, the prediction is positive.
# We can turn these into probabilities with an activation function
from torch import tensor
from torch.nn import Sigmoid, Softmax

activation = (
    Softmax()
)  # Since we have two *exclusive classes*, we use the Softmax function
activation(tensor(pred.predictions))

  activation(tensor(pred.predictions))


tensor([[0.3770, 0.6230],
        [0.4735, 0.5265],
        [0.4129, 0.5871]])

In [13]:
activation = (
    Sigmoid()
)  # With a Sigmoid function, the probabilities don't need to add up to 1 (useful for multilabel classification)
activation(tensor(pred.predictions))

tensor([[0.4746, 0.5988],
        [0.5145, 0.5409],
        [0.4543, 0.5420]])

In [14]:
# If we want to always get probabilities, we can subclass Trainer and add a new predict_proba method

from transformers.trainer_utils import PredictionOutput

class ProbTrainer(Trainer):
    def predict_proba(self, test_dataset: Dataset) -> PredictionOutput:
        logits = self.predict(test_dataset).predictions
        if logits.shape[1] > 2:
            activation = Sigmoid()
        else:
            activation = Softmax()
        return activation(tensor(logits)).numpy()


trainer = ProbTrainer(model=model, train_dataset=datasetify(texts, tokenizer, y))
trainer.train()

pred = trainer.predict_proba(datasetify(new_texts, tokenizer))
pred

  0%|          | 0/1 [00:00<?, ?ba/s]

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3
  Batch size = 8


  return activation(tensor(logits)).numpy()


array([[0.24335532, 0.75664467],
       [0.73538405, 0.26461592],
       [0.5834491 , 0.41655087]], dtype=float32)

In [15]:
params = {
  "batch_size": [16, 32],
  "learning_rate": [5e-5, 3e-5, 2e-5],
  "number of epochs": [2,3,4]
}
import itertools
def product_dict(**kwargs):
    keys = kwargs.keys()
    vals = kwargs.values()
    for instance in itertools.product(*vals):
        yield dict(zip(keys, instance))
param_space = list(product_dict(**params))
len(param_space)



18

In [16]:
from transformers import TrainingArguments
for p in param_space:
    training_args = TrainingArguments(
        num_train_epochs=p["number of epochs"],
        learning_rate=p["learning_rate"],
        per_device_train_batch_size=p["batch_size"],
        output_dir="out"
    )
    trainer = ProbTrainer(model=model, train_dataset=datasetify(texts, tokenizer, y), args=training_args)
    trainer.train()
    break
                           

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


  0%|          | 0/1 [00:00<?, ?ba/s]

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 15
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




In [17]:
import pandas as pd
df = pd.read_csv("https://github.com/mcallaghan/text-as-data/raw/master/datasets/uk_manifestos.csv")
df.head()

Unnamed: 0,text,cmp_code,eu_code,party
0,This election is about the crisis of living st...,503.0,,Labour
1,and the climate and environmental emergency.,501.0,,Labour
2,"Whether we are ready or not, we stand on the b...",501.0,,Labour
3,We must confront this change while dealing wit...,503.0,,Labour
4,Labour led the UK Parliament in declaring a cl...,501.0,,Labour


In [18]:
import numpy as np
df["climate"] = np.where(df["cmp_code"]==501,1,0)
df.head()

Unnamed: 0,text,cmp_code,eu_code,party,climate
0,This election is about the crisis of living st...,503.0,,Labour,0
1,and the climate and environmental emergency.,501.0,,Labour,1
2,"Whether we are ready or not, we stand on the b...",501.0,,Labour,1
3,We must confront this change while dealing wit...,503.0,,Labour,0
4,Labour led the UK Parliament in declaring a cl...,501.0,,Labour,1


In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.text, df.climate, test_size=0.2, random_state=42)

y_train

378     0
1813    0
1090    0
2492    0
2142    0
       ..
3092    0
3772    0
5191    0
5226    0
860     0
Name: climate, Length: 4192, dtype: int64

In [37]:
class ProbTrainer(Trainer):
    def predict_proba(self, test_dataset: Dataset) -> PredictionOutput:
        logits = self.predict(test_dataset).predictions
        if logits.shape[1] > 2:
            activation = Sigmoid()
        else:
            activation = Softmax()
        return activation(tensor(logits)).numpy()

# We can wrap this into one function that turns any set of texts (and optional labels)
# into a tokenized huggingface dataset
def datasetify(x, tokenizer, y=None):
    data_dict = {"text": x}
    if y is not None:
        data_dict["label"] = y
    dataset = Dataset.from_dict(data_dict)

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=False)

    return dataset.map(tokenize_function, batched=True)

trainer = ProbTrainer(model=model, train_dataset=datasetify(X_train, tokenizer, y_train))
trainer.train()

  0%|          | 0/5 [00:00<?, ?ba/s]

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4192
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1572


Step,Training Loss


KeyboardInterrupt: 