<img align="right" width="450px" src="https://github.com/digitalepidemiologylab/covid-twitter-bert/raw/master/images/COVID-Twitter-BERT-medium.png">

# Finetuning COVID-Twitter-BERT using Huggingface
In this notebook we will finetune CT-BERT for sentiment classification using the transformer library by Huggingface.

Learn more about this library [here](https://huggingface.co/transformers/).

## Before proceeding
Create a copy of this notebook by going to "File - Save a Copy in Drive"


# Install transformers and import libraries

In [None]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 35.8 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 41.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 44.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In [None]:
!pip install tf-nightly
!pip install -U -q kaggle --quiet
!pip install livelossplot --quiet
!pip install hiplot --quiet

Collecting tf-nightly
  Downloading tf_nightly-2.9.0.dev20220223-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (498.7 MB)
[K     |████████████████████████████████| 498.7 MB 23 kB/s 
Collecting keras-nightly~=2.9.0.dev
  Downloading keras_nightly-2.9.0.dev2022022308-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 44.1 MB/s 
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting tf-estimator-nightly~=2.9.0.dev
  Downloading tf_estimator_nightly-2.9.0.dev2022022309-py2.py3-none-any.whl (438 kB)
[K     |████████████████████████████████| 438 kB 60.2 MB/s 
Collecting tb-nightly~=2.9.0.a
  Downloading tb_nightly-2.9.0a20220222-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 51.3 MB/s 
Installing collected packages: tf-estimator-nightly, tb-nightly, keras-nightly, gast, tf-nightly
  Attempting uninstall: gast
    Found existing installation: gast 0.5.3
    Uninstalling gast-0.5.3:
  

In [None]:
from transformers import (
   AutoConfig,
   AutoTokenizer,
   TFAutoModelForSequenceClassification,
   AdamW,
   glue_convert_examples_to_features
)

import tensorflow as tf
import tensorflow_datasets as tfds
import json
import pandas as pd
import json
import ast
from sklearn import preprocessing
import numpy as np




# Choose a Model from the Huggingface Library

In [None]:
# Choose model
# @markdown >The default model is <i><b>COVID-Twitter-BERT</b></i>. You can however choose <i><b>BERT Base</i></b> or <i><b>BERT Large</i></b> to compare these models to the <i><b>COVID-Twitter-BERT</i></b>. All these three models will be initiated with a random classification layer. If you go directly to the Predict-cell after having compiled the model, you will see that it still runs the predition. However the output will be random. The training steps below will finetune this for the specific task. <br /><br />
model_name = 'digitalepidemiologylab/covid-twitter-bert' #@param ["digitalepidemiologylab/covid-twitter-bert", "bert-large-uncased", "bert-base-uncased"]

# Initialise tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/421 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
def label_text_to_relevant_id(label):
	# 'agree', 'disagree', 'no_stance', 'not_relevant',
	if label == 'not_relevant':
		return 0
	if label == 'agree':
		return 1
	if label == 'disagree':
		return 1
	if label == 'no_stance':
		return 1
	else:
		raise ValueError(f'Unknown label: {label}')

In [None]:
def process_list(a):
  train_y = []
  for x in a:
    temp2 = ast.literal_eval(x[1])
    key = list(temp2.keys())[0]
    train_y.append(str((int(key), label_text_to_relevant_id(temp2[key]))))

    #train_y.append(label_text_to_relevant_id(temp2[key]))

  return train_y

In [None]:


train_set = pd.read_csv('train.csv', index_col=0)
del train_set["created_at"]
del train_set["id"]

dev_set = pd.read_csv('dev.csv', index_col=0)
del dev_set["created_at"]
del dev_set["id"]

test_set = pd.read_csv('test.csv', index_col=0)
del test_set["created_at"]
del test_set["id"]

train_list = train_set.values.tolist()
dev_list = dev_set.values.tolist()
test_list = train_set.values.tolist()

train_text = train_set['text'].values.tolist()
dev_text = dev_set['text'].values.tolist()
test_text = test_set['text'].values.tolist()

train_x = tokenizer(train_text,
                    add_special_tokens=True,
                    max_length=348,
                    return_token_type_ids=False,
                    padding="max_length",
                    return_attention_mask=True ,
                    truncation=True,
                    return_tensors="pt"
                   )
train_x_inputs = train_x['input_ids']
dev_x = tokenizer(dev_text,
                    add_special_tokens=True,
                    max_length=348,
                    return_token_type_ids=False,
                    padding="max_length",
                    return_attention_mask=True ,
                    truncation=True,
                    return_tensors="pt"
                   )
dev_x_inputs = dev_x['input_ids']
test_x = tokenizer(test_text,
                    add_special_tokens=True,
                    max_length=348,
                    return_token_type_ids=False,
                    padding="max_length",
                    return_attention_mask=True ,
                    truncation=True,
                    return_tensors="pt"
                   )
test_x_inputs = test_x['input_ids']

train_y = process_list(train_list)
dev_y = process_list(dev_list)
test_y = process_list(test_list)

print(train_x_inputs)
print(train_y)

"""
train_x, train_y = process_list(train_list)
dev_x, dev_y = process_list(dev_list)
test_x, test_y = process_list(test_list)
"""
#print(tokenized_train)

#train_set['text'] = train_set['text'].astype(str)


df_col_len = int(train_set['text'].str.len().max())
#print(df_col_len)
#print(dev_set['text'].str.len().max())
#print(test_set['text'].str.len().max())



tensor([[  101,  1996,  8128,  ...,     0,     0,     0],
        [  101,  1030,  2390,  ...,     0,     0,     0],
        [  101,  1030, 11803,  ...,     0,     0,     0],
        ...,
        [  101, 19075,  1024,  ...,     0,     0,     0],
        [  101,  1996,  3891,  ...,     0,     0,     0],
        [  101,  1030,  3712,  ...,     0,     0,     0]])
['(5, 0)', '(2, 1)', '(20, 1)', '(2, 1)', '(4, 1)', '(9, 0)', '(1, 0)', '(1, 1)', '(11, 0)', '(9, 0)', '(3, 1)', '(1, 1)', '(15, 0)', '(1, 1)', '(8, 1)', '(11, 1)', '(9, 1)', '(20, 0)', '(13, 1)', '(11, 0)', '(3, 0)', '(17, 0)', '(12, 1)', '(2, 1)', '(3, 1)', '(10, 1)', '(9, 0)', '(11, 1)', '(9, 1)', '(3, 0)', '(8, 1)', '(8, 1)', '(16, 0)', '(12, 1)', '(13, 1)', '(11, 1)', '(20, 0)', '(10, 1)', '(17, 1)', '(10, 1)', '(8, 1)', '(13, 0)', '(3, 0)', '(17, 0)', '(13, 0)', '(17, 0)', '(1, 1)', '(16, 0)', '(8, 1)', '(5, 0)', '(3, 1)', '(17, 0)', '(8, 1)', '(3, 0)', '(3, 0)', '(5, 0)', '(13, 1)', '(3, 1)', '(5, 1)', '(10, 1)', '(20, 0)',

In [None]:
train_x["input_ids"].shape
print(train_y)

['(5, 0)', '(2, 1)', '(20, 1)', '(2, 1)', '(4, 1)', '(9, 0)', '(1, 0)', '(1, 1)', '(11, 0)', '(9, 0)', '(3, 1)', '(1, 1)', '(15, 0)', '(1, 1)', '(8, 1)', '(11, 1)', '(9, 1)', '(20, 0)', '(13, 1)', '(11, 0)', '(3, 0)', '(17, 0)', '(12, 1)', '(2, 1)', '(3, 1)', '(10, 1)', '(9, 0)', '(11, 1)', '(9, 1)', '(3, 0)', '(8, 1)', '(8, 1)', '(16, 0)', '(12, 1)', '(13, 1)', '(11, 1)', '(20, 0)', '(10, 1)', '(17, 1)', '(10, 1)', '(8, 1)', '(13, 0)', '(3, 0)', '(17, 0)', '(13, 0)', '(17, 0)', '(1, 1)', '(16, 0)', '(8, 1)', '(5, 0)', '(3, 1)', '(17, 0)', '(8, 1)', '(3, 0)', '(3, 0)', '(5, 0)', '(13, 1)', '(3, 1)', '(5, 1)', '(10, 1)', '(20, 0)', '(12, 0)', '(1, 0)', '(8, 1)', '(2, 1)', '(16, 1)', '(5, 0)', '(11, 1)', '(5, 1)', '(11, 0)', '(5, 0)', '(16, 1)', '(1, 0)', '(12, 1)', '(10, 1)', '(17, 0)', '(15, 1)', '(20, 0)', '(12, 0)', '(3, 1)', '(9, 1)', '(2, 1)', '(9, 1)', '(5, 0)', '(7, 1)', '(9, 0)', '(11, 1)', '(4, 1)', '(2, 1)', '(13, 0)', '(8, 1)', '(4, 1)', '(3, 0)', '(5, 0)', '(16, 1)', '(17, 0

In [None]:
np.unique(train_y).size
encoder = preprocessing.LabelEncoder()
tranformed_label = encoder.fit_transform(train_y)
numpy_y = np.asarray(tranformed_label)
print(tranformed_label)
tranformed_label = [tf.convert_to_tensor(x) for x in tranformed_label]

print()
tensor_label = tf.stack(tranformed_label)

print(tranformed_label)
print(tensor_label)
print(len(train_y))

print(len(train_x_inputs))
print(type(numpy_y))



[26 19 21 ... 19 33  0]

[<tf.Tensor: shape=(), dtype=int64, numpy=26>, <tf.Tensor: shape=(), dtype=int64, numpy=19>, <tf.Tensor: shape=(), dtype=int64, numpy=21>, <tf.Tensor: shape=(), dtype=int64, numpy=19>, <tf.Tensor: shape=(), dtype=int64, numpy=25>, <tf.Tensor: shape=(), dtype=int64, numpy=32>, <tf.Tensor: shape=(), dtype=int64, numpy=0>, <tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=32>, <tf.Tensor: shape=(), dtype=int64, numpy=23>, <tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=12>, <tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=31>, <tf.Tensor: shape=(), dtype=int64, numpy=5>, <tf.Tensor: shape=(), dtype=int64, numpy=33>, <tf.Tensor: shape=(), dtype=int64, numpy=20>, <tf.Tensor: shape=(), dtype=int64, numpy=9>, <tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=22>, <tf.Tensor: shap

In [None]:
dev_xy = []
print(dev_x)
for a, b, c, d in zip(dev_x['input_ids'], dev_x['token_type_ids'], dev_x['attention_mask'], dev_y):
  dev_xy.append(({'input_ids': a, 'token_type_ids': b, 'attention_mask': c}, d))

print(dev_xy)

{'input_ids': tensor([[ 101, 2926, 2007,  ...,    0,    0,    0],
        [ 101, 2755, 4638,  ...,    0,    0,    0],
        [ 101, 2064, 1996,  ...,    0,    0,    0],
        ...,
        [ 101, 1030, 5697,  ...,    0,    0,    0],
        [ 101, 2748, 1010,  ...,    0,    0,    0],
        [ 101, 1030, 1021,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}
[({'input_ids': tensor([  101,  2926,  2007,  2023,  3563, 17404,  1012,  2009,  2515,  2025,
         5383,  1037,  2542,  4496,  1037,  2757,  7865,  1999,  2009, 18971,
         1012

In [None]:
max_seq_length = 348 #@param {type: "integer"}
train_batch_size =  13#@param {type: "integer"}
eval_batch_size = 13 #@param {type: "integer"}

# Download the SST-2 Dataset and Prepare for Finetuning
You can skip this step if you are using the already finetuned model

In [None]:
"""# Paramteters
#@markdown >Batch size and sequence length needs to be set to prepare the data. The size of the batches depends on available memory. For Colab GPU limit batch size to 8 and sequence length to 96. By reducing the length of the input (max_seq_length) you can also increase the batch size. For a dataset like SST-2 with lots of short sentences. this will likely benefit training.
max_seq_length = 96 #@param {type: "integer"}
train_batch_size =  8#@param {type: "integer"}
eval_batch_size = 8 #@param {type: "integer"}


#@markdown >The Glue dataset has around 62000 examples, and we really do not need them all for training a decent model. To cut down training time, please reduse this to only a percentage of the entire set.
use_percentage_of_data = 38 #@param {type: "slider", min: 1, max: 100}

# get dataset sizes
glue_builder = tfds.builder('glue/sst2')
num_train_examples = glue_builder.info.splits['train'].num_examples
num_dev_examples = glue_builder.info.splits['validation'].num_examples
num_labels = glue_builder.info.features['label'].num_classes

# download datasets and convert to training features
glue_builder.download_and_prepare()
train_data = glue_builder.as_dataset(split='train')
train_dataset = glue_convert_examples_to_features(train_data, tokenizer, max_length=max_seq_length, task='sst-2')
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)

dev_data = glue_builder.as_dataset(split='validation')
dev_dataset = glue_convert_examples_to_features(dev_data, tokenizer, max_length=max_seq_length, task='sst-2')
dev_dataset = dev_dataset.shuffle(100).batch(eval_batch_size)

# Map the labels for printing
label_mapping = {i: glue_builder.info.features['label'].int2str(i) for i in range(num_labels)}

print(f'\n\nThe dataset is downloaded. The entire dataset has {num_train_examples + num_dev_examples} examples of which you are using {use_percentage_of_data}%. This will result in a train dataset with {int(num_train_examples * (use_percentage_of_data/100))} examples and a validation dataset with {int(num_dev_examples * (use_percentage_of_data/100))} examples.')
"""

'# Paramteters\n#@markdown >Batch size and sequence length needs to be set to prepare the data. The size of the batches depends on available memory. For Colab GPU limit batch size to 8 and sequence length to 96. By reducing the length of the input (max_seq_length) you can also increase the batch size. For a dataset like SST-2 with lots of short sentences. this will likely benefit training.\nmax_seq_length = 96 #@param {type: "integer"}\ntrain_batch_size =  8#@param {type: "integer"} \neval_batch_size = 8 #@param {type: "integer"}\n\n\n#@markdown >The Glue dataset has around 62000 examples, and we really do not need them all for training a decent model. To cut down training time, please reduse this to only a percentage of the entire set.\nuse_percentage_of_data = 38 #@param {type: "slider", min: 1, max: 100}\n\n# get dataset sizes\nglue_builder = tfds.builder(\'glue/sst2\')\nnum_train_examples = glue_builder.info.splits[\'train\'].num_examples\nnum_dev_examples = glue_builder.info.split

# Compile the Model, Train it on the SST-2 Task and Save the Result
You can skip this step if you are using the already finetuned model

In [None]:
#@markdown >The default learning rate of 2e5 will be fine in most cases
learning_rate = 2e-5 #@param {type: "number"}

#@markdown > Typically these type of models are finetuned for 3 epochs. This can be increased for small datasets and decreased for large datasets.
num_epochs = 1  #@param {type: "integer"}

# Initialise a Model for Sequence Classification with 2 labels
config = AutoConfig.from_pretrained(model_name, num_labels=34)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)

# Optimizer and loss
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Metrics and callbacks
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
checkpoint_path = './checkpoints/checkpoint.{epoch:02d}'
callbacks = [tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True)]

num_train_examples = train_set.size
num_dev_examples = dev_set.size
use_percentage_of_data = 100
# Compute some variables
train_steps_per_epoch = int(num_train_examples * (use_percentage_of_data/100) / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples * (use_percentage_of_data/100) / eval_batch_size)


# Compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# Train the model
history = model.fit(x=train_x_inputs,
                    y=tranformed_label,
                    epochs=num_epochs,
                    steps_per_epoch=train_steps_per_epoch,
                    validation_split = 0,
                    callbacks=callbacks)

"""validation_data=dev_xy,"""

# Print some information about the training
print(f'\nThe training has finished training after {num_epochs} epochs.')
print('\nThe history contains the accuracy and loss at every epoch:')
print(json.dumps(history.history, indent=4))

print('\nThe checkpoint callback has generated a checkpoint after every epoch (loss being the training loss, val_loss is the validation loss):')
!ls -lha ./checkpoints/

print('\nWe will now save the finetuned model and the corresponding config file on your Colab disk.')
model.save_pretrained('./huggingface_model/')

print('\nTensorflow model and config-file is saved in ./huggingface_model/')
!ls -lha ./huggingface_model/

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at digitalepidemiologylab/covid-twitter-bert and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




IndexError: ignored

# Predict
Let's run some inference with the trained model

In [None]:
# Small function only used for formatting the output
def format_prediction(preds, label_mapping, label_name):
    preds = tf.nn.softmax(preds, axis=1)
    formatted_preds = []
    for pred in preds.numpy():
        # convert to Python types and sort
        pred = {label: float(probability) for label, probability in zip(label_mapping.values(), pred)}
        pred = {k: v for k, v in sorted(pred.items(), key=lambda item: item[1], reverse=True)}
        formatted_preds.append({label_name: list(pred.keys())[0], f'{label_name}_probabilities': pred})
    return formatted_preds

In [None]:
#@markdown >Please input text that the model can try to classify
input_text = 'Happy little clouds'  #@param {type: "string"}

# Tokenize the input
input_ids = tf.constant(tokenizer.encode(input_text, add_special_tokens=True))[None, :]

# Run predictions
preds = model(input_ids)

# format logits
formatted_preds = format_prediction(preds[0], label_mapping, 'sentiment')

print(f'\nLabel Mapping:{json.dumps(label_mapping, indent=4)}')
print(f'\nLogits: {preds}')
print(f'\nProbabilities:{json.dumps(formatted_preds, indent=4)}')

##### Copyright 2020 Per Egil Kummervold and Martin Müller