<a href="https://colab.research.google.com/github/m-newhauser/rep-or-dem-tweets/blob/main/finetune_full_architecture_tftrainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Resources

* [Fine-tuning DistilBERT with TF (freezing last hidden layer from DistilBERT models)](https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379)
* [Fine-tuning DistilBERT with only TF](https://medium.com/geekculture/hugging-face-distilbert-tensorflow-for-custom-text-classification-1ad4a49e26a7)

* [Fine-tuning multi-class BERT in PyTorch](https://colab.research.google.com/drive/18vy67le2DC-iMJK-AiB0vVKtMRAxmBnB?usp=sharing#scrollTo=4c81NkyZYCab)

* [TFTrain DistilBERT](https://wandb.ai/ayush-thakur/huggingface/reports/How-to-Fine-Tune-Hugging-Face-Transformers-with-Weights-Biases---Vmlldzo0MzQ2MDc)

* [Minimal code example to fine-tune BERT (and save to S3)](https://engineering.freeagent.com/2021/09/15/fine-tuning-bert-for-multiclass-categorisation-with-amazon-sagemaker/)

In [1]:
!pip install transformers==4.6.0
!pip install tweet-preprocessor



In [2]:
import random
import pandas as pd
import numpy as np
import csv
import tensorflow as tf
import preprocessor as p

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import (
    TFDistilBertForSequenceClassification,
    TFTrainer,
    TFTrainingArguments,
)

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

random.seed(123)

## Pre-process data

In [3]:
# Read in raw data -- https://fivethirtyeight.datasettes.com/fivethirtyeight/twitter-ratio%2Fsenators#export
# tweets_raw = pd.read_csv("senators.csv")
tweets_raw = pd.read_parquet("senators.parquet").sample(n=10000, random_state=123)

# # Save to parquet (only have to do this the first time)
# tweets_raw.to_parquet("senators.parquet")

In [29]:
# Remove numbers, emojis and &'s
p.set_options(p.OPT.NUMBER, p.OPT.EMOJI)

tweets = (tweets_raw
          .drop(columns=["created_at", "url", "bioguide_id"])
          .assign(
              text_clean=tweets_raw["text"].apply(p.clean).str.replace("&amp;", "and ").str[:512], # remove &'s and truncate
              party=np.where(tweets_raw.user == "SenSanders", "D", tweets_raw.party) # Change Bernie Sanders from I --> D
              )
          .query('party != "I"') # Remove tweets from Independent senators
          )

tweets.shape

(9894, 9)

In [30]:
# Create a column with numeric labels
label_mapping = {
    "D": 0,
    "R": 1
}

tweets['label'] = np.where(tweets['party'] == "D", 0, 1)

In [31]:
# Convert to list
texts = list(tweets.text_clean)
labels = list(tweets.label)

# Split training dataset into test and train
(train_texts, test_texts, train_labels, test_labels) = train_test_split(
    texts, labels, test_size=0.3
)

### Tokenize data for DistilBERT

In [32]:
# Load DistilBERT tokenizer and tokenize (encode) the texts
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)


### Create encodings

In [33]:
# Wrap encodings in a Tensor Flow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
))

## Fine-tune entire DistilBERT architecture (layers)

In [34]:
# Create a dict of metrics to calculate during training
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


# Provide args for fine-tuning DistilBERT on our data
training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=5,              # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    learning_rate=2e-05,             # start with a low learning rate when fine-tuning
    warmup_steps=250,                # number of warmup steps for learning rate scheduler ([500, 1000] are normal but start low)
    weight_decay=0.01,               # strength of weight decay
    evaluation_strategy="epoch",
    logging_dir='./logs',            # directory for storing logs
    logging_steps=1,
    eval_steps=10
)

# Instantiate the pre-trained model
with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", 
        num_labels=2
    )

# Create the trainer
trainer = TFTrainer(
    model=model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=train_dataset,  # training dataset
    eval_dataset=test_dataset,  # evaluation dataset,
    compute_metrics=compute_metrics # custom function with metrics to compute
)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use i

In [35]:
# Train the model
trainer.train()



In [36]:
# Evaluate the model
trainer.evaluate()



{'eval_accuracy': 0.8780241935483871,
 'eval_f1': 0.8843580758203249,
 'eval_loss': 0.31257635547268775,
 'eval_precision': 0.8696741854636592,
 'eval_recall': 0.899546338302009}

In [19]:
# Save the model
trainer.save_model("finetune-distilbert-senators")

In [37]:
# Make predictions on the test set
test_predictions = trainer.predict(test_dataset)

# Apply softmax to get final predicted labels for test set
test_predictions_labels = test_predictions.predictions.argmax(-1)



In [38]:
# Create an output dataframe with truth and predicted labels on test set
predictions_df = pd.DataFrame({
    "text": test_texts,
    "label": test_labels,
    "pred": test_predictions_labels
})

# Now merge it with other Twitter information
predictions_df = (tweets[["rowid", "user", "state", "party", "text_clean"]]
                  .merge(predictions_df, left_on="text_clean", right_on="text")
                  .drop(columns=["text_clean"])
                  )

In [39]:
# Accuracy by party
(predictions_df
 .groupby("party")
 .apply(lambda x: accuracy_score(x["label"], x["pred"]))
 )

party
D    0.854647
R    0.899285
dtype: float64

In [40]:
# Accuracy by state
(predictions_df
 .groupby("state")
 .apply(lambda x: accuracy_score(x["label"], x["pred"]))
 .sort_values()
 )

state
WV    0.705128
OH    0.706897
LA    0.717391
NH    0.777778
VT    0.779661
VA    0.787879
NV    0.823529
DE    0.823529
HI    0.833333
TX    0.836364
RI    0.850000
MT    0.860759
MD    0.860759
CO    0.862069
CA    0.862745
SC    0.866667
NM    0.868852
IA    0.869565
AK    0.869565
OR    0.870370
NE    0.872340
IL    0.876923
PA    0.881356
MO    0.888889
MA    0.891892
MI    0.892308
SD    0.893617
WA    0.901639
ND    0.901961
NC    0.906250
NY    0.907692
UT    0.909091
NJ    0.910256
MN    0.910448
ID    0.911111
TN    0.913043
GA    0.913043
AL    0.920000
WY    0.922078
KS    0.930556
KY    0.933333
CT    0.933333
AR    0.942308
MS    0.943396
AZ    0.945205
FL    0.952381
ME    0.956522
OK    0.963636
WI    0.969231
IN    0.970149
dtype: float64

In [41]:
# Accuracy by user
(predictions_df
 .groupby("user")
 .apply(lambda x: accuracy_score(x["label"], x["pred"]))
 .sort_values()
 )

user
Sen_JoeManchin     0.636364
SenatorStrange     0.666667
SenatorShaheen     0.666667
senrobportman      0.666667
SenatorLeahy       0.714286
                     ...   
SenPatRoberts      0.975000
McConnellPress     1.000000
SenJoniErnst       1.000000
JeffFlake          1.000000
SenMurphyOffice    1.000000
Length: 99, dtype: float64