<a href="https://colab.research.google.com/github/m-newhauser/rep-or-dem-tweets/blob/main/finetune_distilbert_senator_tweets_pt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning DistilBERT on senator tweets
A guide to fine-tuning DistilBERT on the tweets of American Senators with snscrape, SQLite, and Transformers (PyTorch) on Google Colab.

*The actual fine-tuning is done in this notebook. The data is scraped from twitter in [Part 1: Creating the dataset](https://github.com/m-newhauser/rep-or-dem-tweets/blob/main/get_tweets.ipynb).*

🔗 [Medium article](https://medium.com/@mary.newhauser/fine-tuning-distilbert-on-senator-tweets-a6f2425ca50e)

💾 [Dataset](https://huggingface.co/datasets/m-newhauser/senator-tweets)

🤗 [Model](https://huggingface.co/m-newhauser/distilbert-political-tweets)

In [1]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Install if necessary
print('Installing packages')
!pip install datasets==1.18.3 transformers[sentencepiece]==4.16.2 tweet-preprocessor

Installing packages
Collecting datasets==1.18.3
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.4 MB/s 
[?25hCollecting transformers[sentencepiece]==4.16.2
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 21.9 MB/s 
[?25hCollecting tweet-preprocessor
  Downloading tweet_preprocessor-0.6.0-py3-none-any.whl (27 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.8 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 20.8 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 32.0 MB/s 
Collecting xxhash
  Downloadi

In [2]:
import transformers
import datasets
print(f"Running on transformers v{transformers.__version__} and datasets v{datasets.__version__}")

Running on transformers v4.16.2 and datasets v1.18.3


In [4]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Define root dir in Google Drive
root_dir = "/content/drive/MyDrive/colab_data"

Mounted at /content/drive


## Pre-process data

In [None]:
import sqlite3
import pandas as pd

# Connect to locally created sqlite DB
conn = sqlite3.connect(f"{root_dir}/raw_data/TWEETS.db")  # path to db

# Select only tweets from current session of Congress in 2021
tweets_df = pd.read_sql("SELECT * FROM senators WHERE date BETWEEN '2021-01-20' AND '2021-12-31'", conn)

# Print total number of tweets
print(f"{tweets_df.shape[0]} total tweets in dataset\n")

# Print distribution of tweets by party
print(f"{tweets_df.shape[0]} total tweets in dataset\n")

99693 total tweets in dataset

99693 total tweets in dataset



In [None]:
# Check class distribution of the dataset
tweets_df["party"].value_counts()

Democrat       50091
Republican     48252
Independent     1350
Name: party, dtype: int64

In [None]:
import preprocessor as p
import numpy as np

# Remove numbers, emojis and &'s
p.set_options(p.OPT.NUMBER, p.OPT.EMOJI)

tweets_df = (tweets_df
             .assign(
                 text=tweets_df["text"].apply(p.clean).str.replace("&amp;", "and ").str[:512], # remove &'s and truncate
                 party=np.where(tweets_df.party == "Independent", "Democrat", tweets_df.party) # Change Independent senator's party to Democrat
                 )
             .drop(columns="index")
             )

In [None]:
# Create a list of classes and map them using id2label
id2label = {str(i): label for i, label in enumerate(tweets_df["party"].unique().tolist())}
label2id = {v: k for k, v in id2label.items()}

print(label2id)

{'Republican': '0', 'Democrat': '1'}


In [None]:
# Create a "labels" column from the label2id mapping
tweets_df = (tweets_df
             .assign(labels=tweets_df["party"].map(label2id)) # Create a labels column (for expected DistilBERT input)
             )
tweets_df.head()

Unnamed: 0,date,id,username,text,party,labels
0,2021-01-20 15:10:33,1351909990752280579,SenToddYoung,"Last night, @GovHolcomb mapped out a bright pa...",Republican,0
1,2021-01-20 19:28:32,1351974915524718593,SenToddYoung,"Today, I attended the th Presidential Inaugura...",Republican,0
2,2021-01-20 19:28:33,1351974919689658370,SenToddYoung,The peaceful transfer of power is an essential...,Republican,0
3,2021-01-20 19:28:34,1351974921350606848,SenToddYoung,I stand ready to work with the new administrat...,Republican,0
4,2021-01-20 19:28:34,1351974922202046466,SenToddYoung,I would also like to once again thank the @ING...,Republican,0


In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Put clean data in a dataset split into train and test sets
dataset = Dataset.from_pandas(tweets_df).train_test_split(train_size=0.8, seed=123)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['date', 'id', 'username', 'text', 'party', 'labels'],
        num_rows: 79754
    })
    test: Dataset({
        features: ['date', 'id', 'username', 'text', 'party', 'labels'],
        num_rows: 19939
    })
})


In [None]:
# Cast labels column as class labels
dataset = dataset.class_encode_column("labels")

Flattening the indices:   0%|          | 0/80 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/80 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/8 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/20 [00:00<?, ?ba/s]

Casting to class labels:   0%|          | 0/20 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/2 [00:00<?, ?ba/s]

## Tokenize data for DistilBERT

In [8]:
from transformers import AutoTokenizer

# Load DistilBERT tokenizer and tokenize (encode) the texts
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
# Make a list of columns to remove before tokenization
cols_to_remove = [col for col in dataset["train"].column_names if col != "labels"]
print(cols_to_remove)

['date', 'id', 'username', 'text', 'party']


In [None]:
# Tokenize and encode the dataset
def tokenize(batch):
    tokenized_batch = tokenizer(batch['text'],   # tokenize the "text" column
                                padding=True,    # 
                                truncation=True, 
                                max_length=512)
    return tokenized_batch

dataset_enc = dataset.map(tokenize, batched=True, remove_columns=cols_to_remove, num_proc=4)

# Set dataset format for PyTorch
dataset_enc.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# Check the output
print(dataset_enc["train"].column_names)

['labels', 'input_ids', 'attention_mask']


In [None]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader

# Instantiate a data collator with dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create data loaders for to reshape data for PyTorch model
train_dataloader = DataLoader(
    dataset_enc["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    dataset_enc["test"], batch_size=8, collate_fn=data_collator
)

## Fine-tune DistilBERT

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification

# Dynamically set number of class labels based on dataset
num_labels = dataset["train"].features["labels"].num_classes
print(f"Number of labels: {num_labels}")

# Load model from checkpoint
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", 
                                                           num_labels=num_labels)



Number of labels: 2


Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [None]:
from transformers import AdamW
from transformers import get_scheduler

# Model parameters
learning_rate = 5e-5
num_epochs = 5

# Create the optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Further define learning rate scheduler
num_training_batches = len(train_dataloader)
num_training_steps = num_epochs * num_training_batches
lr_scheduler = get_scheduler(
    "linear",                   # linear decay
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
# Set the device automatically (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# Move model to device
model.to(device)

cuda


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

# Train the model with PyTorch training loop
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/49850 [00:00<?, ?it/s]

In [None]:
# Save model to disk
model.save_pretrained(f"{root_dir}/models/distilbert-political-tweets")

## Evaluate model

In [None]:
from datasets import load_metric

# Load metric
metric = load_metric("glue", "mrpc")

# Iteratively evaluate the model and compute metrics
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

# Get model accuracy and F1 score
metric.compute()

{'accuracy': 0.9076182356186369, 'f1': 0.9116716217512228}

In [None]:
from transformers import TFDistilBertForSequenceClassification, FlaxDistilBertForSequenceClassification

# Convert PyTorch to TensorFlow checkpoint
tf_model = TFDistilBertForSequenceClassification.from_pretrained(
    f"{root_dir}/models/distilbert-political-tweets", 
    config=config, 
    from_pt=True
)

# Save TensorFlow model to disk
tf_model.save_pretrained(f"{root_dir}/models/distilbert-political-tweets")

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


## Reload model from disk and inference

In [6]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoConfig
import torch

# Set the device automatically (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define model config
config = AutoConfig.from_pretrained(f"{root_dir}/models/distilbert-political-tweets",
                                    # label2id=label2id, 
                                    # id2label=id2label
                                    )

# Load model from file and move to GPU
model = AutoModelForSequenceClassification.from_pretrained(f"{root_dir}/models/distilbert-political-tweets", config=config).to(device)

In [9]:
# Tweet from Senator Ted Cruz
cruz_tweet = [""".@SenRonJohnson and I had a great conversation today 
with the truckers of the People’s Convoy. I have long shared their 
concerns about tyrannical COVID-19 mandates. No petty government 
authoritarian should control your personal medical decisions!"""]

# Tokenize inputs
inputs = tokenizer(cruz_tweet, padding=True, truncation=True, return_tensors="pt").to(device) # Move the tensor to the GPU

# Inference model and get logits
outputs = model(**inputs)
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 6.1816, -5.2975]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [10]:
# Convert logits to class probabilities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)


tensor([[9.9999e-01, 1.0343e-05]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


In [None]:
# Tweet from Senator Elizabeth Warren
warren_tweet = ["""Right-wing extremists are fanning the flames of 
hate and picking on trans children in states across the country. 
It’s sickening. To anyone who’s trans: You deserve to be seen, 
respected, and loved for who you are. #ProtectTransKids"""]

# Tokenize inputs
inputs = tokenizer(warren_tweet, padding=True, truncation=True, return_tensors="pt").to(device) # Move the tensor to the GPU

# Inference model and get logits
outputs = model(**inputs)

# Convert logits to class probabilities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)


tensor([[2.6837e-06, 1.0000e+00]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


## Push fine-tuned model to Huggingface 🤗 repo

In [None]:
# Install git-lfs
import huggingface_hub
huggingface_hub.lfs.install_lfs_in_userspace()

In [None]:
# Log in to Huggingface CLI
!transformers-cli login

In [None]:
# Push PT model to hub
tf_model.push_to_hub(
    "distilbert-political-tweets",
    commit_message="add updated tf model",
    language="en",
    dataset_tags="m-newhauser/senator-tweets",
    tags=["text-classification", "transformers", "pytorch"],
    finetuned_from="distilbert-base-uncased"
    )

Upload file tf_model.h5:   0%|          | 32.0k/256M [00:00<?, ?B/s]

To https://huggingface.co/m-newhauser/distilbert-political-tweets
   1d362aa..9f3a1d8  main -> main



'https://huggingface.co/m-newhauser/distilbert-political-tweets/commit/9f3a1d8d9104274d173c3b10cf37704bc9e97561'

In [None]:
import os

# Configure git settings
!git config --global user.email "XXXX"
!git config --global user.name "XXXX"

# Push PT model to hub
model.push_to_hub(
    "distilbert-political-tweets",                            # model name
    language="en",                                            # language
    dataset_tags="m-newhauser/senator-tweets",                # HF dataset used for training
    library_name="pytorch",
    metrics=["accuracy", "f1"],                               
    tags=["text-classification", "transformers", "pytorch"],  # model tags
    finetuned_from="distilbert-base-uncased",                 # base model
    # commit_message="..."
    )

## Push dataset to Huggingface 🤗 repo

In [None]:
!transformers-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: m-newhauser
Password: 
ERROR:root:HfApi.login: This method is deprecated in favor of `set_access_token`.
Login successful
Your token: *****

Your token has been saved to /root/.huggingface/token


In [None]:
from datasets import load_dataset
# dataset = dataset.map(...)  # do all your processing here
dataset.push_to_hub("senator-tweets")

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

## Resources
* [Huggingface Course - Write your training loop in PyTorch](https://huggingface.co/course/chapter3/4?fw=pt) (Article)
* [Huggingface Course - A full training](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter3/section4.ipynb#scrollTo=WARodF9Sa6Yq) (Notebook)
* [Huggingface Course - Sharing pretrained models](https://huggingface.co/course/chapter4/3?fw=pt) (Article)
* [Huggingface - Share a model](https://huggingface.co/docs/transformers/model_sharing)