# Transformer as a feature extractor and a classifier

**TODO** In this project work

## Stage 1: Using Transformer as Token Feature Extractor + External Classifier

- Choosing and preparing a dataset

There are a lot of datasets that can be used as a base for this project, such as Sentiment140, TweetEval, etc. We will be using TweetEval since it is built specifically for evaluation of models on Twitter data. It contains around 58.000 tweets. 

- Preprocessing the text data

Now we want to tokenize the input text using the tokenizer to convert it into input features.

- Loading a Pre-trained Transformer (in our case DistilBERT)

DistilBERT is a smaller, faster, and lighter version of BERT. It's key features are: reduced size, retained performance, transformer architecture.

- Extracting features

We pass the tokenized input through the transformer to get embeddings from the transformer.

- Using the Embeddings with a Classifier and evaluating the baseline model

Now we can use the embeddings as features for a traditional classifier, in this project we are using Logistic Regression. After training the classifier, we evaluate it on a test set using the usual metrics like accuracy, precision, recall. 

In [1]:
import pandas as pd
import torch
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from datasets import load_dataset
import evaluate

In [2]:
# loading the TweetEval dataset
dataset = load_dataset("tweet_eval", "sentiment")  

# viewing the dataset structure
print(dataset)

n = 20000

x_train = dataset['train']['text'][:n]
x_test = dataset['test']['text'][:n]

y_train = dataset['train']['label'][:n]
y_test = dataset['test']['label'][:n]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45615
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


In [3]:
model_name = "distilbert-base-uncased"

In [4]:
tokenizer_base = DistilBertTokenizer.from_pretrained(model_name)
model_base = DistilBertModel.from_pretrained(model_name)

In [5]:
def extract_features(texts):
    inputs = tokenizer_base(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model_base(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()  # Use [CLS] token features

train_features = extract_features(x_train)
test_features = extract_features(x_test)

In [6]:
clf = LogisticRegression(max_iter=100000)  # Increase max_iter if convergence issues arise
clf.fit(train_features, y_train)

y_pred = clf.predict(test_features)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.64      0.66      3972
           1       0.66      0.67      0.67      5937
           2       0.58      0.62      0.60      2375

    accuracy                           0.65     12284
   macro avg       0.64      0.64      0.64     12284
weighted avg       0.65      0.65      0.65     12284



We want to check a couple of examples from the test set and see the results given by our classifier.

In [7]:
def classify_text(text):
    inputs = tokenizer_base([text], return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        features = model_base(**inputs).last_hidden_state[:, 0, :].numpy()
    
    predicted_label = clf.predict(features)[0]
    
    return predicted_label

# to test we select a couple of samples from the test set
num_samples = 2
test_texts = dataset['test']['text'][:num_samples]
test_labels = dataset['test']['label'][:num_samples]

for i, text in enumerate(test_texts):
    predicted_label = classify_text(text)
    print(f"Text: {text}")
    print(f"True Label: {test_labels[i]}, Predicted Label: {predicted_label}")
    print("-" * 50)

Text: @user @user what do these '1/2 naked pics' have to do with anything? They're not even like that.
True Label: 1, Predicted Label: 0
--------------------------------------------------
Text: OH: “I had a blue penis while I was this” [playing with Google Earth VR]
True Label: 1, Predicted Label: 1
--------------------------------------------------


It is logical for the model to get the wrong label in about half of the cases since the average accuracy is 0.65. This accuracy can be improved by using the whole dataset or training for longer.

## Stage 2: Fine-Tuning Transformer

**TODO** Now, instead of using the transformer just as a feature extractor, we want to fine-tune it to handle both feature extraction and classification in an efficient manner.


In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
from torch.utils.data import DataLoader
import wandb

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
# num_labels in our case is 3 since the possible sentiments are negative, neutral, positive
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

lora_config = LoraConfig(
    r=8,              # Rank for LoRA
    lora_alpha=32,    # Scaling factor
    lora_dropout=0.1, # Dropout for LoRA layers
    target_modules=["q_lin", "k_lin", "v_lin"]  # Modules to apply LoRA that correspond to "query", "key", "value" in DistilBERT
)
model_lora = get_peft_model(model, lora_config)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [9]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    loss = torch.nn.functional.cross_entropy(
        torch.tensor(logits), torch.tensor(labels)
    ).item()
    return {"eval_loss": loss}

training_args = TrainingArguments(
    output_dir="./results",
    logging_dir="./logs",
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=300,
    logging_steps=300,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    report_to=["wandb"],
)

# initializing the trainer
trainer = Trainer(
    model=model_lora,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [10]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33melizaveta-mochalova[0m ([33melizaveta-mochalova-universita-firenze[0m). Use [1m`wandb login --relogin`[0m to force relogin




Step,Training Loss,Validation Loss
300,1.0487,No log
600,0.9825,No log
900,0.9323,No log
1200,0.8894,No log




TrainOutput(global_step=1426, training_loss=0.9502775130733367, metrics={'train_runtime': 388.6648, 'train_samples_per_second': 234.727, 'train_steps_per_second': 3.669, 'total_flos': 3036801251888640.0, 'train_loss': 0.9502775130733367, 'epoch': 2.0})

In [None]:
def classify_text(text):
    model.to("cpu")
    
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    inputs = {key: value.to("cpu") for key, value in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    
    return predictions.item()


# Test the Model with an Example
example_text = "I love this!"
predicted_label = classify_text(example_text)
print(f"Text: {example_text}")
print(f"Predicted Label: {predicted_label}")


Text: I tollerate you!
Predicted Label: 1
