# Transformer as a feature extractor and a classifier

**TODO** In this project work

## Stage 1: Using Transformer as Token Feature Extractor + External Classifier

### Step 1: Choosing and preparing a dataset

There are a lot of datasets that can be used as a base for this project, such as Sentiment140, TweetEval, etc. We will be using TweetEval since it is built specifically for evaluation of models on Twitter data. It contains around 58.000 tweets. 

In [1]:
pip install torch transformers datasets scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [3]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers import AdamW
from datasets import load_dataset
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

In [4]:
dataset = load_dataset("tweet_eval", "sentiment")

# Split the dataset into train and test sets
train_data = dataset['train']
test_data = dataset['test']

Downloading readme:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

### Step 2: Loading a Pre-trained Transformer (in our case DistilBERT)

TODO descrive the transformer and some of the characteristics

In [5]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

### Step 3: Tokenizing the Input Text

Now we want to tokenize the input text using the tokenizer to convert it into input features.

In [6]:
def tokenize_data(example):
    return tokenizer(example['text'], padding='max_length', truncation=True)

In [7]:
# Apply the tokenizer on the train and test data
train_data = train_data.map(tokenize_data, batched=True)
test_data = test_data.map(tokenize_data, batched=True)

# Select the input IDs and attention masks
train_inputs = torch.tensor(train_data['input_ids'])
train_labels = torch.tensor(train_data['label'])
test_inputs = torch.tensor(test_data['input_ids'])
test_labels = torch.tensor(test_data['label'])

Map:   0%|          | 0/45615 [00:00<?, ? examples/s]

Map:   0%|          | 0/12284 [00:00<?, ? examples/s]

### Step 4: Extracting features

Pass the tokenized input through the transformer to get embeddings from the transformer.

In [8]:
with torch.no_grad():
    train_embeddings = model(train_inputs).last_hidden_state.mean(dim=1).numpy()
    test_embeddings = model(test_inputs).last_hidden_state.mean(dim=1).numpy()

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 71746191360 bytes.

### Step 5: Use the Embeddings with a Classifier

Once you have the embeddings, you can use them as features for a traditional classifier like Logistic Regression, SVM, or a neural network.

In [None]:
clf = LogisticRegression(max_iter=1000)
clf.fit(train_embeddings, train_labels)

### Step 6: Evaluate the Baseline Model

After training the classifier, evaluate it on a test set using metrics like accuracy, precision, recall, etc.

In [None]:
test_predictions = clf.predict(test_embeddings)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Baseline Accuracy: {accuracy:.4f}")

## Stage 2: Fine-Tuning Transformer
Now, instead of using the transformer just as a feature extractor, you want to fine-tune it to handle both feature extraction and classification in an efficient manner.

### Step 1: Load Pre-Trained Transformer with a Classification Head

Load a transformer that already has a classification head (e.g., BertForSequenceClassification).


In [1]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)


KeyboardInterrupt



### Step 2: Use Transfer Learning and Fine-Tune Specific Layers

Fine-tuning the entire transformer can be computationally expensive. To make it efficient, fine-tune only a few layers while keeping most of the model frozen.
For example, you can fine-tune just the last few layers or the classification head.


In [None]:
for param in model.base_model.parameters():
    param.requires_grad = False  # Freeze base model layers
    
optimizer = AdamW(model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Train the model using a standard training loop (with gradients for the classifier)

### Step 3: Train and Evaluate the Model
Train the model by passing tokenized inputs and labels, and only updating the parameters of the layers you chose to fine-tune.


In [None]:
# Training arguments for Hugging Face Trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer
)

# Train the model
trainer.train()

# Evaluate the fine-tuned model
results = trainer.evaluate()
print(f"Fine-tuned Accuracy: {results['eval_accuracy']:.4f}")