<a href="https://colab.research.google.com/github/jyotidabass/Project-Fine-Tuning-Language-Models/blob/main/Project_Fine_Tuning_Language_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Fine-Tuning Language Models**

**# Step 1: Introduction to the Final Project**

In this project, we'll fine-tune a pre-trained language model on a custom dataset to improve its performance on a specific task. We'll use the Hugging Face Transformers library, which provides a simple interface for working with transformer models.

# **Step 2: Step-by-step guide to Fine-Tuning on a Custom Dataset**

We'll use a dummy dataset containing two text columns: text and label. Our goal is to fine-tune a model to predict the label based on the text.

Dummy Data

In [1]:
import pandas as pd

# Create a dummy dataset
data = {
    'text': ['This is a positive review', 'This is a negative review', 'I love this product', 'I hate this product'],
    'label': [1, 0, 1, 0]
}
df = pd.DataFrame(data)

# **Step 3: Model Training: From Data Preparation to Evaluation**

We'll use the transformers library to load a pre-trained model and fine-tune it on our custom dataset. We'll also define a custom dataset class to handle our data.

In [2]:
import pandas as pd
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import Dataset, DataLoader

# Load pre-trained model and tokenizer
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define custom dataset class
class TextDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.df = df
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        text = self.df.iloc[idx, 0]
        label = self.df.iloc[idx, 1]

        encoding = self.tokenizer.encode_plus(
            text,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Create dataset and data loader
dataset = TextDataset(df, tokenizer)
batch_size = 16
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

# **Step 4: Assessing Model Performance on Various NLP Tasks**

We'll define a function to evaluate the model's performance on our custom dataset.

In [3]:
from sklearn.metrics import accuracy_score, classification_report

def evaluate(model, dataloader):
    model.eval()
    total_correct = 0
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            _, predicted = torch.max(outputs.logits, dim=1)
            total_correct += (predicted == labels).sum().item()

    accuracy = total_correct / len(dataloader.dataset)
    print(f'Accuracy: {accuracy:.4f}')
    print(classification_report(labels.cpu().numpy(), predicted.cpu().numpy()))

# **Step 5: Application of Transformer Models to Complex Problems**

We'll fine-tune the model on our custom dataset and evaluate its performance.

In [6]:
# Fine-tune the model
import torch
import torch.nn as nn # Import the necessary module for loss function
from torch.optim import Adam # Import the Adam optimizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
criterion = nn.CrossEntropyLoss() # Now nn is defined and can be used
optimizer = Adam(model.parameters(), lr=1e-5)

for epoch in range(5):
    model.train()
    total_loss = 0
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = criterion(outputs.logits, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss / len(dataloader)}')

# Evaluate the model
evaluate(model, dataloader)

Epoch 1, Loss: 0.7089009284973145
Epoch 2, Loss: 0.6667802929878235
Epoch 3, Loss: 0.6734365224838257
Epoch 4, Loss: 0.6881130933761597
Epoch 5, Loss: 0.694121778011322
Accuracy: 0.5000
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         2
           1       0.00      0.00      0.00         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Step 6: End-of-Project Evaluation**

We'll evaluate the model's performance on our custom dataset and provide a final assessment.

In [7]:
# Evaluate the model
evaluate(model, dataloader)

Accuracy: 0.5000
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         2
           1       0.00      0.00      0.00         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Step 7: Building and Sharing NLP Demo**

We'll create a simple demo that takes user input and predicts the label using our fine-tuned model.

In [14]:
def predict(text):
    inputs = tokenizer.encode_plus(
        text,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    inputs['input_ids'] = inputs['input_ids'].to(device)
    inputs['attention_mask'] = inputs['attention_mask'].to(device)

    outputs = model(**inputs)
    _, predicted = torch.max(outputs.logits, dim=1)
    return predicted.item()

# Test the demo
text = 'I hate this product'
print(predict(text))

0


*This code provides a basic example of fine-tuning a pre-trained language model on a custom dataset. You can modify the code to suit your specific needs and experiment with different models, hyperparameters, and techniques to improve performance.*

# **THANK YOU**