
<div class="box">
  <table>
    <tr>
        <th colspan="2"><h1><b>Deep Learning Final Project</b></h1></th>
    </tr>
    <tr>
        <td><h3><b> Nikkhah</b></h3></td>
                <td><h3> 99102445</h3></td>
    </tr>
    <tr>
        <td><h3>Mohammad Beigi</h3></td>
                <td><h3> 99102189</h3></td>
    </tr>
    <tr>
        <td><h3>Ostadi</h3></td>
                <td><h3> 99101058</h3></td>
    </tr>

  </table>
</div>



# GAN-BERT: Generative Adversarial Network for BERT

## Introduction

Generative Adversarial Networks (GANs) have shown success in generating realistic data. In the context of natural language processing, integrating GANs with pre-trained models like BERT can lead to enhanced language understanding and generation.

## GAN Overview

A GAN consists of two neural networks, a generator $G$ and a discriminator $D$, trained simultaneously through adversarial training. The generator aims to produce data that is indistinguishable from real data, while the discriminator aims to differentiate between real and generated data.

The training process involves a minimax game where the generator tries to maximize the probability of fooling the discriminator, and the discriminator tries to minimize the probability of misclassifying real and generated samples.

## GAN-BERT Architecture

To apply GANs to BERT, we can use the generator to create augmented or synthesized text data. The discriminator, in turn, evaluates the authenticity of the combined real and generated data.

### Generator

The generator takes random noise or input data and transforms it to resemble real BERT-like text samples. It can inject diversity into the dataset, making the model more robust.

### Discriminator

The discriminator is a binary classifier that assesses the authenticity of a given text sample, determining whether it comes from the real dataset or is generated by the GAN.

## Training Process

The GAN-BERT model is trained in an adversarial manner. The generator aims to generate text that is difficult for the discriminator to classify correctly, while the discriminator evolves to better distinguish between real and generated text.

## Applications

The GAN-BERT model can be applied in various NLP tasks, including text augmentation, data synthesis, and enhancing the robustness of pre-trained BERT models.

## Conclusion

Integrating GANs with BERT opens up possibilities for improved language representation and understanding. The adversarial training process can lead to more versatile and reliable language models.


<h1> Load, Preprocess, Tokenize Dataset</h1>

<h2> Load Dataset </h2>

In [None]:
!pip install gdown

In [None]:
!gdown --folder https://drive.google.com/drive/folders/11YeloR2eTXcTzdwI04Z-M2QVvIeQAU6-

In [6]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score, f1_score
from transformers import BertForSequenceClassification, BertTokenizer
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.optim import AdamW
from torch.nn import DataParallel
import torch
import pandas as pd
import json
import torch
from torch.utils.data import DataLoader, Dataset
from torch.nn import DataParallel
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, f1_score
from tqdm.notebook import tqdm

<h1>Dataset Information </h1>

In [7]:
# Paths to the JSONL files
train_path = '/kaggle/working/SubtaskB/subtaskB_train.jsonl'
dev_path = '/kaggle/working/SubtaskB/subtaskB_dev.jsonl'
# Load train dataset into DataFrame
with open(train_path, 'r') as file:
    train_data = [json.loads(line) for line in file]
train_df = pd.DataFrame(train_data)
# Load dev dataset into DataFrame
with open(dev_path, 'r') as file:
    dev_data = [json.loads(line) for line in file]
dev_df = pd.DataFrame(dev_data)
# Print info
print("Train Dataset Info:")
print(train_df.info())
print("\nDev Dataset Info:")
print(dev_df.info())
# Print summary
print("\nTrain Dataset Summary:")
print(train_df.describe())
print("\nDev Dataset Summary:")
print(dev_df.describe())
# Display samples
print("\nTrain Dataset Samples:")
print(train_df.head())
print("\nDev Dataset Samples:")
print(dev_df.head())

Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71027 entries, 0 to 71026
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    71027 non-null  object
 1   model   71027 non-null  object
 2   source  71027 non-null  object
 3   label   71027 non-null  int64 
 4   id      71027 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.7+ MB
None

Dev Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3000 non-null   object
 1   model   3000 non-null   object
 2   source  3000 non-null   object
 3   label   3000 non-null   int64 
 4   id      3000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 117.3+ KB
None

Train Dataset Summary:
              label            id
count  71027.000000  71027.000000
mean       2.494347  35513.000000
std  

In [8]:
import plotly.express as px

# Create bar chart for Train Dataset
fig_train_label = px.histogram(train_df, x='label', title="Train Dataset Label Distribution")
fig_train_label.update_layout(xaxis_title='Label', yaxis_title='Count')

# Create bar chart for Dev Dataset
fig_dev_label = px.histogram(dev_df, x='label', title="Dev Dataset Label Distribution")
fig_dev_label.update_layout(xaxis_title='Label', yaxis_title='Count')

# Display the plots
fig_train_label.show()
fig_dev_label.show()



<h1> Setup GPU CUDA </h1>

In [9]:

# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Check if CUDA (GPU support) is available
if torch.cuda.is_available():
    # Get the number of available GPUs
    num_gpus = torch.cuda.device_count()
    
    # Print the available GPU devices
    for gpu_id in range(num_gpus):
        print(f"GPU {gpu_id}: {torch.cuda.get_device_name(gpu_id)}")
else:
    print("CUDA is not available. Using CPU.")


GPU 0: Tesla T4
GPU 1: Tesla T4


<h1> Tokenize, test tokenize </h1>

In [10]:
from transformers import BertTokenizer
from transformers import TFBertModel
from torch.utils.data import Dataset


# Get the tokenizer from the model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")  # Use the correct model name here

# Set the padding token (eos_token is used as padding token by default)
tokenizer.padding_side = "right"  # Set to "right" for BERT

# Tokenize and check functionality
text = "This is an example sentence to tokenize."
tokens = tokenizer.encode(text, return_tensors='pt', max_length=256, truncation=True, padding='max_length')

# Decode tokens
decoded_text = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)

print("Tokenized Text:")
print(tokens)
print("\nDecoded Text:")
print(decoded_text)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokenized Text:
tensor([[  101,  2023,  2003,  2019,  2742,  6251,  2000, 19204,  4697,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,

<h1> Tokenize dataset </h1>

In [11]:
import torch
from tqdm import tqdm
from transformers import BertTokenizer, BertForSequenceClassification

# Assuming you have a GPU available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load a pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_data(data, tokenizer):
    tokenized_data = []

    # Move the data to the GPU
    data['text'] = data['text'].astype(str)  # Ensure that the 'text' column is of string type
    data['text'] = data['text'].to_list()

    # Use tqdm to add a progress bar
    for text in tqdm(data['text'], desc='Tokenizing data'):
        # Tokenize on the GPU
        inputs = tokenizer(text, return_tensors='pt', max_length=256, truncation=True, padding='max_length').to(device)
        tokenized_data.append({
            'input_ids': inputs['input_ids'].squeeze().cpu(),  # Move back to CPU for compatibility
            'attention_mask': inputs['attention_mask'].squeeze().cpu()  # Move back to CPU for compatibility
        })

    # Move the tokenized data back to the CPU
    data['tokenized_data'] = tokenized_data
    return data

# Tokenize training data
tokenized_data_train = tokenize_data(data=train_df, tokenizer=tokenizer)

# Tokenize development data
tokenized_data_dev = tokenize_data(data=dev_df, tokenizer=tokenizer)


Tokenizing data: 100%|██████████| 71027/71027 [17:02<00:00, 69.43it/s] 
Tokenizing data: 100%|██████████| 3000/3000 [00:32<00:00, 91.46it/s] 


In [12]:
def get_sampled_data(data, sample_percentage):
    sampled_size = int(len(data) * (sample_percentage / 100))
    sampled_data = data.sample(sampled_size)
    
    labels = torch.tensor(sampled_data['label'].values, dtype=torch.long)

    # Convert the Series to a list before applying torch.stack
    input_ids = torch.stack(list(sampled_data['tokenized_data'].apply(lambda x: x['input_ids'].squeeze().clone().detach())))
    attention_mask = torch.stack(list(sampled_data['tokenized_data'].apply(lambda x: x['attention_mask'].squeeze().clone().detach())))

    return input_ids, attention_mask, labels

<h1> Tokenized samples print </h1>

In [13]:
def print_sample_examples(input_ids, attention_mask, labels, num_examples=5):
    for i in range(min(num_examples, len(input_ids))):
        print(f"Example {i + 1}:")
        print(f"input_ids: {input_ids[i]}")
        print(f"attention_mask: {attention_mask[i]}")
        print(f"label: {labels[i]}")
        
        print("=" * 50)

# Assuming you have obtained input_ids, attention_mask, and labels using get_sampled_data
sample_percentage = 10
input_ids, attention_mask, labels = get_sampled_data(data=train_df, sample_percentage=sample_percentage)

# Print some sample examples
print_sample_examples(input_ids, attention_mask, labels, num_examples=5)


Example 1:
input_ids: tensor([  101,  1996,  2263,  2103,  1997, 15435,  4278,  2001,  1996,  2117,
         2461,  1997,  3868,  1999,  1996, 15754,  3565, 10010,  2528,  1010,
         2218,  2006,  5095,  2385,  2337,  2012,  5322, 23018,  2379, 15435,
         1010,  2660,  1012,  1996,  2679,  2003,  2036,  2124,  2004,  1000,
         1996,  2502, 24419,  2091,  2104,  1000,  1998,  2038,  2042,  2448,
         6604,  2144,  2384,  1012,  2009,  2956,  2048,  3837,  1025,  2028,
         6042,  5219,  2628,  2011,  2093,  6241,  1997, 18559,  2077,  4399,
         2020,  7259,  2058,  2382, 10876,  2169,  1012,  8263,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,

In [16]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_mask, labels):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.labels = labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        label = (self.labels[idx]).clone().detach()

        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'label': label
        }


In [17]:
import torch

# Choose the GPUs you want to clear
gpu_ids = [1, 2]

# Iterate through selected GPUs and clear memory
for gpu_id in gpu_ids:
    try:
        torch.cuda.empty_cache()
        print(f"Cleared memory on GPU {gpu_id}")
    except Exception as e:
        print(f"Error clearing memory on GPU {gpu_id}: {e}")


Cleared memory on GPU 1
Cleared memory on GPU 2


In [18]:
!pip install GPUtil

from GPUtil import showUtilization as gpu_usage

import torch
from GPUtil import showUtilization as gpu_usage
from numba import cuda
gpu_usage()                             
import gc


Collecting GPUtil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: GPUtil
  Building wheel for GPUtil (setup.py) ... [?25ldone
[?25h  Created wheel for GPUtil: filename=GPUtil-1.4.0-py3-none-any.whl size=7392 sha256=19ffd8663fdb0333aa2161a33ce8b29fea5f68f629ece6ebb61a90e16ab5a660
  Stored in directory: /root/.cache/pip/wheels/a9/8a/bd/81082387151853ab8b6b3ef33426e98f5cbfebc3c397a9d4d0
Successfully built GPUtil
Installing collected packages: GPUtil
Successfully installed GPUtil-1.4.0
| ID | GPU | MEM |
------------------
|  0 |  0% |  1% |
|  1 |  0% |  0% |


In [19]:


def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()                             

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)
    torch.cuda.empty_cache()
    print("GPU Usage after emptying the cache")
    gpu_usage()

free_gpu_cache()                           


Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  1% |
|  1 |  0% |  0% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 |  2% |  1% |
|  1 |  0% |  0% |


# Impact of Increasing Train Dataset Length in Fine-Tuning BERT for Classification

## Effects of Increasing Train Dataset Length (i.e. labeled percentage of data)

### 1. **Increased Model Capacity:**
   - A larger training dataset provides more diverse examples, allowing the model to learn more complex patterns and relationships in the data.
   - The increased capacity may result in better generalization to unseen examples during inference.

### 2. **Improved Robustness:**
   - A more extensive dataset helps the model encounter a broader range of scenarios, making it more robust to variations and noise in the input data.

### 3. **Reduced Overfitting:**
   - With more data, the model is less likely to memorize specific examples from the training set (overfitting) and is more likely to capture underlying patterns.

### 4. **Enhanced Learning of Task-Specific Features:**
   - An expanded dataset allows the model to learn more nuanced features specific to the classification task, potentially improving task performance.

### 5. **Potential Computational Challenges:**
   - While a larger dataset can bring benefits, it may also increase computational demands during training. Considerations such as available memory and processing power become crucial.

## Conclusion

Increasing the length of the training dataset when fine-tuning BERT for classification can positively impact the model's performance, generalization, and robustness. However, it's essential to balance the benefits with computational considerations and monitor for potential overfitting. The choice of dataset size should align with the specific requirements and constraints of the classification task.


In [20]:
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm
import plotly.subplots as sp
import plotly.graph_objects as go

def fine_tune_bert(train_dataset, dev_dataset, model, batch_size=16, epochs=5, sample_percentages=[1, 5,10]):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Wrap the model with DataParallel for multi-GPU training
    if torch.cuda.device_count() > 1:
        print("Using", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model, device_ids=[0, 1])

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    criterion = torch.nn.CrossEntropyLoss()
    results = {'loss': {percentage: [] for percentage in sample_percentages},
               'accuracy': {percentage: [] for percentage in sample_percentages},
               'f1_score': {percentage: [] for percentage in sample_percentages}}

    fig_loss = go.Figure()
    fig_accuracy = go.Figure()
    fig_f1_score = go.Figure()

    for i, percentage in enumerate(sample_percentages):
        sampled_size = int(len(train_dataset) * (percentage / 100))

        # Get sampled data
        input_ids, attention_mask, labels = get_sampled_data(train_dataset, sample_percentage=percentage)

        sampled_train_dataset = CustomDataset(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        sampled_train_loader = DataLoader(sampled_train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
        input_ids, attention_mask, labels = get_sampled_data(dev_dataset, sample_percentage=percentage)
        sampled_dev_dataset = CustomDataset(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        for epoch in range(epochs):
            model.train()
            running_loss = 0.0

            # Print the number of samples for each training iteration
            print(f"Training with {percentage}% of the dataset which is {sampled_size}. Epoch {epoch + 1}/{epochs}")

            for batch in tqdm(sampled_train_loader, desc=f'Epoch {epoch + 1}/{epochs}'):
                inputs = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                optimizer.zero_grad()

                outputs = model(inputs, attention_mask=attention_mask)
                loss = criterion(outputs.logits, labels)
                loss.backward()

                # Manually delete tensors to free up GPU memory

                optimizer.step()

                running_loss += loss.item()
                del inputs, attention_mask, labels, outputs
                gc.collect()
                torch.cuda.empty_cache()

            avg_loss = running_loss / len(sampled_train_loader)
            results['loss'][percentage].append(avg_loss)

            # Evaluate on the dev set
            model.eval()
            all_preds = []
            all_labels = []

            dev_loader = DataLoader(sampled_dev_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

            with torch.no_grad():
                for dev_batch in tqdm(dev_loader, desc=f'Evaluation - Epoch {epoch + 1}/{epochs}'):
                    dev_inputs = dev_batch['input_ids'].to(device)
                    dev_attention_mask = dev_batch['attention_mask'].to(device)
                    dev_labels = dev_batch['label'].to(device)

                    dev_outputs = model(dev_inputs, attention_mask=dev_attention_mask)
                    preds = torch.argmax(dev_outputs.logits, dim=1).cpu().numpy()

                    all_preds.extend(preds)
                    all_labels.extend(dev_labels.cpu().numpy())

                # Manually delete tensors to free up GPU memory
                del dev_inputs, dev_attention_mask, dev_labels, dev_outputs

            accuracy = accuracy_score(all_labels, all_preds)
            f1 = f1_score(all_labels, all_preds, average='weighted')

            results['accuracy'][percentage].append(accuracy)
            results['f1_score'][percentage].append(f1)


            torch.cuda.empty_cache()
    # Plotting
    for percentage in sample_percentages:
        fig_loss.add_trace(go.Scatter(x=list(range(epochs)), y=results['loss'][percentage], mode='lines', name=f'{percentage}%'))
        fig_accuracy.add_trace(go.Scatter(x=list(range(epochs)), y=results['accuracy'][percentage], mode='lines', name=f'{percentage}%'))
        fig_f1_score.add_trace(go.Scatter(x=list(range(epochs)), y=results['f1_score'][percentage], mode='lines', name=f'{percentage}%'))

    # Update layout
    fig_loss.update_layout(title='Loss vs Epochs', xaxis_title='Epoch', yaxis_title='Loss', showlegend=True)
    fig_accuracy.update_layout(title='Accuracy vs Epochs', xaxis_title='Epoch', yaxis_title='Accuracy', showlegend=True)
    fig_f1_score.update_layout(title='F1 Score vs Epochs', xaxis_title='Epoch', yaxis_title='F1 Score', showlegend=True)

    # Show the plots
    fig_loss.show()
    fig_accuracy.show()
    fig_f1_score.show()

num_classes = 6
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)

# Example usage
fine_tune_bert(train_dataset=tokenized_data_train, dev_dataset=tokenized_data_dev, model=model, batch_size=16, epochs=3)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using 2 GPUs!
Training with 1% of the dataset which is 710. Epoch 1/3


Epoch 1/3: 100%|██████████| 45/45 [00:54<00:00,  1.21s/it]
Evaluation - Epoch 1/3: 100%|██████████| 2/2 [00:00<00:00,  3.59it/s]


Training with 1% of the dataset which is 710. Epoch 2/3


Epoch 2/3: 100%|██████████| 45/45 [00:53<00:00,  1.18s/it]
Evaluation - Epoch 2/3: 100%|██████████| 2/2 [00:00<00:00,  3.51it/s]


Training with 1% of the dataset which is 710. Epoch 3/3


Epoch 3/3: 100%|██████████| 45/45 [00:53<00:00,  1.18s/it]
Evaluation - Epoch 3/3: 100%|██████████| 2/2 [00:00<00:00,  3.53it/s]


Training with 5% of the dataset which is 3551. Epoch 1/3


Epoch 1/3: 100%|██████████| 222/222 [04:23<00:00,  1.19s/it]
Evaluation - Epoch 1/3: 100%|██████████| 10/10 [00:01<00:00,  5.56it/s]


Training with 5% of the dataset which is 3551. Epoch 2/3


Epoch 2/3: 100%|██████████| 222/222 [04:22<00:00,  1.18s/it]
Evaluation - Epoch 2/3: 100%|██████████| 10/10 [00:01<00:00,  5.47it/s]


Training with 5% of the dataset which is 3551. Epoch 3/3


Epoch 3/3: 100%|██████████| 222/222 [04:23<00:00,  1.19s/it]
Evaluation - Epoch 3/3: 100%|██████████| 10/10 [00:01<00:00,  5.49it/s]


Training with 10% of the dataset which is 7102. Epoch 1/3


Epoch 1/3: 100%|██████████| 444/444 [08:44<00:00,  1.18s/it]
Evaluation - Epoch 1/3: 100%|██████████| 19/19 [00:03<00:00,  5.76it/s]


Training with 10% of the dataset which is 7102. Epoch 2/3


Epoch 2/3: 100%|██████████| 444/444 [08:47<00:00,  1.19s/it]
Evaluation - Epoch 2/3: 100%|██████████| 19/19 [00:03<00:00,  5.72it/s]


Training with 10% of the dataset which is 7102. Epoch 3/3


Epoch 3/3: 100%|██████████| 444/444 [08:44<00:00,  1.18s/it]
Evaluation - Epoch 3/3: 100%|██████████| 19/19 [00:03<00:00,  5.71it/s]


<h1>Using Adapters</h1>


In [None]:
!pip install -U adapters

In [29]:
from adapters import AutoAdapterModel
from transformers import AutoTokenizer

model = AutoAdapterModel.from_pretrained("roberta-base")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

model.load_adapter("AdapterHub/bert", source="hf", set_active=True)

print(model(**tokenizer("This works great!", return_tensors="pt")).logits)

Some weights of RobertaAdapterModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['heads.default.3.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-65befe13-7d54cabe0204151e4a0d222d;e031307b-0a47-4e69-ba23-209ee893c1bb)

Repository Not Found for url: https://huggingface.co/api/models/AdapterHub/bert/revision/main.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

In [None]:
import adapters
from transformers import AutoModelForSequenceClassification

model1 = AutoModelForSequenceClassification.from_pretrained("bert")

In [None]:
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm
import plotly.subplots as sp
import plotly.graph_objects as go
from transformers import BertForSequenceClassification, BertTokenizer



from adapters import AutoAdapterModel



def fine_tune_bert_with_adapters(train_dataset, dev_dataset, model, batch_size=16, epochs=5, sample_percentages=[1, 5, 10]):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Wrap the model with DataParallel for multi-GPU training
    if torch.cuda.device_count() > 1:
        print("Using", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model, device_ids=[0, 1])

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    criterion = torch.nn.CrossEntropyLoss()
    results = {'loss': [], 'accuracy': [], 'f1_score': []}

    # Add adapters to the model
    model.add_adapter("adapter", AdapterType.text_task)
    model.train_adapter("adapter")

    rows = len(sample_percentages) // 2 + len(sample_percentages) % 2
    cols = 2  # Set the number of columns

    fig = sp.make_subplots(rows=rows, cols=cols, subplot_titles=[f'{percentage}%' for percentage in sample_percentages])

    for i, percentage in enumerate(sample_percentages):
        sampled_size = int(len(train_dataset) * (percentage / 100))

        # Get sampled data
        input_ids, attention_mask, labels = get_sampled_data(train_dataset, sample_percentage=percentage)

        sampled_train_dataset = CustomDataset(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        sampled_train_loader = DataLoader(sampled_train_dataset, batch_size=batch_size, shuffle=True, num_workers=8)
        input_ids, attention_mask, labels = get_sampled_data(dev_dataset, sample_percentage=percentage)
        sampled_dev_dataset = CustomDataset(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        for epoch in range(epochs):
            model.train()
            running_loss = 0.0

            # Print the number of samples for each training iteration
            print(f"Training with {percentage}% of the dataset which is {sampled_size}. Epoch {epoch + 1}/{epochs}")

            for batch in tqdm(sampled_train_loader, desc=f'Epoch {epoch + 1}/{epochs}'):
                inputs = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                optimizer.zero_grad()

                # Use adapter forward
                outputs = model(inputs, attention_mask=attention_mask, adapter_names=['adapter'])
                loss = criterion(outputs.logits, labels)
                loss.backward()

                # Manually delete tensors to free up GPU memory
                del inputs, attention_mask, labels, outputs

                optimizer.step()

                running_loss += loss.item()
                torch.cuda.empty_cache()

            avg_loss = running_loss / len(sampled_train_loader)
            results['loss'].append(avg_loss)

            # Evaluate on the dev set
            model.eval()
            all_preds = []
            all_labels = []

            dev_loader = DataLoader(sampled_dev_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

            with torch.no_grad():
                for dev_batch in tqdm(dev_loader, desc=f'Evaluation - Epoch {epoch + 1}/{epochs}'):
                    dev_inputs = dev_batch['input_ids'].to(device)
                    dev_attention_mask = dev_batch['attention_mask'].to(device)
                    dev_labels = dev_batch['label'].to(device)

                    # Use adapter forward
                    dev_outputs = model(dev_inputs, attention_mask=dev_attention_mask, adapter_names=['adapter'])
                    preds = torch.argmax(dev_outputs.logits, dim=1).cpu().numpy()

                    all_preds.extend(preds)
                    all_labels.extend(dev_labels.cpu().numpy())

                # Manually delete tensors to free up GPU memory
                del dev_inputs, dev_attention_mask, dev_labels, dev_outputs

            accuracy = accuracy_score(all_labels, all_preds)
            f1 = f1_score(all_labels, all_preds, average='weighted')

            results['accuracy'].append(accuracy)
            results['f1_score'].append(f1)

            torch.cuda.empty_cache()
    # Plotting
    for percentage in sample_percentages:
        fig_loss.add_trace(go.Scatter(x=list(range(epochs)), y=results['loss'][percentage], mode='lines', name=f'{percentage}%'))
        fig_accuracy.add_trace(go.Scatter(x=list(range(epochs)), y=results['accuracy'][percentage], mode='lines', name=f'{percentage}%'))
        fig_f1_score.add_trace(go.Scatter(x=list(range(epochs)), y=results['f1_score'][percentage], mode='lines', name=f'{percentage}%'))

    # Update layout
    fig_loss.update_layout(title='Loss vs Epochs', xaxis_title='Epoch', yaxis_title='Loss', showlegend=True)
    fig_accuracy.update_layout(title='Accuracy vs Epochs', xaxis_title='Epoch', yaxis_title='Accuracy', showlegend=True)
    fig_f1_score.update_layout(title='F1 Score vs Epochs', xaxis_title='Epoch', yaxis_title='F1 Score', showlegend=True)

    # Show the plots
    fig_loss.show()
    fig_accuracy.show()
    fig_f1_score.show()

num_classes = 6
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)

# Example usage
fine_tune_bert_with_adapters(train_dataset=tokenized_data_train, dev_dataset=tokenized_data_dev, model=model, batch_size=16, epochs=3)
