<a href="https://colab.research.google.com/github/marco-siino/GM_SOURCE_CODE/blob/main/NVIDIA_DS/GM_NVIDIA_DistilBERT_Zero_Shot_Fine_Tuned_multiclass_classification_MSiino.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning Transformer for MultiClass Text Classification of Source Code. Notebook by Marco Siino et al.

# Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* BERT Model and Tokenizer

Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU.

In [1]:
# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForMaskedLM, BertModel, DistilBertModel, DistilBertTokenizer
from sklearn.model_selection import KFold

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Importing and Pre-Processing the domain data

In [3]:
# Import the csv into pandas dataframe and add the headers

df = pd.read_csv('dataset-devmap-nvidia.csv',sep=',')
#df = pd.concat(map(pd.read_csv, ['dataset-devmap-nvidia.csv', 'dataset-devmap-amd.csv']))
#df = pd.read_csv('dataset-devmap-nvidia.csv', sep='\t', names=['benchmark','dataset','comp','rational','mem','localmem','coalesced','atomic','transfer','wgsize','oracle','runtime_cpu','runtime_gpu','src','seq'])
#print(df.head())
# Now include transfer and wgsize columns into the src column.
df['src'] = df['transfer'].astype(str) +" - "+ df['wgsize'].astype(str) +" - "+df["src"]
# # Removing unwanted columns and only leaving title of news and the category which will be the target
df = df[['src','oracle']]
#print(df.head())

encode_dict = {}

def encode_cat(x):
    if x == "GPU":
      encode_dict[x]=1
    else:
      encode_dict[x]=0
    return encode_dict[x]

df['ENCODE_CAT'] = df['oracle'].apply(lambda x: encode_cat(x))

print(df)

df = df.sample(frac=1, random_state=1).reset_index(drop=True)

print(df)

                                                   src oracle  ENCODE_CAT
0    2048 - 255 - __kernel void A(int a, const __gl...    GPU           1
1    131072 - 256 - __kernel void A(__global uint* ...    GPU           1
2    3145728 - 256 - extern void B(float4 a, float4...    GPU           1
3    4096 - 256 - __kernel void A(__global float* a...    GPU           1
4    524288 - 256 - __kernel void A(__global uint* ...    CPU           0
..                                                 ...    ...         ...
675  2000628 - 128 - __kernel void A(__global const...    CPU           0
676  2000628 - 128 - __kernel void A(__global const...    CPU           0
677  71647488 - 0 - extern int D(__private int, __p...    CPU           0
678  71647488 - 256 - extern int B(int, int, int);\...    CPU           0
679  117440512 - 128 - __kernel void A(__global con...    CPU           0

[680 rows x 3 columns]
                                                   src oracle  ENCODE_CAT
0    6346800 -

# Preparing the Dataset and Dataloader

We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. We will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing.
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *Triage* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the DistilBERT model for training.
- We are using the DistilBERT tokenizer to tokenize the data in the `TITLE` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer)
- `target` is the encoded category on the news headline.
- The *Triage* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training.

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [4]:
# RIMPIAZZO MCROCK
# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 10
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert_base_uncased')

In [5]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        title = str(self.data.src[index])
        title = " ".join(title.split())
        inputs = self.tokenizer.encode_plus(
            title,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.ENCODE_CAT[index], dtype=torch.long)
        }

    def __len__(self):
        return self.len

# Creating the Transformer for Fine Tuning


In [6]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model.

class DistillBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert_base_uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output


# Generate the 5 fold objects.

In [7]:
# Each i-train fold can be accessed with df_train[i]. Same for test.
fold_nr = 5

kf = KFold(n_splits=5, random_state=None, shuffle=False)

df_train = []
df_test = []
model = []

for i, (train_index, test_index) in enumerate(kf.split(df)):
  df_train.append(df.iloc[train_index])
  df_test.append(df.iloc[test_index])
# print(df_train[0])

for i in range(0,fold_nr):
  df_train[i] = df_train[i].reset_index(drop=True)
  df_test[i] = df_test[i].reset_index(drop=True)
  # Generate a different model for each fold.
  model.append(DistillBERTClass())
  model[i].to(device)


In [8]:
# Creating the dataset and dataloader

#train_size = 0.8
#train_dataset=df.sample(frac=train_size,random_state=200)
#test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
#train_dataset = train_dataset.reset_index(drop=True)


#print("FULL Dataset: {}".format(df.shape))
#print("TRAIN Dataset: {}".format(train_dataset.shape))
#print("TEST Dataset: {}".format(test_dataset.shape))

#training_set = Triage(train_dataset, tokenizer, MAX_LEN)
#testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

training_set = []
testing_set = []
training_loader = []
testing_loader = []


train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }


for i in range(0,fold_nr):
  training_set.append(Triage(df_train[i], tokenizer, MAX_LEN))
  testing_set.append(Triage(df_test[i], tokenizer, MAX_LEN))
  training_loader.append(DataLoader(training_set[i], **train_params))
  testing_loader.append(DataLoader(testing_set[i], **test_params))


# Calculate accuracy

In [9]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [10]:
# Creating the loss function and optimizer
optimizer = []
loss_function = torch.nn.CrossEntropyLoss()
for i in range(0,fold_nr):
  optimizer.append(torch.optim.Adam(params =  model[i].parameters(), lr=LEARNING_RATE))

# Define the Fine Tuning of the Model

After all the effort of loading and preparing the data and datasets, creating the model and defining its loss and optimizer. This is probably the easier steps in the process.

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network.

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size.
- Subsequent output from the model and the actual category are compared to calculate the loss.
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

In [11]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch, model,optimizer, training_loader):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)

        if _%5000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples
            print(f"\nTraining Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return

# Zero-Shot evaluation on the 5 fold.

In [12]:

def valid(model, testing_loader):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            #print("\n\nSTEP Nr. ", nb_tr_steps)
            # Validation batch is 2. Then, every step is 2 predictions. Then the total steps are half of the size of the test fold.
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            #print("ids è", ids)
            #print("mask è", mask)
            outputs = model(ids, mask).squeeze()
            #print("gli outputs sono:",outputs)
            #print("i targets sono:",targets)
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            #print("bid_idx è",big_idx)
            #print("Correct now is:", n_correct)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)

            if _%5000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 100 steps: {loss_step}")
                print(f"Validation Accuracy per 100 steps: {accu_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")

    return epoch_accu


In [14]:
print('This is the validation section to print the accuracy and see how it performs')
print('Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch')
for i in range(0,fold_nr):
  print("\nEntering FOLD NR. ", i)
  acc = valid(model[i], testing_loader[i])
  print("Accuracy on test data = %0.2f%%" % acc)

This is the validation section to print the accuracy and see how it performs
Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch

Entering FOLD NR.  0




Validation Loss per 100 steps: 0.6982085704803467
Validation Accuracy per 100 steps: 50.0
Validation Loss Epoch: 0.6947986176785301
Validation Accuracy Epoch: 51.470588235294116
Accuracy on test data = 51.47%

Entering FOLD NR.  1
Validation Loss per 100 steps: 0.5970005989074707
Validation Accuracy per 100 steps: 100.0
Validation Loss Epoch: 0.6738659003201652
Validation Accuracy Epoch: 62.5
Accuracy on test data = 62.50%

Entering FOLD NR.  2
Validation Loss per 100 steps: 0.7264237403869629
Validation Accuracy per 100 steps: 0.0
Validation Loss Epoch: 0.6882631656001595
Validation Accuracy Epoch: 58.088235294117645
Accuracy on test data = 58.09%

Entering FOLD NR.  3
Validation Loss per 100 steps: 0.719153642654419
Validation Accuracy per 100 steps: 50.0
Validation Loss Epoch: 0.6846001604024101
Validation Accuracy Epoch: 58.088235294117645
Accuracy on test data = 58.09%

Entering FOLD NR.  4
Validation Loss per 100 steps: 0.6990581154823303
Validation Accuracy per 100 steps: 50.0
V

# Fine-Tuning the model.

In [15]:
for i in range(0,fold_nr):
  print("\nEntering FOLD NR. ", i)
  for epoch in range(EPOCHS):
    train(epoch,model[i],optimizer[i],training_loader[i])


Entering FOLD NR.  0
Training Loss per 5000 steps: 0.7033916711807251
Training Accuracy per 5000 steps: 50.0

The Total Accuracy for Epoch 0: 59.00735294117647
Training Loss Epoch: 0.6446728985756636
Training Accuracy Epoch: 59.00735294117647
Training Loss per 5000 steps: 0.47283250093460083
Training Accuracy per 5000 steps: 100.0

The Total Accuracy for Epoch 1: 73.71323529411765
Training Loss Epoch: 0.5524507199468858
Training Accuracy Epoch: 73.71323529411765
Training Loss per 5000 steps: 0.9352677464485168
Training Accuracy per 5000 steps: 50.0

The Total Accuracy for Epoch 2: 79.41176470588235
Training Loss Epoch: 0.45783138461411
Training Accuracy Epoch: 79.41176470588235
Training Loss per 5000 steps: 0.47082337737083435
Training Accuracy per 5000 steps: 75.0

The Total Accuracy for Epoch 3: 83.82352941176471
Training Loss Epoch: 0.3837527225122732
Training Accuracy Epoch: 83.82352941176471
Training Loss per 5000 steps: 0.17644649744033813
Training Accuracy per 5000 steps: 100.0

# Evaluate the test set on the fine-tuned model.

In [16]:
print('This is the validation section to print the accuracy and see how it performs')
print('Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch')
for i in range(0,fold_nr):
  print("\nEntering FOLD NR. ", i)
  acc = valid(model[i], testing_loader[i])
  print("Accuracy on test data = %0.2f%%" % acc)

This is the validation section to print the accuracy and see how it performs
Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch

Entering FOLD NR.  0
Validation Loss per 100 steps: 0.14858460426330566
Validation Accuracy per 100 steps: 100.0
Validation Loss Epoch: 0.23749760952641202
Validation Accuracy Epoch: 88.97058823529412
Accuracy on test data = 88.97%

Entering FOLD NR.  1
Validation Loss per 100 steps: 0.16743624210357666
Validation Accuracy per 100 steps: 100.0
Validation Loss Epoch: 0.4595368433990242
Validation Accuracy Epoch: 86.76470588235294
Accuracy on test data = 86.76%

Entering FOLD NR.  2
Validation Loss per 100 steps: 0.015113845467567444
Validation Accuracy per 100 steps: 100.0
Validation Loss Epoch: 0.27191994463175756
Validation Accuracy Epoch: 87.5
Accuracy on test data = 87.50%

Entering FOLD NR.  3
Validation Loss per 100 steps: 0.0025896800216287374
Validation Accuracy per 100 steps: 100.0
Valid

# Saving the Trained Model Artifacts for inference

This is the final step in the process of fine tuning the model.

The model and its vocabulary are saved locally. These files are then used in the future to make inference on new inputs of news headlines.

Please remember that a trained neural network is only useful when used in actual inference after its training.

In the lifecycle of an ML projects this is only half the job done. We will leave the inference of these models for some other day.

In [None]:
# Saving the files for re-use

output_model_file = './models/pytorch_distilbert_news.bin'
output_vocab_file = './models/vocab_distilbert_news.bin'

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')
print('This tutorial is completed')