Here I am installing the required python modules for the project : transformers, evaluate, datasets, accelerate.

In [3]:
!pip install transformers
!pip install evaluate
!pip install accelerate -U
!pip install transformers[torch]
!pip install datasets

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.4 MB/s[0m eta [36m0:00:0

Importing the installed modules and packages here.

In [4]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
import numpy as np
import evaluate
from datasets import load_dataset

Now I am loading a dataset named as "Yelp Review" [dataset](https://huggingface.co/datasets/yelp_review_full). It contains customer reviews, marked from 0 to 4 (1 - 5 stars review).

In [5]:
dataset = load_dataset("yelp_review_full")
dataset["train"][100]

Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

Using the pre-trained **DistilBERT** Model for LLM. Now here I am making some smaller sample datasets (train and test) that will be used to train my "Trainer" entity to be used later for my task.

In [10]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(20000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=84).select(range(4000))

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Uploading 5 labelled DistilBERT mdoel, default training hyperparameters and 'accuracy' function.

In [7]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=5)
training_args = TrainingArguments(output_dir="test_trainer")
metric = evaluate.load("accuracy")

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

A new function 'compute metrics' used to find out the label with max probability.

In [8]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Now making my pre-trained model using "Trainer" class and training it with my datasets.

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Step,Training Loss
500,1.1063
1000,1.0277
1500,1.0013
2000,0.9804
2500,0.9576
3000,0.753
3500,0.7708
4000,0.7623
4500,0.7549
5000,0.7664


Step,Training Loss
500,1.1063
1000,1.0277
1500,1.0013
2000,0.9804
2500,0.9576
3000,0.753
3500,0.7708
4000,0.7623
4500,0.7549
5000,0.7664


TrainOutput(global_step=7500, training_loss=0.7580432474772135, metrics={'train_runtime': 2955.0025, 'train_samples_per_second': 20.305, 'train_steps_per_second': 2.538, 'total_flos': 7948469145600000.0, 'train_loss': 0.7580432474772135, 'epoch': 3.0})

Mouting Google Drive.

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Saving my trained model in Drive.

In [13]:
model.save_pretrained('/content/drive/MyDrive/NN_LLM_Project_Files')

Saving my tokenizer in Drive.

In [14]:
tokenizer.save_pretrained('/content/drive/MyDrive/NN_LLM_Project_Files')

('/content/drive/MyDrive/NN_LLM_Project_Files/tokenizer_config.json',
 '/content/drive/MyDrive/NN_LLM_Project_Files/special_tokens_map.json',
 '/content/drive/MyDrive/NN_LLM_Project_Files/vocab.txt',
 '/content/drive/MyDrive/NN_LLM_Project_Files/added_tokens.json',
 '/content/drive/MyDrive/NN_LLM_Project_Files/tokenizer.json')