<a href="https://colab.research.google.com/github/prisar/ai_notebooks/blob/main/nb_108.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform an experiment to use an open-source LLM model to classify text.

## Install necessary libraries

### Subtask:
Install the required libraries such as `transformers` and `torch`.


**Reasoning**:
Install the required libraries using pip.



In [1]:
%pip install transformers torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

## Load a pre-trained open-source llm

### Subtask:
Choose and load a suitable open-source LLM from the Hugging Face Transformers library.


**Reasoning**:
Load the tokenizer and model for text classification using the transformers library.



In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Assuming a binary classification task

print("Tokenizer loaded successfully:", tokenizer)
print("Model loaded successfully:", model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizer loaded successfully: DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
Model loaded successfully: DistilBertForSequenceClassification(
  (distilbert): Dis

## Load a dataset

### Subtask:
Load a dataset for text classification.


**Reasoning**:
Import the load_dataset function and load the sst2 subset of the glue dataset.



In [3]:
from datasets import load_dataset

dataset = load_dataset('glue', 'sst2')
print(dataset)

README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


## Preprocess the data

### Subtask:
Prepare the text data for the model, including tokenization and padding.


**Reasoning**:
Define a function to tokenize the 'sentence' field and apply it to the dataset splits, then remove unnecessary columns.



In [4]:
def tokenize_function(examples):
    return tokenizer(examples["sentence"], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])

print(tokenized_datasets)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})


## Perform text classification

### Subtask:
Use the loaded LLM to classify the text data.


**Reasoning**:
Define the data collator, training arguments, and create a Trainer instance to train and evaluate the model.



In [5]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
eval_results = trainer.evaluate()
print(eval_results)

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

**Reasoning**:
The previous code failed because `evaluation_strategy` is not a valid argument for `TrainingArguments`. The correct argument is `eval_strategy`. Fix the argument name and rerun the code to define the data collator, training arguments, create a Trainer instance, train and evaluate the model.



In [6]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch", # Corrected argument name
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
eval_results = trainer.evaluate()
print(eval_results)

  trainer = Trainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mprisar[0m ([33mprisar-quantiphi[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.1856,0.301694
2,0.1341,0.384326
3,0.0772,0.4339


{'eval_loss': 0.43390020728111267, 'eval_runtime': 1.1356, 'eval_samples_per_second': 767.906, 'eval_steps_per_second': 48.434, 'epoch': 3.0}


## Summarize the experiment

### Subtask:
Summarize the experiment and its results.

**Reasoning**:
Present the experiment and the evaluation results in a clear and concise manner.

## Experiment Summary

This experiment aimed to classify text using a pre-trained open-source LLM model. We used the `distilbert-base-uncased` model from the Hugging Face Transformers library and the `sst2` subset of the GLUE dataset for text classification.

The data was preprocessed by tokenizing the sentences and preparing them for the model. The model was then trained on the training data and evaluated on the validation data.

The evaluation results are as follows:

In [7]:
print(eval_results)

{'eval_loss': 0.43390020728111267, 'eval_runtime': 1.1356, 'eval_samples_per_second': 767.906, 'eval_steps_per_second': 48.434, 'epoch': 3.0}


## Conclusion

The experiment successfully demonstrated the use of an open-source LLM for text classification. The model achieved an evaluation loss of {eval_results['eval_loss']:.4f} on the validation dataset after {eval_results['epoch']:.0f} epochs. Further fine-tuning and hyperparameter tuning could potentially improve the model's performance.