[![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/petuch03/machine-learning-things/blob/main/generative-ai/practices-and-hws/Week4/homework/machine-human-text-classification.ipynb)
<a target="_blank" href="https://colab.research.google.com/github/petuch03/machine-learning-things/blob/main/generative-ai/practices-and-hws/Week4/homework/machine-human-text-classification.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Overview
My solution is fully based on the model which was provided in the practice session.

## Step 1
Here we install all necessary libraries, mount drive into colab.

In [1]:
!pip install transformers # supports Transformer-based models
!pip install datasets # datasets for experiments
!pip install evaluate # evaluation metrics for experiments
!pip install transformers[torch] # backend for training

import pandas as pd # data manipulation & storage
from transformers.utils import logging
from transformers import  set_seed # fix random seed
from google.colab import drive

drive.mount('/content/drive')
logging.set_verbosity_error()
set_seed(0)

output_path = '/content/drive/My Drive/atd'

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow, dill, multiprocess, datasets
  Attempting uninstall: pyarrow
    Found exis

## Step 2
I downloaded the development dataset into an splitted it into train and validation. After this, i also downloaded test dataset and and merged everything into single DatasetDict

In [2]:
from datasets import load_dataset

dev_ds = load_dataset('csv', data_files='/content/drive/MyDrive/Colab Notebooks/generative-ai/Week4/dev.csv')

train_ds, val_ds = dev_ds['train'].train_test_split(test_size=0.2).values() # splitting dataset into

Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
val_ds

Dataset({
    features: ['ID', 'Text', 'Class'],
    num_rows: 400
})

In [4]:
test_ds = load_dataset('csv', data_files='/content/drive/MyDrive/Colab Notebooks/generative-ai/Week4/test.csv')

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
test_ds

DatasetDict({
    train: Dataset({
        features: ['ID', 'Text'],
        num_rows: 20000
    })
})

In [6]:
from datasets import DatasetDict

ds = DatasetDict({
    'train': train_ds,
    'validation': val_ds,
    'test': test_ds['train']
})

In [7]:
ds

DatasetDict({
    train: Dataset({
        features: ['ID', 'Text', 'Class'],
        num_rows: 1600
    })
    validation: Dataset({
        features: ['ID', 'Text', 'Class'],
        num_rows: 400
    })
    test: Dataset({
        features: ['ID', 'Text'],
        num_rows: 20000
    })
})

## Step 3
Starting with dicts for simplier id-2-label convertion (same as in practice), I introduced tokenizer and slightly changed preprocess function comparing it to one appeared in practice. As a result we have tokenized and properly labeled Dataset which is almost ready to be trained on.

In [8]:
# map class IDs to labels
id2label = {0: 'H', 1: 'M'}

# map labels to class IDs
label2id = {'H': 0, 'M': 1}

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def preprocess(batch):
    tokenized_batch = tokenizer(batch['Text'], padding=True, truncation=True, max_length=128)
    # Adds 'label' only if 'Class' exists in the batch. No labeling for test Dataset
    if 'Class' in batch:
        tokenized_batch['label'] = [label2id[label] for label in batch['Class']]
    return tokenized_batch

# Apply preprocessing to the DatasetDict
tokenized_ds = ds.map(preprocess, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [10]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['ID', 'Text', 'Class', 'input_ids', 'attention_mask', 'label'],
        num_rows: 1600
    })
    validation: Dataset({
        features: ['ID', 'Text', 'Class', 'input_ids', 'attention_mask', 'label'],
        num_rows: 400
    })
    test: Dataset({
        features: ['ID', 'Text', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
})

## Step 4

Introduced DataCollator, added metrics for future performance evaluation. Everything copied from the practice

In [11]:
from transformers import DataCollatorWithPadding # import the DataCollatorWithPadding class from the transformers package

# create an instance of DataCollatorWithPadding
# it takes 'tokenizer' as an argument, which will be used for padding sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [12]:
import evaluate # import the evaluate package

accuracy = evaluate.load('accuracy') # we will use the accuracy metric as the main one

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [13]:
import numpy as np # import the numpy package

# this function hets the predictions (e.g. the probilities of each class, takes the most probable precition and compares it to the gold label)
def compute_metrics(eval_pred):

    # get the prediction probabilities and the gold labels
    predictions, labels = eval_pred

    # get the most likely prediction
    predictions = np.argmax(predictions, axis=1)

    # compute and return the accuracy value
    return accuracy.compute(predictions=predictions, references=labels)

## Step 5

Absolutely the same model as in practice. Training arguments and trainer are the same. Steps ends with model training.

In [14]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer # import necessary components from the transformers library

# initialize a model for sequence classification (e.g. for text classification)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [15]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [16]:
# define the training arguments for the model
training_args = TrainingArguments(
    output_dir=f'tmp/',                           # directory to save the model and results
    learning_rate=2e-5,                            # learning rate for optimization
    per_device_train_batch_size=32,              # batch size per GPU for training
    per_device_eval_batch_size=32,               # batch size per GPU for evaluation
    num_train_epochs=10,                           # number of training epochs
    weight_decay=0.01,                            # weight decay for regularization
    evaluation_strategy='epoch',                  # evaluation strategy during training (per epoch)
    save_strategy='epoch',                        # saving strategy during training (per epoch)
    load_best_model_at_end=True,                  # load the best model at the end of training
)

# intialize the Trainer with necessary components and settings
trainer = Trainer(
    model=model,                                  # model to be trained
    args=training_args,                           # training arguments defined above
    train_dataset=tokenized_ds['train'],          # training dataset
    eval_dataset=tokenized_ds['validation'],      # validation dataset
    tokenizer=tokenizer,                          # tokenizer for data processing
    data_collator=data_collator,                  # data collator for padding
    compute_metrics=compute_metrics               # function to compute evaluation metrics
)

In [17]:
# train the model
trainer.train()

{'eval_loss': 0.2059372067451477, 'eval_accuracy': 0.9275, 'eval_runtime': 1.4417, 'eval_samples_per_second': 277.446, 'eval_steps_per_second': 9.017, 'epoch': 1.0}
{'eval_loss': 0.10901492089033127, 'eval_accuracy': 0.9625, 'eval_runtime': 1.3641, 'eval_samples_per_second': 293.238, 'eval_steps_per_second': 9.53, 'epoch': 2.0}
{'eval_loss': 0.08289474248886108, 'eval_accuracy': 0.9725, 'eval_runtime': 1.3672, 'eval_samples_per_second': 292.567, 'eval_steps_per_second': 9.508, 'epoch': 3.0}
{'eval_loss': 0.09285690635442734, 'eval_accuracy': 0.975, 'eval_runtime': 1.4217, 'eval_samples_per_second': 281.347, 'eval_steps_per_second': 9.144, 'epoch': 4.0}
{'eval_loss': 0.10428817570209503, 'eval_accuracy': 0.9725, 'eval_runtime': 1.4234, 'eval_samples_per_second': 281.008, 'eval_steps_per_second': 9.133, 'epoch': 5.0}
{'eval_loss': 0.09679096192121506, 'eval_accuracy': 0.9725, 'eval_runtime': 1.4754, 'eval_samples_per_second': 271.111, 'eval_steps_per_second': 8.811, 'epoch': 6.0}
{'eval_

TrainOutput(global_step=500, training_loss=0.06429217529296875, metrics={'train_runtime': 226.4021, 'train_samples_per_second': 70.671, 'train_steps_per_second': 2.208, 'train_loss': 0.06429217529296875, 'epoch': 10.0})

## Step 6

I predict the id of the label in test dataset (class probabilities are postprocessed with argmax for most likely class ID selection) and convert it back pure label format: 'H' or 'M'. After it I created a DataFrame for submission, filled with data and saved it to my Drive in csv format.

In [19]:
predictions = trainer.predict(tokenized_ds['test']).predictions

predicted_class_indices = np.argmax(predictions, axis=1)
predicted_labels = [id2label[i] for i in predicted_class_indices]

In [22]:
submission_df = pd.DataFrame({
    'ID': test_ds['train']['ID'],
    'Class': predicted_labels
})

submission_path = '/content/drive/MyDrive/Colab Notebooks/generative-ai/Week4/submission_10_epochs.csv'
submission_df.to_csv(submission_path, index=False)

In [23]:
model.save_pretrained('/content/drive/MyDrive/Colab Notebooks/generative-ai/Week4/model_10_epochs')