# Text classification with BERT like LLMs

The goal of the project is to perform text calssification using a bert like LLMs. Since we lack GPU resources, we will use the distilbert model. this model can be found on hugging face.

We will perform the classification on two datasets, the Dmoz-computers and Dmoz-Sports.

We also want to enphasize that this notebook was executed on google colab. depending on the date you will run it, you might need to install some libraries. if that is the case, please install the datasets and evaluate library.

This documento is divided into two parts, first we will train the DistilBERT on the Dmoz-Computer dataset then we will repeat the process on the Dmoz-Sports datasets.

For that we will do the following:
1. Load the Class that was created the BERT_classification.py
2. load the dataset
3. split the dataset into train(70%) and validate(10%) and test(20%)
4. Load the tokenizer and tokenize the datasets.
5. Load and train the models
6. show the F1-scores, accuracy and confusion matrix.

We will repeat this process for both datasets.

## Load the class and model

First we are loading the class BERT_like_classification where the interface was writen. The class uses HUGGING FACE to load the datasets, train the data and evaluate the data.

You can look at the class methods with more details if you want.


In [4]:
from bert_classification import BERT_like_classification

Here we define the model "distilbert/distilbert-base-uncased" 

Aftwards, we instanciate our class.

In [5]:
model_name = "distilbert/distilbert-base-uncased"

BERT_classificator = BERT_like_classification(model_name)

## Dmoz-Computers

### load data

With the class intanciated we start to work with the data. first we load the dataset.
Here we are working with the Dmoz-computers dataset.
the load_dataset method wiil load the csv file into a dataset and will compute the id2label and labels2id variable.
```python

def load_dataset(self, data_file):

    self.ds = load_dataset("csv", data_files=data_file)
    
    class_set = set(self.ds['train']['class'])
    
    self.id2label = {i: cls for i, cls in enumerate(class_set)}
    
    self.label2id = {cls: i for i, cls in enumerate(class_set)}
```

In [6]:
BERT_classificator.load_dataset('Data/Dmoz-Computers.csv')

Generating train split: 0 examples [00:00, ? examples/s]

Afwards we have to split tha date into train validation and test. we were told to do it into 70-20-20% and so we did, however, this percentage is hardcoded into the method. with you wish to change it you need to change the method present inside the class:

```python
   def split_dataset(self):
   
        ds_train_devtest = self.ds['train'].train_test_split(test_size=0.3, seed=42)
        
        ds_devtest = ds_train_devtest['test'].train_test_split(test_size=0.66, seed=42)

        self.ds_splits = DatasetDict({
            'train': ds_train_devtest['train'],
            'valid': ds_devtest['train'],
            'test': ds_devtest['test']
        })
``` 

In [7]:
BERT_classificator.split_dataset()
BERT_classificator.ds_splits

DatasetDict({
    train: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 6650
    })
    valid: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 969
    })
    test: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 1881
    })
})

After spliting the data we have to tokenize the text in the dataset.

for that we load a tokenizer and execute it in the dataset class, here is the code present in the set_tokenizer and tokenize_ds_splits methods:

```python
    def set_tokenizer(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def tokenize_ds_splits(self):
    self.tokenized_ds = self.ds_splits.map(self.preprocess_function, batched=True)

    # add labels to the tokenizer_ds variable
    self.tokenized_ds = self.tokenized_ds.map(
        lambda example: {"labels": self.label2id[example["class"]]})
```

Note that at the and we also create a labels class that will be used to train the data

After this code was executed you can see that the dataset in tokenized_ds already have values such as 

'input_ids', 'attention_mask', 'labels'

In [8]:
BERT_classificator.set_tokenizer(model_name)
BERT_classificator.tokenize_ds_splits()
BERT_classificator.tokenized_ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/6650 [00:00<?, ? examples/s]

Map:   0%|          | 0/969 [00:00<?, ? examples/s]

Map:   0%|          | 0/1881 [00:00<?, ? examples/s]

Map:   0%|          | 0/6650 [00:00<?, ? examples/s]

Map:   0%|          | 0/969 [00:00<?, ? examples/s]

Map:   0%|          | 0/1881 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 6650
    })
    valid: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 969
    })
    test: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1881
    })
})

To vizulize what hapened to the dataset you can check the datasets values. here we will show the first velue of the train dataset.

you can see that the file name, class and text are still the same.

However, we have the imput_ids, which is the tokenized text, the mask and the labels the instance.

In [9]:
BERT_classificator.tokenized_ds['train'][0]

{'file_name': '1485443.txt',
 'text': 'CFCustomtags.com Collection of custom tags organized and suggested by site visitors. ',
 'class': 'Programming',
 'input_ids': [101,
  12935,
  7874,
  20389,
  15900,
  2015,
  1012,
  4012,
  3074,
  1997,
  7661,
  22073,
  4114,
  1998,
  4081,
  2011,
  2609,
  5731,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': 0}

Then we add the data colator to the class, the data colator is important in the training process.
```python
    def set_data_collator(self):
        self.data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)
```


In [10]:
BERT_classificator.set_data_collator()

### Training

Now we can finaly train the data.

to train the data we define the train and eval dataset, load the model and pass it to the train model function. The load model and train_model function is defined as follows:

```python
    def load_model(self, model_name):
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=len(self.id2label), id2label=self.id2label, label2id=self.label2id)

    def train_model(self,train_dataset, eval_dataset,  ):
        self.trainer = Trainer(
            model=self.model,
            args=self.training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            processing_class=self.tokenizer,
            data_collator=self.data_collator,
            compute_metrics=self.compute_metrics,
        )

        self.trainer.train()
```

Note that we we are using several vaues that we intanciated before, such as the id2label, tokenizer, datacolator, and so on.

we also defined the compute metrics method to compute accuracy, f1_macro and f1_micro as follows:

```python
    def compute_metrics(self, predictions):
        preds = np.argmax(predictions.predictions, axis=1)
        return {
            "accuracy": accuracy_score(predictions.label_ids, preds),
            "f1_micro": f1_score(predictions.label_ids, preds, average="micro"),
            "f1_macro": f1_score(predictions.label_ids, preds, average="macro"),
            
        }
```

finaly we have to talk about the training args. You can change the traing args using the function set_training_arguments. However, here we are using the following training args:

```python
self.training_args = TrainingArguments(
            output_dir="bert-computers-model",
            learning_rate=2e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=5,
            weight_decay=0.01,
            eval_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            push_to_hub=False,
            report_to="none"
        )


    def set_training_arguments(self, training_arguments):
        self.training_args = training_arguments
```

In [11]:
train_ds = BERT_classificator.tokenized_ds["train"]
eval_ds = BERT_classificator.tokenized_ds["valid"]
BERT_classificator.load_model(model_name)

BERT_classificator.train_model(train_ds, eval_ds)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,No log,1.336701,0.658411,0.658411,0.631772
2,1.888800,1.021377,0.713106,0.713106,0.695092
3,0.952100,0.937679,0.74097,0.74097,0.728839
4,0.648600,0.896812,0.752322,0.752322,0.742244
5,0.494000,0.882202,0.760578,0.760578,0.751337


### Results

You can see the accuracy and F1 score of each epochs in the evaluation. now we want to see how it performs in the test dataset.

To do so we will define use the test dataset and run the model predict method:

```python
    def model_predict(self,test_dataset):
        return self.trainer.predict(test_dataset)
```

This will result in the predictions and the metrics that we computed during the training.

In [12]:
test_ds = BERT_classificator.tokenized_ds["test"]
results = BERT_classificator.model_predict(test_ds)
results

PredictionOutput(predictions=array([[-0.3578872 , -1.6224889 , -1.3283521 , ..., -1.7544192 ,
        -1.4088093 , -0.7575517 ],
       [-0.93554765, -0.8362965 , -1.1481228 , ...,  4.001464  ,
         0.6307227 ,  1.0953135 ],
       [ 0.3050696 , -1.386666  , -1.4261293 , ...,  1.166365  ,
        -1.1761849 , -1.3785346 ],
       ...,
       [-1.2533303 , -1.5188222 , -1.2313766 , ..., -0.82733566,
        -0.11195961, -0.9971294 ],
       [ 0.7394495 , -1.727582  , -2.5882478 , ...,  1.9668806 ,
        -1.0289481 , -1.9177725 ],
       [-1.0823113 , -1.6979483 , -0.31248593, ..., -0.14146702,
         0.93240064, -0.7035682 ]], dtype=float32), label_ids=array([11, 15,  8, ..., 11,  8, 12]), metrics={'test_loss': 0.9120056629180908, 'test_accuracy': 0.7416267942583732, 'test_f1_micro': 0.7416267942583732, 'test_f1_macro': 0.7401203104952365, 'test_runtime': 3.0984, 'test_samples_per_second': 607.097, 'test_steps_per_second': 38.085})

Finaly, we have to show the results. here we are showing the accuracy, f1_macro, f1-micro

Aftwards we are computing the confusion matrix and presenting it as a pandas dataframe

If you want to compare this results, you can see that the Distill bert calssification performed better than the other methods that we used on previous experimets, such as LR, NB and KNN.

In [24]:
# compute confusion matrix
from sklearn.metrics import confusion_matrix
import numpy as np
import pandas as pd
predictions = results.predictions
predicted_labels = np.argmax(predictions, axis=1)
metrics = results.metrics
print(f'accuracy: {metrics["test_accuracy"]}')
print(f'f1-macro: {metrics["test_f1_macro"]}')
print(f'f1-micro: {metrics["test_f1_micro"]}')

print('Confusion matrix:')
pd.DataFrame(confusion_matrix(test_ds['labels'], predicted_labels))


accuracy 0.7416267942583732
f1-macro 0.7401203104952365
f1-micro 0.7416267942583732
Confusion matrix:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,52,6,1,1,0,1,3,0,11,6,0,0,2,0,3,6,3,1
1,5,72,0,0,0,1,1,1,7,0,0,0,0,0,0,0,1,1
2,0,0,98,1,0,1,0,0,1,1,0,0,3,0,0,0,0,0
3,0,0,0,96,0,0,0,0,0,0,3,0,0,0,0,0,3,6
4,1,2,0,0,79,1,0,0,5,3,0,2,0,0,1,1,2,0
5,3,0,1,0,0,84,0,7,0,0,0,0,0,0,0,2,1,0
6,0,0,1,2,0,0,72,0,0,1,0,0,5,0,0,1,1,0
7,7,0,6,0,1,7,0,76,2,0,0,0,0,0,0,0,0,0
8,5,10,0,5,3,0,4,0,31,7,5,1,6,1,2,3,1,2
9,5,5,1,1,4,0,4,1,7,144,4,5,6,3,4,2,1,0


## Dmoz-Sports

Now we will repeat the same process for this other dataset. since the process is basicaly the same, only changing the dataset, We will not explain step by step what is done.

once again we will follow these steps:
1. Load the Class that was created the BERT_classification.py
2. load the dataset
3. split the dataset into train(70%) and validate(10%) and test(20%)
4. Load the tokenizer and tokenize the datasets.
5. Load and train the models
6. show the F1-scores, accuracy and confusion matrix.

In [25]:
BERT_classificator = BERT_like_classification(model_name)

In [26]:
BERT_classificator.load_dataset('Data/Dmoz-Sports.csv')

Generating train split: 0 examples [00:00, ? examples/s]

In [27]:
BERT_classificator.split_dataset()
BERT_classificator.ds_splits

DatasetDict({
    train: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 9450
    })
    valid: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 1377
    })
    test: Dataset({
        features: ['file_name', 'text', 'class'],
        num_rows: 2673
    })
})

In [28]:
BERT_classificator.set_tokenizer(model_name)
BERT_classificator.tokenize_ds_splits()
BERT_classificator.tokenized_ds

Map:   0%|          | 0/9450 [00:00<?, ? examples/s]

Map:   0%|          | 0/1377 [00:00<?, ? examples/s]

Map:   0%|          | 0/2673 [00:00<?, ? examples/s]

Map:   0%|          | 0/9450 [00:00<?, ? examples/s]

Map:   0%|          | 0/1377 [00:00<?, ? examples/s]

Map:   0%|          | 0/2673 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9450
    })
    valid: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1377
    })
    test: Dataset({
        features: ['file_name', 'text', 'class', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2673
    })
})

In [29]:
BERT_classificator.tokenized_ds['train'][0]

{'file_name': '3273768.txt',
 'text': 'Jason Jennings Official site for this referee includes biography, career information and pictures. ',
 'class': 'Wrestling',
 'input_ids': [101,
  4463,
  14103,
  2880,
  2609,
  2005,
  2023,
  5330,
  2950,
  8308,
  1010,
  2476,
  2592,
  1998,
  4620,
  1012,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': 16}

In [30]:
BERT_classificator.set_data_collator()

In [31]:
train_ds = BERT_classificator.tokenized_ds["train"]
eval_ds = BERT_classificator.tokenized_ds["valid"]
BERT_classificator.load_model(model_name)

BERT_classificator.train_model(train_ds, eval_ds)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,1.5194,0.433201,0.914306,0.914306,0.91609
2,0.3709,0.290114,0.931009,0.931009,0.932443
3,0.215,0.283541,0.933914,0.933914,0.934098
4,0.1389,0.286384,0.938272,0.938272,0.938999
5,0.0962,0.284447,0.936819,0.936819,0.937319


In [32]:
test_ds = BERT_classificator.tokenized_ds["test"]
results = BERT_classificator.model_predict(test_ds)
results

PredictionOutput(predictions=array([[-1.4874568 , -1.3725168 , -0.42121738, ..., -1.2879579 ,
        -2.6495423 , -1.4335514 ],
       [ 6.998868  , -1.1218852 , -0.63164437, ..., -0.24794273,
        -1.0862033 , -1.2787061 ],
       [-1.5502948 , -1.889889  ,  0.14690319, ..., -1.7733362 ,
        -0.36319673, -0.79816693],
       ...,
       [-2.2477841 ,  0.19130567, -2.2924504 , ..., -1.2653762 ,
        -1.6541051 , -0.40284488],
       [-0.11651286, -1.55833   , -0.88925624, ..., -0.9760277 ,
        -0.42926943, -1.4897208 ],
       [ 0.16858548, -1.796612  , -1.254937  , ..., -0.6036927 ,
        -1.1408494 , -0.9983588 ]], dtype=float32), label_ids=array([23,  0,  9, ..., 17, 18,  7]), metrics={'test_loss': 0.29002314805984497, 'test_accuracy': 0.9248035914702581, 'test_f1_micro': 0.9248035914702581, 'test_f1_macro': 0.9242414554144767, 'test_runtime': 3.8212, 'test_samples_per_second': 699.522, 'test_steps_per_second': 43.965})

### Results

Once again the result of the distilBERT is better than the other ML models. we also have to note that other BERT-like models may perform slitly better than distil BERT. However we chosed this one to train it faster with less compute power.

In [35]:
pd.set_option('display.max_columns', 30)
predictions = results.predictions
predicted_labels = np.argmax(predictions, axis=1)
metrics = results.metrics
print(f'accuracy: {metrics["test_accuracy"]}')
print(f'f1-macro: {metrics["test_f1_macro"]}')
print(f'f1-micro: {metrics["test_f1_micro"]}')

print('Confusion matrix:')
pd.DataFrame(confusion_matrix(test_ds['labels'], predicted_labels))

accuracy: 0.9248035914702581
f1-macro: 0.9242414554144767
f1-micro: 0.9248035914702581
Confusion matrix:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
0,105,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0
1,0,100,0,5,0,0,0,0,2,0,0,0,0,0,1,0,0,9,0,0,0,0,0,1,0,0,0
2,0,0,92,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,0,2,0,71,0,1,0,1,2,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,101,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,95,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0
6,0,0,1,0,0,0,87,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,1,0,1,79,3,0,0,0,0,0,0,0,1,0,0,0,2,0,0,1,0,0,0
8,0,0,0,1,0,0,0,2,89,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0
9,0,0,0,0,0,1,0,0,0,85,0,0,1,0,1,0,3,0,0,1,0,0,0,0,0,0,0
