# Bayesian Optimization links

- https://distill.pub/2020/bayesian-optimization/
- https://medium.com/latinxinai/tuning-deep-learning-made-easy-with-scikit-optimize-150c1b1d4ba
- https://scikit-optimize.github.io/stable/modules/generated/skopt.Space.html
- https://medium.com/distributed-computing-with-ray/hyperparameter-optimization-for-transformers-a-guide-c4e32c6c989b
- https://medium.datadriveninvestor.com/k-fold-cross-validation-for-parameter-tuning-75b6cb3214f
- https://towardsdatascience.com/hugging-face-transformers-fine-tuning-distilbert-for-binary-classification-tasks-490f1d192379
- https://huggingface.co/blog/ray-tune
- https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-HuggingFace-Transformers--VmlldzoyMTc2ODI

## [Link 1 (Agnihotri & Batra, 2020)](https://distill.pub/2020/bayesian-optimization/)

- Gold Mining problem: "In this problem, we want to accurately estimate the gold distribution on the new land. We can not drill at every location due to the prohibitive cost. Instead, we should drill at locations providing high information about the gold distribution. This problem is akin to Active Learning
[2, 3]"
- "In this problem, we want to find the location of the maximum gold content. We, again, can not drill at every location. Instead, we should drill at locations showing high promise about the gold content." ""sthing called an acquisition function. Acquisition functions are heuristics for how desirable it is to evaluate a point, based on our present model 4 . We will spend much of this section going through different options for acquisition functions.
This brings us to how Bayesian Optimization works. At every step, we determine what the best point to evaluate next is according to the acquisition function by optimizing it. We then update our model and repeat this process to determine the next point to evaluate.
You may be wondering what’s “Bayesian” about Bayesian Optimization if we’re just optimizing these acquisition functions. Well, at every step we maintain a model describing our estimates and uncertainty at each point, which we update according to Bayes’ rule
[2] at each step. Our acquisition functions are based on this model, and nothing would be possible without them."
  

"When training a model is not expensive and time-consuming, we can do a grid search to find the optimum hyperparameters. However, grid search is not feasible if function evaluations are costly, as in the case of a large neural network that takes days to train. Further, grid search scales poorly in terms of the number of hyperparameters."

"Next, we looked at the “Bayes” in Bayesian Optimization — the function evaluations are used as data to obtain the surrogate posterior. We look at acquisition functions, which are functions of the surrogate posterior and are optimized sequentially. This new sequential optimization is in-expensive and thus of utility of us. "

## [Link 2 (Alvarez, 2023)](https://medium.com/latinxinai/tuning-deep-learning-made-easy-with-scikit-optimize-150c1b1d4ba)

-  #simple to do params section
    search_space = {'lr': np.arange(.001, .0501, .0001),
                    'layers': np.arange(1, max_layers+1),
                    'batch_size': [16, 32, 64, 128],
                    'optimizers': [RMSprop, Adam, Nadam]
                    #SGD causes the exploding gradient problem in regression
              

-      }

# References

[1] Agnihotri, A., & Batra, N. (2020). Exploring Bayesian Optimization. Distill. Retrieved June 25, 2024, from https://distill.pub/2020/bayesian-optimization/

[2] 
Alvarez, J. (2023, June 7). Tuning deep learning made easy with Scikit-Optimize. *LatinXinAI*. Retrieved June 25, 2024, from https://medium.com/latinxinai/tuning-deep-learning-made-easy-with-scikit-optimize

In [22]:
# Import the "datasets" library that allows downloading datasets from Hugging Face
#!pip install datasets # datasets library is already installed on this machine.
import datasets
from datasets import load_dataset
# Enable loading the local copy of the dataset with this function
from datasets import load_from_disk
import pandas as pd
# Download NLP libraries
import nltk
from nltk.tag import pos_tag
# Word tokenizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk import bigrams
# Download stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
import string
import re
import copy

## Import wordnet functionality for negation handling
from nltk.corpus import wordnet
nltk.download('wordnet')

# Import libraries for plotting confusion matrices
import matplotlib.pyplot as plt
import numpy as np

### Scikit-Learn functionality
# TF-IDF functionality
from sklearn.feature_extraction.text import TfidfVectorizer
# Import evaluation metrics
from nltk.classify import accuracy
from sklearn.metrics import accuracy_score, f1_score,  precision_recall_fscore_support, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Import train_test_split for stratified dataset splitting to maintain proportions of each class in each split for cross-val part of the coursework
from sklearn.model_selection import train_test_split
# Allow stratified k-fold cross-validation to address challenges of dealing with an unbalanced dataset.
from sklearn.model_selection import StratifiedKFold

# Import sentiment lexicons
from afinn import Afinn
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ophel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ophel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ophel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\ophel\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

In [3]:
# Load the HuggingFace poem dataset from local storage into a variable called "dataset"
original_dir = './datasets/original_poem_sentiment_dataset'
poem_dataset = load_from_disk(original_dir)
print(f"Original dataset: {poem_dataset}")

train_ds = poem_dataset['train']
val_ds = poem_dataset['validation']
test_ds = poem_dataset['test']

# Convert the three splits into pandas dataframes for easier viewing and analysis of the dataset using the inbuilt 'to_pandas' method
train_df = train_ds.to_pandas() 
val_df = val_ds.to_pandas()
test_df = test_ds.to_pandas()

# Display the first 10 lines in each of the training, val and test data splits
print("TRAIN DATA")
print(train_df.head(10))
print('\n**********************************************************************\n')
print("VALIDATION DATA")
print(val_df.head(10))
print('\n**********************************************************************\n')
print("TEST DATA")
print(test_df.head(10))


Original dataset: DatasetDict({
    train: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 892
    })
    validation: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 105
    })
    test: Dataset({
        features: ['id', 'verse_text', 'label'],
        num_rows: 104
    })
})
TRAIN DATA
   id                                         verse_text  label
0   0  with pale blue berries. in these peaceful shad...      1
1   1                it flows so long as falls the rain,      2
2   2                 and that is why, the lonesome day,      0
3   3  when i peruse the conquered fame of heroes, an...      3
4   4            of inward strife for truth and liberty.      3
5   5                   the red sword sealed their vows!      3
6   6                          and very venus of a pipe.      2
7   7                who the man, who, called a brother.      2
8   8           and so on. then a worthless gaud or two,      0
9   9        

In [5]:
print(train_ds.features)
print(val_ds.features)
print(test_ds.features)

{'id': Value(dtype='int32', id=None), 'verse_text': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive', 'no_impact', 'mixed'], id=None)}
{'id': Value(dtype='int32', id=None), 'verse_text': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive', 'no_impact', 'mixed'], id=None)}
{'id': Value(dtype='int32', id=None), 'verse_text': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive', 'no_impact', 'mixed'], id=None)}


### Tunstall, von Werra and Thomas Wolf: Natural Language Processing with Transformers --> Building Language Applications with Hugging Face (ch. 2)

- DistilBERT cannot receive raw strings as input, the texts must be tokenized and "encoded".
- Tokenization: breaking down a string into the atomic units used in the model.
- The optimal splitting of words into subunits is **usually learned from the corpus**.
- p.33 on subword tokenization, dealing with complex words/mispellings by combining the best aspects of character and word tokenization --> it is "learned" from the pre-training corpus using a mix of rule-based and statistical algorithms.
- WordPiece is used by DistilBERT tokenizers.
- "HuggingFace Trans
formers provides a convenient AutoTokenizer class that allows you to quickly loa 
the tokenizer associated with a pretrained model—we just call its from_pretrained )
method, providing the ID of a model on the Hub or a local file path. Let’s start by
loading the tokenizer for Distil." (p.33)
- BRT ""

### DistilBERT Link and Description
- Link: https://huggingface.co/distilbert/distilbert-base-uncased
- "DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model."
- "Distillation loss: the model was trained to return the same probabilities as the BERT base model."
- "it's mostly intended to be fine-tuned on a downstream task."
- "Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions."
- "DistilBERT pretrained on the same data as BERT, which is BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers)."
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: "[CLS] Sentence A [SEP] Sentence B [SEP]"
- 

### Getting Started with Sentiment Analysis using Python

https://huggingface.co/blog/sentiment-analysis-python



### Hyperparameter Search with Transformers and Ray Tune
## Code adapted from: https://huggingface.co/blog/ray-tune

https://huggingface.co/blog/ray-tune

In [8]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import tensorflow as tf

In [9]:
# CODE FROM: https://python.plainenglish.io/fine-tuning-distilbert-with-your-own-dataset-for-multi-classification-task-69f944189648


In [10]:
# Load in original train-val-test datasets and extract samples and labels
train_set = pd.read_csv("original_test_df.csv")
val_set = pd.read_csv("original_val_df.csv")
test_set = pd.read_csv("original_test_df.csv")
print(train_set.head(5))
# Convert from pandas Series to list as this is what the distilbert tokenizer requires as inputs
train_texts = train_set["verse_text"].to_list()
train_labels=train_set["label"].to_list()
val_texts = val_set["verse_text"].to_list()
val_labels = val_set["label"].to_list()
test_texts = test_set["verse_text"].to_list()
test_labels = test_set["label"].to_list()


tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# Find longest token length in dataset
max_length = 0

# Tokenize each text sample and find max length
for text in train_set["verse_text"]:
    # Tokenize text
    input_ids = tokenizer.encode(text, add_special_tokens=True) # special tokens = indicate start of sequence, end of sequence, separation
    # Update max length
    max_length = max(max_length, len(input_ids))

print("Maximum sequence length:", max_length)

max_length = 0
# Tokenize each val sample and find max length
for text in val_set["verse_text"]:
    # Tokenize text
    input_ids = tokenizer.encode(text, add_special_tokens=True) # special tokens = indicate start of sequence, end of sequence, separation
    # Update max length
    max_length = max(max_length, len(input_ids))

print("Maximum sequence length:", max_length)

max_length = 0
# Tokenize each val sample and find max length
for text in test_set["verse_text"]:
    # Tokenize text
    input_ids = tokenizer.encode(text, add_special_tokens=True) # special tokens = indicate start of sequence, end of sequence, separation
    # Update max length
    max_length = max(max_length, len(input_ids))

print("Maximum sequence length:", max_length)

   id                                         verse_text  label
0   0                      my canoe to make more steady,      2
1   1  and be glad in the summer morning when the kin...      1
2   2       and when they reached the strait symplegades      2
3   3                             she sought for flowers      2
4   4                       if they are hungry, paradise      2
Maximum sequence length: 20
Maximum sequence length: 27
Maximum sequence length: 20


In [11]:
## Code from: https://www.sunnyville.ai/fine-tuning-distilbert-multi-class-text-classification-using-transformers-and-tensorflow/
"""
    Fine-tuning in the HuggingFace's transformers library involves using a pre-trained model and a tokenizer 
    that is compatible with that model's architecture and input requirements.
    Each pre-trained model in transformers can be accessed using the right model
    class and be used with the associated tokenizer class. Since we want to use 
    DistilBert for a classification task, we will use the DistilBertTokenizer tokenizer 
    class to tokenize our texts and then use TFDistilBertForSequenceClassification 
    model class in a later section to fine-tune the pre-trained model using the output from the tokenizer.
    The DistilBertTokenizer generates input_ids and attention_mask as outputs.
    This is what is required by a DistilBert model as its inputs.
"""


tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
print(train_encodings.keys())
# Code from: https://medium.com/@raoashish10/fine-tuning-a-pre-trained-bert-model-for-classification-using-native-pytorch-c5f33e87616e
print(val_encodings['input_ids'][8])
print(val_encodings['attention_mask'][8])
print(tokenizer.decode(train_encodings['input_ids'][8]))

dict_keys(['input_ids', 'attention_mask'])
[101, 2010, 2132, 2003, 11489, 1012, 2002, 6732, 2006, 2273, 1998, 5465, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[CLS] of long - uncoupled bed, and childless eld, [SEP] [PAD] [PAD] [PAD] [PAD]


From https://www.sunnyville.ai/fine-tuning-distilbert-multi-class-text-classification-using-transformers-and-tensorflow/

*"So, in the above code, we defined the tokenizer object using the from_pretrained() method which downloads and caches the tokenizer files associated with the DistilBert model. When we pass text through this tokenizer the generated output will be in the format expected by the DistilBert architecture, as stated above. We use padding and truncation to make sure all the vectors are the same size. You can learn more about DistilBert and it's tokenizer from the DistilBert section of the transformers library's official documentation. And more info regarding the padding and truncation options is available here. Now that we have our texts in an encoded form, there is only one step left before we can begin the fine-tuning process."*

## Fine-Tuning using TFTrainer Class Provided by the *transformers* Library: enables easy training and evaluation of models


From https://www.sunnyville.ai/fine-tuning-distilbert-multi-class-text-classification-using-transformers-and-tensorflow/
- The TFTrainer (Trainer for Pytorch) is a class provided by the transformers library that offers a simple, yet feature-rich, method of training and evaluating models.
- The following code shows how to define the configuration settings and build a model using the TFTrainer class.s.

In [12]:
# Code from: https://huggingface.co/transformers/v3.4.0/custom_datasets.html
""" We put the data in this format so that the data can be easily batched such that each key in the batch encoding
    corresponds to a named parameter of the forward() method of the model we will train. """

# Create a custom dataset class inheriting from PyTorch's Dataset class --> required as inputs to the DistilbertModel
class PoemDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        # Store list of encoded tokens here
        self.encodings = encodings
        # Store list of corresponding labels for each sample here
        self.labels = labels
        
    # This special __ getter function enables retrieving items in an encoding using []-notation and 'idx' as the integer to index into the 
    # encodings
    def __getitem__(self, idx):
        # dict comprehension: creates a key for each key in the encoding for a specific idx/sample: e.g., 'input_ids'
        # the values will be the list containing the encoded tokens, attention masks etc.
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # add key-value pair for label of indexed sample
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    # return size of dataset
    def __len__(self):
        return len(self.labels)

train_dataset = PoemDataset(train_encodings, train_labels)
val_dataset = PoemDataset(val_encodings, val_labels)
test_dataset = PoemDataset(test_encodings, test_labels)

# Verify this has worked by taking the first train sample as an example
print(train_texts[0], train_labels[0], '\n')
print(train_dataset[0])

my canoe to make more steady, 2 

{'input_ids': tensor([  101,  2026, 14347,  2000,  2191,  2062,  6706,  1010,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(2)}


In [23]:
# Ref: https://medium.com/@rakeshrajpurohit/customized-evaluation-metrics-with-hugging-face-trainer-3ff00d936f99

def compute_metrics(pred):
    true_labels = pred.label_ids
    predicted_labels = pred.predictions.argmax(-1)
    # select metrics for dealing with unbalanced classes (macro precision/recall/f1-score)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predicted_labels, average='macro')
    return {
        'precision': precision,
        'recall': recall,
        'f1-score': f1,
    }

In [None]:
# Now all we need to do is create a model to fine-tune, define the TrainingArguments/TFTrainingArguments and instantiate a Trainer/TFTrainer.


# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# Initialize model with correct number of labels
 # this line will result in warning: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# this is a warning telling us to fine-tune the model on our own dataset before proceeding with evaluating on test set, which we will now do.
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

# Create Trainer instance
trainer = Trainer(
    model=model,                      # the instantiated 🤗 Transformers model to be trained
    args=training_args,               # training arguments, defined above
    train_dataset=train_dataset,      # training dataset
    eval_dataset=val_dataset,          # evaluation dataset
    compute_metrics=compute_metrics    # compute macro-avg precision/recall/f1-scores
)

# Start training
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss


In [None]:
# Evaluate on test dataset
eval_results = trainer.evaluate(eval_dataset=test_dataset)

# Print evaluation results
print("Evaluation results:")
for key, value in eval_results.items():
    print(f"{key}: {value}")