# 1. Install dependencies

In [None]:
!pip install -q transformers datasets accelerate -U

# 2. Example: Run sentiment analysis predictions by using Pipeline
Without specifying a model name, HuggingFace uses the default model, which is `distilbert-base-uncased-finetuned-sst-2-english`.

In [None]:
# Using pipeline class to make predictions from models available in the Hub in an easy way
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

# 3. Example: Select a sentiment analysis model for our fine-tuning task
We want to pick a base model that is small enough to train in Colab with free resources and see some results after fine-tuning. We will select a couple to evaluate based on likes, number of downloads, and size.
- The `finiteautomata/bertweet-base-sentiment-analysis` model has 135 million parameters. It is based on BERTweet, which is a RoBERTa model pre-trained on 850M English tweets. It is specifically fine-tuned for sentiment analysis on social media text.
- The `distilbert-base-uncased-finetuned-sst-2-english` model is the default model provided by HuggingFace (that we tried out above). It has 66 million parameters and uses DistilBERT, a smaller, faster, and cheaper version of BERT.

Since both models are small and efficient, we'll try out `finiteautomata/bertweet-base-sentiment-analysis`. This model has POSITIVE, NEGATIVE, and NEUTRAL labels, instead of only POSITIVE/NEGATIVE. This time we will pull it into our code from the HuggingFace Models Hub by name. Note that our values will look slightly different since we're using a new model.

**Model Cards:**


*   https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis
*   https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

**Tip:** If you sign up for a free Hugging Face account, you can interact with their Inference API available on model card pages, allowing you to quickly test sample outputs for various inputs.

In [None]:
# TODO: Set up our selected model for sentiment analysis
sentiment_analysis_model = None
# TODO: Call the model on with passed in `data` to ensure it's working as expected


# 4. Create test data
Now that we have a model in place, let's create some test data that's a little more ambiguous. Add your own!



In [None]:
test_data = [
    # TODO: Add your own test data
    "That was really uncool",
    "I feel unsure",
    "I would rather not work with this company again",
    "I feel you guys did everything you could",
    "The movie was okay, not great but not bad either",
    "I guess it could be worse",
    "The service was decent but could improve",
    "I'm not entirely satisfied with the results",
    "It wasn't what I expected, but it was alright",
    "Is the store open on the weekends?",
    "The food was somewhat edible",
    "I have mixed feelings about this project",
    "How many lines of credit can I open?",
    "The presentation was underwhelming",
    "I suppose it's fine for now",
    "It's not my favorite, but it's acceptable",
    "It's summer weather outside",
    "There are things I like and dislike about it",
    "I'm neither happy nor upset about the outcome",
    "It's passable, though not impressive",
    "The product is okay, but I've seen better",
    "I can tolerate it, but I don't love it",
    "It didn't meet my expectations, but it's not terrible",
    "I'm pretty neutral",
    "Fairly alright",
    "The couch is gray",
    "I hope he eats dirt",
    "Through all the doom and gloom, he survived in the end, though not unscathed",
    "I can't believe how invigorating this experience was, never experienced anything like it",
    "I'm really grateful for the extended hours during the exam season.",
    "I had a hard time finding a quiet spot due to loud conversations.",
    "Are there any penalties for returning books late?"
]

#5. Get a baseline
Log the sentiment labels and scores from the baseline model for each item in the test data. We will use this for a comparison with our fine-tuned model to ensure its performance is improving.

In [None]:
# TODO: Call our sentiment analysis model again, this time with `test_data`, to get sentiment labels and scores
results = None

# TODO: Print each test data phrase with its corresponding sentiment analysis result for visibility
for phrase, result in zip(test_data, results):
    print(f"Sentence: {phrase}\nSentiment: {result}\n")

## Analysis: Areas for improvement
A somewhat subjective analysis:

- Likely Neutral:
  - "I feel unsure" is classified as negative but could be considered more contextually neutral. Its score should reflect more ambiguity.
  - "The movie was okay, not great but not bad either" should be classified as neutral rather than positive.
  - "I have mixed feelings about this project" should be classified as neutral rather than negative.
  - "I'm neither happy nor upset about the outcome" should be classified as neutral rather than negative.
  - "Through all the doom and gloom, he survived in the end, though not unscathed" should likely be neutral rather than positive.
  - "Are there any penalties for returning books late?" should be neutral instead of negative.
- Neutral/Negative:
  - "The product is okay, but I've seen better" should be classified as neutral or negative rather than a strong positive.
  - "I can tolerate it, but I don't love it" may be better as a weaker negative.
  - "The service was decent but could improve" is classified as positive, but has neutral/negative undertones.
- Neutral/Positive:
  - "I can't believe how this experience was, never experienced anything like it" is classified as negative, but has neutral/positive undertones.
  - "Fairly alright" is more neutral/positive than positive.
- Negative:
  - "The food was somewhat edible" is classified as positive, but it suggests a negative sentiment.
  - "It didn't meet my expectations, but it's not terrible" is classified as positive, but it suggests a negative sentiment.

# 6. Fine-tune
## Load a training dataset
Let's try out `yamini0506/hotel_reviews_sentiment_1K`.

Dataset card: https://huggingface.co/datasets/yamini0506/hotel_reviews_sentiment_1K

This dataset comes pre-split into train, val, and test data. This makes it easy for us. However, sometimes we have to split our data ourselves, which can look something like this:
```
### EXAMPLE ###
### SPLITTING A DATASET ###
# Load the dataset
dataset = load_dataset('davanstrien/fake-library-chats-with-sentiment')

# Split the dataset into train and validation sets
train_test_split = dataset['train'].train_test_split(test_size=0.1)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']
```

- **Training dataset**: Used to train the model. The model learns patterns and relationships from this data. Usually the biggest subset of data so it can learn effectively.
- **Validation dataset**: Used to tune the model's hyperparameters and for model selection. It helps in evaluating the model's performance during the training phase and in preventing overfitting by providing feedback on the model's generalization ability. Smaller than training dataset. Often, the data is split into 70-80% for training, 10-15% for validation, and 10-15% for testing.
- **Testing dataset**: Used to evaluate the final model after training and hyperparameter tuning. It provides an unbiased evaluation of the model's performance on unseen data. Smaller than training dataset.

### Tip – Data Integrity:
I referenced another dataset above called `davanstrien/fake-library-chats-with-sentiment`. This dataset looks great upon first glance, but I noticed abnormally high accuracy training accuracy/overfitting when I implemented it. Upon further inspection, the data is full of duplicates. It's important to ensure the data you're using is quality and meaningful to get good results. Overfitting = model learns the training data too closely, resulting in it failing to generalize to unseen data.

Dataset Card: https://huggingface.co/datasets/davanstrien/fake-library-chats-with-sentiment

Below, we can see one method for checking a dataset for overlapping samples.

In [None]:
from datasets import load_dataset, Dataset
from collections import Counter

# TODO: Load the `yamini0506/hotel_reviews_sentiment_1K` dataset
dataset = None

# Inspect the dataset
print(dataset)

# TODO: Split the dataset into train, validation, and test sets
train_dataset = None
val_dataset = None
test_dataset = None

# Inspect the split datasets
print(train_dataset)
print(val_dataset)
print(test_dataset)

# Check for overlapping examples
train_reviews = set([example['total_review'] for example in train_dataset])
test_reviews = set([example['total_review'] for example in test_dataset])
overlap = train_reviews.intersection(test_reviews)
print(f"Number of overlapping reviews: {len(overlap)}")

## Preprocess the dataset
The key piece here is tokenizing our data.

* Tokenization is a crucial step in NLP that breaks down text into tokens, enabling models to process and learn from the data.
* Tokenizers like AutoTokenizer handle the complex task of converting text into tokens and token IDs, including handling padding, truncation, and special tokens.
* Implementation in practice involves initializing a tokenizer, defining a preprocessing function, and applying it to the dataset.

Each tokenizer has a vocabulary, a mapping of tokens to unique identifiers. During tokenization, text is converted into these identifiers for the model to process.

Example:
```
tokens = tokenizer.tokenize("I love machine learning")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)  # Output: [1045, 2293, 3698, 4086]
```

In [None]:
from transformers import AutoTokenizer

# TODO: Set up the AutoTokenizer for our base model
tokenizer = None

# TODO: Write a preprocessing function
def preprocess_function(examples):
    # Tokenize the input texts
    # Use `truncation=True` and `padding=True` in the tokenization process to ensure all input sequences are of consistent length
    tokenized_inputs = None
    # Add labels to the tokenized inputs
    # The Trainer class in transformers library expects 'labels', so we have to remap
    tokenized_inputs['labels'] = None
    return tokenized_inputs

# TODO: Apply the preprocessing function to the train, val, and test datasets
# `batched=True` is a performance optimization that allows the tokenizer to process multiple examples at once
tokenized_train_dataset = None
tokenized_val_dataset = None
tokenized_test_dataset = None

# Inspect the tokenized dataset
print(tokenized_train_dataset)
print(tokenized_val_dataset)
print(tokenized_test_dataset)
print(tokenized_train_dataset[0])
print(tokenized_val_dataset[0])
print(tokenized_test_dataset[0])

## Fine-tune the model using the dataset we loaded

Note: At the top, we've pulled in a `compute_metrics` function to help with an accuracy calculation later on during the final evaluation of our model. We pass this into the TrainingArguments when we initialize the Trainer.


### Key Terms
* **Epoch**: Refers to one complete pass through the entire training dataset. During training, the model updates its weights as it processes the data. An epoch means the model has seen every training example once. Training for multiple epochs allows the model to learn better patterns from the data.
  * Example: If you have 1000 training examples and you train for 5 epochs, your model will have processed 5000 examples in total (though some may be seen more than once).
* **Batch**: A batch is a subset of the training data. Instead of updating the model weights after every single training example, batches allow the model to update weights after processing a set number of examples. Using batches improves computational efficiency and makes the learning process smoother and more stable.
  * Example: If you have 1000 training examples and use a batch size of 100, each epoch will have 10 batches.
* **Loss**: A measure of how well the model's predictions match the actual labels. It quantifies the difference between the predicted outputs and the true outputs. The goal of training is to minimize the loss, making the model's predictions as accurate as possible.
* **Learning Rate**: Hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. A higher learning rate means the model will change weights more quickly, while a lower learning rate means the model will change weights more slowly. The learning rate determines the speed and quality of the convergence to the minimum loss.
* **Weight Decay**:  A regularization technique that involves adding a small penalty to the loss function to discourage large weights. By penalizing large weights, weight decay helps prevent the model from overfitting to the training data. If the model's weights are too large, it might memorize the training data and perform poorly on unseen data. Weight decay helps to keep the weights in check.
* **Overfitting**: Overfitting occurs when a machine learning model captures the noise and details in the training data to such an extent that it negatively impacts the model's performance on new, unseen data. A model that overfits will perform exceptionally well on the training data but poorly on validation or test data because it has memorized the training data rather than learning the underlying patterns.

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import load_metric

# Initialize the metric to be used for evaluation
metric = load_metric('accuracy')

# Define a function to compute metrics during evaluation
def compute_metrics(p):
    # Get the predictions and labels from the inputs
    predictions, labels = p
    # Convert the predictions to the class with the highest probability
    predictions = predictions.argmax(axis=1)
    # Compute and return the accuracy metric
    return metric.compute(predictions=predictions, references=labels)

# TODO: Load the base model using AutoModelForSequenceClassification
# We have 3 sentiment classes (positive, neutral, negative), so pass in 3 labels

# TODO: Initialize training arguments
training_args = TrainingArguments(
    output_dir=None,  # Directory to save the model and other outputs
    eval_strategy=None,  # Evaluate the model at the end of each epoch
    save_strategy=None,  # Save the model at the end of each epoch
    logging_strategy=None,  # Log training information at regular intervals
    logging_steps=None,  # Log every 10 steps
    learning_rate=None,  # Size of optimization step. Here we have a learning rate of 0.00002
    per_device_train_batch_size=None,  # Number of examples to process per step during training
    per_device_eval_batch_size=None,  # Number of examples to process per step during evaluation
    num_train_epochs=None,  # Number of times the model runs through the entire training data. We're using a single epoch for fast train time
    weight_decay=None,  # Penalizes large weights to help prevent overfitting. Set on a logarithmic scale (0.1, 0.01, 0.001...)
)

# TODO: Create a Trainer instance with the specified training arguments and datasets
trainer = Trainer(
    model=None,  # The model to be trained
    args=None,  # Training arguments defined above
    train_dataset=None,  # The dataset to be used for training
    eval_dataset=None,  # The dataset to be used for evaluation
    tokenizer=None,  # The tokenizer used for preprocessing the data
    compute_metrics=None  # The function to compute metrics during evaluation
)

# TODO: Train the model

# TODO: Save the model and tokenizer after training

# 7. Evaluate
Evaluate both the original model and the new model for accuracy.

In [None]:
# TODO: Load the fine-tuned model from the specified directory
new_model = None
# TODO: Create a pipeline for sentiment analysis using the new model and tokenizer
new_model_pipeline = None

# TODO: Get predictions from the original model
# Note: 'sentiment_analysis_model' is the pipeline for the original model
old_results = None
# TODO: Get predictions from the fine-tuned model
# Note: 'new_model_pipeline' is the pipeline for the fine-tuned model
new_results = None

# Compare the results for our test data
for phrase, old_result, new_result in zip(test_data, old_results, new_results):
    print(f"Sentence: {phrase}\nOld Model: {old_result}\nNew Model: {new_result}\n")

This is where we test the accuracy of the model using the `compute_metrics` function we defined earlier.

We use the `test` data split from our dataset this time.

In [None]:
# TODO: Load the original model from HuggingFace model hub
# The model has 3 sentiment classes (positive, neutral, negative), so we pass in 3 labels
original_model = None

# TODO: Initialize training arguments for evaluating the original model
original_training_args = TrainingArguments(
    output_dir=None,
    eval_strategy=None,
    per_device_eval_batch_size=None,
)

# TODO: Create a Trainer instance for the original model
original_trainer = Trainer(
    model=None,
    args=None,
    eval_dataset=None,
    tokenizer=None,
    compute_metrics=None,
)

# TODO: Evaluate the original model using the Trainer
original_model_eval_results = None
print("Original Model Evaluation Results:", original_model_eval_results)

# TODO: Evaluate the fine-tuned model using the Trainer
new_model_eval_results = None
print("New Model Evaluation Results:", new_model_eval_results)

## Congrats! You've fine-tuned a sentiment analysis model.
There are many training techniques and hyperparameters that we didn't cover, so I encourage you to go out there and explore other ways of fine-tuning and see if you can improve the model even more.

#### Additional Reading Recommendations:
* https://rentry.org/llm-training
* https://levelup.gitconnected.com/fine-tune-smaller-nlp-models-with-hugging-face-for-specific-use-cases-1745813471dc
* https://addepto.com/blog/rag-vs-fine-tuning-a-comparative-analysis-of-llm-learning-techniques/