<a href="https://colab.research.google.com/github/hidaaf/Machine-Learning-methods-for-Text-Classification-using-the-IMDB-corpus/blob/main/ML_methods_for_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT 2023 Project (Template)

- Student(s) Name(s): *Hiba Daafane*, *Parisa Piran*
- Date:*19/06/2023*
- Chosen Corpus: IMDB
- Contributions (if group project): we tried to be both involved in all the tasks, especially for the model training since we had to try multiple runs. Hiba was responsible for the processing of the data, while Parisa took care of the tokenization. For the model training, we both tried to run different runs and finally decided on the best version. For the annotation also, each one of us tried to annotate 50 documents and then apply the resulting model for evaluation. Other than that, all comments and results were discussed between the both of us.

### Corpus information

- Description of the chosen corpus: IMDB dataset is a Large Movie Review Dataset. The dataset is meant for binary sentiment classification. The training set contains a number of 25,000 highly polar movie reviews, and another 25,000 for testing. With an additional unlabeled data for use in unsupervised learning setting as well.
- Paper(s) and other published materials related to the corpus:
    
1.   [Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis](https://ieeexplore.ieee.org/abstract/document/8249013)
2.   [Sentiment analysis on IMDB using lexicon and neural networks](https://link.springer.com/article/10.1007/s42452-019-1926-x)
3. [Example of trained model published on the Hugging Face](https://huggingface.co/fabriceyhc/bert-base-uncased-imdb)

- **State-of-the-art performance (best published results) on this corpus:** Both the mentioned papers, propose models based on ANNs that give an accuracy of 89% and 91%. However, both publications raise concerns about overfitting while training the model. The proposed model on the Hugging Face on the other hand, uses a fine-tuned version of distilbert-base-uncased on the imdb dataset, and it achieves an accuracy of 92,8%, with a loss of 0.1903. So we take this information into account to train our model, and hopefully outperform these results.

---

## 1. Setup

In [None]:
# Your code to install and import libraries etc. here
!pip install --quiet datasets evaluate optuna transformers[torch] accelerate -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.6/390.6 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
import datasets
import transformers
import accelerate
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pprint import pprint
import torch

import numpy as np
import evaluate

In [None]:
datasets.disable_progress_bar()
datasets.logging.set_verbosity_error()

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [None]:
# Your code to download the corpus here
dataset = datasets.load_dataset('imdb')

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


In [None]:
# Get info about the corpus
builder = datasets.load_dataset_builder('imdb')
print(builder.info.description)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.


### 2.2. Preprocessing

In [None]:
# Get a look at the data
pprint(dataset["train"][0])

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the '
         'controversy that surrounded it when it was first released in 1967. I '
         'also heard that at first it was seized by U.S. customs if it ever '
         'tried to enter this country, therefore being a fan of films '
         'considered "controversial" I really had to see this for myself.<br '
         '/><br />The plot is centered around a young Swedish drama student '
         'named Lena who wants to learn everything she can about life. In '
         'particular she wants to focus her attentions to making some sort of '
         'documentary on what the average Swede thought about certain '
         'political issues such as the Vietnam War and race issues in the '
         'United States. In between asking politicians and ordinary denizens '
         'of Stockholm about their opinions on politics, she has sex with her '
         'drama teacher, classmates, and married men.<br

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [None]:
# the corpus that does not have a validation set, so we will split off a random portion of the training data to use for validation
from sklearn.model_selection import train_test_split

del dataset['unsupervised']
train_data, val_data = train_test_split(dataset['train'], test_size=0.2, random_state=42, stratify=[example['label'] for example in dataset['train']])
test_data = dataset['test']

# Create DatasetDict and assign splits
train = datasets.Dataset.from_dict(train_data)
valid = datasets.Dataset.from_dict(val_data)

dataset_dict = datasets.DatasetDict({"train": train, "validation": valid, "test": test_data})
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


In [None]:
dataset=dataset_dict.shuffle()

dataset['test'] = dataset['test'].select(range(5000)) #let's keep the test set the same size as the validation (although this doesn't really have much of an effect)

In [None]:
# Tokenization

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_example(example):
  tokenized = tokenizer(example['text'], truncation=True, padding='max_length')
  return tokenized

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
tokenized_dataset = dataset.map(tokenize_example, batched=True)

In [None]:
# Check if everything works
tokenized =tokenize_example(dataset["train"][0])
decoded_tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

print("\nOriginal Example:")
print(dataset["train"][0])
print("\nTokenized Example:")
print(decoded_tokens)


Original Example:
{'text': "This will be a different kind of review. I've seen this movie twice on TV and would like to have a copy because it talks about Panama City and the beach in the winter time which is my favorite time to be there. It was the first movie I'd seen by Ashley Judd and she was great and I've enjoyed every other thing I've seen her in. Sundance's reaction made an impression on me too, as did the director, Victor Nunez, who has directed and written several movies about Florida. This movie speaks to me and I've seen nothing with which to compare it. The plot speaks less to me than the surroundings. Well, I told you it would be a different kind of review.", 'label': 1}

Tokenized Example:
['[CLS]', 'this', 'will', 'be', 'a', 'different', 'kind', 'of', 'review', '.', 'i', "'", 've', 'seen', 'this', 'movie', 'twice', 'on', 'tv', 'and', 'would', 'like', 'to', 'have', 'a', 'copy', 'because', 'it', 'talks', 'about', 'panama', 'city', 'and', 'the', 'beach', 'in', 'the', 'win

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.we

In [None]:
# Define the training arguments
training_args = transformers.TrainingArguments(
    output_dir='checkpoints',
    evaluation_strategy="steps",
    logging_strategy="steps",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_steps=100,
    logging_steps=100,
    max_steps=2000,
    learning_rate=5e-05,
    #weight_decay=0.02,
    load_best_model_at_end=True
    #metric_for_best_model='accuracy',
)


In [None]:
accuracy = evaluate.load('accuracy')

def compute_accuracy(outputs_and_labels):
  outputs, labels = outputs_and_labels
  predictions = np.argmax(outputs, axis = -1)
  return accuracy.compute(predictions = predictions, references = labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
# Define the trainer

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    compute_metrics= compute_accuracy,
    callbacks=[transformers.EarlyStoppingCallback(5)]
)

In [None]:
# Now we train and hope for the best :)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy
100,0.4778,0.366877,0.846
200,0.3844,0.409428,0.8568
300,0.3497,0.343815,0.8694
400,0.3152,0.511713,0.852
500,0.3212,0.300107,0.8958
600,0.3763,0.341987,0.8826
700,0.316,0.302667,0.8892
800,0.2974,0.234931,0.9088
900,0.2699,0.273646,0.9086
1000,0.3234,0.364689,0.8876


TrainOutput(global_step=2000, training_loss=0.3085549211502075, metrics={'train_runtime': 2578.4053, 'train_samples_per_second': 6.205, 'train_steps_per_second': 0.776, 'total_flos': 2119478378496000.0, 'train_loss': 0.3085549211502075, 'epoch': 0.8})

### 3.2 Hyperparameter optimization

In [None]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})


In [None]:
# Your code for hyperparameter optimization here
import optuna

def objective(trial: optuna.Trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 5e-6, 5e-4, log=True)
    train_epochs = trial.suggest_int('num_train_epochs', low = 2,high = 5)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32, 64])


    trainer_args = transformers.TrainingArguments(
        'checkpoints',
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=100,
        logging_steps=100,
        learning_rate=learning_rate,
        num_train_epochs = train_epochs,
        max_steps=2000,
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')

    trainer = transformers.Trainer(
        model=model,
        args=trainer_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=tokenized_dataset['validation'],
        compute_metrics=compute_accuracy,
        callbacks=[transformers.EarlyStoppingCallback(4)]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

In [None]:
#Use this to avoid "CUDA out of memory" (for clearing the occupied cuda memory)
import torch
torch.cuda.empty_cache()

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=3, gc_after_trial=True)

[I 2023-06-29 14:32:21,688] A new study created in memory with name: no-name-45fcdeeb-0506-49c2-b274-313452a583b8


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.

Step,Training Loss,Validation Loss,Accuracy
100,0.4136,0.286817,0.8872
200,0.2886,0.239341,0.9016
300,0.2356,0.234156,0.9014
400,0.2339,0.224099,0.9148
500,0.2442,0.206775,0.9164
600,0.2274,0.217248,0.919
700,0.1861,0.208999,0.9252
800,0.171,0.2214,0.9198
900,0.1716,0.202177,0.9268
1000,0.1682,0.203569,0.925


[I 2023-06-29 15:33:55,951] Trial 0 finished with value: 0.925 and parameters: {'learning_rate': 1.5330213131632162e-05, 'num_train_epochs': 5, 'batch_size': 32}. Best is trial 0 with value: 0.925.
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequence

Step,Training Loss,Validation Loss,Accuracy
100,0.5989,0.353484,0.8756
200,0.3123,0.271825,0.8936
300,0.3215,0.308141,0.8648
400,0.2858,0.237303,0.9102
500,0.2341,0.240097,0.9078
600,0.2592,0.250462,0.9066
700,0.2747,0.226208,0.9154
800,0.2394,0.226085,0.9154
900,0.2671,0.217822,0.9162
1000,0.2506,0.212071,0.919


[I 2023-06-29 16:11:19,404] Trial 1 finished with value: 0.919 and parameters: {'learning_rate': 9.55326461900999e-06, 'num_train_epochs': 4, 'batch_size': 16}. Best is trial 0 with value: 0.925.
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceCl

Step,Training Loss,Validation Loss,Accuracy
100,0.705,0.694562,0.5
200,0.7009,0.693643,0.5
300,0.7005,0.697826,0.5
400,0.6964,0.693778,0.5
500,0.6948,0.693287,0.5
600,0.6935,0.695555,0.5
700,0.6954,0.693165,0.5
800,0.6937,0.693145,0.5
900,0.7055,0.702432,0.5
1000,0.6963,0.693147,0.5


[I 2023-06-29 17:04:34,674] Trial 2 finished with value: 0.5 and parameters: {'learning_rate': 0.00039648323404752603, 'num_train_epochs': 3, 'batch_size': 32}. Best is trial 0 with value: 0.925.


In [None]:
# Print the best trial's hyperparameters and objective value
best_trial = study.best_trial
print(f"Best trial (number {best_trial.number}):")
print(f"  Value: {best_trial.value}")
print(f"  Params: {best_trial.params}")

# Print all trials' hyperparameters and objective values
print("\nAll trials:")
for trial in study.trials:
    print(f"  Trial {trial.number}:")
    print(f"    Value: {trial.value}")
    print(f"    Params: {trial.params}")


Best trial (number 0):
  Value: 0.925
  Params: {'learning_rate': 1.5330213131632162e-05, 'num_train_epochs': 5, 'batch_size': 32}

All trials:
  Trial 0:
    Value: 0.925
    Params: {'learning_rate': 1.5330213131632162e-05, 'num_train_epochs': 5, 'batch_size': 32}
  Trial 1:
    Value: 0.919
    Params: {'learning_rate': 9.55326461900999e-06, 'num_train_epochs': 4, 'batch_size': 16}
  Trial 2:
    Value: 0.5
    Params: {'learning_rate': 0.00039648323404752603, 'num_train_epochs': 3, 'batch_size': 32}


### 3.3. Evaluation on test set

In [None]:
# Unfortunatly with our code, we couldn't find a way to directly test the 'best model' on the test set,
# however we will creating a new instance of the model with the best params

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased')

trainer_args = transformers.TrainingArguments(
    output_dir='best_checkpoints',
    evaluation_strategy = 'steps',
    logging_strategy = 'steps',
    load_best_model_at_end = True,
    eval_steps = 100,
    logging_steps = 100,
    learning_rate = 1.5330213131632162e-05,
    num_train_epochs = 5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32,
    max_steps = 1500,
)

trainer = transformers.Trainer(
    model = model,
    args = trainer_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    compute_metrics=compute_accuracy,
    callbacks=[transformers.EarlyStoppingCallback(4)]
)

trainer.train()



Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.

Step,Training Loss,Validation Loss,Accuracy
100,0.476,0.290378,0.8866
200,0.2879,0.287921,0.8772
300,0.2575,0.283164,0.883
400,0.2454,0.213839,0.9182
500,0.2205,0.242045,0.9044
600,0.2287,0.202617,0.9216
700,0.191,0.271627,0.9008
800,0.1768,0.211797,0.9216
900,0.1752,0.216287,0.9172
1000,0.1602,0.222442,0.9166


TrainOutput(global_step=1500, training_loss=0.20906474367777506, metrics={'train_runtime': 3597.5419, 'train_samples_per_second': 13.342, 'train_steps_per_second': 0.417, 'total_flos': 6358435135488000.0, 'train_loss': 0.20906474367777506, 'epoch': 2.4})

In [None]:
eval_results = trainer.evaluate(eval_dataset = tokenized_dataset['test'])
#print(eval_results)

for metric, value in eval_results.items():
    print(f"{metric}: {value}")

eval_loss: 0.18921582400798798
eval_accuracy: 0.9324
eval_runtime: 88.347
eval_samples_per_second: 56.595
eval_steps_per_second: 1.777
epoch: 2.4


In [None]:
# Save the model
save_directory = "/content/drive/MyDrive/Best_model"
trainer.save_model(save_directory)

In [None]:
# load the model again (connect to drive first)
from transformers import AutoModel

load_directory = "/content/drive/MyDrive/Best_model"
loaded_trainer = AutoModel.from_pretrained(load_directory)

Some weights of the model checkpoint at /content/drive/MyDrive/HLT/Best_model were not used when initializing DistilBertModel: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---

## 4. Results and summary

### 4.1 Corpus insights

As we mentioned in the introduction, IMDB dataset is a Large Movie Review Dataset provided by Stanford. This dataset is quiet popular and is often used as a benchmark for sentiment analysis and text classification tasks. The dataset contains a collection of movie reviews, where each review is labeled with either a positive or negative sentiment.

It's important to not that the version that was used for this work is relatively large, containing about 25,000 movie reviews for training, and another 25,000 for testing. With an additional unlabeled data for use in unsupervised learning setting as well (which will be dismissed in this work). However, due to the limited computational power, we used only 20,000 instances for training and 5000 for validation and test.

### 4.2 Results

In this project, we used the pre-trained model "distilbert-base-uncased" for initializing both the tokenizer and the classifier. Training the classifier was a bit of a tricky task, we mainly faced two issues, the first issue was that we noticed that our initial values for the model's hyperparameters were causing the model to overfit, we ended up having very high values for accuracy, with a train loss that almost reaches 60%, and before going into the hyperparameter tuning, we tried different combinations of parameters until we got some satisfying results, where our model achieved a 92.6% accuracy, with 0.22 loss on the validation data. However, after the Hyperparameter optimization process, we achieved a best accuracy of 92.7%, a validation loss of 0.20, and a training loss of 0.13.

### 4.3 Relation to state of the art

The accuracy our model achieved surpasses all of the mentioned material we discussed earlier, however a deeper dive into the internet helped us realize that the highest recorded accuracy achieved with the imdb Dataset is about 96.21% by [XLNet](https://arxiv.org/abs/1906.08237v2). So taking into account the extremly limited ressources we had to work with, and the constant crashing of memory, we believe that the obtained results are quiet satisfactory, and our model sits at about the top 20 recorded benchmark models.

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

For this task we chose a collection of 100 review comments gathered from amazon, the comments come from a variety of products in order to inssure the diversity of the vocabulary used to describe the products (clothing items, shoes, electronics, cosmetics..). The data contains an almost equal portion of negative and positive reviews.

For the annotation process we used the [INCEpTION](https://inception-project.github.io/) open source annotation platform. This tool is a product of work of the same team that developped WebAnno, that introduces a bit more flexibility and some extra features. The annotation process was quiet simple, all we had to do was upload our plain text, that contained the unannotated corpora, and then select each document (comment) and choose the appropriate label (0 for negative, and 1 for positive). Lastly, we had the option to download our annotated data in a variety of formats, and we ended up choosing the CoNLL format.

### 5.2 Conversion into dataset

In [None]:
def read_conll(file_path):
    with open(file_path, 'r') as file:
        data = file.read().split("\n\n")
        documents = []
        for document in data:
            sentences = document.split('\n')
            parsed_sentences = []
            for sentence in sentences:
                words = sentence.split(' ')
                parsed_sentences.append(words)
            documents.append(parsed_sentences)
    return documents

In [None]:
file_path1 = '/content/anotation 1.conll'
file_path2 = '/content/anotation 2.conll'

parsed_data1 = read_conll(file_path1)
parsed_data2= read_conll(file_path2)
total_parsed_data=parsed_data2+parsed_data1
item_list=[]
item={}
word_list=[]
for sample in total_parsed_data:
  comment=[]
  for i in sample:
      comment.append(i[0])
      if len(i) == 2 :
        tag=int(i[-1].split('-')[1])
  item={'label':tag, 'text':comment}
  item_list.append(item)
#print example
sentence = " ".join(item_list[1]['text'])
print(sentence)
print(item_list[1]['label'])

Fits great , very flowy and comfortable to wear . Just long enough to cover the front pouch .
1


In [None]:
for element in item_list:
  element['text'] = ' '.join(element['text'])

In [None]:
#make sure we don't select any empty lines
data = [d for d in item_list if d['text']]

In [None]:
#make sure labels are integers
for instance in data:
    instance['label'] = int(instance['label'])

In [None]:
data[0]

{'label': 1,
 'text': "This is a nice quality lightweight blouse , great for summer weather . It is lightweight , and slightly see through when wearing darker undergarments . It fits loosely so it's flattering on larger body types and it's pretty long so it looks good either tucked in or untucked . The material is soft and flowy and it's comfortable to wear ."}

In [None]:
from datasets import Dataset
test_dataset = Dataset.from_list(data)
dataset=test_dataset.shuffle()

In [None]:
dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 105
})

In [None]:
test_dataset[0]

{'label': 1,
 'text': "This is a nice quality lightweight blouse , great for summer weather . It is lightweight , and slightly see through when wearing darker undergarments . It fits loosely so it's flattering on larger body types and it's pretty long so it looks good either tucked in or untucked . The material is soft and flowy and it's comfortable to wear ."}

In [None]:
# ### 5.3. Model evaluation on out-of-domain test set
tokenized_dataset = test_dataset.map(tokenize_example, batched=True)

In [None]:
# Check if everything works
tokenized =tokenize_example(test_dataset[0])
decoded_tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

print("\nOriginal Example:")
print(test_dataset[0])
print("\nTokenized Example:")
print(decoded_tokens)


Original Example:
{'label': 1, 'text': "This is a nice quality lightweight blouse , great for summer weather . It is lightweight , and slightly see through when wearing darker undergarments . It fits loosely so it's flattering on larger body types and it's pretty long so it looks good either tucked in or untucked . The material is soft and flowy and it's comfortable to wear ."}

Tokenized Example:
['[CLS]', 'this', 'is', 'a', 'nice', 'quality', 'lightweight', 'blouse', ',', 'great', 'for', 'summer', 'weather', '.', 'it', 'is', 'lightweight', ',', 'and', 'slightly', 'see', 'through', 'when', 'wearing', 'darker', 'under', '##gar', '##ments', '.', 'it', 'fits', 'loosely', 'so', 'it', "'", 's', 'flat', '##tering', 'on', 'larger', 'body', 'types', 'and', 'it', "'", 's', 'pretty', 'long', 'so', 'it', 'looks', 'good', 'either', 'tucked', 'in', 'or', 'un', '##tu', '##cked', '.', 'the', 'material', 'is', 'soft', 'and', 'flow', '##y', 'and', 'it', "'", 's', 'comfortable', 'to', 'wear', '.', '[S

### 5.3. Model evaluation on out-of-domain test set

In [None]:
# Your code to evaluate the model on the out-of-domain test set here
eval_results = trainer.evaluate(tokenized_dataset)
print(eval_results)

for metric, value in eval_results.items():
    print(f"{metric}: {value}")

{'eval_loss': 1.6044725179672241, 'eval_accuracy': 0.5333333333333333, 'eval_runtime': 1.867, 'eval_samples_per_second': 56.241, 'eval_steps_per_second': 2.143, 'epoch': 2.4}
eval_loss: 1.6044725179672241
eval_accuracy: 0.5333333333333333
eval_runtime: 1.867
eval_samples_per_second: 56.241
eval_steps_per_second: 2.143
epoch: 2.4


### 5.4 Bonus task results

So our final evaluation on the annotated dataset is quiet disappointing, with such a good model we ended up getting only 53% accuracy, which is barely better than a random guesser.

This kind of result is not really surprising, because pre-trained models (like our model) are usually trained on specific datasets, which capture the characteristics and patterns of the training data. When evaluating the model on a new dataset from a different text domain, even if the task is the same, the model may struggle to generalize well due to the differences in language style, vocabulary, topic distribution, or sentiment expressions.

This would be the same as trying to use BERT for example directly on our movie reviews dataset, without any previous training. And that is why fine-tuning is important when it comes to language models.

Another thing that should be taken into account is the task of manual annotation, the web is full of all sorts of corpora, and expressions of all kinds, so choosing what documents to collect for a certain task can also play a big role in the final results.

### 5.5. Annotated data

In [None]:
# The annotated data will be included in the project directory