## Email A/B Test
In marketing, the A/B tests are often done via e-mail with two groups that have different subject lines and they measure the click-through rate to check users’ engagements. In this case, launching A/B testing is simply sending emails to groups and do an analysis on the responses after collecting the data back.

A typical example of web browsing data is text, and with conventional natural language processing methods, keyword-based classification models are used. BERT understands the context of the entire text, and thus is expected to be able to classify users' interests more precisely.

In the web page A/B testing case, users in two groups will see two different web pages and we can look at which page has better user engagements.

### Install Required Package

### Load and Check Movie Review Dataset

In [1]:
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')
 
raw_datasets = load_dataset("imdb")
print(raw_datasets)


Found cached dataset imdb (C:/Users/user/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


### Select Samples for Train and Test

In [2]:
sample_train_val = raw_datasets['train'].shuffle().select(range(0,2000)).to_pandas()
sample_test = raw_datasets['test'].shuffle().select(range(0,500)).to_pandas()

### Import Libraries

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, recall_score 
from sklearn.metrics import precision_score, f1_score
 
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback
 
import torch
import numpy as np

### Define Pretrained Tokenizer and Model

In [4]:
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### Preprocess Dataset

In [5]:
# Define a simple class inherited from torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels
 
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item
 
    def __len__(self):
        return len(self.encodings["input_ids"])
 
sample_x = list(sample_train_val["text"])
sample_y = list(sample_train_val["label"])
 
X_train, X_val, Y_train, Y_val = train_test_split(sample_x, sample_y, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
 
input_train = Dataset(X_train_tokenized, Y_train)
input_val = Dataset(X_val_tokenized, Y_val)


### Define Evaluation Metrics

In [6]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    print(classification_report(labels, pred))
 
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)
 
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1_score}


### Fine-tune BERT

One of the most common uses of BERT is to download a model that has been pre-trained with a large amount of text and fine tuning it with a small amount of data.

In [7]:
# Define Training Arguments
args = TrainingArguments(
    output_dir="models",
    evaluation_strategy="steps",
    eval_steps=100,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    seed=0,
    load_best_model_at_end=True,
)
 
# Define Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=input_train,
    eval_dataset=input_val,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)
 
# Fine-tune pre-trained BERT
trainer.train()

***** Running training *****
  Num examples = 1600
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 400
  Number of trainable parameters = 109483778


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
100,No log,0.401446,0.8275,0.977941,0.668342,
200,No log,0.305296,0.8975,0.938889,0.849246,
300,No log,0.359036,0.91,0.905473,0.914573,
400,No log,0.34449,0.91,0.909548,0.909548,


***** Running Evaluation *****
  Num examples = 400
  Batch size = 8
Trainer is attempting to log a value of "<function f1_score at 0x000001F0613FB670>" of type <class 'function'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


              precision    recall  f1-score   support

           0       0.75      0.99      0.85       201
           1       0.98      0.67      0.79       199

    accuracy                           0.83       400
   macro avg       0.86      0.83      0.82       400
weighted avg       0.86      0.83      0.82       400



***** Running Evaluation *****
  Num examples = 400
  Batch size = 8
Trainer is attempting to log a value of "<function f1_score at 0x000001F0613FB670>" of type <class 'function'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


              precision    recall  f1-score   support

           0       0.86      0.95      0.90       201
           1       0.94      0.85      0.89       199

    accuracy                           0.90       400
   macro avg       0.90      0.90      0.90       400
weighted avg       0.90      0.90      0.90       400



***** Running Evaluation *****
  Num examples = 400
  Batch size = 8
Trainer is attempting to log a value of "<function f1_score at 0x000001F0613FB670>" of type <class 'function'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


              precision    recall  f1-score   support

           0       0.91      0.91      0.91       201
           1       0.91      0.91      0.91       199

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400



***** Running Evaluation *****
  Num examples = 400
  Batch size = 8
Trainer is attempting to log a value of "<function f1_score at 0x000001F0613FB670>" of type <class 'function'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


              precision    recall  f1-score   support

           0       0.91      0.91      0.91       201
           1       0.91      0.91      0.91       199

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400





Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=400, training_loss=0.29400196075439455, metrics={'train_runtime': 11896.6906, 'train_samples_per_second': 0.269, 'train_steps_per_second': 0.034, 'total_flos': 841955377152000.0, 'train_loss': 0.29400196075439455, 'epoch': 2.0})

### Load Fine-tuned BERT and Run Prediction

In [9]:
# Load test data
X_test = list(sample_test["text"])
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)
 
# Create torch dataset
test_dataset = Dataset(X_test_tokenized)
 
# Load trained model
model_path = "models/checkpoint-100"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=2)
 
# Define test trainer
test_trainer = Trainer(model)
 
# Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)
 
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

OSError: models/checkpoint-100 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.