### Adapting a pre-trained model for a specific task (fine-tune). The pre-trained model is self-supervised and fine-tuned is supervised. For example, BERT has been trained over 3B words. These foundation models from openAI, Mistral, Antropic, Meta etc. are trained using specific and huge computer resources, not available for common users. A fine-tuned model, such as MRCP, has 3k examples

#### Let's fine-tune BERT for finding phishing URLs

In [4]:
%pip install datasets
%pip install evaluate
%pip install torch
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [5]:
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

import evaluate
import numpy as np
from transformers import DataCollatorWithPadding




In [6]:
from datasets import load_dataset
df = load_dataset('shawhin/phishing-site-classification')

In [7]:
#pre-trained model
model_path = 'google-bert/bert-base-uncased'

#tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

id2label = {0: 'safe', 1: 'not safe'} #not safe is the positive class
label2id = {'safe': 0, 'not safe': 1}

model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           num_labels=2,
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# freeze all base model parameters
#for name, param in model.base_model.named_parameters():
#    param.requires_grad = False
    
for name, param in model.base_model.named_parameters():
    if 'pooler' in name: #only the final poolers layers
        param.requires_grad = True

In [9]:
#define text preprocessing
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True) #truncates to 512 tokens

tokenized_data = df.map(preprocess_function, batched=True)

#### every example in a batch with the same size

In [10]:
#create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### define evaluation metrics

In [11]:
accuracy = evaluate.load('accuracy')
auc_score = evaluate.load('roc_auc')

def compute_metrics(eval_pred):
    #logit       #0,1
    predictions, labels = eval_pred
    
    #apply softmax (from -1 to 1, to 0, 1)
    probabilities = np.exp(predictions) / np.exp(predictions).sum(-1, keepdims=True)
    
    #probabilities of the positive (not safe) class for ROC AUC
    positive_class_probs = probabilities[:,1]
    
    auc = np.round(auc_score.compute(prediction_scores=positive_class_probs,
                                     references=labels)['roc_auc'],3) #references ground truth
    
    #predict most probable class
    predicted_classes = np.argmax(predictions, axis=1)
    acc = np.round(accuracy.compute(predictions=predicted_classes,
                   references=labels)['accuracy'],3)
    
    return {'accuracy': acc, 'AUC': auc}
    
    
    
    
    
    
    

In [12]:
%pip install transformers[torch]

Note: you may need to restart the kernel to use updated packages.


In [13]:
%pip install accelerate>=0.26.0

Note: you may need to restart the kernel to use updated packages.


In [14]:
#learning rate
lr = 2e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir='bert-phishing-classifier_teacher',
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    logging_strategy='epoch',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)
#args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)

In [15]:
%pip install mlflow

Note: you may need to restart the kernel to use updated packages.


#### fine-tune the MODEL

#### validation data (not seen during training)

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


trainer.train()
predictions = trainer.predict(tokenized_data['validation'])
logits = predictions.predictions
labels = predictions.label_ids

metrics = compute_metrics((logits, labels))
print(metrics)

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Auc
1,0.7179,0.701961,0.509,0.754
2,0.7161,0.722116,0.491,0.748
3,0.7134,0.719389,0.509,0.73
4,0.7142,0.702855,0.491,0.738
5,0.7013,0.701998,0.491,0.276
6,0.7022,0.699196,0.491,0.461
7,0.705,0.697413,0.491,0.365
8,0.7007,0.693127,0.509,0.739
9,0.6987,0.693327,0.509,0.705
10,0.6966,0.693925,0.491,0.696




🏃 View run bert-phishing-classifier_teacher at: http://localhost:5000/#/experiments/0/runs/dbf7a0f02a694779ae0e46e0fb7defc6
🧪 View experiment at: http://localhost:5000/#/experiments/0




{'accuracy': 0.5, 'AUC': 0.692}
