<a href="https://colab.research.google.com/github/Upeshjeengar/Fine-Tuning-BERT-/blob/main/Fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning BERT for Phishing URL Identification


Fine-tuning involves adapting a pre-trained model to a particular use case through additional training.

Pre-trained models are developed via unsupervised learning, which precludes the need for large-scale labeled datasets. Fine-tuned models can then exploit pre-trained model representations to significantly reduce training costs and improve model performance compared to training from scratch

BERT, short for Bidirectional Encoder Representations from Transformers, is a machine learning (ML) framework for natural language processing. In 2018, Google developed this algorithm to improve contextual understanding of unlabeled text across a broad range of tasks by learning to predict text that might come before and after (bi-directional) other text.


---


**USECASES**:
* Sentiment Analysis
* chatbot question answer
* Help predicts text when writing an email
* Can quickly summarize long legal contracts
* Differentiate words that have multiple meanings based on the surrounding text




**BERT**
* Bidirectional Can process text left-to-right and right- to-left. BERT uses the encoder segment of a transformation model.
* Applied in Google Docs, Gmail, smart compose, enhanced search, voice assistance, analyzing customer reviews, and so on.
* GLUE score = 80.4% and 93.3% accuracy on the SQUAD dataset.
* Uses two unsupervised tasks, masked language modeling, fill in the blanks and next sentence prediction e.g. does sentence B come after sentence A?

**GPT**
* Autoregressive and unidirectional. Text is processed in one direction. GPT uses the decoder segment of a transformation model.
* Applied in application building, generating ML code, websites, writing articles, podcasts, creating legal documents, and so on.
* 64.3% accuracy on the TriviaAQ benchmark and 76.2% accuracy on LAMBADA, with zero-shot learning
* Straightforward text generation using autoregressive language modeling


BERT uses an encoder that is very similar to the original encoder of the transformer, this means we can say that BERT is a transformer-based model.



In [None]:
# below all modules are by HF
!pip install datasets
!pip install transformers
!pip install evaluate

In [None]:
from datasets import DatasetDict, Dataset, load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
from transformers import DataCollatorWithPadding

**About dataset features**
- text = website URL  
- label = phishing site indicator (1=phishing, 0=not phishing)

In [None]:
# https://huggingface.co/datasets/shawhin/phishing-site-classification
dataset_dict = load_dataset("shawhin/phishing-site-classification")

**About model**  
**BERT base model (uncased)**  
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in [this](https://arxiv.org/abs/1810.04805) paper and first released in [this](https://github.com/google-research/bert) repository. This model is uncased: it does not make a difference between english and English.

[Read more](https://huggingface.co/google-bert/bert-base-uncased)


**Tokenizer**  
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.
* Use Tokenizer when you know exactly which model you are working with.
* Use AutoTokenizer for ease of use and flexibility, especially when working with multiple models or for general-purpose tasks.


**AutoModelForSequenceClassification** is a powerful and convenient tool for practitioners looking to leverage state-of-the-art models for sequence classification tasks. It abstracts away many complexities involved in model selection and configuration, allowing users to focus on their specific applications.





In [None]:
# define pre-trained model path
model_path = "google-bert/bert-base-uncased"

# load model tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# load model with binary classification head
id2label = {0: "Safe", 1: "Not Safe"}
label2id = {"Safe": 0, "Not Safe": 1}
model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           num_labels=2,
                                                           id2label=id2label,
                                                           label2id=label2id,)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# freeze all base model parameters
for name, param in model.base_model.named_parameters():
    param.requires_grad = False

# unfreeze base model pooling layers
for name, param in model.base_model.named_parameters():
    if "pooler" in name:
        param.requires_grad = True
        print(name,param.shape)

pooler.dense.weight torch.Size([768, 768])
pooler.dense.bias torch.Size([768])


In [None]:
# define text preprocessing
def preprocess_function(examples):
    # return tokenized text with truncation(if input text exceeds the model's maximum sequence length, it will be truncated to fit)
    return tokenizer(examples["text"], truncation=True)

# preprocess all datasets
tokenized_data = dataset_dict.map(preprocess_function, batched=True)

Map:   0%|          | 0/2100 [00:00<?, ? examples/s]

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

In [None]:
# DataCollatorWithPadding is a utility that helps you create batches of data with padding.
#(padding refers to the process of adding extra tokens to sequences (like sentences or phrases) to ensure that they all have the same length.).
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# load metrics
accuracy = evaluate.load("accuracy")
auc_score = evaluate.load("roc_auc")

def compute_metrics(eval_pred):
    # get predictions
    predictions, labels = eval_pred

    # apply softmax to get probabilities
    probabilities = np.exp(predictions) / np.exp(predictions).sum(-1,
                                                                 keepdims=True)
    # use probabilities of the positive class for ROC AUC
    positive_class_probs = probabilities[:, 1]
    # compute auc
    auc = np.round(auc_score.compute(prediction_scores=positive_class_probs,
                                     references=labels)['roc_auc'],3)

    # predict most probable class
    predicted_classes = np.argmax(predictions, axis=1)
    # compute accuracy
    acc = np.round(accuracy.compute(predictions=predicted_classes,
                                     references=labels)['accuracy'],3)

    return {"Accuracy": acc, "AUC": auc}

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.54k [00:00<?, ?B/s]

In [None]:
# hyperparameters
lr = 2e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir="bert-phishing-classifier_teacher",  # Directory to save model and logs
    learning_rate=lr,  # Learning rate for optimizer
    per_device_train_batch_size=batch_size,  # Batch size for training per device
    per_device_eval_batch_size=batch_size,  # Batch size for evaluation per device
    num_train_epochs=num_epochs,  # Total training epochs
    logging_strategy="epoch",  # Log metrics at the end of each epoch
    eval_strategy="epoch",  # Evaluate model at the end of each epoch
    save_strategy="epoch",  # Save model checkpoint at the end of each epoch
    load_best_model_at_end=True,  # Load the best model after training
)

In [None]:
# Initialize the Trainer class
trainer = Trainer(
    model=model,  # Pass the model to be trained
    args=training_args,  # Specify training arguments (like learning rate, epochs, etc.)
    train_dataset=tokenized_data["train"],  # Provide the training dataset
    eval_dataset=tokenized_data["test"],  # Provide the evaluation (test) dataset
    tokenizer=tokenizer,  # Include the tokenizer for preprocessing
    data_collator=data_collator,  # Define how to collate data into batches
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Auc
1,0.5056,0.379588,0.824,0.913
2,0.4078,0.337738,0.842,0.931
3,0.3538,0.313385,0.856,0.939
4,0.3573,0.348363,0.853,0.946
5,0.3518,0.335028,0.862,0.948
6,0.3496,0.288773,0.871,0.951
7,0.3349,0.288813,0.878,0.95
8,0.3107,0.28792,0.867,0.95
9,0.3145,0.283944,0.869,0.951
10,0.3135,0.288296,0.867,0.951


TrainOutput(global_step=2630, training_loss=0.35994325935160704, metrics={'train_runtime': 144.6137, 'train_samples_per_second': 145.214, 'train_steps_per_second': 18.186, 'total_flos': 706603239165360.0, 'train_loss': 0.35994325935160704, 'epoch': 10.0})

In [None]:
# apply model to validation dataset
predictions = trainer.predict(tokenized_data["validation"])

# Extract the logits and labels from the predictions object
logits = predictions.predictions
labels = predictions.label_ids

# Use your compute_metrics function
metrics = compute_metrics((logits, labels))
print(metrics)

{'Accuracy': 0.891, 'AUC': 0.945}


In [None]:
import torch
# Test URLs
test_urls = ["google.com", "google.ghfsbc.live/amazon"]

# Preprocess the test URLs
tokenized_test_urls = tokenizer(test_urls, truncation=True, padding=True, return_tensors="pt")

# Move the input tensors to the same device as the model
device = model.device
tokenized_test_urls = {key: value.to(device) for key, value in tokenized_test_urls.items()}

# Predict the output for the test URLs
with torch.no_grad():
    outputs = model(**tokenized_test_urls)

# Get the logits (raw prediction scores)
logits = outputs.logits

# Apply softmax to get the predicted probabilities
probabilities = torch.softmax(logits, dim=-1)

# Get the predicted classes
predicted_classes = torch.argmax(probabilities, dim=-1)

# Convert predictions from tensor to list
predicted_classes = predicted_classes.cpu().numpy()

# Mapping predicted labels to their textual representation
predicted_labels = [id2label[class_idx] for class_idx in predicted_classes]

# Output the results
for url, label, prob in zip(test_urls, predicted_labels, probabilities):
    print(f"URL: {url}")
    print(f"Predicted Label: {label}")
    print(f"Class Probabilities: Safe={prob[0]:.3f}, Not Safe={prob[1]:.3f}\n")

URL: google.com
Predicted Label: Safe
Class Probabilities: Safe=0.893, Not Safe=0.107

URL: google.ghfsbc.live/amazon
Predicted Label: Not Safe
Class Probabilities: Safe=0.421, Not Safe=0.579



# Saving the model to trive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Save the model to your Google Drive
trainer.save_model('/content/drive/MyDrive/bert-phishing-classifier_teacher')

Mounted at /content/drive


In [None]:
# Load the saved model from Google Drive when needed
from transformers import AutoModelForSequenceClassification, AutoTokenizer

loaded_model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/bert-phishing-classifier_teacher')
loaded_tokenizer = AutoTokenizer.from_pretrained('/content/drive/MyDrive/bert-phishing-classifier_teacher')

# Now you can use loaded_model and loaded_tokenizer for inference or further training