In [None]:
# Install all required packages
%pip install -r requirements.txt

### Load Tokenizer and Dataset

In this cell, we are performing two main actions:

1. **Loading the Tokenizer:**
   - We are using the `AutoTokenizer` class from the `transformers` library to load a pre-trained tokenizer for the `distilbert-base-uncased` model. This tokenizer is responsible for converting text into tokens (the basic units the model understands), which will be necessary for processing the text data in our dataset.

   - `AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")` loads the tokenizer associated with the pre-trained `DistilBERT` model. This model is a lighter version of BERT that maintains much of the original performance.

2. **Loading the Dataset:**
   - We use the `load_dataset` function from the `datasets` library to load CSV files containing training and test data. 
   - `data_files` is a dictionary where we specify the file paths to the training and testing datasets (`train.csv` and `test.csv`).
   - `load_dataset("csv", data_files=data_files)` loads the CSV files into a `datasets` object, which will be used for training and evaluation.


In [None]:
# Load tokenizer and dataset
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

data_files = {
	"train": "./datasets/train.csv",
	"test": "./datasets/test.csv"
}

dataset = load_dataset("csv", data_files=data_files)

### Preprocess the Data

In this cell, we define and apply a preprocessing function to the dataset:

1. **Defining the Preprocessing Function:**
   - The function `preprocess_function` takes as input a batch of examples from the dataset (specifically the `"text"` column) and tokenizes them using the previously loaded tokenizer.
   - `tokenizer(examples["text"], truncation=True, padding=True)`:
     - `truncation=True`: Ensures that texts longer than the model's maximum input length are truncated.
     - `padding=True`: Adds padding to shorter texts to ensure all texts in the batch have the same length (important for batch processing).
   
2. **Applying the Preprocessing Function:**
   - `dataset.map(preprocess_function, batched=True)` applies the `preprocess_function` to the entire dataset. The `batched=True` argument allows the function to be applied to batches of examples rather than individual examples, making the process more efficient.
   - The result is a tokenized version of the dataset where the `"text"` column is replaced with tokenized representations, ready to be fed into the model.


In [None]:
# Preprocess the data

def preprocess_function(examples):
	tokenized = tokenizer(examples["text"], truncation=True, padding=True)
	return tokenized

tokenized_dataset = dataset.map(preprocess_function, batched=True)

### Create a Data Collator

In this cell, we are setting up a **data collator** to help with batching the tokenized data during training:

1. **What is a Data Collator?**
   - A **data collator** is responsible for taking a batch of samples and combining them into a single batch of data that can be passed into the model. It ensures that all sequences in the batch are padded to the same length, which is crucial for efficient batch processing.

2. **Using `DataCollatorWithPadding`:**
   - `DataCollatorWithPadding(tokenizer=tokenizer)` is a collator from the `transformers` library that automatically handles padding. It pads all the sequences in the batch to the length of the longest sequence, based on the tokenizer you provide.
   - This ensures that the model receives consistent input lengths during training or evaluation.
   
   - Since we passed the `tokenizer` as an argument, the collator uses the same tokenizer that was used during preprocessing, ensuring that padding is done according to the tokenizer's settings.

3. **How This Helps:**
   - This collator will be used during data loading to create batches of tokenized data, with padding handled dynamically, so you don’t have to worry about it when training the model.


In [None]:
# Create a data collator
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### Define Metrics

In this cell, we are defining a function to compute evaluation metrics during model evaluation or testing:

1. **What is `compute_metrics`?**
   - The `compute_metrics` function is used to evaluate the performance of the model by calculating metrics (e.g., accuracy) during evaluation.
   - It takes as input `eval_pred`, which is a tuple containing the model's predictions and the true labels.

2. **Loading the Accuracy Metric:**
   - `accuracy = evaluate.load("accuracy")` loads the accuracy metric using the `evaluate` library, which provides various evaluation metrics for machine learning tasks. In this case, we are specifically interested in accuracy.
   
3. **Processing the Predictions:**
   - The model’s raw output (`predictions`) is typically a probability distribution for each class (e.g., a softmax output). To convert it into class labels, we use `np.argmax(predictions, axis=1)`, which selects the class with the highest probability for each prediction.
   - This converts the model’s predictions from probabilities into discrete class labels.

4. **Computing the Metric:**
   - `accuracy.compute(predictions=predictions, references=labels)` computes the accuracy by comparing the predicted labels (`predictions`) with the true labels (`labels`). It returns the accuracy score, which measures the percentage of correct predictions.
   
   - The result will be a dictionary containing the accuracy score, typically in the form `{"accuracy": value}`.


In [None]:
# Define metrics
import evaluate
import numpy as np

def compute_metrics(eval_pred):
	accuracy = evaluate.load("accuracy")
	predictions, labels = eval_pred
	predictions = np.argmax(predictions, axis=1)
	return accuracy.compute(predictions=predictions, references=labels)

### Load Model and Set Up Trainer

In this cell, we are loading a pre-trained model, defining training arguments, and setting up the `Trainer` to handle the training and evaluation of the model:

1. **Loading the Pre-trained Model:**
   - We use `AutoModelForSequenceClassification.from_pretrained` to load the pre-trained `distilbert-base-uncased` model, but this time, we are specifically setting it up for a **binary classification** task (`num_labels=2`).
   - The `id2label` and `label2id` mappings are provided to map between numeric labels and string labels:
     - `id2label={0: "Safe", 1: "Fraud"}`: Maps numeric labels (0 and 1) to the string labels ("Safe" and "Fraud").
     - `label2id={"Safe": 0, "Fraud": 1}`: Maps the string labels back to numeric labels for use during training.

2. **Setting Training Arguments:**
   - The `TrainingArguments` class is used to configure the settings for the training process:
     - `output_dir="./out/detext_dataset"`: Directory where model checkpoints and training results will be saved.
     - `learning_rate=2e-5`: The learning rate for model optimization.
     - `per_device_train_batch_size=16`: The batch size used for training on each device.
     - `per_device_eval_batch_size=16`: The batch size used for evaluation.
     - `num_train_epochs=2`: Number of times the model will go through the entire dataset during training.
     - `weight_decay=0.01`: A regularization technique to prevent overfitting.
     - `eval_strategy="epoch"`: Evaluation is done at the end of each epoch.
     - `save_strategy="epoch"`: Model checkpoints will be saved at the end of each epoch.
     - `load_best_model_at_end=True`: The best model (based on evaluation performance) will be loaded at the end of training.
     - `report_to="none"`: Disables logging to external services (like TensorBoard or WandB).

3. **Setting Up the Trainer:**
   - `Trainer` is a high-level API in Hugging Face’s `transformers` library that simplifies training and evaluation:
     - `model=model`: The pre-trained model we loaded earlier.
     - `args=training_args`: The training arguments defined above.
     - `train_dataset=tokenized_dataset["train"]`: The tokenized training dataset.
     - `eval_dataset=tokenized_dataset["test"]`: The tokenized test dataset.
     - `processing_class=tokenizer`: The tokenizer used to preprocess the data.
     - `data_collator=data_collator`: The collator used to handle padding during batching.
     - `compute_metrics=compute_metrics`: The function to compute metrics like accuracy during evaluation.

   The `Trainer` is now set up and ready to handle the training and evaluation process with the specified configurations.


In [None]:

# Load model and set up Trainer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
	"distilbert/distilbert-base-uncased",
	num_labels=2,
	id2label={0: "Safe", 1: "Fraud"},
	label2id={"Safe": 0, "Fraud": 1}
)

training_args = TrainingArguments(
	output_dir="./out/detext_dataset",
	learning_rate=2e-5,
	per_device_train_batch_size=16,
	per_device_eval_batch_size=16,
	num_train_epochs=2,
	weight_decay=0.01,
	eval_strategy="epoch",
	save_strategy="epoch",
	load_best_model_at_end=True,
	report_to="none"
)

trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_dataset["train"],
	eval_dataset=tokenized_dataset["test"],
	processing_class=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics,
)


In [None]:
# Train the model
trainer.train()

### Model Inference for Text Classification

In this cell, we are performing inference with the trained model to predict the class of a given input text. The steps are as follows:

1. **Define the Label Mapping (Optional):**
   - You previously defined `id_to_label = {0: "Safe", 1: "Fraud"}`, but this step is no longer needed as we will use `model.config.id2label` directly to map the predicted class IDs to human-readable labels.

2. **Load the Tokenizer and Model:**
   - `tokenizer = AutoTokenizer.from_pretrained(model_path)`: The tokenizer used during training is loaded from the saved model directory (`model_path`).
   - `model = AutoModelForSequenceClassification.from_pretrained(model_path)`: The trained sequence classification model is loaded from the saved model directory.
   - `model.eval()`: The model is set to evaluation mode, which disables certain training-specific features like dropout.

3. **Prepare the Input Text:**
   - `text = "You have to send money on this bank account: 10001"`: The input text to be classified.
   - `inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)`: The input text is tokenized and converted into tensor format suitable for the model. Padding and truncation are applied to ensure the input text is of the correct size for the model.

4. **Perform Prediction:**
   - `with torch.no_grad()`: This context manager disables gradient calculation, as we’re in inference mode and don’t need to track gradients.
   - `outputs = model(**inputs)`: The model is used to make predictions on the input text. `outputs` contains the raw logits, which are the unnormalized predictions from the model.
   - `logits = outputs.logits`: Extracts the logits from the model’s output.
   - `probabilities = F.softmax(logits, dim=-1)`: The logits are passed through a softmax function to convert them into probabilities.
   - `predicted_class_id = torch.argmax(probabilities, dim=1).item()`: The class with the highest probability is selected as the predicted class.
   - `confidence_score = probabilities[0][predicted_class_id].item()`: The confidence score is the probability of the predicted class.

5. **Map to Label (using `model.config.id2label`):**
   - `label = model.config.id2label[predicted_class_id]`: The predicted class ID is mapped to the corresponding human-readable label using the `id2label` attribute of the model's configuration.

6. **Output the Results:**
   - `print(f"Predicted label: {label}")`: The predicted label ("Safe" or "Fraud") is printed.
   - `print(f"Confidence score: {confidence_score:.4f}")`: The confidence score of the prediction is printed, showing how confident the model is in its classification.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

id_to_label = {0: "Safe", 1: "Fraud"}
model_path = "./out/detext_dataset/[Model Path]"

# Load
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()

# Test input
text = "You have to send money on this bank account: 10001"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Predict
with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probabilities = F.softmax(logits, dim=-1)
	predicted_class_id = torch.argmax(probabilities, dim=1).item()
	confidence_score = probabilities[0][predicted_class_id].item()

label = model.config.id2label[predicted_class_id]

# Output
print(f"Predicted label: {label}")
print(f"Confidence score: {confidence_score:.4f}")