# **Comprehensive Step-by-Step Guide: Fine-Tuning and Deploying a Pre-Trained Model on Azure Machine Learning**
This guide walks you through **fine-tuning a pre-trained transformer model** on Azure Machine Learning (AML) to classify **research papers into multiple academic domains**. We will leverage **Azure AI Studio, Azure Machine Learning, DeepSpeed, and ONNX Runtime** for efficient training and deployment.

---

## **🚀 Step 1: Define Your Task**
### **Task Overview**
**Objective**: Develop a multi-label classification AI model that predicts the **academic domain(s)** of research papers based on their **title and abstract**.  
**Use Case**: This model helps **automate research paper classification** in academic repositories, reducing **manual labeling effort**.

### **Dataset Requirements**
- Dataset should contain **research paper titles, abstracts, and their categories**.
- **Multi-label** setup (papers belong to multiple categories).
- Labels include: **Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, and Quantitative Finance**.

### **Expected Outcome**
- A **fine-tuned transformer model** that accurately classifies papers into multiple domains.
- **Deployment of the model** as an API endpoint for real-world applications.

---

## **📝 Step 2: Prepare Your Dataset**
### **1️⃣ Collect and Preprocess the Data**
Use the **Kaggle Multi-Label Classification Dataset** (or any research paper dataset).  
If using Kaggle, download and extract the dataset:

```python
import pandas as pd

# Load dataset
dataset_path = "/kaggle/input/multilabel-classification-dataset/train.csv"
df = pd.read_csv(dataset_path)

# Rename columns
df.rename(columns={"TITLE": "title", "ABSTRACT": "abstract"}, inplace=True)

# Combine title and abstract
df["text"] = df["title"] + " " + df["abstract"]

# Define label columns
label_columns = ["Computer Science", "Physics", "Mathematics", "Statistics", "Quantitative Biology", "Quantitative Finance"]

# Convert labels to binary format
df[label_columns] = df[label_columns].astype(int)

# Keep only relevant columns
df = df[["text"] + label_columns]
```

### **2️⃣ Split Dataset into Training, Validation, and Test Sets**
```python
from sklearn.model_selection import train_test_split

# Split dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)
```

### **3️⃣ Upload Dataset to Azure Blob Storage**
Upload your dataset to Azure for use in training. Use **Azure Storage Explorer** or the Azure SDK:

```python
from azure.storage.blob import BlobServiceClient

# Azure Storage details
storage_account_name = "your_storage_account"
storage_account_key = "your_storage_key"
container_name = "datasets"

# Upload data
blob_service_client = BlobServiceClient(account_url=f"https://{storage_account_name}.blob.core.windows.net", credential=storage_account_key)
blob_client = blob_service_client.get_blob_client(container=container_name, blob="train.csv")

with open("train.csv", "rb") as data:
    blob_client.upload_blob(data, overwrite=True)
```

---

## **🤖 Step 3: Select and Fine-Tune a Pre-Trained Model**
### **1️⃣ Choose a Model from Azure Model Catalog**
Use **DistilBERT** from Hugging Face, which balances **efficiency and accuracy** for multi-label classification.

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define number of labels
num_labels = len(label_columns)

# Load model for multi-label classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=num_labels, problem_type="multi_label_classification"
)
```

### **2️⃣ Convert Data to Azure Dataset Format**
```python
from datasets import Dataset

# Convert DataFrame to Hugging Face dataset
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)
```

### **3️⃣ Tokenize Data**
```python
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_tokenized = train_dataset.map(preprocess_function, batched=True)
val_tokenized = val_dataset.map(preprocess_function, batched=True)
test_tokenized = test_dataset.map(preprocess_function, batched=True)
```

### **4️⃣ Configure Azure Machine Learning Training Script**
Save the following script as **train.py** and upload it to Azure.

```python
import torch
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    fp16=True,  # Mixed-precision training
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
)

# Train model
trainer.train()
```

### **5️⃣ Run Training on Azure ML Compute Cluster**
Create an **Azure Machine Learning Compute Instance** and execute:

```bash
az ml job create --file train.py
```

---

## **📊 Step 4: Evaluate Your Fine-Tuned Model**
Use **F1-score, accuracy, and Hamming loss** for evaluation.

```python
from sklearn.metrics import f1_score, hamming_loss

# Get model predictions
predictions = trainer.predict(test_tokenized)

# Convert logits to binary predictions
y_pred = (torch.sigmoid(torch.tensor(predictions.predictions)) > 0.5).int().numpy()
y_true = [test_tokenized[i]["labels"] for i in range(len(test_tokenized))]

# Compute metrics
print(f"F1-Score: {f1_score(y_true, y_pred, average='micro')}")
print(f"Hamming Loss: {hamming_loss(y_true, y_pred)}")
```

---

## **🌐 Step 5: Deploy the Model on Azure**
### **1️⃣ Convert to ONNX Format for Efficient Deployment**
```python
import torch
import onnx

# Export model
dummy_input = torch.zeros(1, 512, dtype=torch.int64)
torch.onnx.export(model, dummy_input, "model.onnx", export_params=True)
```

### **2️⃣ Register the Model in Azure ML**
```bash
az ml model register --name research-paper-classifier --model-path model.onnx
```

### **3️⃣ Deploy as an API Endpoint**
Create **deploy.py** with:

```python
from azureml.core.model import Model
from fastapi import FastAPI
import torch
import onnxruntime

app = FastAPI()

# Load model
model_path = Model.get_model_path("research-paper-classifier")
onnx_model = onnxruntime.InferenceSession(model_path)

@app.post("/predict")
async def predict(text: str):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = onnx_model.run(None, {"input_ids": inputs["input_ids"].numpy()})
    return {"predictions": outputs}
```

Deploy with:
```bash
az ml endpoint create --name research-classifier --file deploy.py
```

---

## **📄 Step 6: Write the Final Report**
### **Sections**
1️⃣ **Task Definition** – Multi-label classification of research papers.  
2️⃣ **Dataset Preparation** – Preprocessing and Azure storage upload.  
3️⃣ **Model Fine-Tuning** – Training DistilBERT using Azure ML.  
4️⃣ **Evaluation Metrics** – **F1-score, recall, accuracy, Hamming loss**.  
5️⃣ **Deployment Process** – ONNX optimization and API setup.  
6️⃣ **Future Enhancements** – More training data, hyperparameter tuning.

---

## **🎯 Final Outcome**
✅ **Fine-tuned transformer model** deployed on **Azure** as a real-time **classification API**.  
✅ **Optimized inference** using **ONNX Runtime**.  
✅ **Scalable deployment** with **Azure Machine Learning Endpoint**.  

🔥 **Next Steps**: Test the API with real research papers! 🚀