# **Comprehensive Guide: Multi-Label Classification of Research Papers using Azure Machine Learning**
This guide provides a **step-by-step tutorial** to **design, implement, and deploy** an **AI solution** for **multi-label classification of research papers** using **Azure AI Studio**.

---

## **📌 Project Overview**
### **🎯 Objective**
We will **classify research papers** into **multiple academic domains** using a **pre-trained transformer model** from the **Azure AI Studio Model Catalog**. This AI solution will:
✅ Automate research paper classification.  
✅ Improve discoverability in academic repositories.  
✅ Allow for scalable, cloud-based deployment.

---

## **📝 Step 1: Define the AI Task**
### **✅ Task Definition**
- **Goal:** Assign **multiple academic fields** (e.g., *Computer Science, Physics, Mathematics*) to research papers based on their **title and abstract**.
- **Dataset:** Kaggle’s **multi-label classification dataset** for **academic papers**.
- **Real-World Application:** Used by **academic search engines, journals, and research institutions** to automatically tag papers.

---

## **🔍 Step 2: Explore the Model Catalog**
### **✅ Choose a Pre-trained Model**
Azure AI Studio provides **pre-trained models** from:
- **Microsoft**
- **OpenAI**
- **Hugging Face**

Since we already fine-tuned **DistilBERT** on a similar dataset, we will:
1️⃣ **Use Hugging Face’s DistilBERT model from the Azure Model Catalog.**  
2️⃣ **Fine-tune the model on our multi-label dataset.**  
3️⃣ **Deploy it as an API using Azure Machine Learning.**

---

## **📦 Step 3: Manage Your Model in Azure AI Studio**
### **✅ Set Up the Environment**
1️⃣ **Sign into Azure AI Studio**  
- Go to **[Azure AI Studio](https://ai.azure.com/studio/)**.
- Create a **new workspace** for the project.

2️⃣ **Create an Azure Machine Learning Compute Instance**
- In **Azure ML Studio**, navigate to **Compute** → **Create New**.
- Choose an **instance with GPU** (e.g., *Standard_NC6* for GPU training).

3️⃣ **Organize the Model**
- Open **Azure AI Studio**.
- Select **Models** → **Browse Model Catalog**.
- **Search for "DistilBERT"** and select **Hugging Face’s `distilbert-base-uncased`**.

4️⃣ **Enable Version Control**
- **Register the model** inside your workspace.
- Assign a **version number** to track changes.

---

## **📊 Step 4: Develop the AI Solution**
### **✅ 1. Input Data Preparation**
#### **1️⃣ Load Dataset from Kaggle**
```python
import pandas as pd

# Load dataset
dataset_path = "/kaggle/input/multilabel-classification-dataset/train.csv"
df = pd.read_csv(dataset_path)

# Merge title and abstract into a single text column
df["text"] = df["TITLE"] + " " + df["ABSTRACT"]

# Define label columns
label_columns = ["Computer Science", "Physics", "Mathematics", "Statistics", "Quantitative Biology", "Quantitative Finance"]

# Ensure labels are binary
df[label_columns] = df[label_columns].astype(int)

# Keep only relevant columns
df = df[["text"] + label_columns]
```

---

### **✅ 2. Convert Data to Azure Dataset**
1️⃣ **Upload the dataset to Azure ML**:
- Navigate to **Data** → **Upload Data**.
- Select **Tabular Dataset**.

2️⃣ **Register the dataset**:
- Assign a **name** and **version**.

---

### **✅ 3. Preprocess Text Data**
```python
from transformers import AutoTokenizer
from datasets import Dataset

# Load tokenizer from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Convert Pandas DataFrame to Hugging Face Dataset format
dataset = Dataset.from_pandas(df)

# Tokenize function
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization
tokenized_dataset = dataset.map(preprocess_function, batched=True)
```

---

### **✅ 4. Train DistilBERT on Azure**
1️⃣ **Fine-tune model**
```python
import torch
from transformers import AutoModelForSequenceClassification

# Load pre-trained DistilBERT for multi-label classification
num_labels = len(label_columns)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=num_labels,
    problem_type="multi_label_classification"
)

# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
```

2️⃣ **Set training parameters**
```python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    fp16=True
)

trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
)
```

3️⃣ **Train the model**
```python
trainer.train()
```

---

## **🚀 Step 5: Deploy the Model in Azure**
### **✅ Convert Model to ONNX for Azure**
1️⃣ **Export model**
```python
import torch.onnx

# Convert to ONNX format
dummy_input = torch.ones(1, 512, dtype=torch.int64).to(device)
torch.onnx.export(model, dummy_input, "distilbert.onnx")
```

2️⃣ **Upload to Azure**
- Go to **Azure ML Studio** → **Models** → **Upload Model**.
- Select **distilbert.onnx**.

---

### **✅ Deploy as an API in Azure**
1️⃣ **Create an Azure ML Endpoint**
- In **Azure ML Studio**, go to **Endpoints** → **Create New Endpoint**.

2️⃣ **Deploy the Model**
```python
from azureml.core.model import InferenceConfig, Model
from azureml.core import Workspace, Environment

# Load Azure workspace
ws = Workspace.from_config()

# Define inference config
inference_config = InferenceConfig(entry_script="score.py", environment=Environment("AzureML-AI"))

# Register the model in Azure
model = Model.register(ws, model_name="distilbert-paper-classifier", model_path="distilbert.onnx")

# Deploy as an endpoint
deployment = Model.deploy(ws, "paper-classifier-endpoint", [model], inference_config)
deployment.wait_for_completion(show_output=True)
```

---

## **📊 Step 6: Evaluate the AI Solution**
### **✅ Performance Metrics**
```python
from sklearn.metrics import classification_report

predictions = trainer.predict(tokenized_dataset)
y_pred = (torch.sigmoid(torch.tensor(predictions.predictions)) > 0.5).int().numpy()
y_true = np.array([tokenized_dataset[i]["labels"] for i in range(len(tokenized_dataset))])

print(classification_report(y_true, y_pred, target_names=label_columns))
```

---

## **📜 Step 7: Write the Report**
Your report should include:

📌 **Task Definition**: Why automated paper classification is useful.  
📌 **Model Selection**: Justify using **DistilBERT**.  
📌 **Management Process**: Version control, dataset handling.  
📌 **Solution Development**: Training and deployment process.  
📌 **Evaluation Results**: F1-score, precision-recall metrics.  
📌 **Future Improvements**: Improve recall with **focal loss**, experiment with **BERT/RoBERTa**.

---
