# LinguaFuse Fine-Tuning Tutorial

This notebook demonstrates how to use the LinguaFuse framework to load and process a sample dataset, and fine-tune a transformer model based on different scopes (Local, AWS, AML).

## 1. Install and Import Dependencies

Ensure you have installed the project requirements and import necessary modules.

In [10]:
# Install dependencies (run once)
# %pip install -r ../requirements.txt

# Imports
from pathlib import Path
from transformers import PreTrainedTokenizerFast
import sys
from pathlib import Path

# Add the parent directory to sys.path to resolve imports
root_dir = Path.cwd().resolve().parent
sys.path.append(str(root_dir / "libs"))

from linguafuse.cloud import Scope
from linguafuse.framework import (
    FineTuneOrchestration,
    LocalDataArguments,
)

## 2. Load and Process Dataset

Use Local scope to load the sample CSV, then process into a `ProcessedDataset`.

In [11]:
# Define sample data path
sample_path = root_dir / 'tests' / 'example_data.csv'

# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained('bert-base-uncased')

# Set up orchestration for local dataset
local_args = LocalDataArguments(path=sample_path)
orl = FineTuneOrchestration(data_args=local_args, scope=Scope.LOCAL, tokenizer=tokenizer)

# Process dataset
orl._create_dataset()
print(f"Dataset columns: {orl.dataset.data.columns.tolist()}")
print(f"Number of examples: {len(orl.dataset.data)}")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


Connecting locally with asset: path=PosixPath('/Users/steven/git/LinguaFuse/tests/example_data.csv') <class 'linguafuse.framework.LocalDataArguments'>
Dataset columns: ['label', 'encoded_label', 'text']
Number of examples: 10


## 3. Load Transformer Model

Load the transformer model with the correct `num_labels` inferred from the dataset.

In [12]:
# Load model
model = orl.load_model('bert-base-uncased')
print(f"Model config num_labels: {orl.num_labels}")

Connecting locally with asset: path=PosixPath('/Users/steven/git/LinguaFuse/tests/example_data.csv') <class 'linguafuse.framework.LocalDataArguments'>


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model config num_labels: 3


## 4. Next Steps

- You can extend this notebook to perform training loops using the loaded model and data loaders.
- Experiment with AWS or AML scopes by providing `AwsDataArguments` or `AmlDataArguments` and appropriate credentials.