# LinguaFuse Fine-Tuning Tutorial

This notebook demonstrates how to use the LinguaFuse framework to load and process a sample dataset, and fine-tune a transformer model based on different scopes (Local, AWS, AML).

## 1. Install and Import Dependencies

Ensure you have installed the project requirements and import necessary modules.

In [1]:
# Install dependencies (run once)
# %pip install -r ../requirements.txt

# Imports
from torch.optim import AdamW
from transformers.optimization import get_linear_schedule_with_warmup
from transformers import AutoTokenizer

from pathlib import Path
import sys

# Add the parent directory to sys.path to resolve imports
root_dir = Path.cwd().resolve().parent
sys.path.append(str(root_dir / "libs"))

from linguafuse.cloud import Scope
from linguafuse.trainer import TrainerArguments
from linguafuse.framework import (
    FineTuneOrchestration,
    LocalDataArguments,
)

  from .autonotebook import tqdm as notebook_tqdm


## 2. Load and Process Dataset

Use Local scope to load the sample CSV, then process into a `ProcessedDataset`.

In [2]:
# Define sample data path
sample_path = root_dir / 'tests' / 'example_data.csv'

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Set up orchestration for local dataset
local_args = LocalDataArguments(path=sample_path)
orl = FineTuneOrchestration(
    data_args=local_args, 
    scope=Scope.LOCAL, 
    tokenizer=tokenizer
    )

# Process dataset
orl._create_dataset()
print(f"Dataset columns: {orl.processed_dataset.data.columns.tolist()}")
print(f"Number of examples: {len(orl.processed_dataset.data)}")

Connecting locally with asset: path=PosixPath('/Users/steven/git/LinguaFuse/tests/example_data.csv') <class 'linguafuse.framework.LocalDataArguments'>
Hint: Expecting 'data' to be a tuple of (text, labels)
Unique labels: [0 1 2]
Counts: [4 5 1]
Filtered labels: [0 1]
Filtered indices: [ True  True  True  True  True False  True  True  True  True]
Represented data: (array(['This product exceeds my expectations! Great quality.',
       'Poor customer service and long delivery times.',
       'The interface is intuitive and user-friendly.',
       'Broke after two weeks of use. Disappointed.',
       'Excellent value for money and fast shipping.',
       'Love the design and functionality!',
       'Missing crucial features and documentation.',
       'Quick installation and smooth performance.',
       'Not worth the price. Save your money.'], dtype=object), array([1, 0, 1, 0, 1, 1, 0, 1, 0]))
Sampling 0.2 of the data
Dataset columns: ['label', 'encoded_label', 'text']
Number of examples:

## 3. Load Transformer Model

Load the transformer model with the correct `num_labels` inferred from the dataset.

In [3]:
# Load model
model = orl.load_model('bert-base-uncased')
print(f"Model config num_labels: {orl.num_labels}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model config num_labels: 3


## 4. Next Steps

- You can extend this notebook to perform training loops using the loaded model and data loaders.
- Experiment with AWS or AML scopes by providing `AwsDataArguments` or `AmlDataArguments` and appropriate credentials.

## 5. Fine Tune Model

In this section, we will fine-tune the loaded transformer model using the LinguaFuse framework.  
You will define training arguments such as batch size, learning rate, number of epochs, and optimizer.  
The `orl.train()` method will handle the training loop, evaluation, and model saving automatically.

Make sure your dataset and tokenizer are correctly set up before starting the fine-tuning process.  
You can further customize the training by adjusting the parameters in `TrainingArguments`.

In [4]:
# Define LinguaFuse training arguments
LR = 5e-5
BATCH_SIZE = 1
EPOCHS = 1
OPT = AdamW(params=model.parameters(), lr=LR)
WARM_UP = 10
STEPS = orl.processed_dataset.get_training_steps(epochs=EPOCHS)

lf_training_args = TrainerArguments(
    name="linguafuse_trainer",
    batch_size=BATCH_SIZE,
    learning_rate=LR,
    epochs=EPOCHS,
    save_model=True,
    evaluation_strategy="steps",
    training_data=None, # will be set in orl.train() implicitly
    validation_data=None,
    optimizer=OPT,
    scheduler=get_linear_schedule_with_warmup(optimizer=OPT, num_warmup_steps=WARM_UP, num_training_steps=STEPS),
)

# Fine-tune the model using LinguaFuse framework
orl.train(trainer_args=lf_training_args)

Training data: <torch.utils.data.dataloader.DataLoader object at 0x168879a90>


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Example {'input_ids': tensor([[ 101, 3532, 8013,  ...,    0,    0,    0],
        [ 101, 3631, 2044,  ...,    0,    0,    0],
        [ 101, 1996, 8278,  ...,    0,    0,    0],
        ...,
        [ 101, 2023, 4031,  ...,    0,    0,    0],
        [ 101, 2293, 1996,  ...,    0,    0,    0],
        [ 101, 2009, 2573,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([0, 0, 1, 1, 0, 1, 1, 2])}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:linguafuse.trainer:Starting epoch 1/1
INFO:linguafuse.trainer:Starting step iteration...


check the size 1


Training:   0%|          | 0/1 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Training: 100%|██████████| 1/1 [00:16<00:00, 16.27s/it, loss=loss.item():.4f]
INFO:linguafuse.trainer:Epoch 1 completed with Training loss: 1.1719036102294922
