## 📚 Prerequisites

Before starting, ensure your Azure Services are operational, your Conda environment is configured, and your environment variables are set as described in the [README.md](README.md) document.

## 📋 Table of Contents

This notebook provides a practical guide to enhancing the relevance of Phi-3 models through fine-tuning with Azure Machine Learning:

1. [**Introduction to Fine-Tuning**](#define-field-types): Delve into the essentials of fine-tuning and Retrieval Augmented Generation (RAG) for Phi-3 models. This section covers their importance, benefits, and the strategic approach to customizing language models, offering insights into their technical aspects and real-world applications.
2. [**Exploring the Phi-3 Model Universe**](#configuring-vector-search): Exploring the expanded Phi-3 model family
3. [**Use Case: Enhancing Query Retrieval with Phi-3 Fine-Tuning**](#configuring-semantic-search): Learn how fine-tuned SLM models can drastically enhance search capabilities, resulting in a more efficient and accurate retrieval system.

For additional information, refer to the following resources:
- [Phi-3 Release Documentation](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/)

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbba-ai-small-language-models"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbba-ai-small-language-models


## Exploring the Phi-3 Model Universe

Before we dive in, I highly recommend getting familiar with the nuances of fine-tuning Small Language Models (SLMs), especially focusing on the Phi-3 model. This deep dive will not only introduce you to the flexibility of SLMs in managing real-time applications and niche projects but also showcase how they can be tailored for custom solutions. For a comprehensive guide, check out: [Task Adaptation in Small Language Models: Fine-Tuning Phi-3](https://pabloaicorner.hashnode.dev/task-adaptation-in-small-language-models-fine-tuning-phi-3).

Looking for more resources? Don't miss the [Fine-Tuning with Azure Machine Learning Documentation](https://github.com/Azure/azure-llm-fine-tuning/tree/main/fine-tuning) for additional insights and guidance.

### Phi-3 Version 

| Model         | Parameters | Context Lengths | Capabilities                                                                                   | Use Cases                                                                                                         |
|---------------|------------|-----------------|------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| **Phi-3-Vision** | 4.2 billion | 128K            | A multimodal marvel that seamlessly integrates language and vision.                            | Ideal for interpreting real-world images and digital documents, extracting text from visuals, and analyzing charts and diagrams. |
| **Phi-3-Small**  | 7 billion   | 128K, 8K        | A versatile language model excelling in language, reasoning, coding, and math benchmarks.      | Offers unparalleled performance, setting a new standard for efficiency and cost-effectiveness.                    |
| **Phi-3-Medium** | 14 billion  | 128K, 4K        | A language model that continues to outshine larger models in understanding and reasoning tasks. | Showcases exceptional capabilities in language understanding, reasoning tasks, and coding benchmarks.             |
| **Phi-3-Mini**   | 3.8 billion | 128K, 4K        | Introduced on April 23, 2024, excels in long-context scenarios for its size.                   | Perfect for reasoning tasks and is readily accessible via Models-as-a-Service (MaaS).                             |

### 💡 Benefits of Phi-3 Models

- **Performance and Efficiency:** The Phi-3 family is engineered to surpass models within the same parameter range, delivering top-tier performance in both language and vision tasks across a multitude of applications.
- **Cost Efficiency:** Models like Phi-3-mini and Phi-3-small offer budget-friendly solutions without sacrificing performance, ideal for environments where computational resources are limited.
- **Versatility:** With capabilities spanning natural language processing, coding, math, and multimodal tasks, the Phi-3 models are adept at handling both general-purpose and specialized applications.


## Use Case: Enhancing Query Retrieval through Phi-3 Fine-Tuning

Our goal is to elevate our query retrieval system to new heights, empowering it to categorize user queries with exceptional accuracy and adopt the optimal retrieval strategy. By implementing this advanced fine-tuning, we aim to dramatically improve the user experience while simultaneously reducing operational expenses.


#### 🌐 Phi-3 Models in the Wild: Versatility at Its Best

Phi-3 models stand out for their exceptional adaptability, making them perfect candidates for a myriad of real-world applications:

- **Efficiency on the Edge:** Tailored for performance in environments with limited computational resources, these models are a boon for language processing tasks on edge devices.
  
- **Instant Insights, Anywhere:** Phi-3 models are designed for rapid, local inference, ideal for powering responsive applications in offline settings, such as customer support chatbots.
  
- **Tailored for Every Domain:** From summarizing extensive documents to generating engaging narratives, phi-3 models are adept at specialized tasks across various industries.

#### 🎯 Retrieval Revolution: Strategies Unveiled

Our retrieval system employs a multifaceted approach to ensure maximum relevance and accuracy across all queries:

Please read these for context on the types of issues and studies:
- [Azure AI Search: Outperforming Vector Search with Hybrid Approaches](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167)

1. **Keyword Retrieval:** The cornerstone of traditional search, effective for straightforward queries but less so for complex or misspelled inputs.

2. **Vector Retrieval:** A modern twist using embeddings to understand semantic similarities, overcoming the limitations of keyword-only searches.

3. **Hybrid Retrieval:** A best-of-both-worlds strategy that combines keyword and vector methods to cover all bases, ensuring no query is left behind.

4. **Semantic Ranking:** The final touch, reordering results to prioritize relevance, leveraging deep learning models inspired by Microsoft Bing's algorithms.

#### 🎢 Strategy and Objectives: Navigating the Path to Excellence

Our roadmap is clear: harness the phi-3 model to categorize user queries with razor-sharp accuracy, ensuring each query activates the most suitable retrieval mechanism. This strategy promises to significantly enhance system performance. Key milestones include:

- **Crafting the Perfect Dataset:** A curated collection of data designed to boost the phi-3 model's query classification prowess.
  
- **Building a Superior Retrieval System:** A cutting-edge system that marries keyword and vector search methods with semantic ranking for unmatched relevance.
  
- **Setting the Performance Bar High:** Rigorous evaluation using leading metrics to ensure our system sets new standards for effectiveness.

#### 🚧 From Blueprint to Reality: The Journey Ahead

We're set to embark on a journey of fine-tuning and system integration, with the following phases:

1. **Dataset Curation:** Assembling a rich dataset to train the phi-3 model, covering a wide spectrum of query types.
   
2. **Precision Fine-Tuning:** Employing QLoRA to fine-tune the phi-3 model, enhancing its ability to accurately classify and route queries.
   
3. **Seamless System Integration:** Incorporating the refined model into our retrieval framework, tailored to adapt to diverse search scenarios.
   
4. **Benchmarking Success:** Evaluating our system's performance in real-world and controlled environments to confirm our anticipated gains in accuracy, responsiveness, and efficiency.

By embarking on this project, we aim to not only elevate the precision and efficiency of our retrieval system but also to redefine the benchmarks for language processing tasks through innovative fine-tuning of the phi-3 model.

## 0. Setup Azure Machine Learning (AML) Requirements

To establish a connection with an Azure Machine Learning workspace, specific identifying parameters are required: a subscription ID, a resource group name, and a workspace name. These details will be utilized with the `MLClient` from `azure.ai.ml` to access the desired Azure Machine Learning workspace. For authentication, the default Azure credentials will be employed in this hands-on guide.

In [2]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
import yaml
import os

# Load environment variables from .env file
load_dotenv()

# Gather values from .env
AZURE_SUBSCRIPTION_ID = os.getenv('AZURE_SUBSCRIPTION_ID')
AZURE_RESOURCE_GROUP = os.getenv('AZURE_RESOURCE_GROUP')
AZURE_WORKSPACE = os.getenv('AZURE_WORKSPACE')

In [3]:
import yaml

with open(r'src\finetuning\phi3\config.yaml') as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

# Configuration for Azure and model setup
azure_config = {
    'data_directory': config['config']['data_directory'],
    'cloud_storage_directory': config['config']['cloud_storage_directory'],
    'model_identifier': config['config']['model_identifier'],
    'debug_mode_enabled': config['config']['debug_mode_enabled'],
    'use_low_priority_vm_option': config['config']['use_low_priority_vm_option']
}

# Training environment settings on Azure
azure_training_settings = {
    'azure_environment_name': config['training_settings']['azure_environment_name'],
    'azure_compute_cluster_name': config['training_settings']['azure_compute_cluster_name'],
    'azure_compute_cluster_vm_size': config['training_settings']['azure_compute_cluster_vm_size']
}

In [4]:
from src.aml_helper import AMLManager

ml_manager = AMLManager(AZURE_SUBSCRIPTION_ID, 
                        AZURE_RESOURCE_GROUP, 
                        AZURE_WORKSPACE)

In [5]:
ml_manager.get_or_create_environment_asset(azure_training_settings['azure_environment_name'],
                                           conda_yml="src/finetuning/phi3/env/conda.yaml",
                                           update=True)

2024-07-10 17:12:33,419 - micro - MainProcess - ERROR    Exception: Found Environment asset, but will update the Environment. (aml_helper.py:get_or_create_environment_asset:67)
Traceback (most recent call last):
  File "C:\Users\pablosal\Desktop\gbba-ai-small-language-models\src\aml_helper.py", line 62, in get_or_create_environment_asset
    raise ResourceExistsError('Found Environment asset, but will update the Environment.')
azure.core.exceptions.ResourceExistsError: Found Environment asset, but will update the Environment.
2024-07-10 17:12:34,635 - micro - MainProcess - INFO     Created Environment asset: llm-finetuning-phi3-2024-07-10. (aml_helper.py:get_or_create_environment_asset:75)


Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': 'mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest', 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'llm-finetuning-phi3-2024-07-10', 'description': 'Environment created for llm fine-tuning.', 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/1a4bb722-f155-4502-8033-022a9eb1481b/resourceGroups/dev/providers/Microsoft.MachineLearningServices/workspaces/ml-workspace-dev-eastus-001/environments/llm-finetuning-phi3-2024-07-10/versions/2', 'Resource__source_path': '', 'base_path': 'C:\\Users\\pablosal\\Desktop\\gbba-ai-small-language-models', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x000001D1FB358C40>, 'serialize': <msrest.serialization.Serializer object at 0x000001D1FB358A30>, 'version': '2', 'conda_file': {'channels': ['conda-forge'], 'depe

In [None]:
ml_manager.get_or_create_environment_asset(azure_training_settings['azure_environment_name'],
                                           conda_yml="src/finetuning/phi3/env/conda.yaml",
                                           update=True)

In [6]:
# Create or reuse the compute cluster
compute = ml_manager.create_or_reuse_compute_cluster(
    azure_training_settings['azure_compute_cluster_name'], 
    azure_training_settings['azure_compute_cluster_vm_size'],
    tier='lowpriority',
    max_instances=1)

2024-07-10 17:15:02,402 - micro - MainProcess - INFO     Compute cluster 'gpu-cluster-phi3' does not exist. Creating a new one with size 'Standard_NC12s_v3', tier 'lowpriority', and max instances '1'. (aml_helper.py:create_or_reuse_compute_cluster:129)
2024-07-10 17:15:34,055 - micro - MainProcess - INFO     Compute cluster 'gpu-cluster-phi3' created successfully. (aml_helper.py:create_or_reuse_compute_cluster:139)


## 1. Dataset Curation

This section outlines the approach for generating training and evaluation datasets, specifically tailored for query categorization and retrieval tasks. The process leverages the "Contoso Employee Handbook" as the source document, ensuring relevance and consistency across query categories.

#### Step 1: Categorization of Queries
To accommodate diverse query types, we've delineated the following categories:

- **Concept Seeking Queries**: Abstract inquiries necessitating elaborate responses.
- **Exact Snippet Search**: Precise, lengthy queries directly extracted from the source text.
- **Web Search-like Queries**: Concise queries mimicking typical search engine inputs.
- **Low Query/Doc Term Overlap**: Queries and answers with minimal lexical similarity.
- **Fact Seeking Queries**: Direct questions expecting singular, definitive answers.
- **Keyword Queries**: Brief, focused queries emphasizing critical terms.
- **Queries with Misspellings**: Queries incorporating common spelling errors.
- **Long Queries**: Inquiries exceeding 20 tokens in length.
- **Medium Queries**: Queries spanning between 5 to 20 tokens.
- **Short Queries**: Queries comprising fewer than 5 tokens.

**Source Document**
The foundational text for this exercise is the "Contoso Employee Handbook," a comprehensive guide detailing the policies and procedures for Contoso employees.

#### Step 2: Training Dataset Generation
For each query category, using gpt4 I crafted 20 distinct questions derived from the "Contoso Employee Handbook," ensuring a broad representation of potential inquiries. Each question was meticulously categorized to reflect its nature accurately.


#### Step 3: Evaluation Dataset Creation
To assess the model's performance, I formulated again using gpt4 5 unique questions per category, distinct from the training set yet pertinent to the "Contoso Employee Handbook."



In [7]:
import pandas as pd
df_train = pd.read_csv(r"src\finetuning\phi3\data\retrieval_train.csv")
df_val = pd.read_csv(r"src\finetuning\phi3\data\retrieval_train.csv")
df_train.head()

Unnamed: 0,Question,Kind of Query
0,What is the purpose of the Contoso Employee Ha...,Concept Seeking queries
1,How does Contoso define at-will employment?,Concept Seeking queries
2,What are the classifications of employees at C...,Concept Seeking queries
3,Explain Contoso's policy on equal employment o...,Concept Seeking queries
4,What does Contoso's confidentiality policy ent...,Concept Seeking queries


## 2. Creating training script 

In [8]:
%%writefile src/finetuning/phi3/train_mlflow.py
from mlflow.models.signature import ModelSignature
from mlflow.types import DataType, Schema, ColSpec
import pandas as pd
from typing import List, Dict

import os
import mlflow
from mlflow.models import infer_signature
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from datetime import datetime

logger = logging.getLogger(__name__)

def load_model(model_name_or_path: str = "microsoft/Phi-3-mini-4k-instruct",
               use_cache: bool = False,
               trust_remote_code: bool = True,
               torch_dtype: torch.dtype = torch.bfloat16,
               device_map: dict = None,
               max_seq_length: int = 4096) -> tuple:
    """
    Loads a pre-trained model and its tokenizer with specified configurations.

    Parameters:
    - model_name_or_path (str): Identifier for the model to load. Can be a model ID or path.
    - use_cache (bool): Whether to use caching for model outputs.
    - trust_remote_code (bool): Whether to trust remote code when loading the model.
    - torch_dtype (torch.dtype): Data type for model tensors. Recommended to use torch.bfloat16 for efficiency.
    - device_map (dict): Custom device map for distributing the model's layers across devices.
    - max_seq_length (int): Maximum sequence length for the tokenizer.

    Returns:
    - tuple: A tuple containing the loaded model and tokenizer.
    """
    model_kwargs = {
        "use_cache": use_cache,
        "trust_remote_code": trust_remote_code,
        "torch_dtype": torch_dtype,
        "device_map": device_map
    }
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, **model_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    tokenizer.model_max_length = max_seq_length
    tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
    tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
    tokenizer.padding_side = "right"
    return model, tokenizer

def convert_to_chat_format(df: pd.DataFrame) -> List[List[Dict[str, str]]]:
    """
    Converts a DataFrame containing questions and their types into a chat format.

    Parameters:
    - df (pd.DataFrame): A DataFrame with at least two columns: 'Question' and 'Kind of Query'.
      Each row represents a user query and its categorization.

    Returns:
    - List[List[Dict[str, str]]]: A list of chats, where each chat is a list of messages.
      Each message is a dictionary with 'role' and 'content' keys. The 'role' can be 'system',
      'user', or 'assistant', indicating the sender of the message. The 'content' is the text of the message.
    """
    chats = []

    for index, row in df.iterrows():
        chat = [
            {
                "role": "system",
                "content": "You are an AI assistant supporting users by categorizing their queries."
            },
            {
                "role": "user",
                "content": row["Question"]
            },
            {
                "role": "assistant",
                "content": f"This query is a '{row['Kind of Query']}' type."
            }
        ]
        chats.append(chat)
    
    return chats

def convert_chats_to_dataframe(chats: List[List[Dict[str, str]]]) -> pd.DataFrame:
    """
    Converts a list of chats into a DataFrame where each chat is represented as a dictionary in the 'message' column.

    Parameters:
    - chats (List[List[Dict[str, str]]]): A list of chats, where each chat is a list of messages.
      Each message is a dictionary with 'role' and 'content' keys.

    Returns:
    - pd.DataFrame: A DataFrame with a single column 'message', where each row contains a dictionary
      representing a chat.
    """
    # Convert each chat into a dictionary and store it in a list
    chat_dicts = [{'message': chat} for chat in chats]
    
    # Create a DataFrame from the list of dictionaries
    df = pd.DataFrame(chat_dicts)
    
    return df


def apply_chat_template(
    example: dict,
    tokenizer: PreTrainedTokenizer,
) -> dict:
    """
    Applies a chat template to the messages in an example from a dataset.

    This function modifies the input example by adding a system message at the beginning
    if it does not already start with one. It then applies a chat template formatting
    using the specified tokenizer.

    Parameters:
    - example (dict): A dictionary representing a single example from a dataset. It must
      contain a key 'messages', which is a list of message dictionaries. Each message
      dictionary should have 'role' and 'content' keys.
    - tokenizer (PreTrainedTokenizer): An instance of a tokenizer that supports the
      `apply_chat_template` method for formatting chat messages.

    Returns:
    - dict: The modified example dictionary with an added 'text' key that contains the
      formatted chat as a string.
    """
    messages = example["message"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example

def main(args):
    
    ###################
    # Hyper-parameters
    ###################
    # Only overwrite environ if wandb param passed
    if len(args.wandb_project) > 0:
        os.environ['WANDB_API_KEY'] = args.wandb_api_key    
        os.environ["WANDB_PROJECT"] = args.wandb_project
    if len(args.wandb_watch) > 0:
        os.environ["WANDB_WATCH"] = args.wandb_watch
    if len(args.wandb_log_model) > 0:
        os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model

    use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) 

    training_config = {
        "bf16": True,
        "do_eval": False,
        "learning_rate": args.learning_rate,
        "log_level": "info",
        "logging_steps": args.logging_steps,
        "logging_strategy": "steps",
        "lr_scheduler_type": args.lr_scheduler_type,
        "num_train_epochs": args.epochs,
        "max_steps": -1,
        "output_dir": args.output_dir,
        "overwrite_output_dir": True,
        "per_device_train_batch_size": args.train_batch_size,
        "per_device_eval_batch_size": args.eval_batch_size,
        "remove_unused_columns": True,
        "save_steps": args.save_steps,
        "save_total_limit": 1,
        "seed": args.seed,
        "gradient_checkpointing": True,
        "gradient_checkpointing_kwargs": {"use_reentrant": False},
        "gradient_accumulation_steps": args.grad_accum_steps,
        "warmup_ratio": args.warmup_ratio,
    }

    peft_config = {
        "r": args.lora_r,
        "lora_alpha": args.lora_alpha,
        "lora_dropout": args.lora_dropout,
        "bias": "none",
        "task_type": "CAUSAL_LM",
        #"target_modules": "all-linear",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        "modules_to_save": None,
    }

    checkpoint_dir = os.path.join(args.output_dir, "checkpoints")

    train_conf = TrainingArguments(
        **training_config,
        report_to="wandb" if use_wandb else "azure_ml",
        run_name=args.wandb_run_name if use_wandb else None,    
    )
    peft_conf = LoraConfig(**peft_config)
    model, tokenizer = load_model(args)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = train_conf.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
        + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
    )
    logger.info(f"Training/evaluation parameters {train_conf}")
    logger.info(f"PEFT parameters {peft_conf}")    

    ##################
    # Data Processing
    ##################

    train_dataset = load_dataset('json', data_files=os.path.join(args.train_dir, 'train.jsonl'), split='train')
    eval_dataset = load_dataset('json', data_files=os.path.join(args.train_dir, 'eval.jsonl'), split='train')
    column_names = list(train_dataset.features)

    train_data = pd.read_csv(args.train_dir)
    train_data_chat_format = convert_to_chat_format(train_data)
    df_train_data_chat_format = convert_chats_to_dataframe(train_data_chat_format)
    train_dataset = datasets.Dataset.from_pandas(pd.DataFrame(df_train_data_chat_format, columns=["message"]),split= "train")

    eval_data = pd.read_csv(args.eval_data)
    eval_data_chat_format = convert_to_chat_format(eval_data)
    df_eval_data_chat_format = convert_chats_to_dataframe(eval_data_chat_format)
    eval_data_dataset = datasets.Dataset.from_pandas(pd.DataFrame(df_eval_data_chat_format, columns=["message"]),split="train")

    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_eval_dataset = eval_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    with mlflow.start_run() as run:        
        ###########
        # Training
        ###########
        trainer = SFTTrainer(
            model=model,
            args=train_conf,
            peft_config=peft_conf,
            train_dataset=processed_train_dataset,
            eval_dataset=processed_eval_dataset,
            max_seq_length=args.max_seq_length,
            dataset_text_field="text",
            tokenizer=tokenizer,
            packing=True,
        )

        # Show current memory stats
        gpu_stats = torch.cuda.get_device_properties(0)
        start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
        max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
        logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
        logger.info(f"{start_gpu_memory} GB of memory reserved.")
        
        last_checkpoint = None
        if os.path.isdir(checkpoint_dir):
            checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)]
            if len(checkpoints) > 0:
                checkpoints.sort(key=os.path.getmtime, reverse=True)
                last_checkpoint = checkpoints[0]        

        trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint)

        #############
        # Logging
        #############
        metrics = trainer_stats.metrics

        # Show final memory and time stats 
        used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
        used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
        used_percentage = round(used_memory         /max_memory*100, 3)
        lora_percentage = round(used_memory_for_lora/max_memory*100, 3)

        logger.info(f"{metrics['train_runtime']} seconds used for training.")
        logger.info(f"{round(metrics['train_runtime']/60, 2)} minutes used for training.")
        logger.info(f"Peak reserved memory = {used_memory} GB.")
        logger.info(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
        logger.info(f"Peak reserved memory % of max memory = {used_percentage} %.")
        logger.info(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
                
        trainer.log_metrics("train", metrics)

        model_info = mlflow.transformers.log_model(
            transformers_model={"model": trainer.model, "tokenizer": tokenizer},
            #prompt_template=prompt_template,
            #signature=signature,
            artifact_path=args.model_dir,  # This is a relative path to save model files within MLflow run
        )


def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()
    # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

    # hyperparameters
    parser.add_argument("--train_dir", default="data", type=str, help="Input directory for training")
    parser.add_argument("--model_dir", default="./model", type=str, help="output directory for model")
    parser.add_argument("--epochs", default=1, type=int, help="number of epochs")
    parser.add_argument("--output_dir", default="./output_dir", type=str, help="directory to temporarily store when training a model")
    parser.add_argument("--train_batch_size", default=8, type=int, help="training - mini batch size for each gpu/process")
    parser.add_argument("--eval_batch_size", default=8, type=int, help="evaluation - mini batch size for each gpu/process")
    parser.add_argument("--learning_rate", default=5e-06, type=float, help="learning rate")
    parser.add_argument("--logging_steps", default=2, type=int, help="logging steps")
    parser.add_argument("--save_steps", default=100, type=int, help="save steps")    
    parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps")
    parser.add_argument("--lr_scheduler_type", default="linear", type=str)
    parser.add_argument("--seed", default=0, type=int, help="seed")
    parser.add_argument("--warmup_ratio", default=0.2, type=float, help="warmup ratio")
    parser.add_argument("--max_seq_length", default=2048, type=int, help="max seq length")
    parser.add_argument("--save_merged_model", type=bool, default=False)

    # lora hyperparameters
    parser.add_argument("--lora_r", default=16, type=int, help="lora r")
    parser.add_argument("--lora_alpha", default=16, type=int, help="lora alpha")
    parser.add_argument("--lora_dropout", default=0.05, type=float, help="lora dropout")
    
    # wandb params
    parser.add_argument("--wandb_api_key", type=str, default="")
    parser.add_argument("--wandb_project", type=str, default="")
    parser.add_argument("--wandb_run_name", type=str, default="")
    parser.add_argument("--wandb_watch", type=str, default="gradients") # options: false | gradients | all
    parser.add_argument("--wandb_log_model", type=str, default="false") # options: false | true

    # parse args
    args = parser.parse_args()

    # return args
    return args

if __name__ == "__main__":
    #sys.argv = ['']
    args = parse_args()
    main(args)


Overwriting src/finetuning/phi3/train_mlflow.py


In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

job = command(
    inputs=dict(
        #train_dir=Input(type="uri_folder", path=DATA_DIR), # Get data from local path
        train_dir=Input(path=f"{AZURE_DATA_NAME}@latest"),  # Get data from Data asset
        epoch=d['train']['epoch'],
        train_batch_size=d['train']['train_batch_size'],
        eval_batch_size=d['train']['eval_batch_size'],  
        model_dir=d['train']['model_dir']
    ),
    code="./src_train",  # local path where the code is stored
    compute=azure_compute_cluster_name,
    command="python train_mlflow.py --train_dir ${{inputs.train_dir}} --epochs ${{inputs.epoch}} --train_batch_size ${{inputs.train_batch_size}} --eval_batch_size ${{inputs.eval_batch_size}} --model_dir ${{inputs.model_dir}}",
    #environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/57", # Use built-in Environment asset
    environment=f"{azure_env_name}@latest",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1, # For multi-gpu training set this to an integer value more than 1
    },
)
returned_job = ml_client.jobs.create_or_update(job)
ml_client.jobs.stream(returned_job.name)