## 📚 Prerequisites

Before starting, ensure your Azure Services are operational, your Conda environment is configured, and your environment variables are set as described in the [README.md](README.md) document.

## 📋 Table of Contents

This notebook provides a practical guide to enhancing the relevance of Phi-3 models through fine-tuning with Azure Machine Learning:

1. [**Introduction to Fine-Tuning**](#define-field-types): Delve into the essentials of fine-tuning and Retrieval Augmented Generation (RAG) for Phi-3 models. This section covers their importance, benefits, and the strategic approach to customizing language models, offering insights into their technical aspects and real-world applications.
2. [**Exploring the Phi-3 Model Universe**](#configuring-vector-search): Exploring the expanded Phi-3 model family
3. [**Use Case: Enhancing Query Retrieval with Phi-3 Fine-Tuning**](#configuring-semantic-search): Learn how fine-tuned SLM models can drastically enhance search capabilities, resulting in a more efficient and accurate retrieval system.

For additional information, refer to the following resources:
- [Phi-3 Release Documentation](https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/)

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbba-ai-small-language-models"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbba-ai-small-language-models


## Intro to Fine-Tuning Phi-3 models

In the dynamic world of language models, fine-tuning and Retrieval Augmented Generation (RAG) are key to unlocking superior model performance. This guide explores their roles, benefits, and the strategic hierarchy of language model customization, shedding light on their technical nuances and practical applications.

#### What is Fine-Tuning?

Fine-tuning refines a pre-trained language model (LLM) by training it further on specific tasks or datasets. This process aims to boost the model's accuracy and efficiency by optimizing its parameters for the task at hand.

#### Benefits of Fine-Tuning

- **🚀 Enhanced Performance:** Tailoring models to specific tasks significantly improves accuracy and efficiency.
- **🛑 Reduced Hallucinations:** Fine-tuning helps minimize the generation of irrelevant or fabricated content.
- **💸 Cost-Effectiveness:** Enables smaller models to achieve remarkable task-specific performance, avoiding the need for larger, costlier models.
- **🎨 Customization:** Facilitates the creation of models uniquely adapted to specific requirements or datasets.

### Technical Approach: Azure OpenAI's Strategy

Azure OpenAI employs low-rank approximation (LoRA) for efficient fine-tuning, focusing on adjusting a subset of key parameters to streamline the fine-tuning process while maintaining model performance.

#### 📚 Hierarchy of Language Model Customization

1. **Prompt Engineering:** A simple yet effective method for guiding model responses with well-crafted prompts.
2. **Retrieval Augmented Generation (RAG):** Augments prompts with external data for up-to-date information, without modifying the model.
3. **Fine-Tuning:** Provides deeper customization for specific tasks, though it's used less frequently due to its complexity.
4. **Training a New Model:** The most resource-intensive method, involving the development of a model from the ground up.

#### 🎨 Types of Fine-Tuning

- **Continual Pre-Training (CPT):** Further pre-training with additional data, albeit costly and with limited applicability.
- **Supervised Fine-Tuning (SFT):** Adapts models to new tasks with prompt-completion pairs, requiring high-quality data.
- **Reinforcement Learning from Human Feedback (RLHF):** Aligns models with human preferences through feedback, offering high steerability at the expense of quality data.

#### Fine-Tuning Options: Hosted vs. Self-Managed Options on Azure

When it comes to fine-tuning the Phi-3 model, Azure offers two distinct pathways tailored to different operational needs and expertise levels: **Cloud-Hosted Managed Services** and **Self-Managed Fine-Tuning**. Each option provides unique advantages, allowing you to choose the best fit for your project's requirements.

### 🌐 Cloud-Hosted Managed Services: Streamline Your AI Operations

Azure's Cloud-Hosted Managed Services offer a hassle-free way to fine-tune your models, ideal for teams looking to streamline their AI operations without the overhead of managing infrastructure. This option is perfect for:

- **Fully Managed Services:** Azure takes care of all the underlying infrastructure, allowing you to focus solely on fine-tuning your models.
- **Scalability:** Easily scale your operations up or down based on your needs, without worrying about hardware limitations.
- **Security and Compliance:** Benefit from Azure's built-in security features, ensuring your data and models are protected according to the highest standards.

Ex. UI out-of-the-box fine-tuning of Phi-3 in Azure AI Studio: https://github.com/microsoft/Phi-3CookBook/blob/main/md/04.Fine-tuning/FineTuning_AIStudio.md

### 🛠️ Self-Managed Fine-Tuning: Harnessing Open-Source Tools

For teams with the capability to manage their infrastructure and who prefer a hands-on approach, Self-Managed Fine-Tuning on Azure provides the flexibility to use open-source tools and frameworks. This option is well-suited for:

- **"White Box" LLMs Fine-Tuning:** Efficiently fine-tune open-source or "white box" Large Language Models on small GPU clusters or even single GPU machines, offering a cost-effective solution for smaller projects.
- **Advanced Techniques:** Leverage the full power of Azure's infrastructure to experiment with advanced fine-tuning techniques, such as reinforcement learning, to push the boundaries of what your models can achieve.
- **Customization and Control:** Enjoy complete control over your fine-tuning environment, allowing for deep customization and optimization of your models and infrastructure.

We will showcase this path using Azure ML, learning to manage our infrastructure remotely (cloud infrastructure) and leveraging the library peft to utilize QLoRA for a fine-tuning approach.

### Choosing Your Path

Whether you opt for the ease and simplicity of Cloud-Hosted Managed Services or the flexibility and control of Self-Managed Fine-Tuning, Azure provides a robust platform to support your Phi-3 fine-tuning endeavors. Consider your team's expertise, project requirements, and operational capacity when choosing the best path forward.

### ⚠️ Cautions and Best Practices

#### Cautions

- **Time and Cost:** Fine-tuning requires significant investment, making it less suitable for projects needing quick turnarounds.
- **Data Volatility:** Rapid changes in datasets can challenge the fine-tuning process.
- **Explainability and Grounding:** Initial methods might prioritize alternatives to fine-tuning when these aspects are crucial.

#### Best Practices

- **Data Quality:** Emphasize the collection of large, high-quality datasets for fine-tuning.
- **Baseline Performance:** Utilize prompt engineering or RAG to set performance baselines and pinpoint improvement areas.
- **Iterative Evaluation:** Apply both quantitative and qualitative measures to continually enhance model performance.

#### 🤝 Integrating Fine-Tuning with RAG

Fine-tuning and RAG complement each other, with fine-tuning modifying the model and RAG altering the input data. Their synergy can significantly boost model performance for specific tasks, as demonstrated by case studies from Microsoft Research and other practical applications. https://arxiv.org/pdf/2401.08406

### 🆚 Comparing Small and Large Language Models

What sets Small Language Models (SLMs) apart from Large Language Models (LLMs)? 

The core difference between SLMs and LLMs lies in their inherent capabilities and the scale of data they're trained on. LLMs excel in tasks requiring extensive world knowledge, complex reasoning, and advanced functionalities like orchestration and tool use. They're ideal for open-ended applications where the breadth of understanding and creativity is paramount. On the other hand, SLMs shine in more narrowly defined scenarios, where the focus is on specific language tasks or reasoning within a constrained domain. 

The choice between using an SLM or an LLM isn't binary but rather depends on the specific requirements of your project. Our experience in developing various AI-driven tools has shown that both types of models have their unique advantages and can be complementary. Here are some considerations to help you decide:

### When to Opt for SLMs:

- **Resource Efficiency:** If computational resources are limited, SLMs are the way to go. They require less power, making them more accessible for projects with budget constraints.
- **Faster Inference:** The compact size of SLMs allows for quicker data processing, enabling real-time applications even on local machines.
- **Domain-Specific Tasks:** For projects focused on a particular niche, SLMs can be more practical. They're not only quicker but also more cost-effective to fine-tune, allowing for tailored solutions without the hefty price tag.

Fine-tuning smaller models, such as phi-3, presents a strategic alternative to leveraging larger models for certain applications. With their reduced demand for computational power and memory, SLMs offer a practical solution for tasks that don't require the full might of an LLM. This approach not only accelerates training times but also minimizes operational costs, making SLMs a compelling choice for a wide range of applications.

In this guide, we'll take you through a comprehensive journey of fine-tuning phi-3, demonstrating how to harness its capabilities for domain-specific challenges.

## Exploring the Phi-3 Model Universe

On the remarkable date of May 21, 2024, Microsoft unveiled a groundbreaking expansion to their Phi-3 model family. Below is a comparison of the new additions:

| Model         | Parameters | Context Lengths | Capabilities                                                                                   | Use Cases                                                                                                         |
|---------------|------------|-----------------|------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| **Phi-3-Vision** | 4.2 billion | 128K            | A multimodal marvel that seamlessly integrates language and vision.                            | Ideal for interpreting real-world images and digital documents, extracting text from visuals, and analyzing charts and diagrams. |
| **Phi-3-Small**  | 7 billion   | 128K, 8K        | A versatile language model excelling in language, reasoning, coding, and math benchmarks.      | Offers unparalleled performance, setting a new standard for efficiency and cost-effectiveness.                    |
| **Phi-3-Medium** | 14 billion  | 128K, 4K        | A language model that continues to outshine larger models in understanding and reasoning tasks. | Showcases exceptional capabilities in language understanding, reasoning tasks, and coding benchmarks.             |
| **Phi-3-Mini**   | 3.8 billion | 128K, 4K        | Introduced on April 23, 2024, excels in long-context scenarios for its size.                   | Perfect for reasoning tasks and is readily accessible via Models-as-a-Service (MaaS).                             |

### 💡 Benefits of Phi-3 Models

- **Performance and Efficiency:** The Phi-3 family is engineered to surpass models within the same parameter range, delivering top-tier performance in both language and vision tasks across a multitude of applications.
- **Cost Efficiency:** Models like Phi-3-mini and Phi-3-small offer budget-friendly solutions without sacrificing performance, ideal for environments where computational resources are limited.
- **Versatility:** With capabilities spanning natural language processing, coding, math, and multimodal tasks, the Phi-3 models are adept at handling both general-purpose and specialized applications.

Available through Azure AI Studio and MaaS, these models ensure easy integration into projects, with Phi-3-mini and Phi-3-medium being particularly accessible in regions such as East US2 and Sweden Central.

### 🛠️ Support and Availability

- **Fine-Tuning Support:** With the exception of Phi-3-vision, all models are fine-tunable using the Azure Machine Learning SDK and UI, allowing for customization to meet specific project needs.
- **Global Reach:** Phi-3 models are accessible in all regions supported by Azure AI Studio, ensuring developers worldwide can leverage their advanced capabilities.


## Elevating Query Retrieval with Phi-3 Fine-Tuning 

In this comprehensive guide, we outline our ambitious project to refine the phi-3 model through targeted fine-tuning. Our mission is to supercharge our retrieval system, enabling it to classify user queries with unparalleled precision and select the most effective retrieval strategy. This strategic enhancement is poised to significantly uplift user experience and drive down operational costs.

- [Fine-Tuning with Azure Machine Learning Documentation](https://github.com/Azure/azure-llm-fine-tuning/tree/main/fine-tuning)

#### 🌐 Phi-3 Models in the Wild: Versatility at Its Best

Phi-3 models stand out for their exceptional adaptability, making them perfect candidates for a myriad of real-world applications:

- **Efficiency on the Edge:** Tailored for performance in environments with limited computational resources, these models are a boon for language processing tasks on edge devices.
  
- **Instant Insights, Anywhere:** Phi-3 models are designed for rapid, local inference, ideal for powering responsive applications in offline settings, such as customer support chatbots.
  
- **Tailored for Every Domain:** From summarizing extensive documents to generating engaging narratives, phi-3 models are adept at specialized tasks across various industries.

#### 🎯 Retrieval Revolution: Strategies Unveiled

Our retrieval system employs a multifaceted approach to ensure maximum relevance and accuracy across all queries:

Please read these for context on the types of issues and studies:
- [Azure AI Search: Outperforming Vector Search with Hybrid Approaches](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-search-outperforming-vector-search-with-hybrid/ba-p/3929167)

1. **Keyword Retrieval:** The cornerstone of traditional search, effective for straightforward queries but less so for complex or misspelled inputs.

2. **Vector Retrieval:** A modern twist using embeddings to understand semantic similarities, overcoming the limitations of keyword-only searches.

3. **Hybrid Retrieval:** A best-of-both-worlds strategy that combines keyword and vector methods to cover all bases, ensuring no query is left behind.

4. **Semantic Ranking:** The final touch, reordering results to prioritize relevance, leveraging deep learning models inspired by Microsoft Bing's algorithms.

#### 🎢 Strategy and Objectives: Navigating the Path to Excellence

Our roadmap is clear: harness the phi-3 model to categorize user queries with razor-sharp accuracy, ensuring each query activates the most suitable retrieval mechanism. This strategy promises to significantly enhance system performance. Key milestones include:

- **Crafting the Perfect Dataset:** A curated collection of data designed to boost the phi-3 model's query classification prowess.
  
- **Building a Superior Retrieval System:** A cutting-edge system that marries keyword and vector search methods with semantic ranking for unmatched relevance.
  
- **Setting the Performance Bar High:** Rigorous evaluation using leading metrics to ensure our system sets new standards for effectiveness.

#### 🚧 From Blueprint to Reality: The Journey Ahead

We're set to embark on a journey of fine-tuning and system integration, with the following phases:

1. **Dataset Curation:** Assembling a rich dataset to train the phi-3 model, covering a wide spectrum of query types.
   
2. **Precision Fine-Tuning:** Employing QLoRA to fine-tune the phi-3 model, enhancing its ability to accurately classify and route queries.
   
3. **Seamless System Integration:** Incorporating the refined model into our retrieval framework, tailored to adapt to diverse search scenarios.
   
4. **Benchmarking Success:** Evaluating our system's performance in real-world and controlled environments to confirm our anticipated gains in accuracy, responsiveness, and efficiency.

By embarking on this project, we aim to not only elevate the precision and efficiency of our retrieval system but also to redefine the benchmarks for language processing tasks through innovative fine-tuning of the phi-3 model.

## 0. Setup Azure Machine Learning (AML) Requirements

To establish a connection with an Azure Machine Learning workspace, specific identifying parameters are required: a subscription ID, a resource group name, and a workspace name. These details will be utilized with the `MLClient` from `azure.ai.ml` to access the desired Azure Machine Learning workspace. For authentication, the default Azure credentials will be employed in this hands-on guide.

In [5]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Gather values from .env
AZURE_SUBSCRIPTION_ID = os.getenv('AZURE_SUBSCRIPTION_ID')
AZURE_RESOURCE_GROUP = os.getenv('AZURE_RESOURCE_GROUP')
AZURE_WORKSPACE = os.getenv('AZURE_WORKSPACE')
AZURE_DATA_NAME = os.getenv('AZURE_DATA_NAME')
DATA_DIR = os.getenv('DATA_DIR')
CLOUD_DIR = os.getenv('CLOUD_DIR')
HF_MODEL_NAME_OR_PATH = os.getenv('HF_MODEL_NAME_OR_PATH')
IS_DEBUG = os.getenv('IS_DEBUG') == 'True'  # Assuming IS_DEBUG is a boolean value stored as string
USE_LOWPRIORITY_VM = os.getenv('USE_LOWPRIORITY_VM') == 'True'  # Assuming USE_LOWPRIORITY_VM is a boolean value stored as string

azure_env_name = os.getenv('azure_env_name')
azure_compute_cluster_name = os.getenv('azure_compute_cluster_name')
azure_compute_cluster_size = os.getenv('azure_compute_cluster_size')

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(CLOUD_DIR, exist_ok=True)

FileNotFoundError: [WinError 3] The system cannot find the path specified: ''

In [None]:
from src. import AMLManager

ml_manager = AMLManager(

In [None]:
# import required libraries
import time
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data, Environment, BuildContext
from azure.ai.ml.entities import Model
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes
from azure.core.exceptions import ResourceNotFoundError, ResourceExistsError

credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    ml_client = MLClient(credential, AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_WORKSPACE)

In [None]:
%%writefile {CLOUD_DIR}/conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pip=24.0
  - pip:
    - bitsandbytes==0.43.1
    - transformers~=4.41
    - peft~=0.11
    - accelerate~=0.30
    - trl==0.9.4
    - einops==0.8.0
    - datasets==2.20.0
    - wandb==0.17.2
    - mlflow==2.14.1
    - azureml-mlflow==1.56.0
    - azureml-sdk==1.56.0
    - torchvision==0.18.1

In [None]:
 import os
import yaml
from datetime import datetime
snapshot_date = datetime.now().strftime("%Y-%m-%d")

with open('config.yml') as f:
    d = yaml.load(f, Loader=yaml.FullLoader)
    
AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']
AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']
AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']
AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME']    
DATA_DIR = d['config']['DATA_DIR']
CLOUD_DIR = d['config']['CLOUD_DIR']
HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']
IS_DEBUG = d['config']['IS_DEBUG']
USE_LOWPRIORITY_VM = d['config']['USE_LOWPRIORITY_VM']

azure_env_name = d['train']['azure_env_name']  
azure_compute_cluster_name = d['train']['azure_compute_cluster_name']
azure_compute_cluster_size = d['train']['azure_compute_cluster_size']

!rm -rf $DATA_DIR 
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(CLOUD_DIR, exist_ok=True)

## 1. Dataset Curation

In [None]:
{"prompt":"Create a step-by-step guide on how to make your own DIY makeup remover using natural ingredients. Include specific measurements and instructions on how to mix the ingredients together. Also, recommend storage options and provide tips on how to use it effectively.","prompt_id":"8c24e2e4794d078ef832e0994696de577f8f241ebfb9b6492cddbca43e003385","messages":[{"content":"Create a step-by-step guide on how to make your own DIY makeup remover using natural ingredients. Include specific measurements and instructions on how to mix the ingredients together. Also, recommend storage options and provide tips on how to use it effectively.","role":"user"}, {"content":"Step-by-Step Guide to Make Your Own DIY Makeup Remover using Natural Ingredients:\n\nIngredients:\n\n- 1\/2 cup of extra-virgin olive oil\n- 1\/4 cup of witch hazel\n- 1\/4 cup of filtered water\n- 1 tablespoon of castile soap\n- 5 drops of tea tree oil\n\nTools Required:\n\n- A mixing bowl\n- A stirring spoon\n- A glass container for storage\n\nInstructions:\n\n1. In a mixing bowl, add 1\/2 cup of extra-virgin olive oil.\n\n2. Add 1\/4 cup of witch hazel to the mixing bowl and stir well.\n\n3. Add 1\/4 cup of filtered water to the mixing bowl and stir until the ingredients are well combined.\n\n4. Add 1 tablespoon of castile soap to the mixing bowl and stir gently to combine.\n\n5. Add 5 drops of tea tree oil to the mixture and stir.\n\n6. Transfer the mixture to a glass container for storage.\n\n7. Label the container with the date and ingredients used.\n\nTips on Using Your DIY Makeup Remover:\n\n- Shake well before using to ensure the ingredients are mixed.\n- Use a cotton pad to apply the mixture to your face.\n- Rub the cotton pad gently in circular motions over your face to remove makeup.\n- Wash your face with warm water after using the makeup remover.\n- Store your DIY makeup remover in a cool, dry, and dark place to avoid spoiling.\n- Try using different essential oils, such as lavender or chamomile, to customize your makeup remover according to your skin type.\n- You can also use this makeup remover as a facial cleanser to remove dirt and impurities from your face.","role":"assistant"},{"content":"This guide is great! Can you recommend some specific brands for the ingredients, such as the castile soap and witch hazel? And also, can you add some tips on how to remove stubborn eye makeup with this DIY makeup remover?","role":"user"},{"content":"Yes, definitely! Here are some recommendations for brands of the ingredients:\n\n- Castile soap: Dr. Bronner's Castile Soap\n- Witch hazel: Thayers Witch Hazel Toner\n- Tea tree oil: Healing Solutions Tea Tree Essential Oil\n\nTips on Removing Stubborn Eye Makeup:\n\n- Soak a cotton pad with the DIY makeup remover and hold it over your closed eye for a few seconds to allow the makeup to dissolve.\n- Gently swipe the cotton pad over your eyelids and eyelashes, using small circular motions.\n- For particularly stubborn eye makeup, you can use a Q-tip soaked in the DIY makeup remover to remove any remnants.\n- Be careful not to rub the cotton pad too hard on your eyes, as this can cause irritation or damage to your delicate eye area.","role":"assistant"}]}

In [6]:
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 4.44k/4.44k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 244M/244M [00:04<00:00, 51.4MB/s] 
Downloading data: 100%|██████████| 244M/244M [00:04<00:00, 59.3MB/s] 
Downloading data: 100%|██████████| 244M/244M [00:03<00:00, 61.1MB/s] 
Downloading data: 100%|██████████| 81.2M/81.2M [00:01<00:00, 58.9MB/s]
Downloading data: 100%|██████████| 244M/244M [00:04<00:00, 58.5MB/s] 
Downloading data: 100%|██████████| 243M/243M [00:04<00:00, 58.9MB/s] 
Downloading data: 100%|██████████| 243M/243M [00:04<00:00, 57.3MB/s] 
Downloading data: 100%|██████████| 80.4M/80.4M [00:01<00:00, 55.8MB/s]
Generating train_sft split: 100%|██████████| 207865/207865 [00:05<00:00, 36322.25 examples/s]
Generating test_sft split: 100%|██████████| 23110/23110 [00:00<00:00, 37195.72 examples/s]
Generating train_gen split: 100%|██████████| 256032/256032 [00:05<00:00, 45348.03 examples/s]
Generating test_gen split: 100%|██████████| 

dataset size: 4157
{'prompt': 'Write a step-by-step guide on how to set up and effectively use Trello for managing a project, including how to create boards, lists, and cards, assign tasks to team members, track progress, and collaborate efficiently. Provide tips on best practices for labeling, tagging, and organizing information within Trello, as well as suggestions for integrating other project management tools or software.', 'prompt_id': '08e593adfb879510d70e6065a38e605a9199a6686d7ab8a44d541aa30e5268ab', 'messages': [{'content': 'Write a step-by-step guide on how to set up and effectively use Trello for managing a project, including how to create boards, lists, and cards, assign tasks to team members, track progress, and collaborate efficiently. Provide tips on best practices for labeling, tagging, and organizing information within Trello, as well as suggestions for integrating other project management tools or software.', 'role': 'user'}, {'content': 'Step-by-Step Guide on how to s

In [7]:
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"data/train.jsonl")
test_dataset = dataset['test']
test_dataset.to_json(f"data/eval.jsonl")

Creating json from Arrow format: 100%|██████████| 4/4 [00:01<00:00,  2.80ba/s]
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  4.46ba/s]


5803139

## 2. Creating training script 

In [92]:
%%writefile src/finetuning/phi3/train_mlflow.py
from mlflow.models.signature import ModelSignature
from mlflow.types import DataType, Schema, ColSpec
import pandas as pd
from typing import List, Dict

import os
import mlflow
from mlflow.models import infer_signature
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset
from datetime import datetime

logger = logging.getLogger(__name__)

def load_model(model_name_or_path: str = "microsoft/Phi-3-mini-4k-instruct",
               use_cache: bool = False,
               trust_remote_code: bool = True,
               torch_dtype: torch.dtype = torch.bfloat16,
               device_map: dict = None,
               max_seq_length: int = 4096) -> tuple:
    """
    Loads a pre-trained model and its tokenizer with specified configurations.

    Parameters:
    - model_name_or_path (str): Identifier for the model to load. Can be a model ID or path.
    - use_cache (bool): Whether to use caching for model outputs.
    - trust_remote_code (bool): Whether to trust remote code when loading the model.
    - torch_dtype (torch.dtype): Data type for model tensors. Recommended to use torch.bfloat16 for efficiency.
    - device_map (dict): Custom device map for distributing the model's layers across devices.
    - max_seq_length (int): Maximum sequence length for the tokenizer.

    Returns:
    - tuple: A tuple containing the loaded model and tokenizer.
    """
    model_kwargs = {
        "use_cache": use_cache,
        "trust_remote_code": trust_remote_code,
        "torch_dtype": torch_dtype,
        "device_map": device_map
    }
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, **model_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    tokenizer.model_max_length = max_seq_length
    tokenizer.pad_token = tokenizer.unk_token  # use unk rather than eos token to prevent endless generation
    tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
    tokenizer.padding_side = "right"
    return model, tokenizer

def convert_to_chat_format(df: pd.DataFrame) -> List[List[Dict[str, str]]]:
    """
    Converts a DataFrame containing questions and their types into a chat format.

    Parameters:
    - df (pd.DataFrame): A DataFrame with at least two columns: 'Question' and 'Kind of Query'.
      Each row represents a user query and its categorization.

    Returns:
    - List[List[Dict[str, str]]]: A list of chats, where each chat is a list of messages.
      Each message is a dictionary with 'role' and 'content' keys. The 'role' can be 'system',
      'user', or 'assistant', indicating the sender of the message. The 'content' is the text of the message.
    """
    chats = []

    for index, row in df.iterrows():
        chat = [
            {
                "role": "system",
                "content": "You are an AI assistant supporting users by categorizing their queries."
            },
            {
                "role": "user",
                "content": row["Question"]
            },
            {
                "role": "assistant",
                "content": f"This query is a '{row['Kind of Query']}' type."
            }
        ]
        chats.append(chat)
    
    return chats

def convert_chats_to_dataframe(chats: List[List[Dict[str, str]]]) -> pd.DataFrame:
    """
    Converts a list of chats into a DataFrame where each chat is represented as a dictionary in the 'message' column.

    Parameters:
    - chats (List[List[Dict[str, str]]]): A list of chats, where each chat is a list of messages.
      Each message is a dictionary with 'role' and 'content' keys.

    Returns:
    - pd.DataFrame: A DataFrame with a single column 'message', where each row contains a dictionary
      representing a chat.
    """
    # Convert each chat into a dictionary and store it in a list
    chat_dicts = [{'message': chat} for chat in chats]
    
    # Create a DataFrame from the list of dictionaries
    df = pd.DataFrame(chat_dicts)
    
    return df


def apply_chat_template(
    example: dict,
    tokenizer: PreTrainedTokenizer,
) -> dict:
    """
    Applies a chat template to the messages in an example from a dataset.

    This function modifies the input example by adding a system message at the beginning
    if it does not already start with one. It then applies a chat template formatting
    using the specified tokenizer.

    Parameters:
    - example (dict): A dictionary representing a single example from a dataset. It must
      contain a key 'messages', which is a list of message dictionaries. Each message
      dictionary should have 'role' and 'content' keys.
    - tokenizer (PreTrainedTokenizer): An instance of a tokenizer that supports the
      `apply_chat_template` method for formatting chat messages.

    Returns:
    - dict: The modified example dictionary with an added 'text' key that contains the
      formatted chat as a string.
    """
    messages = example["message"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example

def main(args):
    
    ###################
    # Hyper-parameters
    ###################
    # Only overwrite environ if wandb param passed
    if len(args.wandb_project) > 0:
        os.environ['WANDB_API_KEY'] = args.wandb_api_key    
        os.environ["WANDB_PROJECT"] = args.wandb_project
    if len(args.wandb_watch) > 0:
        os.environ["WANDB_WATCH"] = args.wandb_watch
    if len(args.wandb_log_model) > 0:
        os.environ["WANDB_LOG_MODEL"] = args.wandb_log_model

    use_wandb = len(args.wandb_project) > 0 or ("WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0) 

    training_config = {
        "bf16": True,
        "do_eval": False,
        "learning_rate": args.learning_rate,
        "log_level": "info",
        "logging_steps": args.logging_steps,
        "logging_strategy": "steps",
        "lr_scheduler_type": args.lr_scheduler_type,
        "num_train_epochs": args.epochs,
        "max_steps": -1,
        "output_dir": args.output_dir,
        "overwrite_output_dir": True,
        "per_device_train_batch_size": args.train_batch_size,
        "per_device_eval_batch_size": args.eval_batch_size,
        "remove_unused_columns": True,
        "save_steps": args.save_steps,
        "save_total_limit": 1,
        "seed": args.seed,
        "gradient_checkpointing": True,
        "gradient_checkpointing_kwargs": {"use_reentrant": False},
        "gradient_accumulation_steps": args.grad_accum_steps,
        "warmup_ratio": args.warmup_ratio,
    }

    peft_config = {
        "r": args.lora_r,
        "lora_alpha": args.lora_alpha,
        "lora_dropout": args.lora_dropout,
        "bias": "none",
        "task_type": "CAUSAL_LM",
        #"target_modules": "all-linear",
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        "modules_to_save": None,
    }

    checkpoint_dir = os.path.join(args.output_dir, "checkpoints")

    train_conf = TrainingArguments(
        **training_config,
        report_to="wandb" if use_wandb else "azure_ml",
        run_name=args.wandb_run_name if use_wandb else None,    
    )
    peft_conf = LoraConfig(**peft_config)
    model, tokenizer = load_model(args)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = train_conf.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {train_conf.local_rank}, device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
        + f" distributed training: {bool(train_conf.local_rank != -1)}, 16-bits training: {train_conf.fp16}"
    )
    logger.info(f"Training/evaluation parameters {train_conf}")
    logger.info(f"PEFT parameters {peft_conf}")    

    ##################
    # Data Processing
    ##################

    train_dataset = load_dataset('json', data_files=os.path.join(args.train_dir, 'train.jsonl'), split='train')
    eval_dataset = load_dataset('json', data_files=os.path.join(args.train_dir, 'eval.jsonl'), split='train')
    column_names = list(train_dataset.features)

    train_data = pd.read_csv(args.train_dir)
    train_data_chat_format = convert_to_chat_format(train_data)
    df_train_data_chat_format = convert_chats_to_dataframe(train_data_chat_format)
    train_dataset = datasets.Dataset.from_pandas(pd.DataFrame(df_train_data_chat_format, columns=["message"]),split= "train")

    eval_data = pd.read_csv(args.eval_data)
    eval_data_chat_format = convert_to_chat_format(eval_data)
    df_eval_data_chat_format = convert_chats_to_dataframe(eval_data_chat_format)
    eval_data_dataset = datasets.Dataset.from_pandas(pd.DataFrame(df_eval_data_chat_format, columns=["message"]),split="train")

    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_eval_dataset = eval_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    with mlflow.start_run() as run:        
        ###########
        # Training
        ###########
        trainer = SFTTrainer(
            model=model,
            args=train_conf,
            peft_config=peft_conf,
            train_dataset=processed_train_dataset,
            eval_dataset=processed_eval_dataset,
            max_seq_length=args.max_seq_length,
            dataset_text_field="text",
            tokenizer=tokenizer,
            packing=True,
        )

        # Show current memory stats
        gpu_stats = torch.cuda.get_device_properties(0)
        start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
        max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
        logger.info(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
        logger.info(f"{start_gpu_memory} GB of memory reserved.")
        
        last_checkpoint = None
        if os.path.isdir(checkpoint_dir):
            checkpoints = [os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)]
            if len(checkpoints) > 0:
                checkpoints.sort(key=os.path.getmtime, reverse=True)
                last_checkpoint = checkpoints[0]        

        trainer_stats = trainer.train(resume_from_checkpoint=last_checkpoint)

        #############
        # Logging
        #############
        metrics = trainer_stats.metrics

        # Show final memory and time stats 
        used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
        used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
        used_percentage = round(used_memory         /max_memory*100, 3)
        lora_percentage = round(used_memory_for_lora/max_memory*100, 3)

        logger.info(f"{metrics['train_runtime']} seconds used for training.")
        logger.info(f"{round(metrics['train_runtime']/60, 2)} minutes used for training.")
        logger.info(f"Peak reserved memory = {used_memory} GB.")
        logger.info(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
        logger.info(f"Peak reserved memory % of max memory = {used_percentage} %.")
        logger.info(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
                
        trainer.log_metrics("train", metrics)

        model_info = mlflow.transformers.log_model(
            transformers_model={"model": trainer.model, "tokenizer": tokenizer},
            #prompt_template=prompt_template,
            #signature=signature,
            artifact_path=args.model_dir,  # This is a relative path to save model files within MLflow run
        )


def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()
    # curr_time = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")

    # hyperparameters
    parser.add_argument("--train_dir", default="data", type=str, help="Input directory for training")
    parser.add_argument("--model_dir", default="./model", type=str, help="output directory for model")
    parser.add_argument("--epochs", default=1, type=int, help="number of epochs")
    parser.add_argument("--output_dir", default="./output_dir", type=str, help="directory to temporarily store when training a model")
    parser.add_argument("--train_batch_size", default=8, type=int, help="training - mini batch size for each gpu/process")
    parser.add_argument("--eval_batch_size", default=8, type=int, help="evaluation - mini batch size for each gpu/process")
    parser.add_argument("--learning_rate", default=5e-06, type=float, help="learning rate")
    parser.add_argument("--logging_steps", default=2, type=int, help="logging steps")
    parser.add_argument("--save_steps", default=100, type=int, help="save steps")    
    parser.add_argument("--grad_accum_steps", default=4, type=int, help="gradient accumulation steps")
    parser.add_argument("--lr_scheduler_type", default="linear", type=str)
    parser.add_argument("--seed", default=0, type=int, help="seed")
    parser.add_argument("--warmup_ratio", default=0.2, type=float, help="warmup ratio")
    parser.add_argument("--max_seq_length", default=2048, type=int, help="max seq length")
    parser.add_argument("--save_merged_model", type=bool, default=False)

    # lora hyperparameters
    parser.add_argument("--lora_r", default=16, type=int, help="lora r")
    parser.add_argument("--lora_alpha", default=16, type=int, help="lora alpha")
    parser.add_argument("--lora_dropout", default=0.05, type=float, help="lora dropout")
    
    # wandb params
    parser.add_argument("--wandb_api_key", type=str, default="")
    parser.add_argument("--wandb_project", type=str, default="")
    parser.add_argument("--wandb_run_name", type=str, default="")
    parser.add_argument("--wandb_watch", type=str, default="gradients") # options: false | gradients | all
    parser.add_argument("--wandb_log_model", type=str, default="false") # options: false | true

    # parse args
    args = parser.parse_args()

    # return args
    return args

if __name__ == "__main__":
    #sys.argv = ['']
    args = parse_args()
    main(args)


Overwriting src/finetuning/phi3/train_mlflow.py


## Creating the compute Cluster

In [None]:
import os
import yaml
from datetime import datetime
snapshot_date = datetime.now().strftime("%Y-%m-%d")

with open('config.yml') as f:
    d = yaml.load(f, Loader=yaml.FullLoader)
    
AZURE_SUBSCRIPTION_ID = d['config']['AZURE_SUBSCRIPTION_ID']
AZURE_RESOURCE_GROUP = d['config']['AZURE_RESOURCE_GROUP']
AZURE_WORKSPACE = d['config']['AZURE_WORKSPACE']
AZURE_DATA_NAME = d['config']['AZURE_DATA_NAME']    
DATA_DIR = d['config']['DATA_DIR']
CLOUD_DIR = d['config']['CLOUD_DIR']
HF_MODEL_NAME_OR_PATH = d['config']['HF_MODEL_NAME_OR_PATH']
IS_DEBUG = d['config']['IS_DEBUG']
USE_LOWPRIORITY_VM = d['config']['USE_LOWPRIORITY_VM']

azure_env_name = d['train']['azure_env_name']  
azure_compute_cluster_name = d['train']['azure_compute_cluster_name']
azure_compute_cluster_size = d['train']['azure_compute_cluster_size']

!rm -rf $DATA_DIR 
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(CLOUD_DIR, exist_ok=True)

In [None]:
from azure.ai.ml.entities import AmlCompute

### Create the compute cluster
try:
    compute = ml_client.compute.get(azure_compute_cluster_name)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {azure_compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        tier = 'LowPriority' if USE_LOWPRIORITY_VM else 'Dedicated'
        compute = AmlCompute(
            name=azure_compute_cluster_name,
            size=azure_compute_cluster_size,
            tier=tier,
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print("Error")

In [None]:
env = get_or_create_environment_asset(ml_client, azure_env_name, conda_yml=f"{CLOUD_DIR}/conda.yml", update=False)
data = get_or_create_data_asset(ml_client, AZURE_DATA_NAME, data_local_dir=DATA_DIR, update=False)