## **Project Idea: Predictive Supply Chain and Inventory Management System**

###● Description: Develop an AI system that predicts inventory demand, monitors supply chain disruptions, and provides recommendations to optimize stock levels.

● Key Features:

    o LLM Fine-Tuning on Supply Chain Data: Fine-tune an LLM on historical supply chain data, logistics reports & inventory management systems.
    o RAGs for Supply Chain Insights: Use RAGs to fetch real-time supply chain data, news, and market trends.
    o Agent for Optimization: Predict inventory requirements and optimize supply chain routes and stock levels.

● Steps:

    o Collect datasets of supply chain logs, inventory data, and product demand forecasts.
    o Fine-tune the LLM to understand supply chain operations and logistics.
    o Implement RAGs to fetch real-time data on inventory, suppliers, and logistics disruptions.
    o Build an agent that provides real-time supply chain optimization and recommendations.
    o Evaluate the system’s performance using real-world supply chain scenarios.

Install Unsloth opensource fine tuning and use FastLanguageMode. In this project i have used PEFT fine tuning method based on LoRa

In [None]:
!pip install unsloth trl peft accelerate bitsandbytes

In [None]:
from unsloth import FastLanguageModel
import torch
from transformers import BitsAndBytesConfig

max_seq_length = 1024
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_compute_dtype = torch.float16,
    bnb_4bit_use_double_quant = True,
    llm_int8_enable_fp32_cpu_offload = True, # Crucial for allowing 32-bit CPU offload
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct",
    max_seq_length = max_seq_length,
    device_map = "auto",
    quantization_config = quantization_config,
    dtype = torch.float16,
)

I will now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train <1% of all parameters.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Data Prep Stage

In [None]:
# Loading sample supply chain data set from HuggingFace for demo purposes in real life would integrate the company's historic data for better grounding
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)

ds = load_dataset("alalfi/SupplyChainDataset", split = "train")

I will now use standardize_sharegpt to convert ShareGPT style datasets into HuggingFace generic format.

In [None]:
from unsloth.chat_templates import get_chat_template

# Redefine tokenizer to ensure the chat template is applied
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)

def formatting_prompts_func(examples):
    texts = []
    for i in range(len(examples['Product_Name'])):
        product_name = examples['Product_Name'][i]
        # Assuming 'Order_Item_Quantity' is a suitable replacement for 'Stock Quantity'
        stock_quantity = examples['Order_Item_Quantity'][i]
        # Assuming 'Days_for_shipping_(real)' is a suitable replacement for 'Lead Time (days)'
        lead_time = examples['Days_for_shipping_(real)'][i]
        # Assuming delivery status for getting risks of late deliveries as an example
        delivery_status = examples['Delivery_Status'][i]
        # Add new columns for more comprehensive insights
        late_delivery_risk = examples['Late_delivery_risk'][i]
        order_status = examples['Order_Status'][i]
        shipping_mode = examples['Shipping_Mode'][i]
        sales_amount = examples['Sales'][i]

        #some allied fields for extra information:
        customer_city = examples['Customer_City'][i]
        customer_country = examples['Customer_Country'][i]
        customer_segment = examples['Customer_Segment'][i]
        department_name = examples['Department_Name'][i]
        product_price = examples['Product_Price'][i]
        product_description = examples['Product_Description'][i]
        # Assuming 'Category_Name' is a suitable replacement for 'Category'
        category = examples['Category_Name'][i]

        # Create a simple prompt-response structure for fine-tuning, this can be customized based on what you want the LLM to learn
        conversation = [
            {"role": "user", "content": f"What are the key supply chain details for product '{product_name}'?"},
            {"role": "assistant", "content": f"Product Name: {product_name}, Category: {category}, Stock Quantity: {stock_quantity}, Lead Time: {lead_time} days, Delivery Status: {delivery_status}, Late Delivery Risk: {late_delivery_risk}, Order Status: {order_status}, Shipping Mode: {shipping_mode}, Sales: {sales_amount}."}
        ]
        texts.append(tokenizer.apply_chat_template(
            conversation, tokenize=False, add_generation_prompt=False
        ))
    return { "text" : texts, }

# Apply the custom formatting function directly to the dataset
dataset = ds.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=ds.column_names # Remove original columns if not needed after text generation
)

Training the model

In [None]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8, # 8 for more stable gradients
        warmup_steps = 5,
        num_train_epochs = 2,
        max_steps = 30,
        learning_rate = 1e-5,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        fp16 = True,
        bf16 = False,
        max_grad_norm = 0.3, # Added gradient clipping to prevent exploding gradients
    ),
)
# The set_format call is moved here to ensure labels are tensors before train_on_responses_only
trainer.train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "text"])

In [None]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

In [None]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["input_ids"]])

Show System Stats

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
#Print Trainer Stats

trainer_stats = trainer.train()

In [None]:
display(trainer_stats)

Print Time Stats

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Inferencing the model

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-4",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    #Comment one to understand the model response behaviour for answer to the question which it doesn't completely know the answer and for one which it may know
    #{"role": "user", "content": "What are the best strategies to optimize inventory stock levels and reduce holding costs?"}, #Model doesn't accurately know the answer
    {"role": "user", "content": "What products are on risk of late delivery?"}, #Model may accurately know the answer from trining data.
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(
    input_ids = inputs, max_new_tokens = 150, use_cache = True, temperature = 1.5, min_p = 0.1
)
tokenizer.batch_decode(outputs)

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Which customer is impacted by late deliveries?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(
    input_ids = inputs, streamer = text_streamer, max_new_tokens = 100,
    use_cache = False, temperature = 1.5, min_p = 0.1
)

Save pretrained model for later use

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
model.push_to_hub("prateekmarda/lora_model", token = "") # Online saving , token has been intentionally removed pls provide your HF access token or other allied like so here
tokenizer.push_to_hub("prateekmarda/lora_model", token = "") # Online saving , token has been intentionally removed pls provide your HF access token or other allied like so here

## This completes Part 1 of the project which is Finetuning an LLM on sample supply chain data

In [None]:
!zip GGI5A_notebook_archive_S12456.zip GGI5A_S12456_Capstone_SCM_Part1_Finetuning_LLM.ipynb

In [None]:
!ls