# Local Inference with Azure Foundry Local

**Placeholder** This notebook demonstrates how to use Azure Foundry Local to run inference with your optimized model on your local machine. Azure Foundry Local provides a simple, containerized way to serve and interact with large language models, including those you have fine-tuned and exported from Azure ML.

## What You'll Learn
- How to install and configure Azure Foundry Local
- How to launch a local model server using Foundry
- How to send prompts and receive completions from your model
- How to use the Foundry Python SDK for local inference



## 2. Prepare Your Model and Config for Foundry Local

- Ensure your model and any adapters (such as LoRA) are exported in a format supported by Foundry Local (e.g., ONNX, GGUF, or HuggingFace Transformers format).
- Place your model files in a directory, e.g., `./LocalFoundryEnv/`.
- Create or update an `inference_model.json` config file in that directory, following the [Foundry Local model config guide](https://github.com/microsoft/Foundry-Local/blob/main/docs/model-config.md).

Example `inference_model.json`:
```
{
  "model_format": "onnx",
  "model_path": "./phi-4-mini-onnx-int4-cpu/1/model",
  "adapter_path": "./phi-4-mini-onnx-int4-cpu/1/model/adapter_weights.onnx_adapter",
  "chat_template": "You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'."
}
```
> Tip: If you used 05.Local_Download.ipynb, your model files should already be in a suitable directory. Just add or edit the config file as above.


## 3. Install the Foundry Local if not already installed

Download AI Foundry Local for your platform from the releases page.

Install the package by following the on-screen prompts. After installation, access the tool via command line with foundry.


## 4. Running Your First Model
- Open a command prompt or terminal window.
- Run a model using the following command:
- foundry model run deepseek-r1-1.5b-cpu

This command will:

- Download the model to your local disk
- Load the model into your device
- Start a chat interface

💡 TIP: Replace deepseek-r1-1.5b-cpu with any model from the catalog. Use foundry model list to see available models. In our case we will replace deepseek-r-1-5b-cpu with our local downloaded model from 05.local_download.ipynb. If you used the default settings in 05.Local_Download.ipynb, the model name:"fine-tuning-phi-4-mini-onnx-int4-cpu"


In [None]:
# Install the Foundry SDK if not already installed
import sys
import subprocess

def install_package(package):
    try:
        __import__(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install_package("foundry")


## 5. Connect to the Local Foundry Server and Run Inference

Now, use the Foundry SDK to connect to your local server and send prompts for inference.


In [None]:

from foundry import FoundryClient

# Connect to the local Foundry server
client = FoundryClient(base_url="http://localhost:5000")

# Example multiple-choice question
question = "Which planet is closest to the Sun?"
choices = {
    "A": "Venus",
    "B": "Earth",
    "C": "Mercury",
    "D": "Mars",
    "E": "Jupiter"
}
choice_text = "\n".join([f"({k}) {v}" for k, v in choices.items()])
prompt = (
    "Answer the following multiple-choice question by selecting the correct option.\n\n"
    f"Question: {question}\nAnswer Choices:\n{choice_text}"
)

system_prompt = "You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'."

response = client.chat.completions.create(
    model="local",  # 'local' is the default model name for Foundry Local
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ],
    max_tokens=10,
    temperature=0.0
)

print("Model response:", response.choices[0].message.content)
</VSCode.Cell>
<VSCode.Cell language="markdown">


## 6. Try Your Own Questions

You can now use the `client` object to send any prompt to your local model. Try with your own multiple-choice questions or other tasks supported by your model.


In [None]:

def ask_foundry_mcq(client, question, choices):
    choice_text = "\n".join([f"({k}) {v}" for k, v in choices.items()])
    prompt = (
        "Answer the following multiple-choice question by selecting the correct option.\n\n"
        f"Question: {question}\nAnswer Choices:\n{choice_text}"
    )
    system_prompt = "You are a helpful assistant. Your output should only be one of the five choices: 'A', 'B', 'C', 'D', or 'E'."
    response = client.chat.completions.create(
        model="local",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        max_tokens=10,
        temperature=0.0
    )
    print("Q:", question)
    print("A:", response.choices[0].message.content)



# Example usage


In [None]:
ask_foundry_mcq(
    client,
    "What is the capital of France?",
    {
        "A": "Berlin",
        "B": "London",
        "C": "Paris",
        "D": "Madrid",
        "E": "Rome"
    }
)

```

## 7. Next Steps

- Explore more advanced prompt engineering and system instructions
- Benchmark your model's performance locally
- Integrate the local Foundry server into your applications
- For more details, see the [Foundry Local documentation](https://github.com/microsoft/Foundry-Local/tree/main/docs)


**Congratulations!** You have successfully run local inference with your optimized model using Azure Foundry Local.

