# Foundry Local RAG Implementation Guide

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) system using Foundry Local with Semantic Kernel, ONNX embeddings, and Qdrant vector database.

## Package Installation

First, we install the required NuGet packages for Semantic Kernel and related components.

In [1]:
#r "nuget: Microsoft.SemanticKernel, 1.60.0"

### Install Microsoft Semantic Kernel Core Package

Installing the main Semantic Kernel package which provides the core functionality for building AI applications.

In [2]:
#r "nuget: Microsoft.SemanticKernel.Connectors.Onnx, 1.60.0-alpha"

### Install Semantic Kernel ONNX Connector

Installing the ONNX connector package which enables using ONNX models for embeddings generation in Semantic Kernel.

In [3]:
#r "nuget: Microsoft.SemanticKernel.Connectors.Onnx, 1.60.0-alpha"

### Duplicate ONNX Connector Installation

Note: This is a duplicate installation of the ONNX connector package (same as the previous cell).

In [4]:
#r "nuget: Microsoft.SemanticKernel.Connectors.Qdrant, 1.60.0-preview"

### Install Semantic Kernel Qdrant Connector

Installing the Qdrant connector package to enable vector database operations with Semantic Kernel.

In [5]:
#r "nuget: Qdrant.Client, 1.14.1"

### Install Qdrant Client

Installing the official Qdrant client library for direct communication with the Qdrant vector database.

In [6]:
using Microsoft.SemanticKernel;

## Setup and Configuration

### Import Semantic Kernel

Importing the core Semantic Kernel namespace to access the main functionality.

In [7]:
var builder = Kernel.CreateBuilder();

### Create Kernel Builder

Creating a kernel builder instance which will be used to configure and build the Semantic Kernel with various services.

In [None]:
var embeddModelPath = "Your Jinaai jina-embeddings-v2-base-en onnx model path";
var embedVocab = "Your Jinaai ina-embeddings-v2-base-en vocab file path";

### Define Embedding Model Paths

Setting up file paths for the JINA embedding model files - the ONNX model file and vocabulary file needed for text embeddings.

In [9]:
builder.AddBertOnnxEmbeddingGenerator(embeddModelPath, embedVocab);
builder.AddOpenAIChatCompletion("qwen2.5-0.5b-instruct-generic-gpu", new Uri("http://localhost:5273/v1"), apiKey: "", serviceId: "qwen2.5-0.5b");

### Configure AI Services

Adding the BERT ONNX embedding generator and OpenAI-compatible chat completion service to the kernel builder. The chat service connects to a local Foundry Local instance running the Qwen2.5 model.

In [10]:
var kernel = builder.Build();

### Build the Kernel

Building the final kernel instance with all configured services (embedding generator and chat completion service).

In [11]:
using Microsoft.SemanticKernel.Embeddings;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.Extensions.AI;



### Import Additional Required Namespaces

Importing namespaces for embeddings, chat completion, and Microsoft Extensions AI functionality.

In [12]:
using System.Net.Http;

### Import HTTP Client

Importing System.Net.Http for HTTP communication capabilities.

In [13]:

using Microsoft.SemanticKernel.Memory;
using Microsoft.SemanticKernel.Connectors.Qdrant;

### Import Memory and Vector Database Connectors

Importing Semantic Kernel memory functionality and Qdrant connector for vector database operations.

In [14]:
using Qdrant.Client;
using Qdrant.Client.Grpc;

### Import Qdrant Client Libraries

Importing the Qdrant client and gRPC libraries for direct communication with the Qdrant vector database.

In [15]:

public class VectorStoreService
{
    private readonly QdrantClient _client;
    private readonly string _collectionName;

    public VectorStoreService(string endpoint, string apiKey, string collectionName)
    {
        _client = new QdrantClient(new Uri(endpoint));
        _collectionName = collectionName;
    }

    public async Task InitializeAsync(int vectorSize = 768)
    {
        try
        {
            await _client.GetCollectionInfoAsync(_collectionName);
        }
        catch
        {
            await _client.CreateCollectionAsync(_collectionName, new VectorParams
            {
                Size = (ulong)vectorSize,
                Distance = Distance.Cosine
            });
        }
    }

    public async Task UpsertAsync(string id, ReadOnlyMemory<float> embedding, Dictionary<string, object> metadata)
    {
        var point = new PointStruct
        {
            Id = new PointId { Uuid = id },
            Vectors = embedding.ToArray(),
            Payload = { }
        };

        foreach (var kvp in metadata)
        {
            point.Payload[kvp.Key] = kvp.Value switch
            {
                string s => s,
                int i => i,
                bool b => b,
                _ => kvp.Value.ToString() ?? string.Empty
            };
        }

        await _client.UpsertAsync(_collectionName, new[] { point });
    }

    public async Task<List<ScoredPoint>> SearchAsync(ReadOnlyMemory<float> queryEmbedding, int limit = 3)
    {
        var searchResult = await _client.SearchAsync(_collectionName, queryEmbedding.ToArray(), limit: (ulong)limit);
        return searchResult.ToList();
    }
}

## Service Classes

### Vector Store Service Class

This class provides a wrapper around the Qdrant client to handle vector database operations including:
- Collection initialization with proper vector configuration
- Upserting vectors with metadata
- Searching for similar vectors using cosine similarity

In [16]:

public class RagQueryService
{
    private readonly IEmbeddingGenerator<string, Embedding<float>> _embeddingService;
    private readonly IChatCompletionService _chatService;
    private readonly VectorStoreService _vectorStoreService;

    public RagQueryService(
        IEmbeddingGenerator<string, Embedding<float>> embeddingService,
        IChatCompletionService chatService,
        VectorStoreService vectorStoreService)
    {
        _embeddingService = embeddingService;
        _chatService = chatService;
        _vectorStoreService = vectorStoreService;
    }

    public async Task<string> QueryAsync(string question)
    {
        // return question; // For now, just return the question as a placeholder
           var queryEmbeddingResult = await _embeddingService.GenerateAsync(question);
//         Console.WriteLine(question);
            var queryEmbedding = queryEmbeddingResult.Vector;
            var searchResults = await _vectorStoreService.SearchAsync(queryEmbedding, limit: 5);

            string str_context = "";
            foreach (var result in searchResults)
            {
                if (result.Payload.TryGetValue("text", out var text))
                {
                    str_context += text.ToString();
                }
            }
            var prompt = $@"According to the question {question},, optimize and simplify the content. {str_context}";


            var chatHistory = new ChatHistory();
            chatHistory.AddSystemMessage("You are a helpful assistant that answers questions based on the provided context.");
            chatHistory.AddUserMessage(prompt);

            var fullMessage = string.Empty;

            await foreach (var chatUpdate in _chatService.GetStreamingChatMessageContentsAsync(chatHistory, cancellationToken: default))
            {                     
                if (chatUpdate.Content is { Length: > 0 })
                {
                    fullMessage += chatUpdate.Content;
                }
            }
            return fullMessage ?? "I couldn't generate a response.";
    }
}

### RAG Query Service Class

This service implements the core RAG (Retrieval-Augmented Generation) functionality:
1. Converts user questions into embeddings
2. Searches for relevant context from the vector database
3. Combines the retrieved context with the user question
4. Generates responses using the chat completion service

In [17]:
using System.IO;

### Import File I/O

Importing System.IO for file reading operations needed for document ingestion.

In [18]:

public class DocumentIngestionService
{
    private readonly IEmbeddingGenerator<string, Embedding<float>> _embeddingService;
    private readonly VectorStoreService _vectorStoreService;

    public DocumentIngestionService(IEmbeddingGenerator<string, Embedding<float>> embeddingService, VectorStoreService vectorStoreService)
    {
        _embeddingService = embeddingService;
        _vectorStoreService = vectorStoreService;
    }

    public async Task IngestDocumentAsync(string documentPath, string documentId)
    {
        var content = await File.ReadAllTextAsync(documentPath);
        var chunks = ChunkText(content, 300, 60);

        for (int i = 0; i < chunks.Count; i++)
        {
            var chunk = chunks[i];
            var embeddingResult = await _embeddingService.GenerateAsync(chunk);
            var embedding = embeddingResult.Vector;
            
            await _vectorStoreService.UpsertAsync(
                id: Guid.NewGuid().ToString(),
                embedding: embedding,
                metadata: new Dictionary<string, object>
                {
                    ["document_id"] = documentId,
                    ["chunk_index"] = i,
                    ["text"] = chunk,
                    ["document_path"] = documentPath
                }
            );
        }
    }

    private List<string> ChunkText(string text, int chunkSize, int overlap)
    {
        var chunks = new List<string>();
        var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        
        for (int i = 0; i < words.Length; i += chunkSize - overlap)
        {
            var chunkWords = words.Skip(i).Take(chunkSize).ToArray();
            var chunk = string.Join(" ", chunkWords);
            chunks.Add(chunk);
            
            if (i + chunkSize >= words.Length)
                break;
        }
        
        return chunks;
    }
}

### Document Ingestion Service Class

This service handles the process of ingesting documents into the vector database:
1. Reads document content from files
2. Splits text into chunks with configurable size and overlap
3. Generates embeddings for each chunk
4. Stores chunks with embeddings and metadata in the vector database

In [19]:

using Microsoft.SemanticKernel.ChatCompletion;

### Additional Chat Completion Import

Additional import for chat completion functionality (note: this might be a duplicate import).

In [None]:
var chatService = kernel.GetRequiredService<IChatCompletionService>(serviceKey: "qwen2.5-0.5b");
var embeddingService = kernel.GetRequiredService<IEmbeddingGenerator<string, Embedding<float>>>();

## Initialize Services

### Get Services from Kernel

Retrieving the chat completion service and embedding generator from the configured kernel using their service keys.

In [21]:
var vectorStoreService = new VectorStoreService(
    "http://localhost:6334",
    "",
    "demodocs");

await vectorStoreService.InitializeAsync();

### Create and Initialize Vector Store Service

Creating a VectorStoreService instance pointing to a local Qdrant instance and initializing the collection for storing document embeddings.

In [22]:
var documentIngestionService = new DocumentIngestionService(embeddingService, vectorStoreService);
var ragQueryService = new RagQueryService(embeddingService, chatService, vectorStoreService);

### Create Service Instances

Creating instances of the DocumentIngestionService and RagQueryService with the necessary dependencies (embedding service, chat service, and vector store service).

In [23]:
var filePath = "./foundry-local-architecture.md";
var fileID = "3";

## Document Ingestion Demo

### Define Document Information

Setting up the file path and document ID for the Foundry Local architecture document that will be ingested into the vector database.

In [24]:
await documentIngestionService.IngestDocumentAsync(filePath, fileID);

### Ingest Document into Vector Database

Processing the Foundry Local architecture document by reading its content, chunking it, generating embeddings for each chunk, and storing them in the vector database with metadata.

In [25]:
var question = "What's Foundry Local?";

## RAG Query Demo

### Define Query Question

Setting up a test question to demonstrate the RAG functionality - asking about what Foundry Local is.

In [26]:
var answer = await ragQueryService.QueryAsync(question);

### Execute RAG Query

Running the RAG query which will:
1. Convert the question to embeddings
2. Search for relevant context in the vector database
3. Combine retrieved context with the question
4. Generate a response using the chat completion service

In [27]:
answer

 Here's a simplified version of the text:

---

**Title:** Introduction to Foundry Local

**Overview:** Foundry Local is a design focused on optimizing AI model inference on local devices. This guide explores the core components of Foundry Local and their interactions.

**Key Components**:
- Built-in System Platform (OSX)
- REST Server Framework (API)
- Local Execution Provider
- Model Manager
- Cloud Connectivity Framework

### Foundry Local Services Overview

- Endpoint: http://localhost:PORT/v1  
- Use Case: Run Models Locally, Access the Local Executor.
- ONNX Runtime: Utilizes optimized ONNX models to support local inference.

### ONNX Runtime

- Supported by Multiple Providers: NVIDIA, AMD, Intel (supported by OSLC).
- Provides Unified Interface for All Providers.

### Model Management
- Model Cache (local storage): Automatically generated when models are downloaded from the OSX platform.
- TTL for Memory Storage: Determines how long models

### Display RAG Response

Displaying the final answer generated by the RAG system, which should contain information about Foundry Local based on the ingested document.