Skip to content

muneeb-devp/Hearth.AI

Repository files navigation

Hearth

Batteries-included local LLM inference for .NET

Hearth icon

NuGet License: MIT Build

Docs: muneeb-devp.github.io/Hearth.AI

Run any GGUF language model locally in a .NET 8/9 app with a single line of registration:

builder.Services.AddHearth(o => o.Model = "./models/qwen2.5-7b-q4_k_m.gguf");

Hearth implements the standard IChatClient interface from Microsoft.Extensions.AI, so your application code is identical whether the inference backend is a local model or a cloud API. Swap models without changing a line of business logic.


Why Hearth?

The .NET AI ecosystem offers many cloud-hosted model clients but very little for on-device, private inference. Existing options require significant LLamaSharp boilerplate, don't integrate with the Microsoft.Extensions.* stack, or lack streaming support.

Hearth solves that:

Problem How Hearth helps
LLamaSharp setup is verbose One AddHearth() call wires everything up
Cloud APIs leak sensitive data All inference runs on your machine
Vendor lock-in Implements IChatClient — swap backends without code changes
GGUF model variety is confusing Sane defaults; ChatML template covers 90% of modern models
Streaming is hard to wire up GetStreamingResponseAsync works out of the box

Use Cases

Privacy-sensitive workloads — Legal document review, medical triage, HR workflows: data never leaves your infrastructure.

Air-gapped environments — Industrial control systems, secure government networks, or any deployment without internet access.

Cost control at scale — Running millions of inference calls against a cloud API gets expensive fast. A local model on modest hardware is a fixed cost.

Developer tooling and CI — Code review bots, test fixture generators, commit message writers — run them in CI without API keys or rate limits.

Edge and embedded — Raspberry Pi 5, Jetson Nano, or in-vehicle compute where latency to a data centre is unacceptable.

Compliance and data residency — GDPR, HIPAA, or contractual requirements that prohibit sending data to third-party processors.


Quick Start

1. Install

dotnet add package Hearth.AI

2. Download a GGUF model

Hearth works with any ChatML-compatible GGUF. Qwen 2.5 is an excellent starting point:

# ~4 GB — good quality-to-size ratio
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

3. Register and inject

builder.Services.AddHearth(options =>
{
    options.Model = "./models/qwen2.5-7b-q4_k_m.gguf";
    options.ContextSize = 8192;
    options.GpuLayers = 35;  // 0 = CPU-only, 999 = offload everything
});

Then inject IChatClient anywhere in your application:

public class SummaryService(IChatClient chat)
{
    public async Task<string> SummarizeAsync(string document, CancellationToken ct = default)
    {
        var response = await chat.GetResponseAsync(
        [
            new(ChatRole.System, "Summarize the following document in three sentences."),
            new(ChatRole.User, document)
        ], cancellationToken: ct);

        return response.Message.Text ?? string.Empty;
    }
}

Code Examples

Single-turn question answering

var response = await chat.GetResponseAsync(
[
    new(ChatRole.System, "You are a helpful assistant. Be concise."),
    new(ChatRole.User, "What is the capital of Japan?")
]);

Console.WriteLine(response.Message.Text); // "Tokyo."

Streaming response

await foreach (var update in chat.GetStreamingResponseAsync(
[
    new(ChatRole.User, "Write a haiku about autumn.")
]))
{
    Console.Write(update.Text);  // tokens arrive as they are generated
}

Multi-turn conversation with history

var history = new List<ChatMessage>
{
    new(ChatRole.System, "You are a knowledgeable but concise assistant.")
};

while (true)
{
    Console.Write("> ");
    var input = Console.ReadLine();
    if (input is null || input == "exit") break;

    history.Add(new(ChatRole.User, input));

    var sb = new StringBuilder();
    await foreach (var update in chat.GetStreamingResponseAsync(history))
    {
        Console.Write(update.Text);
        sb.Append(update.Text);
    }
    Console.WriteLine();

    history.Add(new(ChatRole.Assistant, sb.ToString().Trim()));
}

Document classification

public enum Sentiment { Positive, Negative, Neutral }

public async Task<Sentiment> ClassifyAsync(string review)
{
    var response = await chat.GetResponseAsync(
    [
        new(ChatRole.System,
            "Classify the sentiment of the review as exactly one word: Positive, Negative, or Neutral."),
        new(ChatRole.User, review)
    ]);

    return Enum.TryParse<Sentiment>(response.Message.Text?.Trim(), ignoreCase: true, out var result)
        ? result
        : Sentiment.Neutral;
}

Structured JSON extraction

public record PersonInfo(string Name, int Age, string City);

public async Task<PersonInfo?> ExtractAsync(string paragraph)
{
    var response = await chat.GetResponseAsync(
    [
        new(ChatRole.System,
            """
            Extract person information from the text.
            Respond with valid JSON only, no explanation.
            Schema: {"Name": string, "Age": number, "City": string}
            """),
        new(ChatRole.User, paragraph)
    ]);

    var json = response.Message.Text?.Trim();
    return json is null ? null : JsonSerializer.Deserialize<PersonInfo>(json);
}

ASP.NET Core minimal API endpoint

builder.Services.AddHearth(o => o.Model = "./models/qwen2.5-7b-q4_k_m.gguf");

var app = builder.Build();

app.MapPost("/ask", async (AskRequest req, IChatClient chat) =>
{
    var response = await chat.GetResponseAsync(
    [
        new(ChatRole.System, "You are a helpful assistant."),
        new(ChatRole.User, req.Question)
    ]);
    return Results.Ok(new { Answer = response.Message.Text });
});

record AskRequest(string Question);

Background worker with cancellation

public class DocumentProcessorWorker(IChatClient chat) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        await foreach (var doc in documentQueue.ReadAllAsync(stoppingToken))
        {
            try
            {
                var response = await chat.GetResponseAsync(
                [
                    new(ChatRole.System, "Summarize in one paragraph."),
                    new(ChatRole.User, doc.Content)
                ], cancellationToken: stoppingToken);

                await SaveSummaryAsync(doc.Id, response.Message.Text ?? string.Empty);
            }
            catch (OperationCanceledException)
            {
                break;
            }
        }
    }
}

ChatOptions for per-call temperature control

var creative = new ChatOptions { Temperature = 1.2f, MaxOutputTokens = 512 };
var focused  = new ChatOptions { Temperature = 0.1f, MaxOutputTokens = 128 };

var story = await chat.GetResponseAsync(storyMessages, creative);
var fact  = await chat.GetResponseAsync(factMessages,  focused);

Configuration Reference

Register options via AddHearth(Action<HearthOptions>):

builder.Services.AddHearth(options =>
{
    options.Model        = "./models/qwen2.5-7b-q4_k_m.gguf";
    options.ContextSize  = 8192;
    options.GpuLayers    = 35;
    options.BatchSize    = 512;
    options.Threads      = -1;
});
Property Type Default Description
Model string? (required) Path to a local .gguf file, or a Hugging Face repo ID (e.g. Qwen/Qwen2.5-7B-Instruct-GGUF).
ModelFile string? null Specific file within a CacheDirectory. Ignored when Model is already a direct path.
ContextSize int 4096 Maximum tokens in the context window (prompt + response). Larger values need more RAM/VRAM.
GpuLayers int 0 Layers to offload to GPU. 0 = CPU-only. 999 = offload all layers. A good starting point for a 7B Q4 model on an 8 GB card is 35.
BatchSize int 512 Prompt-processing batch size. Higher values improve throughput on long prompts at the cost of memory.
Threads int -1 CPU threads for inference. -1 = let the runtime decide based on available cores.
CacheDirectory string? ~/.hearth/models Directory for cached model files (used in Phase 2 auto-download).

Binding from appsettings.json

{
  "Hearth": {
    "Model": "./models/qwen2.5-7b-q4_k_m.gguf",
    "ContextSize": 8192,
    "GpuLayers": 35
  }
}
builder.Services.AddHearth(options =>
    builder.Configuration.GetSection("Hearth").Bind(options));

Supported Models

Hearth uses the ChatML prompt template, which covers the majority of modern instruction-tuned GGUFs:

Model family Example GGUF repo Notes
Qwen 2.5 Qwen/Qwen2.5-7B-Instruct-GGUF Native ChatML
Llama 3 / 3.1 / 3.2 bartowski/Meta-Llama-3-8B-Instruct-GGUF Native ChatML
Mistral v0.3 / Mixtral TheBloke/Mistral-7B-Instruct-v0.3-GGUF ChatML compatible
Phi-3 / Phi-3.5 microsoft/Phi-3-mini-4k-instruct-gguf ChatML compatible
Gemma 2 bartowski/gemma-2-9b-it-GGUF ChatML compatible
DeepSeek R1 bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF Native ChatML

Recommended quantizations:

  • Q4_K_M — best balance of size and quality; the default choice
  • Q5_K_M / Q6_K — higher quality if RAM/VRAM allows
  • Q8_0 — near-lossless; only practical on large-memory machines
  • Avoid Q2_K for tasks requiring coherent multi-step reasoning

Hearth auto-detects the template family from the GGUF filename and applies the correct prompt format — no manual configuration required.


Performance Guide

GPU layers

GPU offload is the single biggest performance lever. Even partial offload gives a large speedup:

// Apple Silicon — offload everything; Metal backend handles it
options.GpuLayers = 999;

// NVIDIA 8 GB + 7B Q4 model (~4 GB)
// Start at 35 and raise until VRAM runs out
options.GpuLayers = 35;

// CPU-only (default)
options.GpuLayers = 0;

GPU backend packages that unlock GpuLayers > 0 are planned for Phase 5:

Package Backend
Hearth.AI CPU (llama.cpp AVX2) — default
Hearth.AI.Cuda NVIDIA CUDA 12
Hearth.AI.Metal Apple Metal (M-series)
Hearth.AI.Vulkan Vulkan — AMD/Intel GPUs

Memory usage

Factor Rule of thumb
ContextSize Each extra 1024 tokens ≈ +0.5 GB VRAM for a 7B model
Concurrent requests Each in-flight request holds its own KV-cache
Model size 7B Q4_K_M ≈ 4 GB; 13B Q4_K_M ≈ 8 GB

For batch/background processing (not real-time), lower ContextSize to 2048 if your prompts are short. For a long-context use case (document Q&A), 16 384 or 32 768 may be worth the memory cost.

CPU threads

// Cap to half the physical cores when running alongside other services
options.Threads = Environment.ProcessorCount / 2;

-1 (auto) works well for single-process deployments.


Architecture

Your application code
        │
        │  IChatClient  (Microsoft.Extensions.AI)
        ▼
HearthChatClient
        │  formats prompt via ChatML template
        │  builds InferenceParams from ChatOptions
        ▼
StatelessExecutor  (LLamaSharp)
        │  allocates a fresh KV-cache per call
        │  streams tokens back as IAsyncEnumerable<string>
        ▼
LLamaWeights  (singleton, loaded once at DI resolution)
        │
        ▼
llama.cpp  (native GGUF inference)

Singleton model, stateless executor. LLamaWeights is loaded once and held for the application lifetime. Each GetResponseAsync or GetStreamingResponseAsync call creates a short-lived StatelessExecutor with its own KV-cache, which is freed when the call completes. Consequences:

  • Concurrent calls are safe — no shared mutable state between requests.
  • No session lifetime to manage.
  • Memory per concurrent request ≈ ContextSize × num_layers × sizeof(float16).

Lazy loading. The model loads on first DI resolution, not at builder.Build(). Startup time is unaffected; the first inference call pays the load cost (typically 1–5 seconds depending on model size and disk speed).


Console Sample

The repo includes a streaming chat REPL in samples/Hearth.Samples.Console:

# Pass model path as CLI argument
dotnet run --project samples/Hearth.Samples.Console -- ./models/qwen2.5-7b-q4_k_m.gguf

# Or via environment variable
HEARTH_MODEL=./models/qwen2.5-7b-q4_k_m.gguf dotnet run --project samples/Hearth.Samples.Console

Sample session:

╔══════════════════════════════╗
║        Hearth Chat           ║
╚══════════════════════════════╝
Model : /home/user/models/qwen2.5-7b-q4_k_m.gguf
Quit  : type 'exit' or press Ctrl+C

> What is a KV-cache in the context of transformer inference?
Assistant: A KV-cache stores the key/value attention tensors for tokens already processed...

Type exit or press Ctrl+C to quit.


Blazor Chat Component

Hearth.AI.Blazor provides a drop-in streaming chat UI component for Blazor Server and WebAssembly. Install it alongside the base package, register it, and add one line of markup:

dotnet add package Hearth.AI.Blazor
// Program.cs
builder.Services.AddHearth(options => { options.Model = "./models/qwen2.5-7b-q4_k_m.gguf"; });
builder.Services.AddHearthBlazor();
@using Hearth.Blazor.Components

<HearthChat SystemPrompt="You are a helpful assistant." />

The component handles streaming token output, conversation history, Markdown rendering, scroll-to-bottom, cancellation, and four built-in visual themes. See the Blazor chat component guide for the full parameter reference and customization options.

Blazor sample

samples/Hearth.Samples.Blazor shows the component running in a full Blazor Server app:

dotnet run --project samples/Hearth.Samples.Blazor -- ./models/qwen2.5-7b-q4_k_m.gguf

Aspire Integration

Hearth ships two packages for .NET Aspire orchestration:

Package Project type Purpose
Hearth.AI.Aspire.Hosting AppHost Declares the Hearth inference server as an Aspire resource
Hearth.AI.Aspire Service projects Registers IChatClient from the Aspire-injected connection string

AppHost:

var hearth = builder.AddHearth("ai")
    .WithModel("Qwen/Qwen2.5-7B-Instruct-GGUF")
    .WithModelCacheMount("/data/models")
    .WithGpuAcceleration(35);

builder.AddProject<Projects.MyWebApp>("web")
    .WithReference(hearth);

Service project:

builder.AddHearth("ai");  // reads Aspire connection string, registers IChatClient

Then inject IChatClient anywhere — no other changes needed. Aspire handles service discovery and connection string injection automatically. See the Aspire integration guide for the full API reference.


RAG Pipeline

Hearth.AI.Rag adds retrieval-augmented generation using the same local embedding model:

dotnet add package Hearth.AI.Rag
builder.Services
    .AddHearth(o => o.Model = "./models/qwen2.5-7b-q4_k_m.gguf")
    .AddRag(o =>
    {
        o.VectorStore = VectorStoreType.Sqlite;
        o.SqlitePath   = "knowledge.db";
        o.ChunkSize    = 512;
    });

Index documents, then ask grounded questions:

public class KnowledgeService(IRagPipeline rag, DocumentLoaderRegistry loaders)
{
    public async Task IndexFileAsync(string path, CancellationToken ct = default)
    {
        var doc = await loaders.LoadAsync(path, ct);
        await rag.IndexDocumentAsync(doc, ct);
    }

    public async Task<string> AskAsync(string question, CancellationToken ct = default)
    {
        var result = await rag.AskAsync(question, cancellationToken: ct);
        return result.Answer;
    }
}

Built-in document loaders: plain text, Markdown, HTML. Vector backends: in-memory (default) or SQLite (persistent across restarts). See the RAG pipeline guide for chunking strategies, query options, and source inspection.


Roadmap

Phase Feature Status
1 Local GGUF inference, IChatClient, streaming, console sample ✅ Done
2 Hugging Face model downloader — resumable, SHA-verified, progress callbacks; auto-quantization selection; lazy model loading ✅ Done
3 Tool/function calling; IEmbeddingGenerator<string, Embedding<float>>; per-model-family chat templates ✅ Done
4 Hearth.AI.AspNetCoreMapHearth() extension; OpenAI-compatible /v1/chat/completions, /v1/embeddings, /v1/models; SSE streaming ✅ Done
5 Blazor streaming chat sample; GitHub Actions CI/CD; NuGet release; GPU backend packages (Hearth.AI.Cuda, Hearth.AI.Metal, Hearth.AI.Vulkan) ✅ Done
6 Hearth.AI.Aspire.Hosting + Hearth.AI.Aspire — Aspire orchestration; samples/Hearth.Samples.Server inference server; GHCR Docker publish ✅ Done
7 Hearth.AI.Rag — RAG pipeline; RecursiveChunker + MarkdownChunker; plain-text, Markdown, HTML loaders; in-memory and SQLite vector stores ✅ Done
8 Hearth.AI.BlazorHearthChat streaming component; Markdown rendering; four themes; IAsyncDisposable JS interop cleanup ✅ Done

Requirements

  • .NET 8 or .NET 9
  • A .gguf model file — ChatML-compatible models are recommended (see Supported Models)
  • No GPU required for CPU-only inference

Platform support

Hearth inherits LLamaSharp's platform matrix:

Platform Architecture Notes
Windows x64, arm64 Full support
Linux x64, arm64 Full support
macOS x64, Apple Silicon Metal GPU in Phase 5

Author

Muneeb Mughal


Contributing

Contributions are welcome. Please open an issue before submitting a large PR so the approach can be discussed.

# Clone and build
git clone https://github.com/muneeb-devp/Hearth.git
cd Hearth
dotnet build

# Run all tests
dotnet test

# Run the console sample (requires a local GGUF)
dotnet run --project samples/Hearth.Samples.Console -- /path/to/model.gguf

# Build the docs site
dotnet tool restore
dotnet docfx docs/docfx.json

Code style is enforced by .editorconfig and EnforceCodeStyleInBuild=true. CI will reject PRs with style warnings.


License

MIT

About

Run any GGUF language model locally in a .NET 8/9 app with a **single line of registration**:

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors