# Deploy GPU-Accelerated Ollama LLM Server on FABRIC

This notebook demonstrates deploying a complete Large Language Model (LLM) inference server using Ollama on FABRIC's research infrastructure. Following FABRIC's standard deployment patterns, it provisions a GPU-enabled slice and configures a production-ready AI service accessible across the testbed.

## What This Notebook Does

**Infrastructure Provisioning:**
- Automatically selects an optimal FABRIC site with available GPU resources (RTX6000, Tesla T4, A30, or A40)
- Creates a slice with GPU-accelerated compute node connected to FABNetv4 for inter-slice communication
- Configures Docker with NVIDIA runtime support for containerized AI workloads

**LLM Server Deployment:**
- Installs and configures Ollama natively with the `deepseek-r1:7b` model (customizable to other models)
- Sets up Open-WebUI for browser-based interaction with the LLM
- Enables secure remote access via SSH tunneling following FABRIC security practices

**Cross-Slice Integration:**
- Configures the Ollama API server to accept connections from other FABRIC slices via FABNetv4
- Provides both REST API and web interface access for flexible integration with research workflows
- Demonstrates querying patterns for distributed AI applications across FABRIC infrastructure

## Use Cases

- **AI Research**: Deploy custom LLMs for distributed machine learning experiments
- **Multi-Slice Applications**: Provide AI services to other FABRIC experiments via FABNetv4 networking
- **Educational Demonstrations**: Showcase GPU-accelerated AI deployment on research infrastructure
- **Prototype Development**: Test AI applications before scaling to production environments

This example follows FABRIC's jupyter-examples patterns and can be adapted for different LLM models, GPU types, or integrated with other FABRIC services like monitoring via MFLib.

## FABlib Documentation Reference

This notebook demonstrates many key FABlib classes and methods for FABRIC resource management:

### Core Classes
- **[FablibManager](https://fabric-fablib.readthedocs.io/en/latest/fablib.html)**: Main interface for FABRIC operations
- **[Slice](https://fabric-fablib.readthedocs.io/en/latest/slice.html)**: Container for experiment resources
- **[Node](https://fabric-fablib.readthedocs.io/en/latest/node.html)**: Individual compute resources (VMs)
- **[Interface](https://fabric-fablib.readthedocs.io/en/latest/interface.html)**: Network interfaces for connectivity

### Key Methods Used
- **Resource Discovery**: [`list_hosts()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.FablibManager.list_hosts)
- **Slice Management**: [`new_slice()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.fablib.new_slice), [`get_slice()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.fablib.get_slice)
- **Node Operations**: [`add_node()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.add_node), [`execute()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.execute)
- **Networking**: [`add_l3network()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.add_l3network), [`get_interface()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.get_interface)
- **Components**: [`add_component()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_component) for GPUs and NICs
- **Automation**: [`add_post_boot_upload_directory()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_post_boot_upload_directory), [`add_post_boot_execute()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_post_boot_execute)

### Complete Documentation
For comprehensive API reference, visit:
**[FABRIC FABlib Documentation](https://fabric-fablib.readthedocs.io/en/latest/)**

## Import the FABlib Library

Initialize the [FablibManager](https://fabric-fablib.readthedocs.io/en/latest/fablib.html) - the core class for managing FABRIC resources and operations.

In [None]:
from ipaddress import ip_address, IPv4Address, IPv6Address, IPv4Network, IPv6Network
import ipaddress
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

fablib = fablib_manager()

fablib.show_config();

## Create the Experiment Slice

This section provisions a FABRIC slice for GPU-accelerated LLM deployment following the standard resource allocation workflow. The process includes:

1. **Site Discovery**: Identifies FABRIC sites with available GPU resources matching our requirements
2. **Slice Provisioning**: Creates a new slice with GPU-enabled compute nodes 
3. **Network Configuration**: Establishes FABNetv4 connectivity for inter-slice communication
4. **Automated Setup**: Configures post-boot scripts for Docker and NVIDIA driver installation

The slice will contain a single node optimized for AI workloads, with the selected GPU (RTX6000, Tesla T4, A30, or A40) attached and connected to FABRIC's research network infrastructure. Post-boot automation ensures the node is ready for native Ollama installation with full GPU acceleration support.

In [None]:
ollama_slice_name = 'Ollama-slice'

ollama_node_name ='ollama_node'

network_name='net1'
nic_name = 'nic1'
model_name = 'NIC_Basic'

## Select a Site

Configure resource requirements and site preferences for the Ollama server deployment. This section defines the minimum hardware specifications needed to support GPU-accelerated LLM inference and establishes site selection criteria following FABRIC's standard allocation patterns.

Set minimum resource thresholds for:
- **CPU cores** and **RAM** for LLM processing workloads
- **Disk space** for model storage and container images  
- **GPU availability** for accelerated inference (optional for initial filtering)

Optionally specify site preferences using `sites_prefer` for prioritized locations or `sites_avoid` to exclude sites with known constraints or maintenance issues.

In [None]:
# If empty -> do not filter by name
sites_prefer: list[str] = []  # e.g., ['BRIST', 'TOKY'] or [] to disable
sites_avoid: list[str] = ["TACC", "CIEN"]   # e.g., ['BRIST', 'TOKY'] or [] to disable
min_cores = 4
min_ram_gb = 16
min_disk_gb = 200
min_gpu_any = 0       # >0 means at least one GPU of any model for the initial filter

### GPU Host Selection Algorithm

The following cell implements a smart host selection algorithm that:

1. **Filters available hosts** using [`fablib.list_hosts()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.FablibManager.list_hosts) based on resource requirements (cores, RAM, disk) and GPU availability
2. **Applies site preferences** - you can specify preferred sites or sites to avoid using the `avoid` parameter
3. **Randomly selects** from eligible hosts to distribute load across the FABRIC infrastructure
4. **Identifies the best GPU type** available on the selected host from RTX6000, Tesla T4, A30, or A40 models

This approach ensures your slice gets allocated to a site with sufficient resources while respecting any site constraints you've configured.

In [None]:
import random

gpu_models = { 'rtx6000_available': "GPU_RTX6000",
               'tesla_t4_available': "GPU_TeslaT4",
               'a30_available': "GPU_A30",
               'a40_available': "GPU_A40"}

fields = ['name', 'state', 'cores_available', 'ram_available', 'disk_available']
fields = fields + list(gpu_models.keys())

def filter_function(row: dict) -> bool:
    # Name filter: only apply if sites_prefer is non-empty
    if sites_prefer:
        name = (row.get('name') or '')
        name_ok = any(tok.lower() in name.lower() for tok in sites_prefer)
    else:
        name_ok = True

    res_ok = (
        row.get('cores_available', 0) > min_cores and
        row.get('ram_available', 0) > min_ram_gb and
        row.get('disk_available', 0) > min_disk_gb and
        row.get('state') == 'Active'
    )
    any_gpu_ok = any(row.get(gf, 0) > min_gpu_any for gf in gpu_models.keys())

    return name_ok and res_ok and any_gpu_ok

hosts = fablib.list_hosts(fields=fields, 
                            pretty_names=False, 
                            avoid=sites_avoid, 
                            filter_function=filter_function,
                            output='list',
                            quiet=True)


host = random.choice(hosts)

picked_gpu_key = next((gf for gf in gpu_models.keys() if host[gf] > 0), None)
picked_gpu_model = gpu_models.get(picked_gpu_key) 
picked_gpu_count = host.get(picked_gpu_key, 0)

picked_host = host['name']
picked_site = picked_host.split('-', 1)[0].upper()

print(
    f"Chosen Host: {host['name']} | "
    f"GPU: {picked_gpu_model} | Available: {picked_gpu_count}"
)


### Create the Slice  

This cell provisions the FABRIC slice following the standard FABRIC deployment pattern using key FABlib methods:

1. **Creates a new slice** using [`fablib.new_slice()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.fablib.new_slice) named 'Ollama-slice'
2. **Establishes FABNetv4 connectivity** by adding an L3 network using [`slice.add_l3network()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.add_l3network) for inter-slice communication
3. **Provisions a GPU-enabled node** with [`slice.add_node()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.add_node), applying the minimum resource requirements (cores, RAM, disk)
4. **Attaches the chosen GPU** using [`node.add_component()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_component) (RTX6000, Tesla T4, A30, or A40) to support AI/LLM workloads
5. **Connects to FABNetv4** via a NIC_Basic component configured in auto mode using [`interface.set_mode()`](https://fabric-fablib.readthedocs.io/en/latest/interface.html#fabrictestbed_extensions.fablib.interface.Interface.set_mode) for seamless networking
6. **Uploads deployment tools** using [`node.add_post_boot_upload_directory()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_post_boot_upload_directory) including ollama_tools and node_tools directories
7. **Configures post-boot automation** with [`node.add_post_boot_execute()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.add_post_boot_execute) to enable Docker and install NVIDIA dependencies for GPU support
8. **Submits the slice** for provisioning using [`slice.submit()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.submit)

The slice follows FABRIC's standard pattern for AI infrastructure deployment, ensuring the node is ready for native Ollama installation with GPU acceleration and cross-slice connectivity.


In [None]:
#Create Slice
ollama_slice = fablib.new_slice(name=ollama_slice_name)

net1 = ollama_slice.add_l3network(name=network_name)

ollama_node = ollama_slice.add_node(name=ollama_node_name, cores=min_cores, ram=min_ram_gb, host=picked_host,
                                    disk=min_disk_gb, site=picked_site, image='default_ubuntu_22')

ollama_node.add_component(model=picked_gpu_model, name='gpu1')


iface1 = ollama_node.add_component(model=model_name, name=nic_name).get_interfaces()[0]
iface1.set_mode('auto')
net1.add_interface(iface1)

ollama_node.add_post_boot_upload_directory('ollama_tools','.')
ollama_node.add_post_boot_upload_directory('node_tools','.')
ollama_node.add_post_boot_execute('node_tools/enable_docker.sh {{ _self_.image }} ')
ollama_node.add_post_boot_execute('node_tools/dependencies.sh {{ _self_.image }} ')

ollama_slice.submit();

## Install and Configure Ollama

This section installs Ollama directly on the Ubuntu node and configures it to accept connections from remote hosts. This is an alternative to the Docker-based approach.

Users can specify alternative models such as:  

`llama2-7b`, `mistral-7b`, `gemma-7b`, `deepseek-r1:67b`, `phi-2`, `gpt-neo-2.7b`  

For more available models, visit: [Ollama Model Search](https://ollama.com/search)

In [None]:
default_llm_model = "deepseek-r1:7b"

### Reconnect to Existing Slice

If reconnecting to an existing slice (e.g., after restarting your notebook), use [`fablib.get_slice()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.fablib.get_slice) and [`slice.get_node()`](https://fabric-fablib.readthedocs.io/en/latest/slice.html#fabrictestbed_extensions.fablib.slice.Slice.get_node) to retrieve existing resources.

In [None]:
ollama_slice = fablib.get_slice(name=ollama_slice_name)
ollama_node = ollama_slice.get_node(ollama_node_name)

## Install and Configure Ollama

This section installs Ollama natively on the Ubuntu node using the official installer and configures it to accept connections from remote hosts. This approach provides optimal performance by installing Ollama directly on the system rather than using containers.

**Configuration includes:**
- Native Ollama installation with GPU acceleration support
- Systemd service configuration for remote access
- Firewall setup for API connectivity
- Model download and initialization

All configuration steps use the [`node.execute()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.execute) method to run commands on the remote FABRIC node.

### Install Ollama using the official installer

Download and install Ollama natively on Ubuntu. This provides better performance than containerized approaches and direct access to GPU resources.

In [None]:
stdout, stderr = ollama_node.execute("curl -fsSL https://ollama.com/install.sh | sh", quiet=True, output_file=f"{ollama_node.get_name()}.log")
print("Installation output:")
print(stdout)

### Configure Ollama to accept remote connections

By default, Ollama only listens on localhost. Configure it for remote access by creating a systemd override to bind to all interfaces.

In [None]:
# By default, Ollama only listens on localhost. We need to configure it for remote access.

# Create systemd override directory
stdout, stderr = ollama_node.execute("sudo mkdir -p /etc/systemd/system/ollama.service.d/")

# Create override configuration to bind to all interfaces
override_config = """[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
"""

# Write the override file
stdout, stderr = ollama_node.execute(f'echo \'{override_config}\' | sudo tee /etc/systemd/system/ollama.service.d/override.conf')
print("Override configuration created:")
print(stdout)

### Reload systemd and restart Ollama service

Apply the configuration changes by reloading systemd, restarting the Ollama service, and enabling it to start automatically on boot.

In [None]:
stdout, stderr = ollama_node.execute("sudo systemctl daemon-reload")
stdout, stderr = ollama_node.execute("sudo systemctl restart ollama")
stdout, stderr = ollama_node.execute("sudo systemctl enable ollama")

# Check service status
stdout, stderr = ollama_node.execute("sudo systemctl status ollama")
print("Ollama service status:")
print(stdout)

### Configure firewall to allow connections on port 11434

Open the firewall to allow incoming connections on port 11434, which is the default port for the Ollama API.

In [None]:
stdout, stderr = ollama_node.execute("sudo ufw allow 11434/tcp", quiet=True)
print("Firewall configuration:")
print(stdout);

### Pull the desired model

Download the specified LLM model and verify it's available for use. This may take several minutes depending on the model size.

In [None]:
stdout, stderr = ollama_node.execute(f"ollama pull {default_llm_model}", quiet=True, output_file=f"{ollama_node.get_name()}.log")
print(f"Pulling model {default_llm_model}:")
print(stdout)

# List available models
stdout, stderr = ollama_node.execute("ollama list")
print("Available models:")
print(stdout)

### Test local API access

Verify that the Ollama API is running and accessible by testing the connection locally.

In [None]:
# Test API locally first
print("Local API test:")
stdout, stderr = ollama_node.execute("curl -s http://localhost:11434/api/tags", quiet=True)
print(stdout);

### Test Remote Query

Now test querying Ollama from a remote host using curl. The node should be accessible via its FABNetv4 IP address from other FABRIC slices.

In [None]:
import json

# Test a simple query to the model
test_query = '''curl -X POST http://localhost:11434/api/generate \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "''' + default_llm_model + '''",
    "prompt": "Tell me a joke about computer networks.",
    "stream": false
  }' '''

print(test_query)

stdout, stderr = ollama_node.execute(test_query, quiet=True)
print("Query test result:")
# Parse the JSON response
response = json.loads(stdout)

# Print all keys and values
for key, value in response.items():
  print(f"{key}: {value}")


## Enable Access to Ollama Node Across FABRIC  

Configure the `ollamanode` to be accessible from any VM running across FABRIC on FabNetV4 by setting up the necessary routes.

### Retrieve the FabNet IP Address  
Display the FabNet IP address of the Ollama node for sharing with other slices using [`node.get_interface()`](https://fabric-fablib.readthedocs.io/en/latest/node.html#fabrictestbed_extensions.fablib.node.Node.get_interface) and [`interface.get_ip_addr()`](https://fabric-fablib.readthedocs.io/en/latest/interface.html#fabrictestbed_extensions.fablib.interface.Interface.get_ip_addr).

In [None]:
ollama_fabnet_ip_addr = ollama_node.get_interface(network_name=network_name).get_ip_addr()

print(f"Ollama is accessible from other slices at: {ollama_fabnet_ip_addr}")

## Querying Ollama

Users can interact with the LLM through the the command-line interface, REST API, or and Open WebUI.


### CLI Examples

SSH into the `ollama_node` using the command provided above.
To view available models, run:

```bash
ollama run deepseek-r1:7b  "Tell me a joke about computers"
```

Alternatively, you can run the command using the FABlib API:


In [None]:
stdout, stderr = ollama_node.execute(f'ollama run {default_llm_model} "Tell me a joke about computers"')


### REST Examples

The `query.py` script demonstrates how to query the LLM over the REST interface. Although Ollama can run on a remote host, the example below targets the local instance by passing `--host localhost`. Users may also specify a different `--host` and `--port` as needed.


In [None]:
stdout, stderr = ollama_node.execute(f'python3 ollama_tools/query.py --host {ollama_fabnet_ip_addr} --model {default_llm_model} --prompt "Tell me a joke about computers"')

### Open Web UI

To access the Open Web UI from your laptop, youâ€™ll need to start the Open WebUI server in a Docker container, create an SSH tunnel from your laptop, and connect to the server using a browser on your laptop.

Follow the steps below to complete the setup.

#### Start the Open WebUI Server

The required docker compose files where included in the post boot upload. The following command will start the server.

In [None]:
ollama_node.execute(f"cd ollama_tools && cp env.template .env && docker compose up -d",
                   quiet=True, ouput_file=f"{ollama_node.get_name()}.log");

#### Start the SSH Tunnel

- Create SSH Tunnel Configuration `fabric_ssh_tunnel_tools.zip` using [`fablib.create_ssh_tunnel_config()`](https://fabric-fablib.readthedocs.io/en/latest/fablib.html#fabrictestbed_extensions.fablib.fablib.FablibManager.create_ssh_tunnel_config)
- Download your custom `fabric_ssh_tunnel_tools.zip` tarball from the `fabric_config` folder.  
- Untar the tarball and put the resulting folder (`fabric_ssh_tunnel_tools`) somewhere you can access it from the command line.
- Open a terminal window. (Windows: use `powershell`) 
- Use `cd` to navigate to the `fabric_ssh_tunnel_tools` folder.
- In your terminal, run the command that results from running the following cell (leave the terminal window open).

In [None]:
fablib.create_ssh_tunnel_config(overwrite=True)

#### Connect to the Open Web UI

To access the Open Web UI running on the ollama node, create an SSH tunnel from your local machine using the command generated by the next cell:

```bash
ssh -L 8080:<manager-ip>:8080 -i <private_key> -F <ssh_config> <your-username>@<manager-host>
```

Replace `<manager-ip>` and `<manager-host>` with the actual IP address and hostname of the Ceph manager VM.

Then, open your browser and navigate to:


http://localhost:8080


In [None]:
import os
# Port on your local machine that you want to map the File Browser to.
local_port='8080'
# Local interface to map the File Browser to (can be `localhost`)
local_host='127.0.0.1'

# Port on the node used by the File Browser Service
target_port='8080'

# Username/node on FABRIC
target_host=f'{ollama_node.get_username()}@{ollama_node.get_management_ip()}'

print("Use `cd` to navigate into the `fabric_ssh_tunnel_tools` folder.")
print("In your terminal, run the SSH tunnel command")
print()
print(f'ssh  -L {local_host}:{local_port}:127.0.0.1:{target_port} -i {os.path.basename(fablib.get_default_slice_public_key_file())[:-4]} -F ssh_config {target_host}')
print()
print("After running the SSH command, open Open WebUI at http://localhost:8080. If prompted, create an account and start asking questions.")

## Delete the Slice

Please delete your slice when you are done with your experiment.

In [None]:
#ollama_node = fablib.get_slice(ollama_slice_name)
#ollama_node.delete()