<a href="https://colab.research.google.com/github/jinlee-m/llm/blob/main/lightrag_ollama_client.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤗 Use Ollama in LightRAG


# Quick Links

Github repo: https://github.com/SylphAI-Inc/LightRAG

Full Tutorials: https://lightrag.sylph.ai/index.html#

Discord: https://discord.gg/ezzszrRZvT




## 📖 Outline

We will show you how to use Ollama LLM and embedding models with LightRAG `Generator` and `Embedder`. In particular, we will test `llama3` for LLM and `jina/jina-embeddings-v2-base-e` for embeddings. For more models, refer to the [ollama model library](https://ollama.com/library).

This guide goes beyond just using it in LightRAG.

1. We also explore the performance of async calls.
2. We will guide you on how to modify the model file of Ollama.
3. Additionally, you will see llama3's model prompt.

## ⛵ Quick Intro

Ollama is an app that allows you to run transformer models locally with optimized performance.
We integrated ollama as a model client [`OllamaClient`]().
It uses [ollama python sdk](https://github.com/ollama/ollama-python) with its `Client` and `AsyncClient` to interact with the ollama API.
For LLM, we use `generate` instead of `chat` as LightRAG handles the formatting of the prompt.
For embeddings, we use `embeddings` to get the embeddings of the input text.
As of right now, Ollama does not support batch embeddings yet. [2024-07-18]
Here is their blog on plan to support batch embeddings: [link]( https://ollama.com/blog/embedding-models).


### Prerequisites

As the model is running locally, we need to first download the app following their [official instructions](https://github.com/ollama/ollama?tab=readme-ov-file).
Then, we need to pull the models we want to use.

```bash
ollama pull ollama3
ollama pull jina/jina-embeddings-v2-base-e
```

### Setup the T4 GPU

We will use the T4 GPU for the ollama server. On the right upper corner, click on `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU` to enable the GPU.

## 🙌 Get Hands Dirty


### Installation

Install `lightrag` package with `ollama` extra package.

In [None]:
from IPython.display import clear_output

!pip install -U lightrag[ollama]

clear_output()

### Prepare models

1. Download ollama api server.
2. Create a Python script to start the Ollama API server in a separate thread
This script sets up environment variables for the Ollama API server,
and then starts the server using a subprocess in a new thread

In [None]:
!sudo apt-get install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh # download ollama api


# Create a Python script to start the Ollama API server in a separate thread

import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
pciutils is already the newest version (1:3.7.0-6).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
>>> Downloading ollama...
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Note: This depends on environment. On Mac OS, you can start run the `ollama` app without using the above `python script`.

---



Now, lets' prepare the models to have it locally.

In [None]:
!ollama pull llama3
!ollama pull jina/jina-embeddings-v2-base-en:latest

clear_output()

###  Use `llama3` with `Generator`

In [None]:
# use ollama3 with Genearator

from lightrag.core.generator import Generator
from lightrag.core.component import Component
from lightrag.components.model_client.ollama_client import OllamaClient

import time

class SimpleQA(Component):
    def __init__(self):
        super().__init__()
        model_kwargs = {"model": "llama3"}
        self.generator = Generator(
            model_client=OllamaClient(),
            model_kwargs=model_kwargs,
        )

    def call(self, input: dict) -> str:
        return self.generator.call({"input_str": str(input)})

    async def acall(self, input: dict) -> str:
        return await self.generator.acall({"input_str": str(input)})

In [None]:
# run the code
qa = SimpleQA()
print(qa("What is the capital of France?"))

GeneratorOutput(data='Bonjour!\n\nThe capital of France is Paris.', error=None, usage=None, raw_response='Bonjour!\n\nThe capital of France is Paris.', metadata=None)


We will test 10 sync and async calls respectively to see how much speedup we can gain.

In [None]:
queries = ["What is the capital of France?"] * 10


start_time = time.time()
qa = SimpleQA()
answers = []
for query in queries:
    answers.append(qa(query))
print(answers)
time_sync = time.time()-start_time
print(f"Total time for 10 sync call: {time_sync} seconds")

start_time = time.time()

qa = SimpleQA()
tasks = [qa.acall(query) for query in queries]
output = await asyncio.gather(*tasks)
print(output)

time_async = time.time()-start_time
print(f"Total time for 10 async call: {time_async} seconds")

time_sync/time_async

[GeneratorOutput(data='Bonjour!\n\nThe capital of France is Paris.', error=None, usage=None, raw_response='Bonjour!\n\nThe capital of France is Paris.', metadata=None), GeneratorOutput(data="The capital of France is Paris. Would you like to know more about Paris or France in general? I'd be happy to help!", error=None, usage=None, raw_response="The capital of France is Paris. Would you like to know more about Paris or France in general? I'd be happy to help!", metadata=None), GeneratorOutput(data='Bonjour!\n\nThe capital of France is Paris.', error=None, usage=None, raw_response='Bonjour!\n\nThe capital of France is Paris.', metadata=None), GeneratorOutput(data='Bonjour!\n\nThe capital of France is Paris.', error=None, usage=None, raw_response='Bonjour!\n\nThe capital of France is Paris.', metadata=None), GeneratorOutput(data='The capital of France is Paris.', error=None, usage=None, raw_response='The capital of France is Paris.', metadata=None), GeneratorOutput(data='Bonjour!\n\nThe c

1.8158285952671285

We observed that the model can have different output each time, this makes us wonder what temperature it is.

We can not control the temperature directly from the `ollama` `model_kwargs`.

Note: Follow the [official guide](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) if you want to set the temperature.

Let's explore a bit just for fun and also it is quite helpful to understand how the LLM works at the model level.

In [None]:
# show llama3 prompt template, which consists of system, prompt(user), and response

!ollama show --modelfile llama3

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM llama3:latest

FROM /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
TEMPLATE "{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
PARAMETER num_keep 24
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
LICENSE "META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Meta Llama 3 Version Release Date: April 18, 2024
“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.

“Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3 distributed by Meta at https://llama.meta.com/get-started/.

In [None]:
# We will add two more parameters to llama3 to set its temperature

modelfile_content = """
FROM llama3
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.
"""

# Write the content to a Modelfile
with open('Modelfile', 'w') as file:
    file.write(modelfile_content)


# Read the file and create a model named llama3-customize

!ollama create llama3-customize -f Modelfile

[?25ltransferring model data 
using existing layer sha256:6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa 
using existing layer sha256:4fa551d4f938f68b8c1e6afa9d28befb70e3f33f75d0753248d530364aeea40f 
using existing layer sha256:8ab4849b038cf0abc5b1c9b8ee1443dca6b93a045c2272180d985126eb40bf6f 
using existing layer sha256:278f3e552ef89955f0e5b42c48d52a37794179dc28d1caff2d5b8e8ff133e158 
using existing layer sha256:8a3d7e239a33faba1ccf1eab5d167ee16589f9c425616301d44ec7f6d0acadab 
using existing layer sha256:4cd41440c831f4b3963355fce4532f27a815f6f9cf6d0918ddf54581a001b0c0 
writing manifest 
success [?25h


In [None]:
# show llama3-customize which will have the above two parameters added

!ollama show --modelfile llama3-customize

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM llama3-customize:latest

FROM /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
TEMPLATE "{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
SYSTEM You are Mario from super mario bros, acting as an assistant.
PARAMETER num_ctx 4096
PARAMETER num_keep 24
PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>
PARAMETER temperature 1
LICENSE "META LLAMA 3 COMMUNITY LICENSE AGREEMENT

Meta Llama 3 Version Release Date: April 18, 2024
“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Llama Materials set forth herein.

“Documentation” means the sp

### Use Embedding Models with Embedder

We will only use a single string as input instead of using a list of strings.

We will test both `sync` and `async` call with the embedding models. We did not observe obvious speedup in the async call.

In [None]:
from lightrag.core.embedder import Embedder

def prepare_embedder():
    # ollama pull jina/jina-embeddings-v2-base-en:latest
    embedder = Embedder(
        model_client=OllamaClient(),
        model_kwargs={"model": "jina/jina-embeddings-v2-base-en:latest"},
    )
    return embedder

In [None]:
def test_embedder():
    embedder = prepare_embedder()
    response = embedder.call(input="Hello world")
    print(response)


async def test_async_embedder():
    embedder = prepare_embedder()
    response = await embedder.acall(input="Hello world")
    print(response)

In [None]:
import asyncio
import time

queries = ["Hello world"] * 10

start_time = time.time()
for i in range(10):

  test_embedder()

print(f"Total time for 10 sync call: {time.time()-start_time} seconds")

start_time = time.time()
tasks = [test_async_embedder() for _ in range(20)]
await asyncio.gather(*tasks)
print(f"Total time for 10 async call: {time.time()-start_time} seconds")


EmbedderOutput(data=[Embedding(embedding=[-0.48322704434394836, -0.74118971824646, 0.41743358969688416, -0.06327632069587708, -0.1122235357761383, -0.07709698379039764, 0.5567539930343628, -0.7361745834350586, 0.5298600196838379, 0.7180790901184082, -0.6586992144584656, -0.2652803361415863, -0.7631452679634094, 0.050295304507017136, -0.25350281596183777, 0.7019934058189392, 0.5187604427337646, -0.13606834411621094, 0.20805709064006805, -0.038328152149915695, 0.04771933704614639, 0.11655551195144653, -0.5494360327720642, 0.09554587304592133, 0.4329953193664551, 0.7502352595329285, 0.8522652983665466, -0.07809169590473175, 0.10449278354644775, 0.811942458152771, -0.028558574616909027, 0.4817039668560028, -0.1503439098596573, 0.5062540769577026, -0.15916968882083893, -0.36628204584121704, -0.10123422741889954, 0.07832883298397064, 0.041496146470308304, 0.817455530166626, 0.15621016919612885, 0.5324915051460266, 0.17919889092445374, 0.41376233100891113, -0.2866067886352539, -0.011629693210

## ➕ [Optional] Setup Ollama server api on Google Colab and use it locally

You can follow this [blog](https://medium.com/@neohob/run-ollama-locally-using-google-colabs-free-gpu-49543e0def31).
It will require `ngrok` to expose the local server to the internet.
Here is the colab provided by the blog: https://colab.research.google.com/drive/1JNOrMvmkNvugoglaOKCqceL5XSXCOaAh.

## 🐛 Issues and feedback

If you encounter any issues, please report them here: [GitHub Issues](https://github.com/SylphAI-Inc/LightRAG/issues).

For feedback, you can use either the [GitHub discussions](https://github.com/SylphAI-Inc/LightRAG/discussions) or [Discord](https://discord.gg/ezzszrRZvT).