<a href="https://colab.research.google.com/github/jessmka/RAG/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding a simple RAG from scratch in Colab

https://stackoverflow.com/questions/77697302/how-to-run-ollama-in-google-colab

https://huggingface.co/blog/ngxson/make-your-own-rag

In [1]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.4.6-py3-none-any.whl.metadata (4.7 kB)
Collecting httpx<0.28.0,>=0.27.0 (from ollama)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Downloading ollama-0.4.6-py3-none-any.whl (13 kB)
Downloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpx, ollama
  Attempting uninstall: httpx
    Found existing installation: httpx 0.28.1
    Uninstalling httpx-0.28.1:
      Successfully uninstalled httpx-0.28.1
Successfully installed httpx-0.27.2 ollama-0.4.6


In [2]:
!curl https://ollama.ai/install.sh | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13269    0 13269    0     0  42091      0 --:--:-- --:--:-- --:--:-- 41990
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [3]:
import os
import asyncio

# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))
import asyncio
import threading

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro)
    loop.close()

# Create a new event loop that will run in a new thread
new_loop = asyncio.new_event_loop()

# Start ollama serve in a separate thread so the cell won't block execution
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

>>> starting ollama serve


In [4]:
!ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf
!ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBvViC/1wVEdCfBFHQ4iUx24PG2AlmHNPzqbukTtnYzJ

2025/01/21 00:12:53 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* f

In [6]:
# from datasets import load_dataset

In [None]:
# !pip install datasets

In [None]:
# from datasets import load_dataset

In [None]:
# dataset = load_dataset("text", data_files="https://huggingface.co/ngxson/demo_simple_rag_py/resolve/main/cat-facts.txt")


In [None]:
# dset = list(dataset.data.values())

In [None]:
# dset

In [7]:
dataset = []
with open('/content/drive/MyDrive/homework/cat-facts.txt', 'r') as file:
  dataset = file.readlines()
  print(f'Loaded {len(dataset)} entries')

Loaded 150 entries


In [8]:
# len(dset)

In [9]:
import ollama

EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
# The embedding is a list of floats, for example: [0.1, 0.04, -0.34, 0.21, ...]
VECTOR_DB = []

def add_chunk_to_database(chunk):
  embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
  VECTOR_DB.append((chunk, embedding))

In [10]:
for i, chunk in enumerate(dataset):
  add_chunk_to_database(chunk)
  print(f'Added chunk {i+1}/{len(dataset)} to the database')

time=2025-01-21T00:13:59.031Z level=INFO source=server.go:104 msg="system memory" total="12.7 GiB" free="11.3 GiB" free_swap="0 B"
time=2025-01-21T00:13:59.032Z level=INFO source=memory.go:356 msg="offload to cpu" layers.requested=-1 layers.model=13 layers.offload=0 layers.split="" memory.available="[11.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="86.0 MiB" memory.required.partial="0 B" memory.required.kv="6.0 MiB" memory.required.allocations="[86.0 MiB]" memory.weights.total="56.4 MiB" memory.weights.repeating="43.8 MiB" memory.weights.nonrepeating="12.6 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
time=2025-01-21T00:13:59.033Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-74aebb552ea73b271d3b9c709923b4b7633b304fbc897a0498e52a180c3a9da9 --ctx-size 2048 --batch-size 512 --threads 1 --no-mmap --parallel 1 --port 38105"
time=2025-0

In [11]:
def cosine_similarity(a, b):
  dot_product = sum([x * y for x, y in zip(a, b)])
  norm_a = sum([x ** 2 for x in a]) ** 0.5
  norm_b = sum([x ** 2 for x in b]) ** 0.5
  return dot_product / (norm_a * norm_b)


In [13]:
def retrieve(query, top_n=3):
  query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
  # temporary list to store (chunk, similarity) pairs
  similarities = []
  for chunk, embedding in VECTOR_DB:
    similarity = cosine_similarity(query_embedding, embedding)
    similarities.append((chunk, similarity))
  # sort by similarity in descending order, because higher similarity means more relevant chunks
  similarities.sort(key=lambda x: x[1], reverse=True)
  # finally, return the top N most relevant chunks
  return similarities[:top_n]

In [14]:
input_query = input('Ask me a question: ')
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
  print(f' - (similarity: {similarity:.2f}) {chunk}')

instruction_prompt = f'''
You are a helpful chatbot.
Use only the following pieces of context to answer the question. Don't make up any new information:
{''.join([f' - {chunk}' for chunk, similarity in retrieved_knowledge])}
'''


Ask me a question: tell me about fat cats
[GIN] 2025/01/21 - 00:16:48 | 200 |   84.960028ms |       127.0.0.1 | POST     "/api/embed"
Retrieved knowledge:
 - (similarity: 0.70) The largest cat breed is the Ragdoll. Male Ragdolls weigh between 12 and 20 lbs. (5.4-9.0 k). Females weigh between 10 and 15 lbs. (4.5-6.8 k).

 - (similarity: 0.67) The heaviest cat on record is Himmy, a Tabby from Queensland, Australia. He weighed nearly 47 pounds (21 kg). He died at the age of 10.

 - (similarity: 0.67) Cats must have fat in their diet because they can’t produce it on their own.



In [None]:
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

# print the response from the chatbot in real-time
print('Chatbot response:')
for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)


Chatbot response:
time=2025-01-21T00:16:57.324Z level=INFO source=server.go:104 msg="system memory" total="12.7 GiB" free="11.2 GiB" free_swap="0 B"
time=2025-01-21T00:16:57.324Z level=INFO source=memory.go:356 msg="offload to cpu" layers.requested=-1 layers.model=17 layers.offload=0 layers.split="" memory.available="[11.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.6 GiB" memory.required.partial="0 B" memory.required.kv="256.0 MiB" memory.required.allocations="[1.6 GiB]" memory.weights.total="813.3 MiB" memory.weights.repeating="607.8 MiB" memory.weights.nonrepeating="205.5 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="554.3 MiB"
time=2025-01-21T00:16:57.325Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cpu_avx2/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-6f85a640a97cf2bf5b8e764087b1e83da0fdb51d7c9fab7d0fece9385611df83 --ctx-size 8192 --batch-size 512 --threads 1 --no-mmap --parallel 4 --