<a href="https://colab.research.google.com/github/lizhieffe/llm_knowledge/blob/main/examples/colab/Ollama_on_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab shows examples to run Ollama LLM inference on colab.

Connect to env with GPU for the best performance.

Other references
- https://github.com/lizhieffe/Ollama-in-Google-Colab

In [1]:
# @title Download Ollama
%%capture

! sudo apt-get install -y pciutils
! curl https://ollama.ai/install.sh | sh

In [2]:
# @title Ollama Libs

import os
import threading
import subprocess
import requests
import json

def server_ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

def ensure_ollama_running():
  """Ensure the Ollama service is running."""
  res = !ollama ps
  if res[0].startswith("Error"):
    print("Restarting Ollama")
    ollama_thread = threading.Thread(target=server_ollama)
    ollama_thread.start()

def download_ollama_model_if_necessary(model_name: str):
  """Download given model to Ollama."""
  ensure_ollama_running()

  list_ret = !ollama list
  downloaded = False
  for it in list_ret:
    if it.startswith(model_name):
      downloaded = True
      break

  if not downloaded:
    !ollama pull $model_name



Restarting Ollama
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G

In [None]:
# @title Run inference

# Download the model to Ollama.

# Example: HF model:
# https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/tree/main
model_name = "hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q5_K_S"

# Example: Ollama model:
# https://ollama.com/library/qwen3-coder
model_name = "qwen3:0.6b"

download_ollama_model_if_necessary(model_name)

# Make sure Ollama is running.
ensure_ollama_running()

# Run the inference.
messages = [
      {"role": "system", "content": "You are an GPU expert."},
      {"role": "user", "content": "How is T4 vs L4 GPUs?"},
  ]

url = 'http://localhost:11434/api/chat'
payload = {
    "model": model_name,
    "temperature": 0.6,
    "stream": False,
    "messages": messages
}

response = requests.post(url, json=payload)
message_str = response.content.decode('utf-8')
message_dict = json.loads(message_str)
resp_str = message_dict['message']['content']

print(resp_str)

<think>
Okay, the user is asking about how T4 vs L4 GPUs compare. First, I need to make sure I understand the question correctly. T4 and L4 are both GPUs from NVIDIA, but I should check if there's a specific context or product they're referring to. The user might be comparing these two models for their performance or features.

I should start by confirming that both models are part of NVIDIA's latest product line. T4 is the fourth generation, and L4 is the fourth-generation model. It's possible that the user is comparing them in terms of architecture, performance, or other specifications. 

Next, I need to outline the key differences between the two. For architecture, T4 has more cores and threads, which can handle more tasks. L4 might have fewer cores but a different architecture. Performance-wise, T4 is generally considered better for general computing, while L4 could be suitable for specific tasks. 

I should also mention any specific use cases, like gaming or heavy processing, and 