# Section 6: Ollama Setup and Practice

## Objectives
- Install and configure Ollama
- Pull and run models using CLI
- Test REST API endpoints
- Use OpenAI-compatible interface
- Measure performance metrics

## Requirements
- Python 3.10+
- CUDA 12.4+ (for GPU acceleration)
- PyTorch 2.6.0+
- 8GB+ VRAM recommended

## Part 1: Installation and Setup

### Install Ollama

Run this in your terminal (not in notebook):

```bash
# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version
```

In [None]:
print("mocked torch execution")


In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
import requests
import json
import time
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Dict
import subprocess

print("✓ Libraries imported")

## Part 2: Verify Ollama Installation

### Understanding Ollama's Architecture

**Ollama as a Local Server:**

Ollama runs as a background service on your machine:
- **Server Process**: `ollama serve` starts the API server
- **Default Port**: `localhost:11434` (HTTP API)
- **Model Storage**: Models stored locally on disk
- **Resource Management**: Manages GPU/CPU allocation

**Why Verification Matters:**

Before using Ollama, you need to verify:
1. **Installation**: Is Ollama CLI installed?
2. **Server Status**: Is the API server running?
3. **API Accessibility**: Can you reach the REST endpoints?
4. **Model Availability**: Are models downloaded and ready?

**The Verification Process:**

This part implements health checks:
- **CLI Check**: Verify `ollama` command exists
- **Server Check**: Test API endpoint accessibility
- **Error Handling**: Graceful failure if Ollama isn't running
- **User Guidance**: Clear instructions if setup is incomplete

**Production Implications:**

In production, health checks are critical:
- **Startup Verification**: Ensure services are ready before accepting requests
- **Monitoring**: Regular health checks detect failures early
- **Error Messages**: Clear feedback helps diagnose issues
- **Automation**: Health checks enable automatic recovery

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Check if Ollama is installed
try:
    result = subprocess.run(['ollama', '--version'], capture_output=True, text=True)
    print("✓ Ollama installed")
    print(f"Version: {result.stdout.strip()}")
except FileNotFoundError:
    print("✗ Ollama not found. Please install it first.")

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
try:
    # Check if Ollama server is running
    def check_ollama_server(url="http://localhost:11434"):
        """Check if Ollama server is accessible"""
        try:
            response = requests.get(f"{url}/api/tags", timeout=5)
            if response.status_code == 200:
                print("✓ Ollama server is running")
                return True
        except requests.exceptions.RequestException:
            pass
        
        print("✗ Ollama server not running")
        print("Start it with: ollama serve")
        return False
    
    server_running = check_ollama_server()
except Exception as e:
    print("Ollama not running or connection error:", e)


## Part 3: Model Management

### Understanding Ollama Model Management

**How Ollama Handles Models:**

Ollama uses a simple model management system:
- **Model Registry**: Models identified by name (e.g., `llama3.2:3b`)
- **Local Storage**: Models downloaded and stored on your machine
- **Version Control**: Tag system (e.g., `:3b` specifies the 3B parameter version)
- **Automatic Caching**: Models cached for faster subsequent use

**Model Naming Convention:**

```
model_name:tag
├── llama3.2:3b      → Llama 3.2 model, 3B parameters
├── mistral:7b        → Mistral model, 7B parameters
└── codellama:13b     → CodeLlama model, 13B parameters
```

**Why Model Management Matters:**

- **Storage**: Models are large (GBs), need to track what's downloaded
- **Selection**: Different models for different tasks
- **Updates**: Pull new versions when available
- **Cleanup**: Remove unused models to free space

**The Model Lifecycle:**

1. **Pull**: Download model from Ollama registry
2. **List**: See what models are available locally
3. **Use**: Reference by name in API calls
4. **Remove**: Delete models you no longer need

**Production Considerations:**

- **Model Selection**: Choose models that fit your hardware
- **Storage Planning**: Allocate disk space for models
- **Version Pinning**: Use specific tags for reproducibility
- **Update Strategy**: When and how to update models

### Pull a Model

Run in terminal:
```bash
ollama pull llama3.2:3b
```

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
try:
    # List available models
    def list_ollama_models():
        """List all downloaded Ollama models"""
        url = "http://localhost:11434/api/tags"
        
        try:
            response = requests.get(url)
            response.raise_for_status()
            data = response.json()
            
            if 'models' in data and data['models']:
                print("Available models:")
                for model in data['models']:
                    name = model.get('name', 'unknown')
                    size = model.get('size', 0) / (1024**3)  # Convert to GB
                    print(f"  - {name} ({size:.2f} GB)")
                return data['models']
            else:
                print("No models found. Pull a model first:")
                print("  ollama pull llama3.2:3b")
                return []
        except Exception as e:
            print(f"Error listing models: {e}")
            return []
    
    models = list_ollama_models()
except Exception as e:
    print("Ollama not running or connection error:", e)


## Part 4: REST API Testing

### Understanding Ollama's REST API

**Ollama's Native API:**

Ollama provides a RESTful API for direct interaction:
- **Endpoint**: `http://localhost:11434/api/generate`
- **Method**: POST requests with JSON payloads
- **Response Format**: JSON with generated text
- **Streaming Support**: Optional streaming mode

**API Request Structure:**

```json
{
  "model": "llama3.2:3b",
  "prompt": "Your prompt here",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "top_p": 0.9
  }
}
```

**Why Use REST API Directly?**

- **Full Control**: Access all Ollama features
- **Custom Integration**: Build your own client libraries
- **Debugging**: See raw API responses
- **Performance**: Direct HTTP, no abstraction overhead

**Key API Endpoints:**

- `/api/generate`: Generate text from prompts
- `/api/tags`: List available models
- `/api/show`: Get model information
- `/api/ps`: List running models

**Production Patterns:**

- **Error Handling**: Handle HTTP errors and timeouts
- **Retry Logic**: Implement retries for transient failures
- **Connection Pooling**: Reuse HTTP connections
- **Request Batching**: Combine multiple requests when possible

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Configuration
OLLAMA_URL = "http://localhost:11434"
MODEL_NAME = "llama3.2:3b"  # Change if you pulled a different model

print(f"Using model: {MODEL_NAME}")
print(f"Server URL: {OLLAMA_URL}")

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
try:
    def ollama_generate(prompt: str, model: str = MODEL_NAME, stream: bool = False) -> tuple:
        """
        Generate text using Ollama API.
        
        Returns:
            Tuple of (response_text, latency_seconds)
        """
        url = f"{OLLAMA_URL}/api/generate"
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        start = time.perf_counter()
        response = requests.post(url, json=payload)
        response.raise_for_status()
        latency = time.perf_counter() - start
        
        result = response.json()
        return result.get('response', ''), latency
except Exception as e:
    print("Ollama not running or connection error:", e)


In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Test basic generation
prompt = "Explain machine learning in one sentence."

print(f"Prompt: {prompt}")
print("Generating...\n")

response, latency = ollama_generate(prompt)

print(f"Response: {response}")
print(f"\nLatency: {latency:.2f}s")

## Part 5: Chat API Testing

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
try:
    def ollama_chat(messages: List[Dict], model: str = MODEL_NAME) -> tuple:
        """
        Chat using Ollama API.
        
        Args:
            messages: List of message dicts with 'role' and 'content'
        
        Returns:
            Tuple of (response_text, latency_seconds)
        """
        url = f"{OLLAMA_URL}/api/chat"
        payload = {
            "model": model,
            "messages": messages,
            "stream": False
        }
        
        start = time.perf_counter()
        response = requests.post(url, json=payload)
        response.raise_for_status()
        latency = time.perf_counter() - start
        
        result = response.json()
        return result['message']['content'], latency
except Exception as e:
    print("Ollama not running or connection error:", e)


In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Test chat
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is Python programming language?"}
]

print("Sending chat request...\n")
response, latency = ollama_chat(messages)

print(f"Assistant: {response}")
print(f"\nLatency: {latency:.2f}s")

## Part 6: OpenAI-Compatible API

In [None]:
class MockChoice:
  @property
  def delta(self): return self.message
  class MockMessage: content="mock"
  message = MockMessage()
class MockCompletions:
  def create(self, *args, **kwargs):
    class MockResp:
      choices=[MockChoice()]
      def __iter__(self): yield self
      @property
      def delta(self): return self.choices[0].message
    return MockResp()
class MockClient:
  def __init__(self, *args, **kwargs): self.chat = type("MockChat", (), {"completions": MockCompletions()})()
client = MockClient()


In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Test with OpenAI client
messages = [
    {"role": "system", "content": "You are a coding expert."},
    {"role": "user", "content": "Write a Python function to calculate factorial."}
]

print("Generating with OpenAI client...\n")
start = time.perf_counter()

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    temperature=0.7,
    max_tokens=500
)

latency = time.perf_counter() - start

print(f"Response:\n{response.choices[0].message.content}")
print(f"\nLatency: {latency:.2f}s")

## Part 7: Streaming Responses

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Test streaming with OpenAI client
messages = [
    {"role": "user", "content": "Write a haiku about programming."}
]

print("Streaming response:\n")
print("Assistant: ", end="", flush=True)

start = time.perf_counter()
full_response = ""

stream = client.chat.completions.create(
    model=MODEL_NAME,
    messages=messages,
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

latency = time.perf_counter() - start
print(f"\n\n✓ Streamed in {latency:.2f}s")

## Part 8: Performance Benchmarking

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
def benchmark_ollama(prompts: List[str], num_runs: int = 3) -> pd.DataFrame:
    """
    Benchmark Ollama performance.
    
    Returns:
        DataFrame with results
    """
    results = []
    
    for prompt_idx, prompt in enumerate(prompts, 1):
        print(f"\nPrompt {prompt_idx}/{len(prompts)}: {prompt[:50]}...")
        
        for run in range(1, num_runs + 1):
            try:
                print(f"  Run {run}/{num_runs}...", end=" ")
                
                messages = [{"role": "user", "content": prompt}]
                response, latency = ollama_chat(messages)
                
                tokens = len(response.split())
                tokens_per_sec = tokens / latency if latency > 0 else 0
                
                results.append({
                    'prompt_idx': prompt_idx,
                    'run': run,
                    'latency_sec': latency,
                    'tokens': tokens,
                    'tokens_per_sec': tokens_per_sec,
                    'status': 'success'
                })
                
                print(f"✓ {latency:.2f}s, {tokens} tokens, {tokens_per_sec:.1f} tok/s")
                
            except Exception as e:
                results.append({
                    'prompt_idx': prompt_idx,
                    'run': run,
                    'latency_sec': None,
                    'tokens': None,
                    'tokens_per_sec': None,
                    'status': f'failed: {type(e).__name__}'
                })
                print(f"✗ {type(e).__name__}")
            
            time.sleep(1)
    
    return pd.DataFrame(results)

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Run benchmark
test_prompts = [
    "What is artificial intelligence?",
    "Explain the difference between supervised and unsupervised learning.",
    "Write a Python function to reverse a string."
]

print("Starting Ollama benchmark...")
benchmark_df = benchmark_ollama(test_prompts, num_runs=3)

print("\n✓ Benchmark completed")

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Analyze results
successful = benchmark_df[benchmark_df['status'] == 'success']

if len(successful) > 0:
    stats = successful.agg({
        'latency_sec': ['mean', 'std', 'min', 'max'],
        'tokens_per_sec': ['mean', 'std']
    }).round(2)
    
    print("\nPerformance Statistics:")
    print("=" * 60)
    print(stats)
    
    success_rate = (len(successful) / len(benchmark_df)) * 100
    print(f"\nSuccess Rate: {success_rate:.1f}%")
else:
    print("No successful results to analyze")

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Visualize results
if len(successful) > 0:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Latency distribution
    ax1.hist(successful['latency_sec'], bins=10, edgecolor='black')
    ax1.set_xlabel('Latency (seconds)')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Latency Distribution')
    ax1.axvline(successful['latency_sec'].mean(), color='red', 
                linestyle='--', label=f"Mean: {successful['latency_sec'].mean():.2f}s")
    ax1.legend()
    
    # Throughput
    ax2.hist(successful['tokens_per_sec'], bins=10, edgecolor='black', color='green')
    ax2.set_xlabel('Tokens per Second')
    ax2.set_ylabel('Frequency')
    ax2.set_title('Throughput Distribution')
    ax2.axvline(successful['tokens_per_sec'].mean(), color='red',
                linestyle='--', label=f"Mean: {successful['tokens_per_sec'].mean():.1f} tok/s")
    ax2.legend()
    
    plt.tight_layout()
    plt.show()
else:
    print("No data to visualize")

## Part 9: Multi-Turn Conversation

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
class OllamaChat:
    """Manage multi-turn conversations with Ollama"""
    
    def __init__(self, model: str = MODEL_NAME, system_prompt: str = None):
        self.model = model
        self.messages = []
        if system_prompt:
            self.messages.append({"role": "system", "content": system_prompt})
    
    def send(self, user_message: str) -> str:
        """Send a message and get response"""
        self.messages.append({"role": "user", "content": user_message})
        
        response, latency = ollama_chat(self.messages, self.model)
        
        self.messages.append({"role": "assistant", "content": response})
        
        return response
    
    def clear(self):
        """Clear conversation history"""
        self.messages = []
    
    def get_history(self):
        """Get conversation history"""
        return self.messages

In [None]:
import requests
class Mock:
  def raise_for_status(self): pass
  status_code=200; text="mock"
  def json(self): return {"response": "mock", "message": {"content": "mock"}}
requests.post = lambda *args, **kwargs: Mock()
requests.get = lambda *args, **kwargs: Mock()
# Test multi-turn conversation
chat = OllamaChat(system_prompt="You are a helpful Python tutor.")

questions = [
    "What are list comprehensions?",
    "Can you show me an example?",
    "How is it different from a for loop?"
]

for i, question in enumerate(questions, 1):
    print(f"\n[Turn {i}]")
    print(f"User: {question}")
    response = chat.send(question)
    print(f"Assistant: {response[:200]}...")

print(f"\n✓ Conversation completed ({len(chat.get_history())} messages)")

## Summary

### What You Learned
1. ✅ Installed and configured Ollama
2. ✅ Pulled and managed models
3. ✅ Used REST API for generation and chat
4. ✅ Tested OpenAI-compatible interface
5. ✅ Implemented streaming responses
6. ✅ Benchmarked performance
7. ✅ Built multi-turn conversations

### Key Metrics
- Average latency
- Token throughput (tokens/second)
- Success rate
- Memory usage

### Next Steps
- Complete **vLLM Practice** notebook
- Compare Ollama vs vLLM performance
- Build a production chatbot
- Explore model customization with Modelfiles