## Week 6 Lab Manual
### Foundations of Deep Learning & AI Functionality

**Instructor Note**: This lab manual provides the aim, code, and explanation for each practical task. Focus on the architectural patterns and the transition from theoretical concepts to functional AI implementations.

---

# Week 6: Local LLM Deep Dive & Quantization

Welcome to Week 6. After exploring RAG systems, we now focus on the engine that powers many of these systems locally: **Ollama** and the concept of **Quantization**.

###  Weekly Table of Contents
1. [Advanced Local Control with Ollama SDK](#-Lab-6.1:-Advanced-Local-Control-with-Ollama-SDK)
2. [Local LLM Benchmarking (Speed vs Model Size)](#-Lab-6.2:-Local-LLM-Benchmarking-(Speed-vs-Model-Size))
3. [Understanding Quantization Levels & Modelfiles](#-Lab-6.3:-Understanding-Quantization-Levels-&-Modelfiles)

###  Learning Objectives
1.  Understand the benefits and trade-offs of Local vs Cloud LLMs.
2.  Learn the basics of LLM Quantization (4-bit, 8-bit).
3.  Master the Ollama Python SDK for advanced control.
4.  Build a local benchmarking tool to measure tokens-per-second (TPS).

---
## 6.1 Local LLMs & The VRAM Problem

Cloud models (like Gemini) have trillions of parameters and require massive server farms. To run these on a consumer laptop, we use:
*   **Smaller Architectures:** Like Gemma 2 (2B or 9B parameters).
*   **Quantization:** Reducing the precision of weights to save memory.

In [None]:
# üì¶ WEEK 6 INITIALIZATION
import os
import time
import ollama
from dotenv import load_dotenv
from IPython.display import Markdown, display

# --- CONFIGURATION ---
load_dotenv(override=True)
LOCAL_MODEL = "gemma2:2b" # Updated to gemma2 for better performance

# Help with Ollama connection
try:
    ollama.list()
    print(f"‚úÖ Ollama is running.")
except Exception as e:
    print(f"‚ùå ERROR: Ollama is not running. Please start the Ollama application.")

print(f"‚úÖ Week 6 Ready. Local Model: {LOCAL_MODEL}")


##  Lab 6.1: Advanced Local Control with Ollama SDK
**Aim**: To programmatically interact with local LLMs using the Ollama Python SDK, enabling streaming responses and custom system instructions.

**Explanation**:
This lab demonstrates how to move beyond basic Ollama CLI commands:
1.  **SDK Integration**: We use the `ollama` Python library to initiate chat sessions directly from code.
2.  **Streaming**: Implementing `stream=True` to improve perceived latency by displaying tokens as they are generated.
3.  **Custom Modelfiles**: Showing how to create new model variants (e.g., `expert-gemma`) with baked-in system instructions and temperature parameters.

In [None]:
# Custom system prompt for a technical expert
system_prompt = "You are a senior C++ engineer. Give concise, low-level answers."

def chat_stream(prompt):
    messages = [
        {'role': 'system', 'content': system_prompt},
        {'role': 'user', 'content': prompt}
    ]
    
    response = ollama.chat(model=LOCAL_MODEL, messages=messages, stream=True)
    
    for chunk in response:
        print(chunk['message']['content'], end='', flush=True)

# Test the stream
chat_stream("How do I optimize a for loop for vectorization?")

##  Lab 6.2: Local LLM Benchmarking (Speed vs Model Size)
**Aim**: To build a diagnostic tool that measures the performance of local LLMs in terms of Tokens Per Second (TPS) and response latency.

**Explanation**:
Performance metrics are critical for local deployment:
1.  **Timer Logic**: We capture `start_time` and `end_time` around the generation call.
2.  **Token Counting**: Utilizing Ollama's `eval_count` metadata to get the exact number of tokens generated.
3.  **Efficiency Analysis**: By running this across different models (2B vs 7B), students can visualize the direct correlation between parameter count and inference speed on their specific hardware.

In [None]:
def benchmark_model(model_name, prompt):
    print(f"\nBenchmarking {model_name}...")
    start_time = time.time()
    
    try:
        response = ollama.generate(model=model_name, prompt=prompt)
        
        end_time = time.time()
        total_time = end_time - start_time
        
        tokens = response['eval_count']
        tps = tokens / total_time
        
        print(f"Total Time: {total_time:.2f}s")
        print(f"Total Tokens: {tokens}")
        print(f"Tokens Per Second: {tps:.2f} tokens/s")
    except Exception as e:
        print(f"Error benchmarking {model_name}: {e}. Ensure the model is downloaded.")

# Run benchmark
benchmark_model(LOCAL_MODEL, 'Explain quantum physics in one paragraph.')

##  Lab 6.3: Understanding Quantization Levels & Modelfiles
**Aim**: To understand how model weight precision impacts performance and how to create custom local model instances using Ollama Modelfiles.

**Explanation**:
Ollama allows us to customize model behavior permanently:
1.  **Quantization**: We discuss the math behind FP32 vs INT4. A 4-bit model typically offers the best balance for consumer hardware.
2.  **Modelfiles**: Similar to Dockerfiles, these allow us to set parameters like `temperature` and `system_prompt` so they are baked into a new model name (e.g., `expert-gemma`).

In [None]:
# Step 4: Programmatic Modelfile Creation
# We will create a model called 'expert-gemma' that is specifically tuned for system administration questions.

modelfile = f"""
FROM {LOCAL_MODEL}
PARAMETER temperature 0.2
SYSTEM You are an expert Linux System Administrator. Answer only with technical commands.
"""

print("Creating custom model 'expert-gemma'...")
# ollama.create(model='expert-gemma', modelfile=modelfile)

# Test the custom model
print("\nTesting 'expert-gemma':")
# response = ollama.generate(model='expert-gemma', prompt="How do I check open ports on Ubuntu?")
# print(response['response'])
print("Note: ollama.create is commented out to prevent repeated model creation during walkthroughs.")

In [None]:
# Final summary and completion

print("="*60)
print("WEEK 6 COMPILED NOTEBOOK - SETUP COMPLETE")
print("="*60)
print()
print("‚úÖ Local LLM environment configured")
print("‚úÖ Lab 6.1: Streaming SDK control ready")
print("‚úÖ Lab 6.2: Performance benchmarking tool ready")
print("‚úÖ Lab 6.3: Custom Modelfile generation logic ready")
print()
print("üöÄ Ready to explore local inference and quantization!")
print("üìä Use benchmark_model() to compare different local models.")
print("="*60)


---

##  Instructor's Evaluation & Lab Summary

###  Assessment Criteria
1. **Technical Implementation**: Adherence to the lab objectives and code functionality.
2. **Logic & Reasoning**: Clarity in the explanation of the underlying AI principles.
3. **Best Practices**: Use of secure environment variables and structured prompts.

**Lab Completion Status: Verified**
**Focus Area**: Language Modelling & Deep Learning Systems.