
# Overview of Ollama & Demo with Mistral 7B

This notebook introduces **Ollama**, a platform for running large language models **locally**, and gives a hands-on demo using **Mistral 7B**, a lightweight but capable open-weight model.

### Outline
1. What is Ollama? Motivation & features
2. Architecture & model ecosystem
3. Installing & setup
4. Using Ollama via Python / shell
5. Demo: prompt + evaluation
6. Limitations, best practices, and extensions

*This notebook can be run on a laptop with adequate resources (see instructions below).* 



## 1. What is Ollama?

- **Ollama** is an open framework for running large language models **locally**, avoiding dependence on cloud APIs.
- **Motivations**:
  - Privacy & data control: your data never leaves your machine.
  - Lower latency and more control over inference.
  - No per-token API cost once the model is downloaded.
  - Full experimentation freedom: introspection, custom fine-tuning, etc.
- Recent improvements: support for new architectures, improved performance and quantization.
- Includes a public model registry with many popular open-weight models.



## 2. Model ecosystem in Ollama

In the Ollama model registry you will find many models from small to large:

- **Mistral 7B** — a state-of-the-art open model with 7 billion parameters.
- **Phi-4**, **OLMo 2**, and others.

We will use **Mistral 7B**, which provides a good balance of performance and resource requirements for laptop setups when quantized to 4-bit.



## 3. Installing & setup

Run the following shell commands (Linux / macOS). For Windows, use WSL or the Windows installer from the Ollama website.


In [3]:

# Install Ollama (if not installed)
#!curl -fsSL https://ollama.com/install.sh | sh

# Check version in terminal
#ollama --version

# Pull the Mistral 7B model
#!ollama pull mistral

# (Optional) show model info in terminal
#ollama show mistral



The quantized model is roughly 4–6 GB and will run on laptops with at least:
- Apple Silicon M1/M2/M3 with ≥16 GB unified memory, or
- NVIDIA GPU with ≥8 GB VRAM.

CPU-only laptops can also run it but will be slower.



## 4. Using Ollama via Python (and shell)

Ollama can be used from the command line or from Python using `subprocess` or its HTTP API.

### CLI chat
```bash
ollama run mistral "Tell me a short story about a robot and a cat."
```



### Ollama HTTP API
Although you can use a Python wrapper in a subprocess to call ollama with a prompt, the preferred method is using the API to reduce latency.

> NOTE: Before runnnig the next cell, open a terminal and run 
```bash
ollama serve 
```
>to start the HTTP server.

In [4]:
import requests
from IPython.display import display, HTML

def ollama_chat(prompt: str, model: str = "mistral"):
    """
    Run a prompt through the specified model using the current Ollama CLI syntax.
    """
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False},
    )
    return r.json()["response"]

# Example usage
text = ollama_chat("Tell me a short story about a robot and a cat. One paragraph only")

# display with wrapping inside a <pre> block
display(HTML(f"<pre style='white-space: pre-wrap; word-wrap: break-word;'>{text}</pre>"))


## 5. Demo: Prompting & Evaluation

Below we send multiple prompts and view outputs.


In [5]:
import requests
from IPython.display import display, HTML

prompts = [
    "Write a short, whimsical poem about a moonlit forest.",
    "Explain in simple terms how backpropagation works in neural networks.",
    "Given the sentence: 'The cat sat on the mat.', produce an alternative sentence with same meaning but different structure."
]

for p in prompts:
    # POST request to the local Ollama server
    r = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "mistral", "prompt": p, "stream": False},
    )
    out = r.json()["response"]

    # Show the prompt and the wrapped response
    display(HTML(
        f"<b>Prompt:</b> {p}<br>"
        f"<pre style='white-space: pre-wrap; word-wrap: break-word;'>{out}</pre>"
        "<hr>"
    ))




### Evaluation ideas
- Compare outputs from Mistral 7B with other models (if available) to observe differences in coherence and creativity.
- Critique hallucinations or factual errors.
- Try different prompting styles (formal, casual, etc.).



## 6. Limitations, best practices, and extensions

**Limitations / challenges**
- Laptop resource constraints (especially CPU-only).
- Quantization trade-offs: aggressive quantization may reduce quality.
- Latency on long prompts.

**Best practices**
- Use shorter context windows when possible.
- Adjust temperature / top_p to control creativity.
- Explore chain-of-thought prompting to improve reasoning.

**Extensions**
- Experiment with tool-calling or retrieval-augmented generation.
- Fine-tune smaller adapters for specific domains.
- Combine with external knowledge sources for RAG pipelines.
