# Local Large Language Model with Ollama (Llama 3 8B)

This notebook uses the Llama 3 8B model running locally on your machine. Llama 3 8B offers significantly better performance than the base model while still being resource-efficient.

In [1]:
import os
import requests
import json
from typing import List, Dict, Any, Optional

OLLAMA_ENDPOINT = "http://localhost:11434"
MODEL_NAME = "llama3:8b" 

In [2]:
def check_ollama_status():
    try:
        response = requests.get(f"{OLLAMA_ENDPOINT}/api/tags")
        if response.status_code == 200:
            models = response.json().get("models", [])
            if models:
                print(f"Ollama is running with the following models:")
                for model in models:
                    print(f" - {model['name']}")
                
                # Check if Llama 3 8B is available
                llama_models = [m for m in models if 'llama3:8b' in m['name'].lower()]
                if llama_models:
                    print(f"\n✓ Llama 3 8B model found")
                else:
                    print(f"\n⚠ Llama 3 8B was not found")
            else:
                print("Ollama is running, but no models were found. Run 'ollama pull llama3:8b' in the terminal.")
        else:
            print("Could not connect to Ollama.")
    except Exception as e:
        print(f"Error connecting to Ollama: {e}")
        print("Make sure the Ollama service is running with: 'systemctl --user start ollama'")

check_ollama_status()

Ollama is running with the following models:
 - german-mixtral:latest
 - mixtral:latest
 - llama3:8b
 - mannix/llama3.1-8b-abliterated:latest
 - llama3.1:latest
 - dolphin-llama3:latest

✓ Llama 3 8B model found


In [3]:
class Llama3Client:
    def __init__(self, base_url: str = OLLAMA_ENDPOINT, model: str = MODEL_NAME):
        self.base_url = base_url
        self.model = model
        
        # Default parameters optimized for llama3
        self.default_params = {
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "num_ctx": 4096  
        }
        
    def chat_completion(self, messages: List[Dict[str, str]], 
                         max_tokens: Optional[int] = None, 
                         stream: bool = False,
                         **kwargs):
        endpoint = f"{self.base_url}/api/chat"
        
        # For inference
        options = self.default_params.copy()

        for key, value in kwargs.items():
            options[key] = value
            
        if max_tokens:
            options["num_predict"] = max_tokens
        
        payload = {
            "model": self.model,
            "messages": messages,
            "stream": stream,
            "options": options
        }
            
        if stream:
            response = requests.post(endpoint, json=payload, stream=True)
            return response.iter_lines()
        else:
            response = requests.post(endpoint, json=payload)
            return response.json()
    
    def get_token_count(self, messages: List[Dict[str, str]]):
        """Estimate the number of tokens in the messages"""
        # ~4 chars/token for llama3
        total_chars = sum(len(m["content"]) for m in messages)
        estimated_tokens = total_chars // 4
        return estimated_tokens

### Send a single message to the model


In [7]:
client = Llama3Client()

message_text = [
    {"role":"system","content":"You are an AI assistant that helps people find answers. You use the Llama 3 8B model and can answer complex questions."},
    {"role":"user","content":"Tell me what flat white is and how it differs from other popular coffee beverages like cappuchino?"},
]

completion = client.chat_completion(message_text, max_tokens=200)

print(completion["message"]["content"])

I'd be happy to help you with that!

A flat white is a type of coffee drink that originated in Australia and New Zealand. It's a double shot of espresso topped with a thin layer of microfoam (steamed milk that's been frothed to a consistency similar to whipped cream). The ratio of espresso to milk is typically around 1:3, which means the drink is strong on coffee flavor but still has a creamy texture.

Now, let's compare it to other popular coffee drinks like cappuccino:

* Cappuccino: A traditional Italian drink that consists of one-third espresso, one-third steamed milk, and one-third frothed milk. The key difference is the layering of the ingredients, with the frothed milk on top in a cappuccino, whereas in a flat white, the microfoam is more evenly distributed throughout the drink.
* Latte: A latte has a higher milk-to-coffee ratio than a flat white,


## Streaming responses

Streaming is useful for longer responses.

In [None]:
stream_resp = client.chat_completion(message_text, max_tokens=200, stream=True)

full_response = ""
for line in stream_resp:
    if line:
        chunk = json.loads(line)
        if chunk.get("message") and chunk["message"].get("content"):
            content = chunk["message"]["content"]
            full_response += content
            print(content, end="", flush=True)

Deutsche Telekom MMS, a subsidiary of Deutsche Telekom AG, is an Information Technology (IT) service provider offering managed services, cloud solutions, and business applications to customers. To understand what sets them apart, let's first consider the market landscape.

In the IT services sector, companies typically focus on one or more areas:

1. **Managed Services**: Proactively monitoring and managing customers' IT infrastructure, networks, and systems.
2. **Cloud Services**: Offering cloud-based solutions for storage, computing, and applications, often with a subscription-based model.
3. **Business Applications**: Providing software solutions for specific industries or business functions, such as enterprise resource planning (ERP) or customer relationship management (CRM).

Deutsche Telekom MMS differentiates itself from other IT service providers in several ways:

1. **Telecommunications heritage**: As a subsidiary of Deutsche Telekom AG, MMS benefits from the parent company's 

## Structured Outputs - Extracting JSON

Llama 3 8B has improved structured reasoning capabilities.

In [None]:
from pydantic import BaseModel
import json

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

# Function to extract structured data with Llama 3 8B
def extract_structured_data(prompt, schema_class, temperature=0.2):
    # Lower temperature for more precise structured outputs
    schema_dict = schema_class.model_json_schema()
    schema_str = json.dumps(schema_dict, indent=2)
    
    # Optimized prompt for Llama 3 8B
    structured_prompt = f"""
    Extract the following information from the text and return it as valid JSON.
    Schema: {schema_str}
    
    Text: {prompt}
    
    Respond with a valid JSON object (without Markdown formatting or additional text).
    """
    
    # Request with optimal parameters for structured extraction
    messages = [
        {"role": "system", "content": "You are a precise assistant that extracts information as JSON."},
        {"role": "user", "content": structured_prompt},
    ]
    
    response = client.chat_completion(messages, temperature=temperature)
    
    # Extract JSON from the response
    response_text = response["message"]["content"]
    
    # If the response contains Markdown code blocks
    if "```json" in response_text:
        json_str = response_text.split("```json")[1].split("```")[0].strip()
    elif "```" in response_text:
        json_str = response_text.split("```")[1].strip()
    else:
        json_str = response_text.strip()
    
    # Remove any comments if present
    json_lines = [line for line in json_str.split('\n') if not line.strip().startswith('//')]
    clean_json = '\n'.join(json_lines)
    
    # Convert JSON to a Python object
    try:
        result = json.loads(clean_json)
        # Convert to the Pydantic model
        return schema_class(**result)
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        print(f"Received JSON: {clean_json}")
        raise

# Example with more complex text for Llama 3 8B
event_text = """Tina Smith and Dr. Tony Miller have scheduled to meet for the annual lottery 
at the community center on December 12, 2024. The event starts at 3:00 PM and prizes will be raffled off."""

event = extract_structured_data(event_text, CalendarEvent)
print(event)

## Multi-turn Conversation with Llama 3 8B

Llama 3 8B has good conversational memory. Here's an example of a multi-turn dialogue.

In [None]:
# Function for multi-turn dialogue
def conversation(messages, print_response=True):
    response = client.chat_completion(messages)
    assistant_message = response["message"]
    
    if print_response:
        print(f"Assistant: {assistant_message['content']}\n")
    
    # Add message to the dialogue
    messages.append(assistant_message)
    return messages

# Start a new conversation
convo = [
    {"role": "system", "content": "You are a helpful assistant who can explain complex topics clearly."}
]

# First question
convo.append({"role": "user", "content": "What are the key differences between traditional neural networks and transformer models?"})
convo = conversation(convo)

# Follow-up question
convo.append({"role": "user", "content": "Can you give me an example where these differences matter in practice?"})
convo = conversation(convo)

# Another follow-up
convo.append({"role": "user", "content": "Which programming languages and libraries should I learn to work with these models?"})
convo = conversation(convo)

## Advanced Application: Document Summarization

Llama 3 8B is good at summarizing documents.

In [None]:
def summarize_document(text, max_length=200, focus_points=None):
    """Summarize a longer text with Llama 3 8B"""
    
    # Estimate if the text fits within context
    estimated_tokens = len(text) // 4  # Rough estimation
    
    instructions = f"Summarize the following text in about {max_length} words."
    
    if focus_points:
        focus_str = ", ".join(focus_points)
        instructions += f" Focus particularly on these aspects: {focus_str}."
    
    if estimated_tokens > 3500:
        print("Warning: The text might be too long for the context window. The summary might be incomplete.")
    
    messages = [
        {"role": "system", "content": "You are an assistant that creates precise summaries, capturing the key points."},
        {"role": "user", "content": f"{instructions}\n\nDOCUMENT: {text}"}
    ]
    
    # Optimized parameters for summaries
    response = client.chat_completion(messages, temperature=0.3, top_p=0.95)
    return response["message"]["content"]

# Example: Summarize a longer text
sample_text = """
[This would be a longer text...]
Modern software development faces numerous challenges. Agile methods have become the standard, 
but their implementation varies greatly depending on company culture and project requirements. DevOps practices enable 
faster development cycles, but require close collaboration between development and operations teams...
[etc. for a longer text]
"""

summary = summarize_document(sample_text, focus_points=["Agile methods", "DevOps", "Collaboration"])
print(summary)

## Creating a Custom Llama 3 8B Model

Here's how to create and use a custom version of Llama 3 8B:

In [None]:
# This cell creates a Modelfile for a custom Llama 3 8B
# You would run this in a terminal, not directly in the notebook

'''
# In terminal, create a file named Modelfile with these contents:

FROM llama3:8b

# Set model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096

# Define system behavior
SYSTEM """
You are a helpful assistant who provides clear, accurate, and concise answers.
You prefer structured, well-organized responses and list important information in a clear format.
"""

# Then build it with:
ollama create custom-llama3-8b -f Modelfile
'''

print("Run the above commands in your terminal to create a custom Llama 3 8B model.")
print("After creating it, you can use it by changing MODEL_NAME to 'custom-llama3-8b' at the top of this notebook.")