# LLM API Integration: From Basics to Advanced Applications

This notebook provides a comprehensive introduction to using Large Language Models (LLMs) via APIs in your applications. Before diving into complex agent-based systems, it's important to understand the fundamentals of API interaction with modern language models.

## Learning Objectives

1. Understand basic API calling patterns for OpenAI and other LLM providers
2. Learn about key parameters that control model behavior
3. Explore strategies for managing context length and efficient batching
4. Implement best practices for caching and cost optimization
5. Utilize advanced capabilities including tool calls and multimodal inputs
6. Apply these concepts to real-world applications

## Prerequisites

- Basic Python knowledge
- OpenAI API key (and optionally keys for other providers)
- Understanding of how to run a Jupyter notebook

## 1 - Environment Setup

First, let's set up our environment with the necessary packages. We'll be using:
- `openai`: For accessing OpenAI's models
- `anthropic`: (Optional) For Claude models
- `requests`: For making HTTP requests to other API providers
- `tiktoken`: For token counting
- `matplotlib`: For visualizations
- `pandas`: For data handling

In [None]:
import os
import getpass
import json
import time
from typing import List, Dict, Any, Optional, Union

# Install necessary packages if not already installed
try:
    import openai
    print(f"✅ openai {openai.__version__}")
except ImportError:
    print("⏳ Installing openai...")
    !pip install -q openai
    import openai
    print(f"✅ openai {openai.__version__}")

try:
    import tiktoken
    print(f"✅ tiktoken {tiktoken.__version__}")
except ImportError:
    print("⏳ Installing tiktoken...")
    !pip install -q tiktoken
    import tiktoken
    print(f"✅ tiktoken {tiktoken.__version__}")

try:
    import matplotlib.pyplot as plt
    print("✅ matplotlib")
except ImportError:
    print("⏳ Installing matplotlib...")
    !pip install -q matplotlib
    import matplotlib.pyplot as plt
    print("✅ matplotlib")

try:
    import pandas as pd
    print(f"✅ pandas {pd.__version__}")
except ImportError:
    print("⏳ Installing pandas...")
    !pip install -q pandas
    import pandas as pd
    print(f"✅ pandas {pd.__version__}")

✅ openai 1.96.1
✅ tiktoken 0.9.0
✅ matplotlib
✅ pandas 2.2.2


# Note: Setting Up Virtual Environments

To isolate dependencies for a project, you can create a virtual environment. You can create one by running `python -m venv env` in your terminal, which will create a folder named `env` in your current directory.

Activate the environment with `.\env\Scripts\activate` on Windows or `source env/bin/activate` on macOS/Linux. Once activated, your terminal prompt will change to show the environment name. Install the required packages with `pip install openai tiktoken matplotlib pandas`. When you're finished working, you can deactivate the environment by simply typing `deactivate`.


## 2 - API Key Management

Most LLM providers require authentication through API keys. Let's set up our API keys safely:

In [None]:
# Setting up OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("🔑 Enter your OpenAI API key: ")

# Initialize clients
from openai import OpenAI
client = OpenAI()  # Uses the OPENAI_API_KEY environment variable by default

🔑 Enter your OpenAI API key: ··········


### API Key Security Best Practices

When working with API keys, always follow these best practices:

1. **Never hardcode API keys** in your source code
2. **Don't commit API keys** to version control
3. Use **environment variables** for production
4. Set up **usage limits** in your API provider dashboard
5. **Rotate keys periodically** and when team members change

For local development, you can use `.env` files, but remember to add them to `.gitignore`.

## 3 - Basic API Calling Patterns

### OpenAI Completion API

The most fundamental interaction with LLMs is the completion API, which generates text based on a prompt. Here's the basic syntax for OpenAI:

In [None]:
def basic_completion(prompt: str, model: str = "gpt-4o", temperature: float = 0.7) -> str:
    """Basic text completion using OpenAI's chat completion API."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

# Example usage
prompt = "Explain quantum computing in 50 words or less."
result = basic_completion(prompt)
print(result)

Quantum computing harnesses quantum mechanics to process information using qubits, which can represent both 0 and 1 simultaneously. This enables parallel computations and complex problem-solving far beyond classical computers, potentially revolutionizing fields like cryptography, optimization, and drug discovery by offering exponential speed-ups for certain tasks.


### ChatGPT-style Conversation API

Most modern LLMs support conversation-style interactions with different roles. Here's how to implement a conversation:

In [None]:
def chat_conversation(messages: List[Dict[str, str]],
                      model: str = "gpt-3.5-turbo",
                      temperature: float = 0.7) -> str:
    """Multi-turn conversation using OpenAI's chat completion API.
       This is designed to send a complete conversation history to the API and get the next response."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error: {str(e)}"

# Example conversation
conversation = [
    {"role": "system", "content": "You are a helpful assistant specializing in physics."},
    {"role": "user", "content": "What is the Higgs boson?"},
    {"role": "assistant", "content": "The Higgs boson is an elementary particle in the Standard Model of physics that gives mass to other particles through the Higgs mechanism."},
    {"role": "user", "content": "How was it discovered?"}
]

result = chat_conversation(conversation)
print(result)

The Higgs boson was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. Scientists at CERN conducted experiments colliding protons at very high energies and analyzing the data to search for the signature of the Higgs boson. The discovery was confirmed by two independent experiments, ATLAS and CMS, which both observed a particle consistent with the predicted properties of the Higgs boson.


##### Conversation Structure: The chat API allows you to send multiple messages with different roles:
- system: Sets overall behavior/personality instructions
- user: Messages from the human
- assistant: Previous AI responses

### Structured Output with JSON Mode

When you need the model to return structured data that can be directly parsed by your application:

In [None]:
def json_output(prompt: str, model: str = "gpt-3.5-turbo") -> Dict[str, Any]:
    """Get structured JSON output from the model."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            response_format={"type": "json_object"}
        )
        return json.loads(response.choices[0].message.content)
    except Exception as e:
        return {"error": str(e)}

# Example
product_prompt = "Create a JSON object for a product with name, price, category, and three features."
product_data = json_output(product_prompt)
print(json.dumps(product_data, indent=2))

{
  "product": {
    "name": "Laptop",
    "price": "$1000",
    "category": "Electronics",
    "features": {
      "screenSize": "15.6 inch",
      "memory": "8GB RAM",
      "storage": "256GB SSD"
    }
  }
}


### API Calling Patterns with Other Providers

Other LLM providers follow similar patterns, though with some syntax differences:

#### Anthropic (Claude)
```python
import anthropic
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

response = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    messages=[
        {"role": "user", "content": "Explain quantum entanglement briefly."}
    ]
)
print(response.content[0].text)
```

#### DeepSeek
```python
import requests

headers = {
    "Authorization": f"Bearer {os.environ.get('DEEPSEEK_API_KEY')}",
    "Content-Type": "application/json"
}

data = {
    "model": "deepseek-chat",
    "messages": [
        {"role": "user", "content": "What are transformer models?"}
    ],
    "temperature": 0.7
}

response = requests.post("https://api.deepseek.com/v1/chat/completions",
                         headers=headers,
                         json=data)
result = response.json()
print(result["choices"][0]["message"]["content"])
```

These examples illustrate that while the specific syntax varies, the core pattern of sending a request with messages/prompts and receiving generated text in response remains consistent across providers.

## 4 - Model Parameters and Their Impact

Model parameters significantly affect the output quality, creativity, and behavior of LLMs. Let's explore the key parameters:

### Temperature

Temperature controls randomness in the output. Lower values (near 0) make responses more deterministic and focused, while higher values (near 1-2) increase creativity and randomness.

- **Low temperature (0.0-0.3)**: Good for factual Q&A, code generation, structured data extraction
- **Medium temperature (0.4-0.7)**: Balanced for general conversation and content creation
- **High temperature (0.8-1.0+)**: Creative writing, brainstorming, divergent thinking

In [None]:
def compare_temperatures(prompt: str, temperatures: List[float] = [0.0, 0.5, 1.0, 1.5]):
    """Compare model outputs at different temperature settings."""
    results = {}

    for temp in temperatures:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=temp,
            # seed=42  # Using seed for better comparison
        )
        results[f"Temperature {temp}"] = response.choices[0].message.content

    return results

# Example
creative_prompt = "Generate a slogan for a new electric car brand called 'Volt'"
temp_comparison = compare_temperatures(creative_prompt)

for temp, response in temp_comparison.items():
    print(f"\n{temp}:")
    print(f"{response}\n{'-' * 50}")


Temperature 0.0:
"Empower your drive with Volt - the future of electric mobility."
--------------------------------------------------

Temperature 0.5:
"Powering the future with Volt: Electrify your drive!"
--------------------------------------------------

Temperature 1.0:
"Experience the power and efficiency of Volt: driving into the future."
--------------------------------------------------

Temperature 1.5:
"Unleash the power of Volt: Electric driving, redefined."
--------------------------------------------------


### Top-p Sampling (Nucleus Sampling)

Controls token diversity by considering only the tokens that comprise the top_p probability mass.

- **Low top_p (0.1-0.4)**: More focused and deterministic responses
- **Medium top_p (0.5-0.7)**: Balanced, somewhat diverse responses
- **High top_p (0.8-1.0)**: More varied, potentially more creative responses

### Max Tokens

Limits the maximum length of the generated response. Important for:
- Controlling response length for specific formats
- Managing costs
- Preventing unnecessarily verbose outputs

### Frequency and Presence Penalties

Controls repetition in the model's output:

- **Frequency penalty**: Reduces the likelihood of repeating tokens that have already appeared frequently
- **Presence penalty**: Reduces the likelihood of repeating any token that has appeared before, regardless of frequency

Useful for long-form content to avoid circular reasoning or repetitive language.

In [None]:
def compare_penalties(prompt: str):
    """Compare model outputs with different frequency and presence penalties."""
    results = {}

    # No penalty
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150
    )
    results["No penalty"] = response.choices[0].message.content

    # Frequency penalty
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150,
        frequency_penalty=1.0
    )
    results["Frequency penalty 1.0"] = response.choices[0].message.content

    # Presence penalty
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150,
        presence_penalty=1.0
    )
    results["Presence penalty 1.0"] = response.choices[0].message.content

    return results

# Example
prompt = "List 5 benefits of regular exercise."
penalty_comparison = compare_penalties(prompt)

for setting, response in penalty_comparison.items():
    print(f"\n{setting}:")
    print(f"{response}\n{'-' * 50}")


No penalty:
1. Improved physical health: Regular exercise can help prevent chronic diseases such as heart disease, diabetes, and obesity. It also helps strengthen muscles and bones, improve flexibility, and increase overall fitness levels.

2. Mental health benefits: Exercise has been shown to have a positive impact on mental health by reducing symptoms of anxiety and depression, improving mood, and boosting self-esteem. It can also help alleviate stress and improve cognitive function.

3. Weight management: Regular exercise can help with weight loss or weight maintenance by burning calories and increasing metabolism. It also helps build muscle mass, which can further contribute to a healthy body composition.

4. Increased energy levels: Engaging in regular physical activity can help boost energy levels and combat feelings of fatigue. Exercise can improve
--------------------------------------------------

Frequency penalty 1.0:
1. Improved physical health: Regular exercise can help r

## 5 - Context Length Management

The context length is the total number of tokens the model can process in a single request, including both the input (prompt) and output (completion). Managing context effectively is crucial for:

1. **Cost optimization**: Fewer tokens = lower cost
2. **Response quality**: Most relevant context = better answers
3. **Performance**: Smaller contexts = faster responses

### Token Counting

Tokens are the basic units processed by LLMs - roughly 4 characters or 3/4 of a word in English:

In [None]:
def count_tokens(text: str, model: str = "gpt-3.5-turbo"):
    """Count the number of tokens in a text string for a specific model."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

# Examples
texts = [
    "Hello world",
    "This is a slightly longer sentence to demonstrate token counting.",
    """This is a much longer paragraph that contains multiple sentences.
    Token counting becomes more important as the length of text increases,
    especially when working with models that have context limits and
    when you're paying per token for API usage."""
]

for i, text in enumerate(texts):
    print(f"Text {i+1} ({len(text)} characters):")
    print(f"- {count_tokens(text)} tokens")
    print(f"- Approx. {count_tokens(text) / (len(text) / 4):.2f}x character-to-token ratio")

Text 1 (11 characters):
- 2 tokens
- Approx. 0.73x character-to-token ratio
Text 2 (65 characters):
- 11 tokens
- Approx. 0.68x character-to-token ratio
Text 3 (259 characters):
- 48 tokens
- Approx. 0.74x character-to-token ratio


### Context Window Considerations

Different models have different context window sizes. For example:

- GPT-3.5 Turbo: 16K tokens
- GPT-4: 8K-128K tokens (depending on version)
- Claude 3 Opus: 200K tokens
- Mistral Large: 32K tokens

### Strategies for Managing Large Contexts

1. **Chunking**: Break large documents into smaller segments
2. **Summarization**: Generate summaries of lengthy content
3. **Selective Context**: Only include the most relevant context
4. **Sliding Window**: Process long documents in overlapping chunks

Let's implement a simple chunking strategy:

In [None]:
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 100, model: str = "gpt-3.5-turbo"):
    """Split text into overlapping chunks of approximately chunk_size tokens."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)

    chunks = []
    i = 0
    while i < len(tokens):
        # Get chunk of tokens
        chunk_end = min(i + chunk_size, len(tokens))
        chunk_tokens = tokens[i:chunk_end]

        # Convert tokens back to text
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        # Move to next chunk with overlap
        i += (chunk_size - overlap)

    return chunks

# Example with a longer text
long_text = """[The Massachusetts Institute of Technology (MIT) is a private research university located in Cambridge, Massachusetts, United States. It was founded in 1861 in response to the increasing industrialization of the United States and adopted a European polytechnic university model, emphasizing laboratory instruction in applied science and engineering. Classes officially began in 1865, after delays caused by the American Civil War.
MIT is organized into five schools: the School of Architecture and Planning; the School of Engineering; the School of Humanities, Arts, and Social Sciences; the Sloan School of Management; and the School of Science. In 2022, MIT also launched the Schwarzman College of Computing to integrate computer science and AI across disciplines. The institute has approximately 4,600 undergraduate students and 7,300 graduate students. It maintains a low student-to-faculty ratio and is known for its rigorous academic programs.
MIT has long been a global leader in science, technology, and engineering education and research. Its faculty and alumni have been associated with numerous groundbreaking inventions and innovations, including radar, the digital computer, synthetic self-replicating molecules, and the development of the internet’s early architecture. As of 2024, MIT affiliates have won 100+ Nobel Prizes, 50+ National Medals of Science, and numerous other honors. The Institute places strong emphasis on entrepreneurship; companies founded by MIT alumni generate trillions of dollars in annual revenue and employ millions of people worldwide.
Research at MIT is supported by more than 30 departments and over 80 interdisciplinary labs and centers. Notable among them are the MIT Media Lab, the Computer Science and Artificial Intelligence Laboratory (CSAIL), the Lincoln Laboratory (a federally funded R&D center), and the Koch Institute for Integrative Cancer Research. MIT consistently ranks among the top universities in global research output and innovation metrics.
MIT's campus spans 168 acres along the Charles River and features a mix of historic and modern architecture, including buildings designed by noted architects such as Alvar Aalto, Frank Gehry, and I.M. Pei. Student life at MIT is known for being intense but collaborative, with a wide array of extracurricular opportunities, including athletics, music, entrepreneurship clubs, and the Independent Activities Period (IAP), a January term where students can pursue short courses, projects, or travel.
Admissions to MIT are highly competitive. For undergraduate admission, MIT does not consider legacy status or demonstrated interest and has a need-blind admissions policy for all applicants, including international students. Financial aid is entirely need-based, and MIT guarantees to meet the full demonstrated financial need of admitted students. In 2023, over 70% of undergraduates received financial aid, and the median annual debt at graduation was zero.
MIT’s motto, “Mens et Manus” — Latin for “Mind and Hand” — reflects its educational philosophy of integrating theoretical knowledge with practical application. The Institute remains at the forefront of addressing global challenges through education, research, and innovation, in fields ranging from climate change and clean energy to AI, biotechnology, and quantum computing.]"""

chunks = chunk_text(long_text, chunk_size=100, overlap=20)
print(f"Split into {len(chunks)} chunks with ~100 tokens each and 20 tokens overlap")

# Print the first two chunks
print("\nChunk 1:")
print(f"{chunks[0]}")
print("\nChunk 2:")
print(f"{chunks[1]}")
print("\nNotice the overlap between the chunks")


Split into 8 chunks with ~100 tokens each and 20 tokens overlap

Chunk 1:
[The Massachusetts Institute of Technology (MIT) is a private research university located in Cambridge, Massachusetts, United States. It was founded in 1861 in response to the increasing industrialization of the United States and adopted a European polytechnic university model, emphasizing laboratory instruction in applied science and engineering. Classes officially began in 1865, after delays caused by the American Civil War.
MIT is organized into five schools: the School of Architecture and Planning; the School of Engineering; the School of Humanities

Chunk 2:
 into five schools: the School of Architecture and Planning; the School of Engineering; the School of Humanities, Arts, and Social Sciences; the Sloan School of Management; and the School of Science. In 2022, MIT also launched the Schwarzman College of Computing to integrate computer science and AI across disciplines. The institute has approximately 4,60

## 6 - Caching and Cost Optimization

API calls to LLMs can be expensive. Implementing caching helps reduce costs and improve application performance.

### Simple In-Memory Cache

In [None]:
class LLMCache:
    """Simple in-memory cache for LLM responses."""

    def __init__(self, max_size: int = 1000):
        self.cache = {}
        self.max_size = max_size

    def get_key(self, model: str, messages: List[Dict[str, str]], temperature: float):
        """Create a cache key from request parameters."""
        # Convert messages to a hashable form
        msg_str = json.dumps(messages, sort_keys=True)
        return f"{model}|{msg_str}|{temperature}"

    def get(self, model: str, messages: List[Dict[str, str]], temperature: float):
        """Retrieve a cached response if it exists."""
        key = self.get_key(model, messages, temperature)
        return self.cache.get(key)

    def set(self, model: str, messages: List[Dict[str, str]], temperature: float, response: str):
        """Cache a response."""
        if len(self.cache) >= self.max_size:
            # Simple eviction strategy: remove a random item
            self.cache.pop(next(iter(self.cache)))

        key = self.get_key(model, messages, temperature)
        self.cache[key] = response

# Cached completion function
cache = LLMCache()

def cached_completion(messages: List[Dict[str, str]], model: str = "gpt-3.5-turbo", temperature: float = 0.7):
    """Completion function with caching."""
    # Check cache first
    cached_response = cache.get(model, messages, temperature)
    if cached_response:
        print("Cache hit! ⚡️")
        return cached_response

    print("Cache miss, calling API... 🌐")
    # Make API call
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    result = response.choices[0].message.content

    # Cache the result
    cache.set(model, messages, temperature, result)

    return result

# Example usage
print("First call:")
response1 = cached_completion([{"role": "user", "content": "What's the capital of France?"}])
print(f"Response: {response1[:50]}...")

print("\nSecond call (should be cached):")
response2 = cached_completion([{"role": "user", "content": "What's the capital of France?"}])
print(f"Response: {response2[:50]}...")

First call:
Cache miss, calling API... 🌐
Response: The capital of France is Paris....

Second call (should be cached):
Cache hit! ⚡️
Response: The capital of France is Paris....


### Production Caching Strategies

For production applications, consider using:

1. **Redis**: Fast, distributed in-memory cache
2. **DynamoDB/MongoDB**: For persistent caching across restarts
3. **Semantic caching**: Cache similar but not identical queries
4. **Time-based expiration**: Refresh cache after certain period

### Cost Optimization Tips

1. **Use the right model size**: Smaller models are cheaper and often sufficient
2. **Optimize prompts**: Clear, concise prompts reduce token usage
3. **Implement rate limiting**: Prevent accidental overuse
4. **Monitor usage**: Track consumption patterns for optimization
5. **Batch processing**: Process multiple requests at once when possible

## 7 - Function/Tool Calling

One of the most powerful features of modern LLMs is their ability to use tools or functions to interact with external systems. This enables them to:

1. Access real-time information
2. Perform calculations
3. Take actions in the real world
4. Generate structured outputs

Let's implement a simple function calling example using OpenAI:

In [None]:
def get_weather(location: str, unit: str = "celsius"):
    """Simulate getting weather information for a location."""
    # In a real application, this would call a weather API
    import random
    temp = random.randint(0, 35)
    conditions = random.choice(["sunny", "cloudy", "rainy", "snowy"])

    if unit == "fahrenheit":
        temp = temp * 9/5 + 32

    return {
        "location": location,
        "temperature": temp,
        "unit": unit,
        "conditions": conditions
    }

def get_current_time(timezone: str = "UTC"):
    """Get the current time in a specific timezone."""
    from datetime import datetime
    # Simplified implementation without actual timezone handling
    return {"current_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"), "timezone": timezone}

# Define the function specifications for the LLM
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., 'San Francisco'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current time in a specific timezone",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {
                        "type": "string",
                        "description": "Timezone, e.g., 'UTC', 'EST'"
                    }
                }
            }
        }
    }
]

# Function to handle the AI assistant with function calling
def assistant_with_tools(user_message: str):
    """Process user messages and handle function calls as needed."""
    messages = [{"role": "user", "content": user_message}]

    # First, get the model's response with potential function calls
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        tools=tools,
        tool_choice="auto"  # Let the model decide when to use functions
    )

    response_message = response.choices[0].message
    messages.append(response_message)  # Add response to conversation history

    # Check if the model wants to call a function
    if response_message.tool_calls:
        # Process each function call
        for tool_call in response_message.tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)

            # Execute the function
            function_response = None
            if function_name == "get_weather":
                function_response = get_weather(**function_args)
            elif function_name == "get_current_time":
                function_response = get_current_time(**function_args)

            # Add function response to messages
            if function_response:
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "name": function_name,
                    "content": json.dumps(function_response)
                })

        # Get a new response from the model after function call
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=messages
        )

        return second_response.choices[0].message.content
    else:
        # If no function was called, return the initial response
        return response_message.content

# Test the assistant with function calling
queries = [
    "What's the weather like in Tokyo?",
    "What time is it in London?",
    "I'm planning a trip to Paris next week, should I pack a raincoat?"
]

for query in queries:
    print(f"\nUser: {query}")
    response = assistant_with_tools(query)
    print(f"Assistant: {response}")


User: What's the weather like in Tokyo?
Assistant: The weather in Tokyo is currently 8 degrees Celsius and sunny.

User: What time is it in London?
Assistant: The current time in London is 18:18 (6:18 PM) on July 17, 2025.

User: I'm planning a trip to Paris next week, should I pack a raincoat?
Assistant: It looks like the weather in Paris next week will be snowy with temperatures around 33°C. I would recommend packing a warm coat or jacket rather than a raincoat to prepare for the snow. Safe travels!


## 8 - Practical Application: Simple Question-Answering System

Let's put everything together to build a simple but effective question-answering system that demonstrates best practices:

In [None]:
class SimpleQASystem:
    """A simple question-answering system with caching and error handling."""

    def __init__(self, model="gpt-3.5-turbo"):
        self.model = model
        self.client = OpenAI()
        self.cache = LLMCache()  # Reusing our cache from earlier
        self.conversation_history = []
        self.system_message = {"role": "system", "content": "You are a helpful assistant that provides concise, accurate answers."}

    def count_tokens(self, text):
        """Count tokens for the model."""
        try:
            encoding = tiktoken.encoding_for_model(self.model)
            return len(encoding.encode(text))
        except Exception:
            # Fallback approximation
            return len(text.split()) * 1.3

    def trim_conversation(self, max_tokens=3000):
        """Trim conversation history to fit within token limit."""
        # Always keep system message and last user question
        essential_msgs = [self.system_message, self.conversation_history[-1]]
        essential_tokens = sum(self.count_tokens(msg["content"]) for msg in essential_msgs)

        available_tokens = max_tokens - essential_tokens
        trimmed_history = []

        # Add as many previous messages as will fit
        for msg in reversed(self.conversation_history[:-1]):
            msg_tokens = self.count_tokens(msg["content"])
            if msg_tokens <= available_tokens:
                trimmed_history.insert(0, msg)
                available_tokens -= msg_tokens
            else:
                break

        return [self.system_message] + trimmed_history + [self.conversation_history[-1]]

    def answer_question(self, question, temperature=0.7):
        """Answer a user question with caching and error handling."""
        # Add question to history
        self.conversation_history.append({"role": "user", "content": question})

        # Prepare messages with trimming if needed
        messages = self.trim_conversation()

        # Check cache
        cache_key = (self.model, tuple([(m["role"], m["content"]) for m in messages]), temperature)
        cached = self.cache.get(self.model, messages, temperature)
        if cached:
            print("Using cached response")
            answer = cached
        else:
            print(f"Calling {self.model} API...")
            try:
                # Make API call with error handling and timeout
                start_time = time.time()
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    temperature=temperature
                )
                answer = response.choices[0].message.content
                elapsed = time.time() - start_time
                print(f"API call completed in {elapsed:.2f}s")

                # Cache the result
                self.cache.set(self.model, messages, temperature, answer)

            except Exception as e:
                answer = f"I'm sorry, I encountered an error: {str(e)}"
                print(f"API Error: {e}")

        # Add answer to conversation history
        self.conversation_history.append({"role": "assistant", "content": answer})
        return answer

    def reset_conversation(self):
        """Reset the conversation history."""
        self.conversation_history = []

# Example usage
qa_system = SimpleQASystem()

# Test with a few questions
questions = [
    "What is the difference between supervised and unsupervised learning?",
    "Can you give me a simple example of each?",
    "What would be a good use case for reinforcement learning?"
]

for q in questions:
    print(f"\nQ: {q}")
    answer = qa_system.answer_question(q)
    print(f"A: {answer}")


Q: What is the difference between supervised and unsupervised learning?
Calling gpt-3.5-turbo API...
API call completed in 1.13s
A: Supervised learning is a type of machine learning where the model is trained on labeled data, meaning it is given input data along with the correct output. The model learns to map input data to the correct output based on the labeled examples.

Unsupervised learning, on the other hand, is a type of machine learning where the model is trained on unlabeled data. The model learns to find patterns and structures in the data without explicit guidance on the correct output.

Q: Can you give me a simple example of each?
Calling gpt-3.5-turbo API...
API call completed in 1.58s
A: Sure!

Supervised learning example: 
Predicting housing prices based on features like size, location, and number of bedrooms. The model is trained on a dataset where each house's features are accompanied by its actual selling price.

Unsupervised learning example:
Clustering customer dat

## 10 - Conclusion and Best Practices

Congratulations! You've learned the essential skills for integrating LLMs into your applications via APIs. Here's a summary of best practices to remember:

### Technical Best Practices

1. **Use the right model for the task**: Balance cost vs. capability
2. **Set appropriate parameters**: Adjust temperature based on the task's creative needs
3. **Implement caching**: Reduce costs and improve responsiveness
4. **Manage context efficiently**: Be mindful of token limits
5. **Handle errors gracefully**: Always implement timeouts and fallbacks
6. **Use structured outputs when possible**: Function calling or JSON mode for predictable formats

### Ethical and Security Considerations

1. **Never expose API keys**: Use environment variables or secure vaults
2. **Monitor usage and costs**: Set up billing alerts and rate limits
3. **Implement content filtering**: Consider using moderation endpoints
4. **Provide clear user guidance**: Set expectations about AI-generated content
5. **Consider privacy implications**: Be careful with user data sent to APIs

### Next Steps

Now that you understand the basics of LLM API integration, you're ready to explore more advanced concepts like:

1. Building sophisticated AI agents that can take actions
2. Implementing RAG (Retrieval Augmented Generation) systems
3. Fine-tuning models for your specific use cases
4. Developing hybrid systems that combine rule-based and ML approaches

The companion notebook "Agentic AI Lab" explores how to build an AI agent that can make strategic decisions in a simulated environment, applying many of the concepts you've learned here.