# Section 5: Chat Inference with Hugging Face Providers

## Objectives
- Perform chat inference using Hugging Face models
- Compare provider performance for conversational AI
- Measure latency and token throughput
- Implement streaming responses

## Requirements
- Python 3.10+
- PyTorch 2.6.0+
- CUDA 12.4+ (optional)
- Hugging Face API token

In [None]:
# Install required packages
!pip install -q huggingface_hub>=0.20.0 torch>=2.6.0 pandas matplotlib

In [None]:
import os
from getpass import getpass
from huggingface_hub import InferenceClient
import time
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Dict, Generator
import json

print("✓ Libraries imported")

## Authentication

In [None]:
# Load token securely
HF_TOKEN = os.getenv("HF_TOKEN")

if not HF_TOKEN:
    HF_TOKEN = getpass("Enter your Hugging Face token: ")

assert HF_TOKEN, "Token required"
print("✓ Token loaded")

In [None]:
# Initialize client
client = InferenceClient(token=HF_TOKEN)

# Model for chat
CHAT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"

print(f"✓ Using model: {CHAT_MODEL}")

## Exercise 1: Basic Chat Completion

**What you'll practice:** Perform basic chat inference using Hugging Face models.

This exercise introduces you to conversational AI:
- Setting up chat completion requests with message history
- Understanding the message format (system, user, assistant roles)
- Measuring response latency
- Handling chat completion responses

You'll learn the fundamental pattern for all conversational AI interactions, which forms the basis for more advanced features like multi-turn conversations and streaming responses.

In [None]:
def chat_completion(
    messages: List[Dict[str, str]],
    model: str = CHAT_MODEL,
    max_tokens: int = 500,
    temperature: float = 0.7
) -> tuple:
    """
    Perform chat completion.
    
    Args:
        messages: List of message dicts with 'role' and 'content'
        model: Model to use
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
    
    Returns:
        Tuple of (response_text, latency_seconds)
    """
    start = time.perf_counter()
    
    response = client.chat_completion(
        messages=messages,
        model=model,
        max_tokens=max_tokens,
        temperature=temperature
    )
    
    latency = time.perf_counter() - start
    response_text = response.choices[0].message.content
    
    return response_text, latency

In [None]:
# Simple chat example
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain machine learning in simple terms."}
]

print("Sending chat request...")
response, latency = chat_completion(messages)

print(f"\n✓ Response received in {latency:.2f}s")
print(f"\nAssistant: {response}")

## Exercise 2: Multi-Turn Conversation

**What you'll practice:** Build conversations that maintain context across multiple exchanges.

This exercise teaches you to:
- Maintain conversation history across multiple turns
- Format messages correctly for context preservation
- Handle conversation state management
- Create natural dialogue flows

You'll learn how to build chatbots and assistants that remember previous parts of the conversation, enabling more natural and useful interactions.

In [None]:
class ChatSession:
    """Manage a multi-turn conversation"""
    
    def __init__(self, system_prompt: str = "You are a helpful assistant."):
        self.messages = [
            {"role": "system", "content": system_prompt}
        ]
        self.latencies = []
    
    def send(self, user_message: str) -> str:
        """Send a message and get response"""
        self.messages.append({"role": "user", "content": user_message})
        
        response, latency = chat_completion(self.messages)
        self.latencies.append(latency)
        
        self.messages.append({"role": "assistant", "content": response})
        
        return response
    
    def get_history(self) -> List[Dict]:
        """Get conversation history"""
        return [m for m in self.messages if m["role"] != "system"]
    
    def print_conversation(self):
        """Print formatted conversation"""
        for msg in self.get_history():
            role = msg["role"].capitalize()
            print(f"\n{role}: {msg['content']}")
            print("-" * 70)

In [None]:
# Create a chat session
session = ChatSession("You are an expert in artificial intelligence.")

# Multi-turn conversation
questions = [
    "What is deep learning?",
    "How does it differ from traditional machine learning?",
    "Can you give a practical example?"
]

for i, question in enumerate(questions, 1):
    print(f"\n[Turn {i}] User: {question}")
    response = session.send(question)
    print(f"Assistant: {response[:200]}...")
    print(f"Latency: {session.latencies[-1]:.2f}s")

print(f"\n✓ Conversation completed")
print(f"Average latency: {sum(session.latencies) / len(session.latencies):.2f}s")

## Exercise 3: Provider Performance Comparison

**What you'll practice:** Compare different providers' performance for chat inference.

This exercise demonstrates how to:
- Benchmark multiple providers with the same prompts
- Measure latency, throughput, and token generation rates
- Analyze performance differences between providers
- Make informed decisions about provider selection

You'll learn to systematically evaluate different AI providers to choose the best one for your specific use case, balancing cost, speed, and quality.

In [None]:
def benchmark_chat_providers(
    test_messages: List[List[Dict]],
    providers: List[str] = ["auto"],
    num_runs: int = 3
) -> pd.DataFrame:
    """
    Benchmark chat performance across providers.
    
    Args:
        test_messages: List of message lists to test
        providers: Providers to test
        num_runs: Runs per test case
    
    Returns:
        DataFrame with results
    """
    results = []
    
    for provider in providers:
        print(f"\nTesting provider: {provider}")
        print("-" * 50)
        
        for msg_idx, messages in enumerate(test_messages, 1):
            for run in range(1, num_runs + 1):
                try:
                    print(f"  Test {msg_idx}/{len(test_messages)}, Run {run}...", end=" ")
                    
                    response, latency = chat_completion(messages)
                    tokens = len(response.split())  # Approximate token count
                    
                    results.append({
                        'provider': provider,
                        'test_case': msg_idx,
                        'run': run,
                        'latency_sec': latency,
                        'tokens': tokens,
                        'tokens_per_sec': tokens / latency if latency > 0 else 0,
                        'status': 'success'
                    })
                    
                    print(f"✓ {latency:.2f}s, {tokens} tokens")
                    
                except Exception as e:
                    results.append({
                        'provider': provider,
                        'test_case': msg_idx,
                        'run': run,
                        'latency_sec': None,
                        'tokens': None,
                        'tokens_per_sec': None,
                        'status': f'failed: {type(e).__name__}'
                    })
                    print(f"✗ {type(e).__name__}")
                
                time.sleep(1)
    
    return pd.DataFrame(results)

In [None]:
# Define test cases
test_cases = [
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ],
    [
        {"role": "system", "content": "You are a coding expert."},
        {"role": "user", "content": "Explain list comprehensions in Python."}
    ],
    [
        {"role": "system", "content": "You are a data science tutor."},
        {"role": "user", "content": "What is the difference between supervised and unsupervised learning?"}
    ]
]

# Run benchmark
print("Starting chat benchmark...")
chat_benchmark_df = benchmark_chat_providers(
    test_messages=test_cases,
    providers=["auto"],
    num_runs=3
)

print("\n✓ Benchmark completed")

In [None]:
# Analyze results
successful = chat_benchmark_df[chat_benchmark_df['status'] == 'success']

if len(successful) > 0:
    stats = successful.groupby('provider').agg({
        'latency_sec': ['mean', 'std', 'min', 'max'],
        'tokens_per_sec': ['mean', 'std']
    }).round(2)
    
    print("\nProvider Statistics:")
    print("=" * 70)
    print(stats)
    
    # Visualize
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Latency
    successful.boxplot(column='latency_sec', by='provider', ax=ax1)
    ax1.set_title('Latency Distribution')
    ax1.set_ylabel('Seconds')
    
    # Throughput
    successful.boxplot(column='tokens_per_sec', by='provider', ax=ax2)
    ax2.set_title('Token Throughput')
    ax2.set_ylabel('Tokens/Second')
    
    plt.suptitle('')
    plt.tight_layout()
    plt.show()
else:
    print("No successful results to analyze")

## Exercise 4: Streaming Responses

In [None]:
def chat_stream(
    messages: List[Dict[str, str]],
    model: str = CHAT_MODEL
) -> Generator[str, None, None]:
    """
    Stream chat completion tokens.
    
    Yields:
        Token strings as they arrive
    """
    stream = client.chat_completion(
        messages=messages,
        model=model,
        stream=True,
        max_tokens=500
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

In [None]:
# Test streaming
messages = [
    {"role": "system", "content": "You are a creative writer."},
    {"role": "user", "content": "Write a short story about a robot learning to paint."}
]

print("Streaming response:\n")
print("Assistant: ", end="", flush=True)

full_response = ""
start_time = time.perf_counter()

for token in chat_stream(messages):
    print(token, end="", flush=True)
    full_response += token

elapsed = time.perf_counter() - start_time

print(f"\n\n✓ Streamed in {elapsed:.2f}s")
print(f"✓ Total tokens: {len(full_response.split())}")

## Exercise 5: Error Handling and Retry Logic

In [None]:
def chat_with_retry(
    messages: List[Dict],
    max_retries: int = 3,
    backoff_factor: float = 2.0
) -> tuple:
    """
    Chat completion with exponential backoff retry.
    
    Args:
        messages: Chat messages
        max_retries: Maximum retry attempts
        backoff_factor: Multiplier for delay
    
    Returns:
        Tuple of (response, latency, attempts)
    """
    for attempt in range(1, max_retries + 1):
        try:
            print(f"Attempt {attempt}/{max_retries}...", end=" ")
            response, latency = chat_completion(messages)
            print("✓ Success")
            return response, latency, attempt
            
        except Exception as e:
            print(f"✗ Failed: {type(e).__name__}")
            
            if attempt < max_retries:
                delay = backoff_factor ** (attempt - 1)
                print(f"  Retrying in {delay:.1f}s...")
                time.sleep(delay)
            else:
                raise Exception(f"Failed after {max_retries} attempts: {e}")

In [None]:
# Test retry logic
messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

try:
    response, latency, attempts = chat_with_retry(messages)
    print(f"\nResponse: {response}")
    print(f"Latency: {latency:.2f}s")
    print(f"Attempts: {attempts}")
except Exception as e:
    print(f"\nAll retries failed: {e}")

## Summary

### Key Takeaways
1. ✅ Performed chat completions with Hugging Face models
2. ✅ Managed multi-turn conversations
3. ✅ Benchmarked provider performance
4. ✅ Implemented streaming responses
5. ✅ Added robust error handling

### Best Practices
- Maintain conversation context in messages list
- Use streaming for better UX in interactive apps
- Implement retry logic with exponential backoff
- Monitor token throughput and latency
- Handle errors gracefully

### Next Steps
- Complete the **Provider Benchmarking** notebook
- Proceed to **Section 6: Local Inference Endpoints**
- Build a production chat application

In [None]:
# Cleanup
if 'HF_TOKEN' in locals():
    del HF_TOKEN
print("✓ Cleanup completed")