[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kgweber-cwru/coding-with-ai-wn26/blob/main/week-6-best-practices/concepts.ipynb)

# Week 6: Best Practices and Production Patterns

## Learning Objectives
- Implement robust error handling
- Optimize costs and performance
- Test LLM applications effectively
- Design for production deployment
- Consider security and privacy

In [6]:
import os
import sys
from pathlib import Path

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install -q google-genai google-auth google-api-core python-dotenv numpy chromadb
    from google.colab import auth
    auth.authenticate_user()
    try:
        PROJECT_ID = input("Enter your Google Cloud Project ID (press Enter to use default ADC): ").strip()
    except Exception:
        PROJECT_ID = ""
    if PROJECT_ID:
        os.environ["GOOGLE_CLOUD_PROJECT"] = PROJECT_ID
else:
    def find_service_account_json(max_up=6):
        p = Path.cwd()
        for _ in range(max_up):
            candidate = p / "series-2-coding-llms" / "creds"
            if candidate.exists():
                for f in candidate.glob("*.json"):
                    return str(f.resolve())
            candidate2 = p / "creds"
            if candidate2.exists():
                for f in candidate2.glob("*.json"):
                    return str(f.resolve())
            p = p.parent
        return None

    sa_path = find_service_account_json()
    if sa_path:
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = sa_path
    else:
        try:
            from dotenv import load_dotenv
            load_dotenv()
        except Exception:
            pass


In [7]:
import os
import time
from dotenv import load_dotenv
from google import genai
from google.genai import types
import google.auth
from google.api_core import exceptions

creds, project = google.auth.default()
project = os.environ.get("GOOGLE_CLOUD_PROJECT", project)
client = genai.Client(vertexai=True, project=project, location="us-central1")
print(f"Using project: {project}")

print("✅ Environment loaded successfully!")

Using project: coding-with-ai-wn-26
✅ Environment loaded successfully!


## Part 1: Error Handling

In [8]:
def robust_api_call(contents, max_retries=3, **kwargs):
    """Make API call with retry logic"""
    
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model=kwargs.get('model', 'gemini-2.5-flash-lite'),
                contents=contents,
                config=types.GenerateContentConfig(
                    temperature=kwargs.get('temperature', 0.7),
                    max_output_tokens=kwargs.get('max_output_tokens', None)
                )
            )
            return response.text
            
        except exceptions.ResourceExhausted as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limit hit. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                return "Error: Rate limit exceeded. Please try again later."
                
        except exceptions.GoogleAPICallError as e:
            print(f"API error: {e}")
            if attempt < max_retries - 1:
                time.sleep(1)
            else:
                return f"Error: API unavailable - {str(e)}"
                
        except Exception as e:
            print(f"Unexpected error: {e}")
            return f"Error: {str(e)}"
    
    return "Error: Max retries exceeded"

# Test it
result = robust_api_call("Say hello!")
print(result)

Hello! How can I help you today?


## Part 2: Cost Management

Pricing is super dynamic and even for a given model, depends on the volume and type of thinking.  Here's Google's pricelist:

https://ai.google.dev/gemini-api/docs/pricing

In [11]:
class CostTracker:
    """Track API costs across calls"""
    
    PRICING = {
        "gemini-2.5-flash-lite": {"input": 0.10, "output": 0.40},  # per 1M tokens approx
        "gemini-2.5-pro": {"input": 1.25, "output": 10.00},
        "gemini-embedding-001": {"input": 0.15, "output": 0} # Often free or very cheap
    }
    
    def __init__(self):
        self.calls = []
        self.total_cost = 0.0
    
    def track_completion(self, response, model="gemini-2.5-flash-lite"):
        """Track a completion call"""
        if not response.usage_metadata:
             return 0.0

        input_tokens = response.usage_metadata.prompt_token_count
        output_tokens = response.usage_metadata.candidates_token_count
        
        # Default pricing if model not found
        pricing = self.PRICING.get(model, self.PRICING["gemini-2.5-flash-lite"])

        cost = (
            (input_tokens / 1_000_000) * pricing["input"] +
            (output_tokens / 1_000_000) * pricing["output"]
        )
        
        self.calls.append({
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost
        })
        
        self.total_cost += cost
        return cost
    
    def report(self):
        """Generate cost report"""
        if not self.calls:
            return "No calls tracked"
        
        total_input = sum(c['input_tokens'] for c in self.calls)
        total_output = sum(c['output_tokens'] for c in self.calls)
        
        return f"""Cost Report:
Total calls: {len(self.calls)}
Total input tokens: {total_input:,}
Total output tokens: {total_output:,}
Total cost: ${self.total_cost:.8f}
Average cost per call: ${self.total_cost/len(self.calls):.8f}"""

# Test it
tracker = CostTracker()

for i in range(3):
    response = client.models.generate_content(
        model="gemini-2.5-flash-lite",
        contents=f"Count to {i+1}"
    )
    cost = tracker.track_completion(response)
    print(f"Call {i+1} cost: ${cost:.8f}")

print("\n" + tracker.report())

Call 1 cost: $0.00000080
Call 2 cost: $0.00000200
Call 3 cost: $0.00000360

Cost Report:
Total calls: 3
Total input tokens: 12
Total output tokens: 13
Total cost: $0.00000640
Average cost per call: $0.00000213


## Part 3: Testing Strategies

In [12]:
class LLMTestSuite:
    """Test suite for LLM applications"""
    
    def __init__(self):
        self.tests = []
        self.results = []
    
    def add_test(self, name, prompt, expected_contains=None, expected_not_contains=None):
        """Add a test case"""
        self.tests.append({
            "name": name,
            "prompt": prompt,
            "expected_contains": expected_contains or [],
            "expected_not_contains": expected_not_contains or []
        })
    
    def run_tests(self, temperature=0):
        """Run all tests"""
        print(f"Running {len(self.tests)} tests...\n")
        
        for test in self.tests:
            response = client.models.generate_content(
                model="gemini-2.5-flash-lite",
                contents=test["prompt"],
                config=types.GenerateContentConfig(temperature=temperature)
            )
            
            output = response.text.lower()
            
            # Check expectations
            passed = True
            failures = []
            
            for expected in test["expected_contains"]:
                if expected.lower() not in output:
                    passed = False
                    failures.append(f"Missing: {expected}")
            
            for not_expected in test["expected_not_contains"]:
                if not_expected.lower() in output:
                    passed = False
                    failures.append(f"Should not contain: {not_expected}")
            
            result = {
                "name": test["name"],
                "passed": passed,
                "output": output,
                "failures": failures
            }
            
            self.results.append(result)
            
            status = "✓ PASS" if passed else "✗ FAIL"
            print(f"{status}: {test['name']}")
            if failures:
                for f in failures:
                    print(f"  - {f}")
        
        # Summary
        passed_count = sum(1 for r in self.results if r['passed'])
        print(f"\n{passed_count}/{len(self.tests)} tests passed")

# Example tests
suite = LLMTestSuite()

suite.add_test(
    "Math calculation",
    "What is 15 + 27? Reply with only the number.",
    expected_contains=["42"]
)

suite.add_test(
    "Medical terminology",
    "Translate 'hypertension' to plain language in one sentence.",
    expected_contains=["blood pressure", "high"],
    expected_not_contains=["hypertension"]  # Should use plain language
)

suite.run_tests()

Running 2 tests...

✓ PASS: Math calculation
✗ FAIL: Medical terminology
  - Should not contain: hypertension

1/2 tests passed


## Part 4: Caching for Cost Reduction

Here's a class to help you do it on your own.

Additionally, though, most current LLMs have some caching you can activate.

Gemini models implicitly cache (1024 tokens for `flash`, 4096 for `pro`.) You can see the number of tokens which were cache hits in the response object's `usage_metadata` field.

Detailed documentation for how to use Gemini's explict caching capability is here:

https://ai.google.dev/gemini-api/docs/caching?lang=python

In [14]:
import hashlib
import json

class CachedLLM:
    """LLM with response caching"""
    
    def __init__(self):
        self.cache = {}
        self.hits = 0
        self.misses = 0
    
    def _make_key(self, contents, **kwargs):
        """Create cache key from request"""
        key_dict = {
            "contents": contents,
            "model": kwargs.get('model', 'gemini-2.5-flash-lite'),
            "temperature": kwargs.get('temperature', 0.7)
        }
        # Handle string or list contents for hashing
        key_str = json.dumps(key_dict, sort_keys=True, default=str)
        return hashlib.md5(key_str.encode()).hexdigest()
    
    def complete(self, contents, **kwargs):
        """Complete with caching"""
        key = self._make_key(contents, **kwargs)
        
        if key in self.cache:
            self.hits += 1
            print("  [Cache hit]")
            return self.cache[key]
        
        self.misses += 1
        print("  [Cache miss - calling API]")
        
        response = client.models.generate_content(
            model=kwargs.get('model', 'gemini-2.5-flash-lite'),
            contents=contents,
            config=types.GenerateContentConfig(
                temperature=kwargs.get('temperature', 0.7)
            )
        )
        
        result = response.text
        self.cache[key] = result
        return result
    
    def stats(self):
        total = self.hits + self.misses
        hit_rate = self.hits / total if total > 0 else 0
        return f"Cache stats: {self.hits} hits, {self.misses} misses ({hit_rate:.1%} hit rate)"

# Test caching
cached_llm = CachedLLM()

# Same request multiple times
for i in range(3):
    print(f"\nRequest {i+1}:")
    result = cached_llm.complete(
        "What is the capital of France?",
        temperature=0
    )

print("\n" + cached_llm.stats())


Request 1:
  [Cache miss - calling API]

Request 2:
  [Cache hit]

Request 3:
  [Cache hit]

Cache stats: 2 hits, 1 misses (66.7% hit rate)


## Part 5: Input Validation and Safety

In [15]:
class SafeLLMApp:
    """LLM application with input validation"""
    
    MAX_INPUT_LENGTH = 2000  # characters
    MAX_TOKENS = 500
    
    @staticmethod
    def validate_input(user_input):
        """Validate user input"""
        if not user_input or not user_input.strip():
            return False, "Input cannot be empty"
        
        if len(user_input) > SafeLLMApp.MAX_INPUT_LENGTH:
            return False, f"Input too long (max {SafeLLMApp.MAX_INPUT_LENGTH} chars)"
        
        # Add other validation as needed
        # - Check for malicious patterns
        # - Filter PII if required
        # - Check for prompt injection attempts
        
        return True, "OK"
    
    @staticmethod
    def sanitize_output(output):
        """Clean up LLM output"""
        # Remove potential PII or sensitive info
        # Format output consistently
        # Add disclaimers if needed
        return output.strip()
    
    @staticmethod
    def safe_query(user_input, system_message="You are a helpful assistant."):
        """Process query safely"""
        # Validate
        valid, message = SafeLLMApp.validate_input(user_input)
        if not valid:
            return {"error": message}
        
        try:
            # Call API with limits
            response = client.models.generate_content(
                model="gemini-2.5-flash-lite",
                contents=user_input,
                config=types.GenerateContentConfig(
                    system_instruction=system_message,
                    max_output_tokens=SafeLLMApp.MAX_TOKENS,
                    temperature=0.7
                )
            )
            
            output = response.text
            return {"response": SafeLLMApp.sanitize_output(output)}
            
        except Exception as e:
            return {"error": f"Processing failed: {str(e)}"}

# Test safe query
result = SafeLLMApp.safe_query("What is machine learning?")
print(result)

# Test validation
result = SafeLLMApp.safe_query("")  # Empty input
print(result)

result = SafeLLMApp.safe_query("x" * 3000)  # Too long
print(result)

{'response': 'Machine learning is a subfield of artificial intelligence (AI) that focuses on **enabling computer systems to learn from data without being explicitly programmed.** Instead of providing a computer with a rigid set of instructions to perform a task, machine learning algorithms allow the computer to identify patterns, make predictions, and improve its performance over time as it\'s exposed to more data.\n\nThink of it like teaching a child. You don\'t give them a step-by-step manual for every single thing they\'ll encounter. Instead, you show them examples, explain concepts, and they gradually learn to understand and respond to new situations. Machine learning works in a similar way.\n\nHere\'s a breakdown of the key concepts:\n\n**1. Data is King:**\n* Machine learning algorithms rely heavily on **data**. This data can be anything from images, text, numbers, sensor readings, or even sounds.\n* The quality and quantity of data significantly impact the performance of a machi

## Part 6: Production Checklist

### Before deploying to production:

#### Security
- [ ] API keys stored securely (environment variables, secret manager)
- [ ] Input validation implemented
- [ ] Output sanitization for PII
- [ ] Rate limiting on user requests
- [ ] Prompt injection protection

#### Reliability
- [ ] Error handling for all API calls
- [ ] Retry logic with exponential backoff
- [ ] Timeout handling
- [ ] Graceful degradation when API unavailable
- [ ] Logging for debugging

#### Cost Management
- [ ] Cost tracking implemented
- [ ] Token limits set appropriately
- [ ] Caching for repeated queries
- [ ] Budget alerts configured
- [ ] Regular cost monitoring

#### Quality
- [ ] Test suite with expected behaviors
- [ ] Edge case testing
- [ ] Output quality evaluation
- [ ] User feedback mechanism
- [ ] A/B testing capability

#### User Experience
- [ ] Clear error messages
- [ ] Loading indicators
- [ ] Source attribution (for RAG)
- [ ] Disclaimers where appropriate
- [ ] Feedback collection

## Key Takeaways

1. **Always handle errors** - APIs fail, networks drop, limits hit
2. **Track costs** - They add up quickly in production
3. **Test systematically** - LLMs are probabilistic, test thoroughly
4. **Cache when possible** - Save money and improve speed
5. **Validate inputs** - Protect against misuse and errors
6. **Monitor in production** - Watch costs, errors, and quality

## Congratulations!

You've completed the series! You now know how to:
- Work with LLM APIs
- Build conversations
- Engineer prompts programmatically
- Use embeddings for semantic search
- Build RAG systems
- Deploy production-ready applications

### Next Steps
- Build real applications in your domain
- Explore advanced topics (fine-tuning, agents, etc.)
- Share your work with colleagues
- Keep learning as the field evolves!