# üöÄ EchoNote Inference API - FastAPI + NGROK

**Production-Ready API Server for Meeting Summarization**

This notebook creates a FastAPI server that loads your fine-tuned EchoNote model from HuggingFace and exposes it via NGROK for remote access from localhost.

## üìã Features

- ‚úÖ Load model from HuggingFace Hub (haris936hk/echonote)
- ‚úÖ FastAPI server with auto-generated Swagger docs
- ‚úÖ NGROK tunnel with static domain support
- ‚úÖ API Key authentication
- ‚úÖ Rate limiting and timeout protection
- ‚úÖ Error handling with retries
- ‚úÖ Request logging and monitoring
- ‚úÖ Batch inference support

## üéØ Workflow

```
HuggingFace Model ‚Üí FastAPI Server ‚Üí NGROK Tunnel ‚Üí Your Localhost Client
```

## ‚ö° Quick Start

1. Run all cells in order
2. Get your NGROK URL from the output
3. Use the URL to make API calls from localhost
4. Visit `[NGROK_URL]/docs` for interactive API documentation

## üì¶ 1. Install Dependencies

In [None]:
%%capture
# Core dependencies
!pip install fastapi uvicorn python-multipart
!pip install pyngrok
!pip install slowapi  # Rate limiting

# Model loading options (choose one based on your preference)
# Option 1: Unsloth (Faster inference, recommended)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Option 2: Standard transformers (uncomment if not using unsloth)
# !pip install transformers accelerate bitsandbytes

# Utilities
!pip install tenacity  # Retry logic

print("‚úÖ All dependencies installed!")

## üîß 2. Configuration

In [None]:
import os
import json
import time
import logging
from datetime import datetime

# ============================================================================
# CONFIGURATION - EDIT THESE VALUES
# ============================================================================

# HuggingFace Model
MODEL_NAME = "haris936hk/echonote"  # Your fine-tuned model
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
LOAD_IN_4BIT = True  # Memory efficient loading

# NGROK Configuration
NGROK_AUTH_TOKEN = "YOUR_NGROK_AUTH_TOKEN"  # Get from https://dashboard.ngrok.com
NGROK_STATIC_DOMAIN = None  # Optional: e.g., "your-echonote.ngrok-free.app"

# Security
API_KEY = "echonote-secret-api-key-2025"  # Change this to a secure key!

# Server Settings
HOST = "0.0.0.0"
PORT = 8000

# Rate Limiting (requests per minute)
RATE_LIMIT = "10/minute"  # Adjust based on your needs

# Inference Settings
MAX_NEW_TOKENS = 1000
TEMPERATURE = 0.3  # Lower for more deterministic outputs
TOP_P = 0.95
REQUEST_TIMEOUT = 60  # Seconds

# ============================================================================

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("‚úÖ Configuration loaded!")
print(f"üì¶ Model: {MODEL_NAME}")
print(f"üîê API Key: {'*' * (len(API_KEY) - 4) + API_KEY[-4:]}")

## ü§ó 3. Load Model from HuggingFace

In [None]:
print("üîÑ Loading model from HuggingFace...")
print(f"üì¶ Model: {MODEL_NAME}")

# Option 1: Load with Unsloth (RECOMMENDED - Faster inference)
try:
    from unsloth import FastLanguageModel
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=None,  # Auto-detect best dtype
        load_in_4bit=LOAD_IN_4BIT,
    )
    
    # Enable fast inference mode
    FastLanguageModel.for_inference(model)
    
    print("‚úÖ Model loaded with Unsloth (fast inference enabled)!")
    USING_UNSLOTH = True
    
except ImportError:
    # Option 2: Standard transformers (fallback)
    print("‚ö†Ô∏è Unsloth not available, using standard transformers...")
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        load_in_4bit=LOAD_IN_4BIT,
        device_map="auto",
        torch_dtype=torch.float16,
    )
    
    print("‚úÖ Model loaded with standard transformers!")
    USING_UNSLOTH = False

# Model info
print(f"\nüìä Model Information:")
print(f"   - Max sequence length: {MAX_SEQ_LENGTH}")
print(f"   - 4-bit quantization: {LOAD_IN_4BIT}")
print(f"   - Using Unsloth: {USING_UNSLOTH}")

## üîÑ 4. Define Inference Function

In [None]:
import torch
from tenacity import retry, stop_after_attempt, wait_exponential

# System prompt from your training
SYSTEM_PROMPT = """You are an AI assistant specialized in analyzing meeting transcripts.

Your task is to:
1. Read the meeting transcript carefully
2. Analyze the NLP features provided
3. Generate a structured JSON summary

Output Format (strict JSON):
{
  "executiveSummary": string (60-100 words),
  "keyDecisions": string[],
  "actionItems": [
    {"task": string, "assignee": string, "deadline": string, "priority": "high/medium/low"}
  ],
  "nextSteps": string[],
  "keyTopics": string[],
  "sentiment": "positive" | "neutral" | "negative"
}

Important:
- Output ONLY valid JSON, nothing else
- executiveSummary must be at least 150 characters
- If no decisions/actions found, return empty arrays []
- sentiment must match the tone of the meeting
"""

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def generate_summary(transcript: str) -> str:
    """
    Generate meeting summary from transcript.
    
    Args:
        transcript: Meeting transcript text
    
    Returns:
        JSON string with structured summary
    """
    try:
        # Format prompt with chat template
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": transcript}
        ]
        
        # Apply chat template
        prompt = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
        # Tokenize
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        # Generate
        start_time = time.time()
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=MAX_NEW_TOKENS,
                temperature=TEMPERATURE,
                top_p=TOP_P,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        # Decode
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract JSON from response (after the assistant prompt)
        # The model should output JSON directly
        if "<|im_start|>assistant" in generated_text:
            json_output = generated_text.split("<|im_start|>assistant")[-1].strip()
        else:
            json_output = generated_text.split(prompt)[-1].strip()
        
        # Remove any markdown code blocks if present
        json_output = json_output.replace('```json', '').replace('```', '').strip()
        
        inference_time = time.time() - start_time
        
        logger.info(f"‚úÖ Inference completed in {inference_time:.2f}s")
        
        return json_output
        
    except Exception as e:
        logger.error(f"‚ùå Inference error: {str(e)}")
        raise

print("‚úÖ Inference function ready!")

## üåê 5. Create FastAPI Server

In [None]:
from fastapi import FastAPI, HTTPException, Security, Request
from fastapi.security import APIKeyHeader
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field, validator
from typing import Optional, List, Dict, Any
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import asyncio

# ============================================================================
# Pydantic Models (Input/Output Validation)
# ============================================================================

class MeetingInput(BaseModel):
    """Input model for meeting transcript"""
    transcript: str = Field(
        ...,
        min_length=100,
        max_length=10000,
        description="Meeting transcript text (100-10000 characters)"
    )
    
    class Config:
        schema_extra = {
            "example": {
                "transcript": "MEETING TRANSCRIPT:\nOkay, let's get started. Sarah, can you give us an update on the Q3 numbers? Sure, looking at the dashboard, we're seeing revenue at $2.5M which is actually 15% above our target..."
            }
        }

class BatchMeetingInput(BaseModel):
    """Input model for batch processing"""
    transcripts: List[str] = Field(
        ...,
        min_items=1,
        max_items=10,
        description="List of meeting transcripts (max 10)"
    )

class SummaryOutput(BaseModel):
    """Output model for meeting summary"""
    summary: Dict[str, Any] = Field(..., description="Structured meeting summary")
    metadata: Dict[str, Any] = Field(..., description="Request metadata")

class HealthResponse(BaseModel):
    """Health check response"""
    status: str
    model: str
    timestamp: str
    uptime_seconds: float

# ============================================================================
# FastAPI App Setup
# ============================================================================

app = FastAPI(
    title="EchoNote Inference API",
    description="Production-ready API for meeting summarization using fine-tuned Qwen2.5-7B",
    version="1.0.0",
    docs_url="/docs",
    redoc_url="/redoc",
)

# CORS middleware (allow all origins for development)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Rate limiting
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

# API Key authentication
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)

def verify_api_key(api_key: Optional[str] = Security(api_key_header)):
    """Verify API key from request header"""
    if api_key is None or api_key != API_KEY:
        raise HTTPException(
            status_code=403,
            detail="Invalid or missing API key. Include 'X-API-Key' header."
        )
    return api_key

# Track server start time
SERVER_START_TIME = time.time()

# ============================================================================
# Middleware for Logging
# ============================================================================

@app.middleware("http")
async def log_requests(request: Request, call_next):
    """Log all requests with timing information"""
    start_time = time.time()
    
    # Process request
    response = await call_next(request)
    
    # Calculate duration
    duration = time.time() - start_time
    
    # Log
    logger.info(
        f"{request.method} {request.url.path} - "
        f"Status: {response.status_code} - "
        f"Duration: {duration:.2f}s"
    )
    
    # Add timing header
    response.headers["X-Process-Time"] = str(duration)
    
    return response

# ============================================================================
# API Endpoints
# ============================================================================

@app.get("/", tags=["General"])
async def root():
    """Root endpoint - API information"""
    return {
        "message": "Welcome to EchoNote Inference API",
        "version": "1.0.0",
        "model": MODEL_NAME,
        "endpoints": {
            "docs": "/docs",
            "health": "/health",
            "predict": "/predict",
            "batch_predict": "/batch-predict"
        },
        "documentation": "Visit /docs for interactive API documentation"
    }

@app.get("/health", response_model=HealthResponse, tags=["General"])
async def health_check():
    """Health check endpoint"""
    return HealthResponse(
        status="healthy",
        model=MODEL_NAME,
        timestamp=datetime.now().isoformat(),
        uptime_seconds=time.time() - SERVER_START_TIME
    )

@app.post("/predict", response_model=SummaryOutput, tags=["Inference"])
@limiter.limit(RATE_LIMIT)
async def predict(
    request: Request,
    meeting_input: MeetingInput,
    api_key: str = Security(verify_api_key)
):
    """
    Generate meeting summary from transcript.
    
    **Authentication:** Requires X-API-Key header
    
    **Rate Limit:** 10 requests per minute per IP
    
    **Returns:** Structured JSON summary with executive summary, decisions, action items, etc.
    """
    try:
        start_time = time.time()
        
        # Run inference with timeout
        try:
            summary_json = await asyncio.wait_for(
                asyncio.to_thread(generate_summary, meeting_input.transcript),
                timeout=REQUEST_TIMEOUT
            )
        except asyncio.TimeoutError:
            raise HTTPException(
                status_code=504,
                detail=f"Request timeout after {REQUEST_TIMEOUT} seconds"
            )
        
        # Parse JSON
        try:
            summary_dict = json.loads(summary_json)
        except json.JSONDecodeError as e:
            logger.error(f"JSON parsing error: {str(e)}")
            logger.error(f"Raw output: {summary_json[:500]}")
            raise HTTPException(
                status_code=500,
                detail="Failed to parse model output as JSON"
            )
        
        # Calculate metrics
        inference_time = time.time() - start_time
        
        # Return response
        return SummaryOutput(
            summary=summary_dict,
            metadata={
                "model": MODEL_NAME,
                "inference_time_seconds": round(inference_time, 2),
                "timestamp": datetime.now().isoformat(),
                "transcript_length": len(meeting_input.transcript),
            }
        )
        
    except HTTPException:
        raise
    except Exception as e:
        logger.error(f"Prediction error: {str(e)}")
        raise HTTPException(
            status_code=500,
            detail=f"Internal server error: {str(e)}"
        )

@app.post("/batch-predict", tags=["Inference"])
@limiter.limit("3/minute")  # Stricter limit for batch
async def batch_predict(
    request: Request,
    batch_input: BatchMeetingInput,
    api_key: str = Security(verify_api_key)
):
    """
    Generate summaries for multiple transcripts.
    
    **Authentication:** Requires X-API-Key header
    
    **Rate Limit:** 3 requests per minute per IP
    
    **Max Batch Size:** 10 transcripts
    """
    try:
        results = []
        
        for idx, transcript in enumerate(batch_input.transcripts):
            try:
                summary_json = await asyncio.to_thread(generate_summary, transcript)
                summary_dict = json.loads(summary_json)
                
                results.append({
                    "index": idx,
                    "status": "success",
                    "summary": summary_dict
                })
            except Exception as e:
                results.append({
                    "index": idx,
                    "status": "error",
                    "error": str(e)
                })
        
        return {
            "results": results,
            "total": len(batch_input.transcripts),
            "successful": sum(1 for r in results if r["status"] == "success"),
            "failed": sum(1 for r in results if r["status"] == "error")
        }
        
    except Exception as e:
        logger.error(f"Batch prediction error: {str(e)}")
        raise HTTPException(
            status_code=500,
            detail=f"Batch processing error: {str(e)}"
        )

print("‚úÖ FastAPI server configured!")
print(f"üìö Endpoints: /, /health, /predict, /batch-predict")
print(f"üîê Authentication: X-API-Key header required")
print(f"üö¶ Rate limit: {RATE_LIMIT}")

## üåê 6. Setup NGROK Tunnel

In [None]:
from pyngrok import ngrok, conf
import nest_asyncio

# Allow nested event loops (required for Jupyter)
nest_asyncio.apply()

# ============================================================================
# NGROK Setup
# ============================================================================

# Set auth token
if NGROK_AUTH_TOKEN and NGROK_AUTH_TOKEN != "YOUR_NGROK_AUTH_TOKEN":
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    print("‚úÖ NGROK auth token configured")
else:
    print("‚ö†Ô∏è NGROK_AUTH_TOKEN not set! Get yours from: https://dashboard.ngrok.com")
    print("   Update the NGROK_AUTH_TOKEN variable in section 2")

# Kill any existing tunnels
ngrok.kill()

# Configure NGROK
conf.get_default().region = "us"  # Change to your region: us, eu, ap, au, sa, jp, in

# Create tunnel
if NGROK_STATIC_DOMAIN:
    # Use static domain (free tier includes 1 static domain)
    print(f"üîó Creating tunnel with static domain: {NGROK_STATIC_DOMAIN}")
    public_url = ngrok.connect(
        PORT,
        domain=NGROK_STATIC_DOMAIN,
        bind_tls=True
    )
else:
    # Use random domain
    print(f"üîó Creating tunnel with random domain...")
    public_url = ngrok.connect(PORT, bind_tls=True)

print("\n" + "="*80)
print("üéâ NGROK TUNNEL CREATED SUCCESSFULLY!")
print("="*80)
print(f"\nüåê Public URL: {public_url}")
print(f"üìö API Docs: {public_url}/docs")
print(f"üîç Redoc: {public_url}/redoc")
print(f"‚ù§Ô∏è Health Check: {public_url}/health")
print(f"\nüîê API Key: {API_KEY}")
print(f"\n‚ö° Ready to accept requests!\n")
print("="*80)

# Store for later use
NGROK_PUBLIC_URL = str(public_url)

## üöÄ 7. Start FastAPI Server

**‚ö†Ô∏è Important:** This cell will run continuously. The server will keep running until you stop it manually.

To stop the server:
- Click the **Stop** button in the toolbar
- Or press **Kernel ‚Üí Interrupt** in the menu

In [None]:
import uvicorn

print("\nüöÄ Starting FastAPI server...")
print(f"üìç Local: http://{HOST}:{PORT}")
print(f"üåê Public: {NGROK_PUBLIC_URL}")
print("\n‚è≥ Server is running... Press STOP to terminate.\n")
print("="*80)

# Run server
try:
    uvicorn.run(
        app,
        host=HOST,
        port=PORT,
        log_level="info",
        access_log=True
    )
except KeyboardInterrupt:
    print("\nüõë Server stopped by user")
finally:
    # Cleanup
    ngrok.kill()
    print("‚úÖ NGROK tunnel closed")

---

# üìù Testing Section

**Run these cells in a SEPARATE notebook or after stopping the server above**

## üß™ 8. Test API from Localhost

### Option A: Test in Browser

1. Copy your NGROK URL from above
2. Visit: `[NGROK_URL]/docs`
3. Click "Authorize" and enter your API key
4. Test the `/predict` endpoint

### Option B: Test with Python Code (Run in separate notebook/terminal)

In [None]:
import requests
import json

# ============================================================================
# UPDATE THESE FROM YOUR SERVER OUTPUT
# ============================================================================
NGROK_URL = "YOUR_NGROK_URL"  # e.g., https://your-domain.ngrok-free.app
API_KEY = "echonote-secret-api-key-2025"  # Same as server

# Sample transcript
sample_transcript = """MEETING TRANSCRIPT:
Okay everyone, let's get started with our Q3 review. Sarah, can you walk us through the numbers?

Sure thing. So looking at the dashboard, we're seeing revenue at $2.5M which is actually 15% above our target for the quarter. Our MRR is sitting at $830K with a healthy 5% month-over-month growth.

That's fantastic Sarah. What about customer metrics?

Customer count is at 450, up from 380 last quarter. Churn rate has dropped to 2.5% which is the lowest we've seen. NPS score is holding steady at 72.

Great work team. Now, we need to discuss the product roadmap for Q4. Mike, what are we looking at?

Well, we have three major initiatives. First is the mobile app redesign which should launch by October 15th. Second is the new analytics dashboard - that's about 60% complete. Third is the API v2 rollout.

Okay, let's make sure we have clear ownership. Mike, you'll lead the mobile redesign. Sarah, can you take the analytics dashboard? And I'll handle the API rollout with the engineering team.

Sounds good. We should also discuss the competitor analysis that came out last week.

Right. The TechCrunch article about CompetitorX raising $50M is concerning. We need to accelerate our feature development to stay ahead.

Agreed. Let's schedule a strategy session for next week to dive deeper into this. Everyone free Tuesday at 2pm?

Works for me.

Same here.

Perfect. Let's wrap up with action items. Mike - mobile redesign launch plan by Friday. Sarah - analytics dashboard progress update by Wednesday. I'll send out the strategy session invite for Tuesday.

Got it, thanks everyone!
"""

# ============================================================================
# Test Health Endpoint
# ============================================================================

print("üîç Testing health endpoint...")
try:
    response = requests.get(
        f"{NGROK_URL}/health",
        headers={"ngrok-skip-browser-warning": "true"}
    )
    print(f"‚úÖ Status: {response.status_code}")
    print(f"üìÑ Response: {json.dumps(response.json(), indent=2)}")
except Exception as e:
    print(f"‚ùå Error: {str(e)}")

# ============================================================================
# Test Prediction Endpoint
# ============================================================================

print("\n" + "="*80)
print("üîç Testing prediction endpoint...")
print("="*80)

try:
    response = requests.post(
        f"{NGROK_URL}/predict",
        json={"transcript": sample_transcript},
        headers={
            "X-API-Key": API_KEY,
            "Content-Type": "application/json",
            "ngrok-skip-browser-warning": "true"  # Skip NGROK interstitial page
        }
    )
    
    if response.status_code == 200:
        result = response.json()
        
        print("\n‚úÖ SUCCESS!")
        print(f"\n‚è±Ô∏è Inference time: {result['metadata']['inference_time_seconds']}s")
        print(f"\nüìä Summary:")
        print(json.dumps(result['summary'], indent=2))
        
    else:
        print(f"\n‚ùå Error: {response.status_code}")
        print(f"üìÑ Response: {response.text}")
        
except Exception as e:
    print(f"\n‚ùå Request failed: {str(e)}")

print("\n" + "="*80)

## üì¶ 9. Example: Batch Processing

In [None]:
# Example batch request
batch_transcripts = [
    sample_transcript,
    "Another meeting transcript here...",
]

try:
    response = requests.post(
        f"{NGROK_URL}/batch-predict",
        json={"transcripts": batch_transcripts},
        headers={
            "X-API-Key": API_KEY,
            "Content-Type": "application/json",
            "ngrok-skip-browser-warning": "true"
        }
    )
    
    if response.status_code == 200:
        result = response.json()
        print(f"‚úÖ Batch processing complete")
        print(f"üìä Total: {result['total']}")
        print(f"‚úÖ Successful: {result['successful']}")
        print(f"‚ùå Failed: {result['failed']}")
    else:
        print(f"‚ùå Error: {response.status_code} - {response.text}")
        
except Exception as e:
    print(f"‚ùå Request failed: {str(e)}")

---

## üìñ Documentation & Troubleshooting

### üéØ Quick Reference

**API Endpoints:**
- `GET /` - API information
- `GET /health` - Health check
- `POST /predict` - Single transcript inference
- `POST /batch-predict` - Batch inference (max 10)
- `GET /docs` - Interactive Swagger UI
- `GET /redoc` - ReDoc documentation

**Authentication:**
```python
headers = {
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json",
    "ngrok-skip-browser-warning": "true"  # Skip NGROK warning page
}
```

**Rate Limits:**
- `/predict`: 10 requests/minute per IP
- `/batch-predict`: 3 requests/minute per IP

### ‚ö†Ô∏è Common Issues

**1. NGROK Tunnel Not Working:**
- ‚úÖ Make sure you set `NGROK_AUTH_TOKEN` in section 2
- ‚úÖ Get token from: https://dashboard.ngrok.com/get-started/your-authtoken
- ‚úÖ Check if tunnel is active: run `!ngrok tunnel list`

**2. API Key Errors:**
- ‚úÖ Include `X-API-Key` header in all requests
- ‚úÖ API key must match the one set in configuration

**3. Rate Limit Errors:**
- ‚úÖ Wait 60 seconds between batches of requests
- ‚úÖ Use batch endpoint for multiple transcripts

**4. Model Loading Errors:**
- ‚úÖ Ensure you have sufficient GPU/RAM
- ‚úÖ Try setting `LOAD_IN_4BIT = True` for lower memory usage
- ‚úÖ Check model exists: https://huggingface.co/haris936hk/echonote

**5. JSON Parsing Errors:**
- ‚úÖ Model might need more examples or fine-tuning
- ‚úÖ Check if transcript is in correct format
- ‚úÖ Ensure transcript is between 100-10000 characters

**6. Timeout Errors:**
- ‚úÖ Increase `REQUEST_TIMEOUT` in configuration
- ‚úÖ Shorter transcripts process faster
- ‚úÖ Check GPU availability

**7. NGROK Bandwidth Limit:**
- ‚úÖ Free tier: 1GB/month (~1000 requests)
- ‚úÖ Upgrade to paid plan or use alternative tunneling service

### üöÄ Production Deployment

**This setup is for TESTING ONLY. For production:**

1. **Deploy to proper hosting:**
   - HuggingFace Inference Endpoints
   - AWS/GCP/Azure with proper API gateway
   - Modal.com, Replicate, or RunPod

2. **Add proper security:**
   - JWT authentication instead of API keys
   - HTTPS with proper SSL certificates
   - Rate limiting per user/organization
   - Request validation and sanitization

3. **Add monitoring:**
   - Prometheus/Grafana for metrics
   - Sentry for error tracking
   - CloudWatch/Datadog for logs

4. **Add caching:**
   - Redis for response caching
   - Reduce redundant inference calls

### üìö Additional Resources

- **FastAPI Docs:** https://fastapi.tiangolo.com
- **NGROK Docs:** https://ngrok.com/docs
- **Unsloth Docs:** https://docs.unsloth.ai
- **HuggingFace Hub:** https://huggingface.co/docs/hub

### üí° Tips

- üî• Use `/docs` endpoint for interactive testing
- üìä Check `/health` endpoint to verify server is running
- üîê Never commit API keys to git repositories
- ‚ö° Use batch endpoint for multiple transcripts (more efficient)
- üíæ Save important outputs immediately (NGROK sessions can disconnect)
- üåê Get a static NGROK domain for consistent URL

---

## ‚ú® What's Next?

1. **Test thoroughly** with different meeting transcripts
2. **Monitor performance** and adjust parameters
3. **Collect feedback** on summary quality
4. **Fine-tune further** if needed
5. **Deploy to production** with proper infrastructure

---

**üéâ Congratulations! You now have a working AI-powered meeting summarization API!**