# Combined API Deployment for Project Agent and Marketing Agent

This notebook combines two separate APIs:
1. Project Agent API - Using Hugging Face model to generate project-related insights
2. Marketing Agent API - Using a fine-tuned model for marketing-related responses

Both APIs will run on different endpoints but hosted on the same Google Colab instance.

## Install Required Packages

In [None]:
# Install required packages for both APIs
!pip install transformers huggingface_hub fastapi uvicorn pyngrok pydantic
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets hf_transfer
!pip install --no-deps unsloth
!pip install nest_asyncio



## Import Libraries

In [None]:
# Common imports
import os
import threading
import uvicorn
import nest_asyncio
import sqlite3
import json
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Dict, Optional, List, Any
from pyngrok import ngrok

# Apply nest_asyncio to allow running asyncio code in Jupyter
nest_asyncio.apply()

## Configure ngrok

In [None]:
ngrok_auth_token = ""
ngrok.set_auth_token(ngrok_auth_token)

## Project Agent API Setup

Instead of using Llama 3.2, we'll use a more reliable model to avoid the 'apply_qkv' attribute error.

In [None]:
# Project Agent Imports
from huggingface_hub import login
import requests
import time

# Authenticate with Hugging Face for Project Agent
PROJECT_HF_TOKEN = ""  # Replace with your actual token if needed
login(token=PROJECT_HF_TOKEN)

# Create FastAPI app for Project Agent
project_app = FastAPI()

# Setup SQLite Database
db_name = "project_responses.db"
conn = sqlite3.connect(db_name)
cursor = conn.cursor()

cursor.execute('''
CREATE TABLE IF NOT EXISTS project_responses (
    section TEXT,
    heading TEXT,
    sub_question TEXT,
    response TEXT,
    PRIMARY KEY (section, sub_question)
)
''')

# Define Project Agent questions and sub-questions
questions_and_subquestions = {
    "Technical Architecture": [
        ("High Level System Architecture", "What should be the high-level system architecture?"),
        ("Backend and Frontend Technologies", "What backend and frontend technologies would be suitable?"),
    ],
    "Required Technical Tools & Stack": [
        ("Programming Languages and Integrations", "What programming languages, cloud services, and third-party integrations are needed?"),
    ],
    "Engineering Team Structure": [
        ("Required Engineers and Expertise", "What kind of engineers and expertise are required (e.g., backend, frontend, DevOps, ML engineers, etc.)?"),
        ("Development Team Size", "How many developers would be needed at each stage?"),
    ],
}

# Project Agent request model
class ProjectRequest(BaseModel):
    task: str

# Using Hugging Face Inference API instead of loading model locally to avoid errors
def generate_text_with_hf_api(prompt, model_name="mistralai/Mistral-7B-Instruct-v0.2"):
    API_URL = f"https://api-inference.huggingface.co/models/{model_name}"
    headers = {"Authorization": f"Bearer {PROJECT_HF_TOKEN}"}

    # Format for instruction models
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 150,
            "temperature": 0.7,
            "top_p": 0.9,
            "return_full_text": False
        }
    }

    # Retry mechanism
    max_retries = 3
    retry_delay = 5  # seconds

    for attempt in range(max_retries):
        response = requests.post(API_URL, headers=headers, json=payload)

        if response.status_code == 200:
            try:
                return response.json()[0]["generated_text"].strip()
            except (KeyError, IndexError):
                print(f"Unexpected response format: {response.json()}")
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                    continue
        elif response.status_code == 503 and "loading" in response.text.lower():
            print(f"Model is loading. Waiting {retry_delay} seconds before retry...")
            time.sleep(retry_delay)
        else:
            print(f"Error: {response.status_code}, {response.text}")
            if attempt < max_retries - 1:
                time.sleep(retry_delay)
                continue

    # Fallback response if all retries fail
    return "Unable to generate response at this time. Please try again later."

# Project Agent API endpoint
@project_app.post("/generate")
async def generate_project_responses(request: ProjectRequest):
    # Prepare the JSON response structure
    response_json = {}

    for section, sub_questions in questions_and_subquestions.items():
        section_responses = {}

        for heading, sub_question in sub_questions:
            prompt = f"""
            Based on the following project idea: {request.task}

            Provide a concise, well-structured, and informative response to the following question in a single paragraph only (The response should be just a single paragraph and not bullet points etc):

            **{sub_question}**

            Limit your response to 200-250 characters. The response should focus on providing key insights and actionable recommendations. Avoid conversational phrases like "Choose the best answer for your project" or any unnecessary options. The goal is to provide a detailed but succinct summary that directly addresses the question in a factual, professional tone.
            """
            print(f"Asking about: {sub_question}...")

            # Generate the response using Hugging Face API
            response = generate_text_with_hf_api(prompt)

            # Add response to the section
            section_responses[heading] = response

            # Store response in SQLite database
            cursor.execute("INSERT OR REPLACE INTO project_responses (section, heading, sub_question, response) VALUES (?, ?, ?, ?)",
                          (section, heading, sub_question, response))
            conn.commit()

        # Add section responses to the main response
        response_json[section] = section_responses

    return {"generated_text": response_json}

# Handle database closing
@project_app.on_event("shutdown")
async def project_shutdown():
    conn.close()

        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).
        
  @project_app.on_event("shutdown")


## Marketing Agent API Setup

In [None]:
# Marketing Agent Imports
import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer

# Create FastAPI app for Marketing Agent
marketing_app = FastAPI()

# Configure CORS for Marketing Agent
marketing_app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Define Marketing Agent models
class GenerationRequest(BaseModel):
    instruction: str
    input_text: str = ""
    max_new_tokens: int = 2048
    temperature: float = 0.7
    top_p: float = 0.9

class GenerationResponse(BaseModel):
    generated_text: str

# Marketing Agent model parameters
max_seq_length = 2048
dtype = None  # None for auto detection
load_in_4bit = True
HF_MODEL_PATH = "Hamza-Mubashir/marketing_rafam97_finetuned"  # Replace with your model if needed

# Global variables for Marketing Agent model and tokenizer
marketing_model = None
marketing_tokenizer = None

# Define the prompt template for Marketing Agent
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Load Marketing Agent model function
def load_marketing_model():
    global marketing_model, marketing_tokenizer
    try:
        print("Loading model for Marketing Agent...")
        marketing_model, marketing_tokenizer = FastLanguageModel.from_pretrained(
            model_name=HF_MODEL_PATH,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )
        FastLanguageModel.for_inference(marketing_model)  # Enable faster inference
        print("Marketing Agent model loaded successfully!")
    except Exception as e:
        print(f"Error loading Marketing Agent model: {e}")
        raise e

# Marketing Agent API endpoints
@marketing_app.get("/")
async def marketing_root():
    return {"message": "Marketing Agent API is running. Send POST requests to /generate endpoint."}

@marketing_app.post("/generate", response_model=GenerationResponse)
async def generate_marketing_text(request: GenerationRequest):
    global marketing_model, marketing_tokenizer

    if marketing_model is None or marketing_tokenizer is None:
        raise HTTPException(status_code=503, detail="Marketing Agent model not loaded yet. Please try again later.")

    try:
        # Format the prompt
        formatted_prompt = alpaca_prompt.format(
            request.instruction,
            request.input_text,
            ""  # Leave output blank for generation
        )

        # Tokenize input
        inputs = marketing_tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

        # Generate text
        with torch.no_grad():
            outputs = marketing_model.generate(
                **inputs,
                max_new_tokens=request.max_new_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                use_cache=True
            )

        # Decode and return the generated text
        full_output = marketing_tokenizer.batch_decode(outputs)[0]

        # Extract only the response part
        response_prefix = "### Response:"
        if response_prefix in full_output:
            generated_text = full_output.split(response_prefix)[1].strip()
        else:
            generated_text = full_output

        return {"generated_text": generated_text}

    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Generation error: {str(e)}")

## Combined FastAPI Application Setup

In [None]:
# Create a main FastAPI app to mount both APIs
main_app = FastAPI(title="Multi-Agent API System")

# Mount both APIs
main_app.mount("/project", project_app)
main_app.mount("/marketing", marketing_app)

## Function to Keep Colab Alive

In [None]:
# Function to prevent Colab from disconnecting
def keep_alive():
    import time
    import IPython.display
    from google.colab import output

    while True:
        try:
            time.sleep(60)
            output.eval_js("new Audio('https://dummy.mp3').play();")
            IPython.display.clear_output(wait=True)
            print("Server is still running.")
        except Exception as e:
            print(f"Keep-alive error: {e}")
            # Continue even if there's an error

## Test the Marketing Agent Model

In [None]:
# Load and test the Marketing Agent model
def test_marketing_model():
    try:
        # Test model with a sample generation
        print("\nTesting Marketing Agent model with a sample prompt:")
        test_instruction = "Venture Force - A Multi Agent Framework for Early Age Startups"
        test_input = """Large Language Model (LLM) dialogue agents have unveiled unforeseen limitations in specific domains
        due to their generalized training data with typical problems i.e poor contextual parsing, lack of domain
        knowledge, factual inaccuracies, ethical dilemmas, bias propagation, and hallucinations."""

        formatted_prompt = alpaca_prompt.format(test_instruction, test_input, "")
        inputs = marketing_tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
        text_streamer = TextStreamer(marketing_tokenizer)
        print("Generated sample:")
        _ = marketing_model.generate(
            **inputs,
            streamer=text_streamer,
            max_new_tokens=100,
            temperature=0.7,
            top_p=0.9,
            use_cache=True
        )
    except Exception as e:
        print(f"Error testing marketing model: {e}")
        print("Continuing with API setup despite testing error...")

# Try to load Marketing model with error handling
try:
    load_marketing_model()
    test_marketing_model()
except Exception as e:
    print(f"Error loading or testing marketing model: {e}")
    print("The Marketing API may not function correctly, but we'll continue setting up the server.")

Loading model for Marketing Agent...
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-18' coro=<Server.serve() done, defined at /usr/local/lib/python3.11/dist-packages/uvicorn/server.py:68> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/usr/local/lib/python3.11/dist-packages/uvicorn/server.py", line 66, in run
    return asyncio.run(self.serve(sockets=sockets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/nest_asyncio.py", line 92, in run_until_complete
    self._run_once()
  File "/usr/local/lib/python3.11/dist-packages/nest_asyncio.py", line 133, in _run_once
    handle._run()
  File "/usr/lib/python3.11/asyncio/events.py", line 84, in _run
    s

Marketing Agent model loaded successfully!

Testing Marketing Agent model with a sample prompt:
Generated sample:
<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Venture Force - A Multi Agent Framework for Early Age Startups

### Input:
Large Language Model (LLM) dialogue agents have unveiled unforeseen limitations in specific domains
        due to their generalized training data with typical problems i.e poor contextual parsing, lack of domain
        knowledge, factual inaccuracies, ethical dilemmas, bias propagation, and hallucinations.

### Response:
Develop a Venture Force - a multi-agent framework integrating AI, human experts, and user feedback to address
        domain-specific challenges. Utilize a combination of LLMs, rule-based systems, and knowledge graphs to
        provide accurate, context-specific, and up-to-date information. 

## Run the Combined Server

In [None]:
# Start ngrok with error handling
try:
    ngrok_tunnel = ngrok.connect(8000)
    print(f"\nPublic URL: {ngrok_tunnel.public_url}")
    print(f"Project Agent API Endpoint: {ngrok_tunnel.public_url}/project/generate")
    print(f"Marketing Agent API Endpoint: {ngrok_tunnel.public_url}/marketing/generate")

    # Print example curl commands
    print("\nExample curl commands:")
    print(f'''# Project Agent API
    curl -X 'POST' \
      '{ngrok_tunnel.public_url}/project/generate' \
      -H 'Content-Type: application/json' \
      -d '{{
        "task": "Build an AI-powered customer relationship management system"
      }}'
    ''')

    print(f'''# Marketing Agent API
    curl -X 'POST' \
      '{ngrok_tunnel.public_url}/marketing/generate' \
      -H 'Content-Type: application/json' \
      -d '{{
        "instruction": "Create a marketing strategy",
        "input_text": "Our startup provides AI-powered customer service solutions for small businesses",
        "max_new_tokens": 1024
      }}'
    '''
    )
except Exception as e:
    print(f"Error setting up ngrok: {e}")
    print("Will attempt to start server without ngrok...")

# Keep the server alive in a separate thread with error handling
thread = threading.Thread(target=keep_alive, daemon=True)
thread.start()

# Start uvicorn server
try:
    uvicorn.run(main_app, host="0.0.0.0", port=8000)
except Exception as e:
    print(f"Error starting server: {e}")


Public URL: https://68b9-35-198-236-96.ngrok-free.app
Project Agent API Endpoint: https://68b9-35-198-236-96.ngrok-free.app/project/generate
Marketing Agent API Endpoint: https://68b9-35-198-236-96.ngrok-free.app/marketing/generate

Example curl commands:
# Project Agent API
    curl -X 'POST'       'https://68b9-35-198-236-96.ngrok-free.app/project/generate'       -H 'Content-Type: application/json'       -d '{
        "task": "Build an AI-powered customer relationship management system"
      }'
    
# Marketing Agent API
    curl -X 'POST'       'https://68b9-35-198-236-96.ngrok-free.app/marketing/generate'       -H 'Content-Type: application/json'       -d '{
        "instruction": "Create a marketing strategy",
        "input_text": "Our startup provides AI-powered customer service solutions for small businesses",
        "max_new_tokens": 1024
      }'
    


INFO:     Started server process [225]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


Keep-alive error: NotSupportedError: Failed to load because no supported source was found.
INFO:     2601:19b:67c:83f0:cdfc:c2ef:8119:1957:0 - "GET /project/generate HTTP/1.1" 405 Method Not Allowed
Asking about: What should be the high-level system architecture?...
Asking about: What backend and frontend technologies would be suitable?...
Asking about: What programming languages, cloud services, and third-party integrations are needed?...
Asking about: What kind of engineers and expertise are required (e.g., backend, frontend, DevOps, ML engineers, etc.)?...
Asking about: How many developers would be needed at each stage?...
INFO:     34.169.116.93:0 - "POST /project/generate HTTP/1.1" 200 OK
INFO:     34.169.116.93:0 - "POST /marketing/generate HTTP/1.1" 200 OK
Asking about: What should be the high-level system architecture?...
Asking about: What backend and frontend technologies would be suitable?...
Asking about: What programming languages, cloud services, and third-party integrati