# VNeID Voice AI Backend (Colab/Kaggle)

Run this notebook on **Kaggle** or **Google Colab** with GPU enabled.

**Setup:**
1. Enable GPU: Runtime > Change runtime type > GPU (T4)
2. Add your `CLAUDE_API_KEY` and `NGROK_AUTH_TOKEN` to Secrets
3. Run cell 1 (Install) → **RESTART RUNTIME** → Run remaining cells
4. Copy the ngrok URL to your React Native app

**Important:** After installing dependencies, you MUST restart the runtime!

In [None]:
# Step 1: Install dependencies
# After running this cell, RESTART THE RUNTIME, then continue from Step 2

!pip uninstall -y scipy -q 2>/dev/null
!pip install -q numpy==1.26.4 scipy==1.12.0
!pip install -q flask flask-cors pyngrok anthropic
!pip install -q faster-whisper
!pip install -q edge-tts  # Microsoft Edge TTS (free, Vietnamese support)

# Try to install VieNeu-TTS (may fail due to dependencies)
try:
    !pip install -q vieneu 2>/dev/null
    print("VieNeu-TTS installed!")
except:
    print("VieNeu-TTS skipped - will use Edge TTS")

print("\n" + "="*50)
print("RESTART RUNTIME NOW!")
print("Kaggle: Session > Restart Session")
print("Colab: Runtime > Restart runtime")
print("Then run from Step 2 (Configuration)")
print("="*50)

In [None]:
# Step 2: Configuration (run this AFTER restarting runtime)
import os

# For Kaggle: Add secrets in "Add-ons" > "Secrets"
# For Colab: Add secrets in the key icon on the left sidebar

try:
    # Kaggle
    from kaggle_secrets import UserSecretsClient
    secrets = UserSecretsClient()
    CLAUDE_API_KEY = secrets.get_secret("CLAUDE_API_KEY")
    NGROK_AUTH_TOKEN = secrets.get_secret("NGROK_AUTH_TOKEN")
except:
    try:
        # Colab
        from google.colab import userdata
        CLAUDE_API_KEY = userdata.get('CLAUDE_API_KEY')
        NGROK_AUTH_TOKEN = userdata.get('NGROK_AUTH_TOKEN')
    except:
        # Manual input (replace with your keys)
        CLAUDE_API_KEY = "your-claude-api-key-here"
        NGROK_AUTH_TOKEN = "your-ngrok-token-here"

# Model settings
WHISPER_MODEL = "base"  # tiny, base, small, medium, large-v3
VIENEU_VOICE = "Binh"

print(f"Claude API: {'OK' if CLAUDE_API_KEY and CLAUDE_API_KEY != 'your-claude-api-key-here' else 'MISSING'}")
print(f"Ngrok: {'OK' if NGROK_AUTH_TOKEN and NGROK_AUTH_TOKEN != 'your-ngrok-token-here' else 'MISSING'}")

In [None]:
# Step 3: Load models
import torch
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only'}")

# Load Whisper
print(f"\nLoading Whisper ({WHISPER_MODEL})...")
from faster_whisper import WhisperModel
whisper_model = WhisperModel(
    WHISPER_MODEL,
    device="cuda" if torch.cuda.is_available() else "cpu",
    compute_type="float16" if torch.cuda.is_available() else "int8"
)
print("Whisper loaded!")

# Try VieNeu-TTS first, fallback to Edge TTS
USE_VIENEU = False
vieneu_tts = None
vieneu_voice = None

try:
    print(f"\nTrying VieNeu-TTS...")
    from vieneu import Vieneu
    vieneu_tts = Vieneu()
    vieneu_voice = vieneu_tts.get_preset_voice(VIENEU_VOICE)
    USE_VIENEU = True
    print(f"VieNeu-TTS loaded! Voices: {vieneu_tts.list_preset_voices()}")
except Exception as e:
    print(f"VieNeu-TTS failed: {e}")
    print("Using Edge TTS (Microsoft) as fallback")
    import edge_tts
    USE_VIENEU = False

print(f"\nTTS Engine: {'VieNeu-TTS' if USE_VIENEU else 'Edge TTS (vi-VN-HoaiMyNeural)'}")

In [None]:
# Step 4: Backend server code
from flask import Flask, request, jsonify
from flask_cors import CORS
import anthropic
import numpy as np
import base64
import tempfile
import json
import re
import io
import wave
import asyncio

app = Flask(__name__)
CORS(app)

claude = anthropic.Anthropic(api_key=CLAUDE_API_KEY)
conversation_history = []
MAX_HISTORY = 10


def get_system_prompt(user_context, screen_context):
    user_info = ""
    if user_context:
        user_info = f"""
THONG TIN CONG DAN (DA XAC THUC - KHONG HOI LAI):
- Ho ten: {user_context.get('hoTen', 'N/A')}
- So CCCD: {user_context.get('cccd', 'N/A')}
- Ngay sinh: {user_context.get('ngaySinh', 'N/A')}
"""

    screen_info = ""
    if screen_context:
        screen_name = screen_context.get('screen_name', 'home')
        step = screen_context.get('current_step', 0)
        actions = screen_context.get('available_actions', [])
        screen_info = f"""
MAN HINH: {screen_name}
Buoc hien tai: {step}
Thao tac kha dung: {', '.join(actions)}
"""

    user_name = "ong/ba"
    if user_context and user_context.get('hoTen'):
        name_parts = user_context.get('hoTen', '').split()
        if name_parts:
            user_name = f"ong/ba {name_parts[-1]}"

    return f"""Ban la can bo huong dan thu tuc hanh chinh tai bo phan mot cua.

{user_info}
{screen_info}

NGUYEN TAC GIAO TIEP:
1. Van phong hanh chinh, lich su, chuyen nghiep
2. Xung "toi", goi nguoi dan la "{user_name}"
3. KHONG hoi lai thong tin da co
4. KHONG lap lai noi dung da noi
5. KHONG noi cau thua nhu "vang a", "da a", "duoc a", "de toi"
6. Khi da hieu yeu cau -> thuc hien ngay, khong hoi lai

CACH TRA LOI:
- Voi LENH/YEU CAU don gian (lam LLTP, tiep tuc, quay lai...): Tra loi 1 cau ngan + thuc hien
- Voi CAU HOI can giai thich (LLTP la gi? Can giay to gi? Bao lau?...): Tra loi DAY DU, RO RANG, CHINH XAC

VI DU TRA LOI NGAN (lenh hanh dong):
- "lam ly lich tu phap" -> "Toi mo form dang ky." + ACTION
- "2 ban" -> "Da ghi nhan." + ACTION  
- "tiep" -> "Chuyen buoc tiep." + ACTION

VI DU TRA LOI DAY DU (cau hoi):
- "Ly lich tu phap la gi?" -> "Ly lich tu phap la van ban ghi lai thong tin ve tien an, tien su cua cong dan. Day la giay to bat buoc khi xin viec, du hoc, dinh cu nuoc ngoai, hoac lam ho so phap ly."
- "Can giay to gi?" -> "De lam LLTP, {user_name} can chuan bi: 1. CCCD/CMND ban chinh, 2. Ho khau hoac giay xac nhan cu tru. Neu lam ho nguoi khac can them giay uy quyen co cong chung."
- "Mat bao lau?" -> "Thoi gian xu ly tu 10-15 ngay lam viec ke tu ngay nop ho so hop le. Truong hop cap dac biet la 3 ngay."

KHONG tra loi kieu:
- "Vang a, toi se giup..."
- "Da, de toi xem..."
- Lap lai cau hoi cua nguoi dan

KHI CAN THUC HIEN THAO TAC, them JSON cuoi cau:
@@ACTION@@{{"action": "ten_action", "data": {{}}}}@@END@@

CAC ACTION:
- navigate_lltp: Mo form Ly lich tu phap
- navigate_home: Ve trang chu  
- next_step: Chuyen buoc tiep
- prev_step: Quay lai
- fill_field: Dien thong tin (data: muc_dich, so_ban, loai_phieu)
- submit: Nop ho so
"""


def transcribe(audio_path):
    segments, _ = whisper_model.transcribe(audio_path, language="vi", beam_size=5, vad_filter=True)
    return " ".join([s.text for s in segments]).strip()


def text_to_speech_vieneu(text):
    """TTS using VieNeu-TTS"""
    audio = vieneu_tts.infer(text=text, voice=vieneu_voice, temperature=1.0, top_k=50)
    if audio.dtype in [np.float32, np.float64]:
        audio = (audio * 32767).astype(np.int16)
    
    buffer = io.BytesIO()
    with wave.open(buffer, 'wb') as wav:
        wav.setnchannels(1)
        wav.setsampwidth(2)
        wav.setframerate(24000)
        wav.writeframes(audio.tobytes())
    buffer.seek(0)
    return base64.b64encode(buffer.read()).decode('utf-8')


async def text_to_speech_edge_async(text):
    """TTS using Edge TTS (async)"""
    import edge_tts
    
    communicate = edge_tts.Communicate(text, "vi-VN-HoaiMyNeural")
    buffer = io.BytesIO()
    
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            buffer.write(chunk["data"])
    
    buffer.seek(0)
    return base64.b64encode(buffer.read()).decode('utf-8')


def text_to_speech_edge(text):
    """TTS using Edge TTS (sync wrapper)"""
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        return loop.run_until_complete(text_to_speech_edge_async(text))
    finally:
        loop.close()


def text_to_speech(text):
    """Convert text to speech"""
    if not text:
        return None
    try:
        if USE_VIENEU:
            return text_to_speech_vieneu(text)
        else:
            return text_to_speech_edge(text)
    except Exception as e:
        print(f"TTS Error: {e}")
        return None


def process_with_claude(text, user_context, screen_context):
    global conversation_history
    conversation_history.append({"role": "user", "content": text})
    if len(conversation_history) > MAX_HISTORY * 2:
        conversation_history = conversation_history[-MAX_HISTORY * 2:]

    try:
        response = claude.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=300,
            system=get_system_prompt(user_context, screen_context),
            messages=conversation_history
        )
        result = response.content[0].text
        conversation_history.append({"role": "assistant", "content": result})

        action, data = None, {}
        match = re.search(r'@@ACTION@@(.+?)@@END@@', result, re.DOTALL)
        if match:
            try:
                action_json = json.loads(match.group(1).strip())
                action = action_json.get('action')
                data = action_json.get('data', {})
            except:
                pass

        clean_text = re.sub(r'@@ACTION@@.*?@@END@@', '', result, flags=re.DOTALL).strip()
        return clean_text, action, data
    except Exception as e:
        print(f"Claude Error: {e}")
        return "He thong gap loi. Vui long thu lai.", None, {}


@app.route('/health', methods=['GET'])
def health():
    return jsonify({
        "status": "healthy", 
        "gpu": torch.cuda.is_available(),
        "tts": "vieneu" if USE_VIENEU else "edge"
    })


@app.route('/reset', methods=['POST'])
def reset():
    global conversation_history
    conversation_history = []
    return jsonify({"status": "ok"})


@app.route('/process_text', methods=['POST'])
def process_text():
    try:
        data = request.json
        text = data.get('text', '')
        user_context = data.get('user_context')
        screen_context = data.get('screen_context')

        if not text:
            return jsonify({"success": False, "error": "No text"})

        response_text, action, action_data = process_with_claude(text, user_context, screen_context)
        audio_b64 = text_to_speech(response_text)

        return jsonify({
            "success": True,
            "transcript": text,
            "response": response_text,
            "audio": audio_b64,
            "action": action,
            "data": action_data,
        })
    except Exception as e:
        return jsonify({"success": False, "error": str(e)})


@app.route('/process_voice', methods=['POST'])
def process_voice():
    try:
        if 'audio' not in request.files:
            return jsonify({"success": False, "error": "No audio"})

        audio_file = request.files['audio']
        user_context = request.form.get('user_context')
        screen_context = request.form.get('screen_context')

        if user_context:
            user_context = json.loads(user_context)
        if screen_context:
            screen_context = json.loads(screen_context)

        with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp:
            audio_file.save(tmp.name)
            tmp_path = tmp.name

        transcript = transcribe(tmp_path)
        os.unlink(tmp_path)

        print(f"Transcript: {transcript}")

        if not transcript:
            no_hear = "Khong nghe ro. Vui long noi lai."
            return jsonify({
                "success": True,
                "transcript": "",
                "response": no_hear,
                "audio": text_to_speech(no_hear),
                "action": None,
                "data": {},
            })

        response_text, action, action_data = process_with_claude(transcript, user_context, screen_context)
        audio_b64 = text_to_speech(response_text)

        return jsonify({
            "success": True,
            "transcript": transcript,
            "response": response_text,
            "audio": audio_b64,
            "action": action,
            "data": action_data,
        })
    except Exception as e:
        import traceback
        traceback.print_exc()
        return jsonify({"success": False, "error": str(e)})


print("Flask app ready!")

In [None]:
# Step 5: Start ngrok tunnel and run server
from pyngrok import ngrok
import threading

# Set ngrok auth token
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Start ngrok tunnel
public_url = ngrok.connect(5000)
print("=" * 60)
print("VNeID Voice AI Backend is running!")
print("=" * 60)
print(f"\n>>> PUBLIC URL: {public_url}")
print(f"\nCopy this URL to your React Native app's BACKEND_URL")
print("=" * 60)

# Run Flask in a thread
def run_flask():
    app.run(host='0.0.0.0', port=5000, threaded=True, use_reloader=False)

flask_thread = threading.Thread(target=run_flask)
flask_thread.start()

print("\nServer is running! Keep this notebook open.")
print("Check /health endpoint to verify:", f"{public_url}/health")

In [None]:
# Step 6 (Optional): Test the endpoints
import requests

# Test health
r = requests.get(f"{public_url}/health")
print("Health:", r.json())

# Test text processing
r = requests.post(f"{public_url}/process_text", json={
    "text": "Xin chao",
    "user_context": {"hoTen": "Nguyen Van A"},
    "screen_context": {"screen_name": "home"}
})
result = r.json()
print("Response:", result.get('response'))
print("Has audio:", "Yes" if result.get('audio') else "No")

In [None]:
# Step 7: Keep the notebook running
# Run this cell to keep the server alive
import time
print("="*60)
print("Server is running!")
print(f"URL: {public_url}")
print("="*60)
print("\nKeep this cell running. Press stop button to shutdown.")

while True:
    time.sleep(60)
    print(".", end="", flush=True)