# üöÄ Machine Translation & Document Search on Google Colab

Ch·∫°y h·ªá th·ªëng Machine Translation + Document Search tr√™n Google Colab

**Y√™u c·∫ßu:** 
- Google Account
- GPU Colab (khuy·∫øn ngh·ªã)
- Kho·∫£ng 30 ph√∫t l·∫ßn ƒë·∫ßu (download models)

## 1Ô∏è‚É£ C√†i ƒê·∫∑t Dependencies

In [None]:
# C·∫≠p nh·∫≠t pip
!pip install --upgrade pip

# C√†i ƒë·∫∑t c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
!pip install -q \
    fastapi==0.104.1 \
    uvicorn==0.24.0 \
    python-multipart==0.0.6 \
    sqlalchemy==2.0.23 \
    sentence-transformers==2.2.2 \
    torch==2.0.1 \
    transformers==4.35.0 \
    annoy==1.17.3 \
    numpy==1.24.3 \
    pydantic==2.4.2 \
    httpx==0.25.0

print("‚úÖ Dependencies installed successfully!")

## 2Ô∏è‚É£ Clone Project T·ª´ GitHub (Optional)

In [None]:
# Clone project n·∫øu mu·ªën d√πng code t·ª´ GitHub
# !git clone https://github.com/phuocdai2004/haystack.git
# %cd haystack/backend

# Ho·∫∑c t·∫°o project structure c·ª•c b·ªô
import os
from pathlib import Path

# T·∫°o th∆∞ m·ª•c
os.makedirs('haystack_colab/app/models', exist_ok=True)
os.makedirs('haystack_colab/app/routes', exist_ok=True)
os.makedirs('haystack_colab/app/services', exist_ok=True)
os.makedirs('haystack_colab/app/utils', exist_ok=True)
os.makedirs('haystack_colab/data', exist_ok=True)

print("‚úÖ Project structure created!")

## 3Ô∏è‚É£ T·∫°o Database Models

In [None]:
# T·∫°o database.py
database_code = '''
"""Database configuration for Colab"""
from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker, Session
from typing import Generator
import logging

logger = logging.getLogger(__name__)

# SQLite database
DATABASE_URL = "sqlite:///./haystack.db"

engine = create_engine(
    DATABASE_URL,
    connect_args={"check_same_thread": False},
    echo=False
)

SessionLocal = sessionmaker(
    bind=engine,
    class_=Session,
    expire_on_commit=False
)

def init_db():
    """Initialize database"""
    try:
        with engine.begin() as conn:
            conn.execute(text("""
                CREATE TABLE IF NOT EXISTS document (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    title VARCHAR(255),
                    content TEXT NOT NULL,
                    language VARCHAR(50) NOT NULL DEFAULT 'vi',
                    doc_metadata JSON,
                    embedding BLOB,
                    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
                )
            """))
            
            conn.execute(text("CREATE INDEX IF NOT EXISTS idx_document_language ON document(language)"))
            conn.execute(text("CREATE INDEX IF NOT EXISTS idx_document_title ON document(title)"))
            conn.execute(text("CREATE INDEX IF NOT EXISTS idx_document_created_at ON document(created_at)"))
        
        logger.info("Database initialized successfully!")
    except Exception as e:
        logger.error(f"Database initialization error: {e}")
        raise

def get_session() -> Generator[Session, None, None]:
    """Get database session"""
    with SessionLocal() as session:
        yield session
'''

with open('haystack_colab/app/database.py', 'w') as f:
    f.write(database_code)

print("‚úÖ Database module created!")

## 4Ô∏è‚É£ T·∫°o Embedding Service

In [None]:
# T·∫°o embedding_service.py
embedding_service_code = '''
"""Embedding service for semantic search"""
from sentence_transformers import SentenceTransformer
import numpy as np
import logging

logger = logging.getLogger(__name__)

embedding_model = None

def init_embeddings():
    """Initialize embedding model"""
    global embedding_model
    if embedding_model is None:
        try:
            logger.info("Loading embedding model...")
            embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
            logger.info(f"‚úì Model loaded: {embedding_model.get_sentence_embedding_dimension()} dims")
        except Exception as e:
            logger.error(f"Failed to load model: {e}")
            embedding_model = None
    return embedding_model

def get_embedding(text: str) -> np.ndarray:
    """Generate embedding for text"""
    model = init_embeddings()
    if model is None:
        raise RuntimeError("Embedding model not available")
    
    embedding = model.encode(text, convert_to_numpy=True)
    return embedding

def get_embeddings_batch(texts: list) -> list:
    """Generate embeddings for multiple texts"""
    model = init_embeddings()
    if model is None:
        raise RuntimeError("Embedding model not available")
    
    embeddings = model.encode(texts, convert_to_numpy=True)
    return embeddings.tolist()
'''

os.makedirs('haystack_colab/app/services', exist_ok=True)
with open('haystack_colab/app/services/embedding_service.py', 'w') as f:
    f.write(embedding_service_code)

print("‚úÖ Embedding service created!")

## 5Ô∏è‚É£ T·∫°o Translation Service

In [None]:
# T·∫°o translation_service.py
translation_service_code = '''
"""Translation service"""
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import logging

logger = logging.getLogger(__name__)

translation_models = {}

def load_translation_model(model_name: str):
    """Load translation model"""
    if model_name not in translation_models:
        try:
            logger.info(f"Loading {model_name}...")
            tokenizer = AutoTokenizer.from_pretrained(model_name)
            model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
            translation_models[model_name] = (tokenizer, model)
            logger.info(f"‚úì Model loaded: {model_name}")
        except Exception as e:
            logger.error(f"Failed to load model {model_name}: {e}")
            raise
    return translation_models[model_name]

def translate(text: str, source_lang: str, target_lang: str) -> str:
    """Translate text"""
    if source_lang == "en" and target_lang == "vi":
        model_name = "Helsinki-NLP/opus-mt-en-vi"
    elif source_lang == "vi" and target_lang == "en":
        model_name = "Helsinki-NLP/opus-mt-vi-en"
    else:
        raise ValueError(f"Language pair {source_lang}-{target_lang} not supported")
    
    tokenizer, model = load_translation_model(model_name)
    
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return result
'''

with open('haystack_colab/app/services/translation_service.py', 'w') as f:
    f.write(translation_service_code)

print("‚úÖ Translation service created!")

## 6Ô∏è‚É£ T·∫°o FastAPI Application

In [None]:
# T·∫°o main.py
main_code = '''
"""Main FastAPI application"""
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import logging
import sys
import os

# Th√™m ƒë∆∞·ªùng d·∫´n
sys.path.insert(0, os.path.dirname(__file__))

from app.database import init_db, SessionLocal, engine
from app.services.translation_service import translate
from app.services.embedding_service import get_embedding, get_embeddings_batch

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="Translation & Document Search API")

# CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Request/Response models
class TranslationRequest(BaseModel):
    text: str
    source_lang: str
    target_lang: str

class TranslationResponse(BaseModel):
    source: str
    target: str
    source_lang: str
    target_lang: str

class EmbeddingRequest(BaseModel):
    texts: list[str]

class DocumentUploadRequest(BaseModel):
    title: str
    content: str
    language: str = "vi"

# Initialize database
@app.on_event("startup")
async def startup():
    logger.info("Initializing database...")
    init_db()
    logger.info("Application startup complete")

# Routes
@app.get("/health")
async def health_check():
    return {"status": "ok", "message": "API is running on Colab!"}

@app.post("/api/translate")
async def translate_text(request: TranslationRequest):
    try:
        result = translate(request.text, request.source_lang, request.target_lang)
        return TranslationResponse(
            source=request.text,
            target=result,
            source_lang=request.source_lang,
            target_lang=request.target_lang
        )
    except Exception as e:
        logger.error(f"Translation error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/api/embed")
async def get_embeddings(request: EmbeddingRequest):
    try:
        embeddings = get_embeddings_batch(request.texts)
        return {
            "texts": request.texts,
            "embeddings": embeddings,
            "dimension": 384
        }
    except Exception as e:
        logger.error(f"Embedding error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/api/documents")
async def upload_document(request: DocumentUploadRequest):
    try:
        from sqlalchemy import text
        
        # T·∫°o embedding
        embedding = get_embedding(request.content)
        
        # L∆∞u v√†o database
        with SessionLocal() as session:
            session.execute(text("""
                INSERT INTO document (title, content, language, embedding)
                VALUES (:title, :content, :language, :embedding)
            """), {
                "title": request.title,
                "content": request.content,
                "language": request.language,
                "embedding": embedding.tobytes()
            })
            session.commit()
        
        return {"status": "success", "message": "Document uploaded"}
    except Exception as e:
        logger.error(f"Upload error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    logger.info("Starting FastAPI server...")
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

with open('haystack_colab/main.py', 'w') as f:
    f.write(main_code)

print("‚úÖ Main application created!")

## 7Ô∏è‚É£ T·∫°o __init__.py Files

In [None]:
# T·∫°o __init__.py files
import os

for dir_path in [
    'haystack_colab',
    'haystack_colab/app',
    'haystack_colab/app/models',
    'haystack_colab/app/routes',
    'haystack_colab/app/services',
    'haystack_colab/app/utils'
]:
    init_file = os.path.join(dir_path, '__init__.py')
    if not os.path.exists(init_file):
        with open(init_file, 'w') as f:
            f.write('')

print("‚úÖ __init__.py files created!")

## 8Ô∏è‚É£ Download Models (T√πy Ch·ªçn)

In [None]:
# Download models ƒë·ªÉ ti·∫øt ki·ªám th·ªùi gian sau n√†y
print("‚è≥ Downloading translation models (this may take a few minutes)...")

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

try:
    print("Downloading Helsinki-NLP/opus-mt-en-vi...")
    AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-vi")
    AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-vi")
    print("‚úì EN‚ÜíVI model downloaded")
except Exception as e:
    print(f"‚ö†Ô∏è Error downloading EN‚ÜíVI: {e}")

try:
    print("\nDownloading Helsinki-NLP/opus-mt-vi-en...")
    AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-vi-en")
    AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-vi-en")
    print("‚úì VI‚ÜíEN model downloaded")
except Exception as e:
    print(f"‚ö†Ô∏è Error downloading VI‚ÜíEN: {e}")

try:
    print("\nDownloading Sentence Transformers embeddings model...")
    from sentence_transformers import SentenceTransformer
    SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    print("‚úì Embeddings model downloaded")
except Exception as e:
    print(f"‚ö†Ô∏è Error downloading embeddings: {e}")

print("\n‚úÖ All models downloaded!")

## 9Ô∏è‚É£ Test Translation API

In [None]:
# Test translation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

print("Testing Translation API...\n")

# Test EN ‚Üí VI
print("üìù Test 1: Translate English ‚Üí Vietnamese")
model_name_en_vi = "Helsinki-NLP/opus-mt-en-vi"
tokenizer_en_vi = AutoTokenizer.from_pretrained(model_name_en_vi)
model_en_vi = AutoModelForSeq2SeqLM.from_pretrained(model_name_en_vi)

text_en = "Hello, how are you today?"
inputs = tokenizer_en_vi(text_en, return_tensors="pt", max_length=512, truncation=True)
outputs = model_en_vi.generate(**inputs, max_length=512)
text_vi = tokenizer_en_vi.decode(outputs[0], skip_special_tokens=True)

print(f"Input (EN): {text_en}")
print(f"Output (VI): {text_vi}")
print()

# Test VI ‚Üí EN
print("üìù Test 2: Translate Vietnamese ‚Üí English")
model_name_vi_en = "Helsinki-NLP/opus-mt-vi-en"
tokenizer_vi_en = AutoTokenizer.from_pretrained(model_name_vi_en)
model_vi_en = AutoModelForSeq2SeqLM.from_pretrained(model_name_vi_en)

text_vi_input = "Xin ch√†o, b·∫°n kh·ªèe kh√¥ng?"
inputs = tokenizer_vi_en(text_vi_input, return_tensors="pt", max_length=512, truncation=True)
outputs = model_vi_en.generate(**inputs, max_length=512)
text_en_output = tokenizer_vi_en.decode(outputs[0], skip_special_tokens=True)

print(f"Input (VI): {text_vi_input}")
print(f"Output (EN): {text_en_output}")
print()

print("‚úÖ Translation API working!")

## üîü Test Embedding API

In [None]:
# Test embeddings
from sentence_transformers import SentenceTransformer
import numpy as np

print("Testing Embeddings API...\n")

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

texts = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks",
    "Python is a programming language"
]

embeddings = model.encode(texts)

print(f"Model dimension: {embeddings[0].shape}")
print(f"Number of texts: {len(texts)}")
print()

for i, text in enumerate(texts):
    print(f"Text {i+1}: {text[:50]}...")
    print(f"Embedding shape: {embeddings[i].shape}")
    print(f"First 5 values: {embeddings[i][:5]}")
    print()

print("‚úÖ Embeddings API working!")

## 1Ô∏è‚É£1Ô∏è‚É£ Ch·∫°y FastAPI Server Tr√™n Colab

In [None]:
# Ch·∫°y ngrok tunnel ƒë·ªÉ expose API
!pip install -q pyngrok

from pyngrok import ngrok
import os

# B·∫°n c·∫ßn ƒëƒÉng k√Ω t·∫°i https://dashboard.ngrok.com ƒë·ªÉ l·∫•y auth token
# ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN")

print("‚è≥ Starting FastAPI server...")
print("Tip: C·∫ßn ngrok auth token ƒë·ªÉ ch·∫°y server tr·ª±c ti·∫øp")
print("Thay v√†o ƒë√≥, b·∫°n c√≥ th·ªÉ test API locally trong Colab")

## 1Ô∏è‚É£2Ô∏è‚É£ Test API Locally (Recommended for Colab)

In [None]:
# Test API locally m√† kh√¥ng c·∫ßn ch·∫°y server
print("Testing API Functions Directly...\n")

# Test 1: Translation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def translate_func(text, source_lang, target_lang):
    if source_lang == "en" and target_lang == "vi":
        model_name = "Helsinki-NLP/opus-mt-en-vi"
    elif source_lang == "vi" and target_lang == "en":
        model_name = "Helsinki-NLP/opus-mt-vi-en"
    else:
        raise ValueError("Language pair not supported")
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model.generate(**inputs, max_length=512)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

# Test translations
test_cases = [
    ("Good morning", "en", "vi"),
    ("Thank you very much", "en", "vi"),
    ("T√¥i y√™u l·∫≠p tr√¨nh", "vi", "en"),
]

for text, source, target in test_cases:
    result = translate_func(text, source, target)
    print(f"[{source.upper()}‚Üí{target.upper()}] {text}")
    print(f"Result: {result}")
    print()

print("‚úÖ All tests passed!")

## üìå H∆∞·ªõng D·∫´n S·ª≠ D·ª•ng Tr√™n Colab

### ‚úÖ ∆Øu ƒêi·ªÉm
- ‚úì Free GPU t·ª´ Google
- ‚úì Kh√¥ng c·∫ßn c√†i ƒë·∫∑t m√°y t√≠nh
- ‚úì M√¥ h√¨nh t·ª± ƒë·ªông download
- ‚úì D·ªÖ share v√† collaborative

### ‚ö†Ô∏è H·∫°n Ch·∫ø
- ‚ö† M·ªói session ch·ªâ ch·∫°y ~12 gi·ªù
- ‚ö† C·∫ßn t√≠nh to√°n l·∫°i khi kh·ªüi ƒë·ªông l·∫°i
- ‚ö† Kh√¥ng c√≥ persistent file system

### üöÄ C√°ch Ch·∫°y Server Tr·ª±c Ti·∫øp (Advanced)

```python
# C·∫ßn install ngrok
# 1. ƒêƒÉng k√Ω t·∫°i https://dashboard.ngrok.com
# 2. L·∫•y auth token
# 3. Ch·∫°y cell b√™n d∆∞·ªõi
```

## 1Ô∏è‚É£3Ô∏è‚É£ Run Full Server (Advanced - C·∫ßn ngrok)

In [None]:
# C√°ch ch·∫°y full server tr√™n Colab

full_server_code = '''
import subprocess
import os

# B∆∞·ªõc 1: Setup ngrok
print("Setup ngrok...")
os.system("pip install -q pyngrok")

from pyngrok import ngrok

# ‚ö†Ô∏è Thay YOUR_AUTH_TOKEN b·∫±ng token c·ªßa b·∫°n t·ª´ https://dashboard.ngrok.com
AUTH_TOKEN = "YOUR_AUTH_TOKEN"
ngrok.set_auth_token(AUTH_TOKEN)

# B∆∞·ªõc 2: Start FastAPI server
print("Starting FastAPI server...")
subprocess.Popen(["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"])

# B∆∞·ªõc 3: Expose with ngrok
print("Exposing with ngrok...")
public_url = ngrok.connect(8000, "http")
print(f"\n‚úÖ Public URL: {public_url}")
print(f"API is now accessible at: {public_url}/docs")

# B∆∞·ªõc 4: Keep server running
import time
while True:
    time.sleep(1)
'''

print("Code ƒë·ªÉ ch·∫°y full server tr√™n Colab:")
print(full_server_code)
print("\n‚ö†Ô∏è C·∫ßn edit AUTH_TOKEN tr∆∞·ªõc khi ch·∫°y!")

## üìä Summary

| Feature | Colab | Local Machine |
|---------|-------|---------------|
| GPU | ‚úÖ Free | T√πy c√≥/kh√¥ng |
| C√†i ƒë·∫∑t | ‚úÖ T·ª± ƒë·ªông | ‚ö†Ô∏è Manual |
| Persistent | ‚ö†Ô∏è 12h/session | ‚úÖ Vƒ©nh vi·ªÖn |
| Server | ‚úÖ C√≥ (ngrok) | ‚úÖ Native |
| Cost | ‚úÖ Free | ‚ö†Ô∏è ƒêi·ªán nƒÉng |

## üéØ Recommend
- **For Demo**: D√πng Colab (nhanh, free)
- **For Production**: Ch·∫°y tr√™n local ho·∫∑c server ri√™ng