# RAG + Rule-based Itinerary Generator

Notebook n√†y th·ª±c hi·ªán:
- Load d·ªØ li·ªáu `tourism_places` t·ª´ PostgreSQL.
- Encode records b·∫±ng `sentence-transformers` (multilingual-e5-small).
- X√¢y FAISS vector store (local, in-memory).
- D√πng RAG (vector search) ƒë·ªÉ l·∫•y context li√™n quan.
- T·∫°o l·ªãch tr√¨nh **rule-based** (chia s√°ng/tr∆∞a/chi·ªÅu/t·ªëi) d·ª±a tr√™n **selected places** t·ª´ c√°c tr∆∞·ªùng metadata (highlights, activities, duration_recommend, weather_notes, special_for, ...).

In [2]:
!pip install sentence-transformers faiss-cpu python-dotenv sqlalchemy psycopg2-binary



## imports + dotenv

In [3]:
import os
import sys
from typing import List, Dict, Any
from dotenv import load_dotenv
load_dotenv()

# Th∆∞ vi·ªán ML / DB
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from sqlalchemy import create_engine, MetaData, Table, text

  from .autonotebook import tqdm as notebook_tqdm


## DB connection note

In [4]:
DATABASE_URL = os.getenv("DATABASE_URL") or "postgresql+psycopg2://postgres:abc123@localhost:5432/tourismdb"

engine = create_engine(DATABASE_URL)
print("Engine:", engine)

def load_postgres_data_dynamic(engine, table_names: List[str]) -> List[Dict[str, Any]]:
    """
    Load rows from given table names. Return list of dicts: {"id": "<table>:<id>", "record": row_dict}
    """
    metadata = MetaData()
    results = []
    with engine.begin() as conn:
        for tbl_name in table_names:
            table = Table(tbl_name, metadata, autoload_with=engine)
            rows = conn.execute(table.select()).fetchall()
            for r in rows:
                row_dict = dict(r._mapping)
                rid = row_dict.get("id", None)
                results.append({
                    "id": f"{tbl_name}:{rid}",
                    "record": row_dict
                })
    return results

# Test connection quickly (optional)
try:
    with engine.connect() as conn:
        r = conn.execute(text("SELECT 1")).fetchone()
        print("DB connected:", r)
except Exception as e:
    print("DB connection error:", e)

Engine: Engine(postgresql+psycopg2://postgres:***@localhost:5432/tourismdb)
DB connected: (1,)


## Embedding model

In [5]:
embed_model = SentenceTransformer("intfloat/multilingual-e5-small")

def record_to_text(record: Dict[str, Any]) -> str:
    """
    Convert a DB record dict to a single text chunk for embedding.
    Keep the most useful fields for retrieval: name, description, highlights, activities, tags, province, category.
    """
    parts = []
    if record.get("name"):
        parts.append(str(record["name"]))
    if record.get("province"):
        parts.append(str(record["province"]))
    if record.get("category"):
        parts.append(str(record["category"]))
    # highlights
    if record.get("highlights"):
        if isinstance(record["highlights"], list):
            parts.append("Highlights: " + ", ".join(map(str, record["highlights"])))
        else:
            parts.append("Highlights: " + str(record["highlights"]))
    # activities
    if record.get("activities"):
        if isinstance(record["activities"], list):
            parts.append("Activities: " + ", ".join(map(str, record["activities"])))
        else:
            parts.append("Activities: " + str(record["activities"]))
    if record.get("description"):
        parts.append(str(record["description"]))
    if record.get("tags"):
        if isinstance(record["tags"], list):
            parts.append("Tags: " + ", ".join(map(str, record["tags"])))
        else:
            parts.append("Tags: " + str(record["tags"]))
    return " | ".join(parts)

def embed_texts(texts: List[str]) -> np.ndarray:
    emb = embed_model.encode(texts, convert_to_numpy=True, show_progress_bar=False)
    return emb.astype("float32")

## VectorStore wrapper

In [6]:
class VectorStore:
    def __init__(self, vectors: np.ndarray, records: List[Dict[str, Any]]):
        """
        vectors: numpy array (N, D)
        records: parallel list of dicts (each contains 'id' and 'record')
        """
        self.records = records
        d = vectors.shape[1]
        self.index = faiss.IndexFlatL2(d)
        self.index.add(vectors)

    def search_by_vector(self, q_vec: np.ndarray, top_k: int = 5):
        if q_vec.ndim == 1:
            q_vec = q_vec.reshape(1, -1)
        distances, indices = self.index.search(q_vec, top_k)
        out = []
        for idx in indices[0]:
            if 0 <= idx < len(self.records):
                out.append(self.records[idx])
        return out

    def search_by_text(self, text: str, top_k: int = 5):
        q_emb = embed_texts([text])
        return self.search_by_vector(q_emb, top_k=top_k)

## Build store from DB

In [7]:
# Load tourism_places from postgres and build index (heavy step)
table_names = ["tourism_places"]
data = load_postgres_data_dynamic(engine, table_names)
print("Loaded records:", len(data))

# Convert to text and embed
texts = [record_to_text(r["record"]) for r in data]
vectors = embed_texts(texts)
print("Vectors shape:", vectors.shape)

# Build store
store = VectorStore(vectors, data)
print("Vector store ready.")

Loaded records: 387
Vectors shape: (387, 384)
Vector store ready.


## Rule-based itinerary generator (design)
Quy t·∫Øc ch√≠nh:
- Ng∆∞·ªùi d√πng ch·ªçn **place_ids** (√≠t nh·∫•t 1).
- Ho·∫∑c input `province + days + preferences` => h·ªá th·ªëng s·∫Ω l·∫•y c√°c places t·ª´ DB theo province v√† filter theo preferences.
- RAG (vector search) ƒë∆∞·ª£c d√πng ƒë·ªÉ l·∫•y th√™m **contexts** li√™n quan (top_k).
- Rule-based scheduler s·∫Ω:
  - S·∫Øp x·∫øp places theo `popularity_score` (n·∫øu c√≥) ho·∫∑c theo th·ª© t·ª± user ch·ªçn.
  - G√°n 1‚Äì3 places/ng√†y: chia ra morning / noon / afternoon / evening (∆∞u ti√™n duration_recommend).
  - M·ªói place output: name, address, highlights, activities, duration_recommend, price_range, open_hours, seasonal_events, special_for, weather_notes.


In [None]:
from math import ceil
from collections import defaultdict

def enrich_selected_places(selected_records: List[Dict[str, Any]], rag_contexts: List[Dict[str, Any]]):
    """
    Selected_records: list of dicts each {"id":..., "record": {...}} or user-built dict.
    rag_contexts: list of RAG hit objects (each has 'record' or 'raw')
    Return enriched places list with useful fields.
    """
    # build map by id or lower-name
    by_id = {}
    by_name = {}
    for item in rag_contexts:
        raw = item.get("raw") or item.get("record") or item.get("record", {})
        if not isinstance(raw, dict):
            continue
        rid = raw.get("id")
        name = raw.get("name")
        if rid is not None:
            by_id[int(rid)] = raw
        if isinstance(name, str):
            by_name[name.strip().lower()] = raw

    enriched = []
    for rec in selected_records:
        # rec may come as {"id":..., "record": {...}} OR normal place dict (from ORM)
        if "record" in rec and isinstance(rec["record"], dict):
            r = rec["record"]
        else:
            r = rec
        meta = {
            "id": r.get("id"),
            "name": r.get("name"),
            "address": r.get("address"),
            "latitude": r.get("latitude"),
            "longitude": r.get("longitude"),
            "highlights": r.get("highlights") or [],
            "activities": r.get("activities") or [],
            "duration_recommend": r.get("duration_recommend"),
            "price_range": r.get("price_range"),
            "open_hours": r.get("open_hours"),
            "seasonal_events": r.get("seasonal_events") or [],
            "special_for": r.get("special_for") or [],
            "best_time_to_visit": r.get("best_time_to_visit"),
            "tags": r.get("tags") or [],
            "weather_notes": r.get("weather_notes"),
            "popularity_score": float(r.get("popularity_score") or 0.0)
        }

        # try to augment from rag_contexts (matching by id or name)
        matched = None
        try:
            if meta["id"] and int(meta["id"]) in by_id:
                matched = by_id[int(meta["id"])]
            else:
                nm = meta["name"]
                if nm and nm.strip().lower() in by_name:
                    matched = by_name[nm.strip().lower()]
        except Exception:
            matched = None

        if matched:
            # override sparse fields if present
            for f in ["highlights", "activities", "duration_recommend", "price_range", "open_hours", "seasonal_events", "special_for", "best_time_to_visit", "tags", "weather_notes", "popularity_score"]:
                if matched.get(f):
                    meta[f] = matched.get(f) if not isinstance(matched.get(f), list) else matched.get(f)

        enriched.append(meta)
    return enriched

def schedule_places(enriched_places: List[Dict[str, Any]], days: int):
    """
    Very simple scheduler:
    - sort by popularity_score desc
    - chunk into roughly ceil(n / days) per day
    - allocate into morning/noon/afternoon/evening depending on position and duration_recommend
    """
    if days <= 0:
        raise ValueError("days must be > 0")
    n = len(enriched_places)
    per_day = max(1, ceil(n / days))
    sorted_places = sorted(enriched_places, key=lambda x: x.get("popularity_score", 0.0), reverse=True)
    schedule = []
    idx = 0
    for d in range(1, days + 1):
        day_block = {"day": d, "slots": {"morning": [], "noon": [], "afternoon": [], "evening": []}}
        for s in range(per_day):
            if idx >= n:
                break
            place = sorted_places[idx]
            # assign slot heuristically: prefer longer durations to morning/afternoon multi-hour
            dur = str(place.get("duration_recommend") or "").lower()
            if "day" in dur or "ng√†y" in dur or "4" in dur or "2 ng√†y" in dur:
                # multi-day place: put in morning and possibly afternoon on same day (as summary)
                day_block["slots"]["morning"].append(place)
            else:
                # rotate slots: morning -> noon -> afternoon -> evening
                slot_order = ["morning", "noon", "afternoon", "evening"]
                slot = slot_order[(s) % 4]
                day_block["slots"][slot].append(place)
            idx += 1
        schedule.append(day_block)
    return schedule

def format_itinerary(schedule):
    lines = []
    for day in schedule:
        d = day["day"]
        lines.append(f"Day {d}**\n")
        for slot in ["morning", "noon", "afternoon", "evening"]:
            items = day["slots"].get(slot, [])
            if not items:
                continue
            # human-friendly slot name
            slot_name = {"morning":"S√°ng","noon":"Tr∆∞a","afternoon":"Chi·ªÅu","evening":"T·ªëi"}[slot]
            lines.append(f"{slot_name}: " + ("; ".join([p["name"] for p in items])))
            for p in items:
                if p.get("activities"):
                    lines.append(f"- Ho·∫°t ƒë·ªông: {', '.join(p.get('activities'))}")
                if p.get("weather_notes"):
                    lines.append(f"- G·ª£i √Ω: {p.get('weather_notes')}")
                if p.get("highlights"):
                    lines.append(f"- N·ªïi b·∫≠t: {', '.join(p.get('highlights'))}")
                if p.get("duration_recommend"):
                    lines.append(f"- Th·ªùi gian: {p.get('duration_recommend')}")
                if p.get("price_range"):
                    lines.append(f"- Gi√° tham kh·∫£o: {p.get('price_range')}")
                if p.get("open_hours"):
                    lines.append(f"- Gi·ªù m·ªü: {p.get('open_hours')}")
                # blank line between places
                lines.append("")
        lines.append("")  # extra blank line per day
    return "\n".join(lines)

## Top-level generator: from selected place_ids

In [9]:
def generate_itinerary_from_place_ids(place_ids: List[int], start_date: str, end_date: str, top_k_contexts: int = 6):
    # days calculation
    from datetime import datetime
    s = datetime.strptime(start_date, "%Y-%m-%d")
    e = datetime.strptime(end_date, "%Y-%m-%d")
    if e < s:
        raise ValueError("end_date < start_date")
    days = (e - s).days + 1

    # load records for these ids from DB
    # Simple approach: query tourism_places by id
    conn = engine.connect()
    md = MetaData()
    tbl = Table("tourism_places", md, autoload_with=engine)
    rows = conn.execute(tbl.select().where(tbl.c.id.in_(place_ids))).fetchall()
    conn.close()

    selected_records = [{"id": r._mapping.get("id"), "record": dict(r._mapping)} for r in rows]

    # RAG: build query from place names (fallback)
    names = [r["record"].get("name") for r in selected_records if r["record"].get("name")]
    query = " ".join(names)
    contexts = store.search_by_text(query, top_k=top_k_contexts)

    # Enrich selected places
    enriched = enrich_selected_places(selected_records, contexts)

    # schedule
    schedule = schedule_places(enriched, days)

    # format text
    itinerary_text = format_itinerary(schedule)

    return {
        "province": enriched[0].get("province") if enriched else None,
        "start_date": start_date,
        "end_date": end_date,
        "days": days,
        "selected_count": len(enriched),
        "rag_contexts_used": contexts,
        "schedule": schedule,
        "itinerary_text": itinerary_text
    }

## Run example with your sample input

In [None]:
# Example run using your sample place IDs
sample_place_ids = [339, 385]  # Hang S∆°n ƒêo√≤ng, Hang √ân
start_date = "2025-06-01"
end_date = "2025-06-02"

out = generate_itinerary_from_place_ids(sample_place_ids, start_date, end_date, top_k_contexts=6)
print("=== ITINERARY TEXT ===\n")
print(out["itinerary_text"])

=== ITINERARY TEXT ===

üóìÔ∏è **Day 1**

**S√°ng:** Hang S∆°n ƒêo√≤ng
- Ho·∫°t ƒë·ªông: Th√°m hi·ªÉm chuy√™n nghi·ªáp, C·∫Øm tr·∫°i trong hang, B∆°i hang
- G·ª£i √Ω: Y√™u c·∫ßu th·ªÉ l·ª±c cao
- N·ªïi b·∫≠t: B·ª©c t∆∞·ªùng Vi·ªát Nam, V∆∞·ªùn ƒê·ªãa ƒê√†ng, H·ªë s·ª•t
- Th·ªùi gian: 4 ng√†y 3 ƒë√™m
- Gi√° tham kh·∫£o: 3.000 USD (Tour th√°m hi·ªÉm)
- Gi·ªù m·ªü: Theo tour ƒë·∫∑t tr∆∞·ªõc


üóìÔ∏è **Day 2**

**S√°ng:** Hang √ân
- Ho·∫°t ƒë·ªông: Trekking r·ª´ng, C·∫Øm tr·∫°i, T·∫Øm su·ªëi
- G·ª£i √Ω: ƒêi b·ªô ƒë∆∞·ªùng d√†i
- N·ªïi b·∫≠t: B√£i tr·∫°i trong hang, C·ª≠a hang kh·ªïng l·ªì, Chim √©n
- Th·ªùi gian: 2 ng√†y 1 ƒë√™m
- Gi√° tham kh·∫£o: 7-9 tri·ªáu VNƒê (Tour)
- Gi·ªù m·ªü: Theo tour




## Conclusion: How the RAG System Works in This Project

H·ªá th·ªëng Recommendation s·ª≠ d·ª•ng k·ªπ thu·∫≠t RAG (Retrieval-Augmented Generation) ƒë·ªÉ t·∫°o g·ª£i √Ω h√†nh tr√¨nh du l·ªãch ch√≠nh x√°c, ph√π h·ª£p v·ªõi nhu c·∫ßu ng∆∞·ªùi d√πng d·ª±a tr√™n d·ªØ li·ªáu ƒë·ªãa ƒëi·ªÉm c√≥ s·∫µn.
M·∫∑c d√π project kh√¥ng c√≤n s·ª≠ d·ª•ng LLM ƒë·ªÉ sinh vƒÉn b·∫£n d√†i, ki·∫øn tr√∫c RAG v·∫´n gi·ªØ vai tr√≤ quan tr·ªçng trong l·∫•y d·ªØ li·ªáu ‚Äì l·ªçc ‚Äì suy lu·∫≠n ‚Äì gh√©p h√†nh tr√¨nh.

### 1. Data Retrieval (Truy xu·∫•t d·ªØ li·ªáu)

- Ng∆∞·ªùi d√πng nh·∫≠p v√†o c√°c tham s·ªë nh∆∞:

    - province

    - days

    - interests, pace, group_type, time_preferences

- H·ªá th·ªëng s·ª≠ d·ª•ng ch√∫ng ƒë·ªÉ truy v·∫•n database v√† l·∫•y ra danh s√°ch c√°c ƒë·ªãa ƒëi·ªÉm c√≥:

    - Ph√π h·ª£p s·ªü th√≠ch

    - N·∫±m trong t·ªânh t∆∞∆°ng ·ª©ng

    - Kh√¥ng n·∫±m trong avoid_categories

- K·∫øt qu·∫£ retrieval cho ra t·∫≠p c√°c ƒë·ªãa ƒëi·ªÉm ·ª©ng vi√™n.

### 2. Ranking (X·∫øp h·∫°ng ƒë·ªô ph√π h·ª£p)

- M·ªói ƒë·ªãa ƒëi·ªÉm ƒë∆∞·ª£c ch·∫•m ƒëi·ªÉm theo c√°c y·∫øu t·ªë:

    - M·ª©c ƒë·ªô tr√πng kh·ªõp interests

    - Ph√π h·ª£p v·ªõi th·ªùi gian trong ng√†y

    - ƒê·ªãa ƒëi·ªÉm c√≥ weather_notes, duration_recommend h·ª£p l√Ω

    - Lo·∫°i h√¨nh ph√π h·ª£p v·ªõi pace (chill/medium/full)

    - G·ª£i √Ω theo group_type (family, adventure, couple‚Ä¶)

- ƒêi·ªÉm t·ªïng h·ª£p ‚Üí X·∫øp th·ª© t·ª± ∆∞u ti√™n ‚Üí Ch·ªçn nh·ªØng ƒëi·ªÉm t·ªët nh·∫•t.

### 3. Generation

H·ªá th·ªëng d√πng:

- Template markdown

- Rule-based formatting

- Logic ph√¢n ng√†y v√† ph√¢n bu·ªïi

V√≠ d·ª•:

- Bu·ªïi s√°ng ∆∞u ti√™n tham quan/ch·ª•p ·∫£nh

- Bu·ªïi chi·ªÅu ∆∞u ti√™n kh√°m ph√°, v·∫≠n ƒë·ªông

- T·ªëi ∆∞u ti√™n ngh·ªâ ng∆°i ho·∫∑c city tour

H·ªá th·ªëng t·ª± ƒë·ªông:

- Chia l·ªãch theo days

- Ph√¢n b·ªï ƒë·ªãa ƒëi·ªÉm theo time_preferences

- Gh√©p th√™m activities, highlights, weather notes