This is an AI search and summary chatbot that provides real-time search (RAG) + summary + translation functions based on open-source LLMs for 10B to 20B levels.
- Users ask questions in natural language β the model translates them into search queries β the search results are synthesized to generate answers.
- Up-to-date information is supplemented with the Web Search API (Google Custom Search).
- During the inference phase, the model uses a prompt system to separate the roles of the planner (search decision/query rewrite) and the generator (final answer synthesis).
- Model
- LLM: GPT-OSS 20B
- Embedder: bge-small / e5-small
- Reranker: monoT5-small / e5-reranker
- Framework: PyTorch, Transformers, TRL
- Search: FAISS + BM25, Google Custom Search API
- Serving/Distribution: Docker, HF Space, gradio
- Aim for large-scale model-level search accuracy with a medium-sized LLM.
- Generate reliable/evidence-rich answers with RAG.
- Ensure up-to-dateness with a real-time search API.
- Optimize the prompt system, decoding policy, and RAG context design.
- Demonstrate SaaS-based deployment with Hugging Face Space.
- Explore scalability through research extensions (architecture level).
- Rate limit application/cache reuse
- LLM (planner prompt)
- Analyzes question intent, evaluates the quality of local database search results, and generates 1-5 search queries if necessary.
- Recommends search for up-to-date/uncertain results.
- Local search + web search API
- Vector DB (BM25+Dense) β Top-k, calls Google Custom Search API if necessary.
- Reranking
- Top-50 β Top-5~10 with Cross Encoder
- Context Composition
- Normalization within budget for key phrases/titles/URLs/tokens
- LLM (Generator Prompt)
- Generates answers with summaries/key points/sources based on context
- Marks "Insufficient Information" when evidence is lacking
- Multilingual Service
- Response Streaming/Meta Display
- Remaining Search Counts/Source Links/Timestamps
- Cache Storage/Log Collection