Skip to content

lennney/InsightFlow

Repository files navigation

InsightFlow

目录 | English

InsightFlow 是一个 AI Native Web 应用,专注于职业内容搜索与有据可查的摘要生成。它摄入结构化的就业市场内容,通过混合检索获取证据,返回带有引用、证据片段和来源卡片的结构化摘要。项目定位清晰:聚焦于求职、实习和职业研究的检索与综合,而非通用的聊天产品或社交平台。

功能特点

  • 将 CSV 和 JSON 数据集导入本地 SQLite 存储
  • 在标准化、分块、索引和检索全流程中保留来源元数据
  • 基于 SQLite 关键词搜索和 FAISS 向量索引构建混合检索
  • 通过 FastAPI 后端生成有据可查的结构化摘要
  • 在 React 前端展示摘要分节、来源卡片和证据片段
  • 支持本地开发的 fake provider 和生产环境的 OpenAI-compatible provider

技术栈

  • 后端:FastAPI、SQLAlchemy、SQLite
  • 检索:SQLite 关键词搜索 + FAISS 向量索引 + 确定性的合并与重排流程
  • 前端:React、TypeScript、Vite
  • 测试:pytest、Vitest

仓库结构

insightflow/
  backend/      FastAPI 应用、检索管道、数据导入逻辑、测试和脚本
  frontend/     React 客户端,用于搜索和有据摘要展示
  data/        本地原始数据和处理后数据目录
  docs/        验证笔记、实验记录和迭代历史
  .planning/    GSD 规划产物和项目状态

当前范围

MVP 已实现:

  • 内容导入
  • 支持父文档溯源的分块
  • 混合检索
  • 结构化摘要 API
  • 引用链接和证据片段提取
  • 前端来源卡片渲染

v1 明确不在范围内:

  • 用户账户体系
  • 社交/社区功能
  • 个性化推荐
  • 多轮记忆对话
  • 移动端 App

快速上手

1. 引导后端

.\bootstrap.ps1

该脚本准备后端虚拟环境并安装依赖。

2. 启动 API

.\bootstrap.ps1 -RunApp

核心后端接口:

  • GET /healthz
  • POST /imports
  • POST /retrieval/index/build
  • POST /retrieval/search
  • POST /summary

3. 启动前端

Set-Location frontend
npm install
$VITE_API_BASE_URL = "http://127.0.0.1:8001"
npm run dev -- --host 127.0.0.1 --port 5173

推荐本地组合:

  • 后端:127.0.0.1:8001
  • 前端:127.0.0.1:5173

本地工作流示例

导入样本数据:

& .\backend\.venv\Scripts\python.exe -m app.cli.import_batch
  --input backend\tests\fixtures\sample_job_dataset.csv
  --source sample_dataset
  --source-type dataset
  --db data\processed\insightflow.db

构建检索索引:

Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe scripts\build_retrieval_index.py
  --db sqlite:///../data/processed/insightflow.db
  --vector-index-dir ../data/processed/vector_index

使用 fake provider 运行后端:

Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$EMBEDDING_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8001

Provider 配置

InsightFlow 支持本地测试的 fake provider 和真实推理的 OpenAI-compatible provider。

真实 provider 配置示例:

$CHAT_MODEL_PROVIDER = "openai_compatible"
$CHAT_API_BASE_URL = "https://api.example.com/v1"
$CHAT_API_KEY = "your-chat-key"
$CHAT_MODEL_NAME = "your-chat-model"
$EMBEDDING_MODEL_PROVIDER = "openai_compatible"
$EMBEDDING_API_BASE_URL = "https://api.example.com/v1"
$EMBEDDING_API_KEY = "your-embedding-key"
$EMBEDDING_MODEL_NAME = "your-embedding-model"

重要默认路径:

  • 数据库:data/processed/insightflow.db
  • 向量索引:data/processed/vector_index

测试

后端:

& .\backend\.venv\Scripts\python.exe -m pytest backend\tests -q

前端:

Set-Location frontend
npm test

构建前端:

Set-Location frontend
npm run build

设计原则

  • 证据优先:摘要应始终位于检索证据范围之内
  • 来源透明:每个主张都应可追溯到引用和片段
  • 模块化后端:API、services、retrieval、db、schemas 和 prompts 保持清晰分离
  • 聚焦范围:先在职业信息检索与综合上做透,再考虑扩展

文档

项目状态

仓库已完成 MVP 初始实施路径,目前处于验证、检索质量调优和发布就绪阶段。

许可证

本项目当前采用 MIT License 发布。

发布说明

首个公开版本发布说明见 docs/release_notes_v0.1.0.md


English

InsightFlow is an AI-native web app for career-content search and grounded summarization. It ingests structured job-market content, retrieves evidence with hybrid search, and returns structured summaries with citations, evidence snippets, and source cards.

The project is intentionally narrow: it focuses on retrieval and summarization for job, internship, and career research, rather than being a general-purpose chat product or social platform.

What It Does

  • Imports CSV and JSON datasets into a local SQLite store
  • Preserves source metadata across normalization, chunking, indexing, and retrieval
  • Builds hybrid retrieval with SQLite keyword search and a FAISS vector index
  • Produces evidence-backed structured summaries through a FastAPI backend
  • Shows summary sections, source cards, and evidence snippets in a React frontend
  • Supports fake providers for local development and OpenAI-compatible providers for real inference

Stack

  • Backend: FastAPI, SQLAlchemy, SQLite
  • Retrieval: SQLite keyword search, FAISS vector index, deterministic merge and rerank pipeline
  • Frontend: React, TypeScript, Vite
  • Testing: pytest, Vitest

Repository Structure

insightflow/
  backend/      FastAPI app, retrieval pipeline, ingestion logic, tests, and scripts
  frontend/     React client for search and grounded summary display
  data/         Local raw and processed data directories
  docs/         Validation notes, experiments, and iteration history
  .planning/    GSD planning artifacts and project state

Current Scope

Implemented in the current MVP:

  • Content import
  • Chunking with parent-document traceability
  • Hybrid retrieval
  • Structured summary API
  • Citation linking and evidence snippet extraction
  • Source-card rendering in the UI

Explicitly out of scope for v1:

  • User accounts
  • Social/community features
  • Personalized recommendation
  • Multi-turn memory chat
  • Mobile app

Quick Start

1. Bootstrap the backend

.\bootstrap.ps1

This script prepares the backend virtual environment and installs dependencies.

2. Start the API

.\bootstrap.ps1 -RunApp

Core backend endpoints:

  • GET /healthz
  • POST /imports
  • POST /retrieval/index/build
  • POST /retrieval/search
  • POST /summary

3. Start the frontend

Set-Location frontend
npm install
$VITE_API_BASE_URL = "http://127.0.0.1:8001"
npm run dev -- --host 127.0.0.1 --port 5173

Recommended local pairing:

  • Backend: 127.0.0.1:8001
  • Frontend: 127.0.0.1:5173

Example Local Workflow

Import sample data:

& .\backend\.venv\Scripts\python.exe -m app.cli.import_batch
  --input backend\tests\fixtures\sample_job_dataset.csv
  --source sample_dataset
  --source-type dataset
  --db data\processed\insightflow.db

Build retrieval indexes:

Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe scripts\build_retrieval_index.py
  --db sqlite:///../data/processed/insightflow.db
  --vector-index-dir ../data/processed/vector_index

Run the backend with fake providers:

Set-Location backend
$CHAT_MODEL_PROVIDER = "fake"
$EMBEDDING_MODEL_PROVIDER = "fake"
$FORCE_REBUILD = "true"
& .\.venv\Scripts\python.exe -m uvicorn app.main:app --host 127.0.0.1 --port 8001

Provider Configuration

InsightFlow supports fake providers for local testing and OpenAI-compatible providers for real inference.

Example real-provider setup:

$CHAT_MODEL_PROVIDER = "openai_compatible"
$CHAT_API_BASE_URL = "https://api.example.com/v1"
$CHAT_API_KEY = "your-chat-key"
$CHAT_MODEL_NAME = "your-chat-model"
$EMBEDDING_MODEL_PROVIDER = "openai_compatible"
$EMBEDDING_API_BASE_URL = "https://api.example.com/v1"
$EMBEDDING_API_KEY = "your-embedding-key"
$EMBEDDING_MODEL_NAME = "your-embedding-model"

Important default paths:

  • Database: data/processed/insightflow.db
  • Vector index: data/processed/vector_index

Testing

Backend:

& .\backend\.venv\Scripts\python.exe -m pytest backend\tests -q

Frontend:

Set-Location frontend
npm test

Build frontend:

Set-Location frontend
npm run build

Design Principles

  • Evidence first: summaries should stay within retrieved evidence boundaries
  • Source transparency: each claim should be traceable to citations and snippets
  • Modular backend: API, services, retrieval, db, schemas, and prompts remain clearly separated
  • Narrow scope: solve career-information retrieval and synthesis well before expanding

Documentation

Project Status

The repository has completed the initial MVP implementation path and is currently in a verification, retrieval-quality tuning, and release-readiness phase.

License

This project is currently released under the MIT License.

Release Notes

The first public release notes are available at docs/release_notes_v0.1.0.md.

About

一个面向职业内容检索与证据化总结的 AI Native Web 应用,支持混合检索、引用和证据片段展示。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors