BookTutor Backend Skeleton

This repo implements the ingestion pipeline defined in spec/pdf-ingestion-and-chunking.md, now backed by Postgres so later retrieval/model work can build on the stored chunks.

Postgres Setup

Start Postgres (Docker Compose example):
```
docker compose up -d postgres
```

Configure the connection string for local development:

export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/booktutor

Create the schema using the provided SQL (includes the generated search_tsv column + GIN index):

psql "$DATABASE_URL" -f booktutor/schema.sql
# or: python -c "from booktutor import db; db.init_db()"

Python Environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Ingest a Chapter PDF

The CLI ingests a single chapter PDF, normalizes the text, chunks it (900–1200 tokens with ~200 token overlap), and writes chunk metadata/text into Postgres. It expects DATABASE_URL to point at your Postgres instance.

python ingest_chapter.py path/to/chapter.pdf \
  --course-id=HIST101 \
  --course-title="Modern History" \
  --chapter-no=1 \
  --chapter-title="Foundations" \
  --resource-title="Chapter 1"

Add --init-schema to create tables on first run, and --replace to delete any existing chapter resource before ingesting.

Batch Ingest via Manifest

To ingest many chapters in one shot, define chapters/manifest.csv with columns chapter_no,chapter_title,pdf_path and run:

python scripts/ingest_manifest.py chapters/manifest.csv \
  --course-id=HIST101 \
  --course-title="Modern History" \
  --replace

The script reads each CSV row, optionally clears prior resources for that (course_id, chapter_no) pair, runs the ingestion pipeline, and prints a compact summary per chapter (resource id, pages, chunk count).

Inspect Stored Data

Quick Postgres queries to confirm what was ingested:

-- Per-chapter counts straight from resources (set during ingestion)
SELECT chapter_no, chapter_title, page_count, chunk_count
FROM resources
WHERE course_id = 'HIST101'
ORDER BY chapter_no;

-- If you want the min/max page ranges, the resource_stats view augments those counts
SELECT chapter_no, chapter_title, page_count, chunk_count, min_page_start, max_page_end
FROM resource_stats
WHERE course_id = 'HIST101'
ORDER BY chapter_no;

-- Preview the first few chunks for chapter 1
SELECT chunk_index, LEFT(text, 200)
FROM chunks
WHERE course_id = 'HIST101' AND chapter_no = 1
ORDER BY chunk_index
LIMIT 3;

-- (Optional) verify the generated search_tsv column is populated
SELECT COUNT(*) FROM chunks WHERE search_tsv @@ plainto_tsquery('modern history');

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
booktutor		booktutor
chapters		chapters
docs		docs
gemini_harness		gemini_harness
scripts		scripts
spec		spec
test		test
tests		tests
.gitignore		.gitignore
Makefile		Makefile
PROJECT_BRIEF.md		PROJECT_BRIEF.md
README.md		README.md
docker-compose.yml		docker-compose.yml
ingest_chapter.py		ingest_chapter.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BookTutor Backend Skeleton

Postgres Setup

Python Environment

Ingest a Chapter PDF

Batch Ingest via Manifest

Inspect Stored Data

About

Uh oh!

Releases

Packages

Languages

interzone2/booktutor

Folders and files

Latest commit

History

Repository files navigation

BookTutor Backend Skeleton

Postgres Setup

Python Environment

Ingest a Chapter PDF

Batch Ingest via Manifest

Inspect Stored Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages