This repo implements the ingestion pipeline defined in spec/pdf-ingestion-and-chunking.md, now backed by Postgres so later retrieval/model work can build on the stored chunks.
- Start Postgres (Docker Compose example):
docker compose up -d postgres
- Configure the connection string for local development:
export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/booktutor - Create the schema using the provided SQL (includes the generated
search_tsvcolumn + GIN index):psql "$DATABASE_URL" -f booktutor/schema.sql # or: python -c "from booktutor import db; db.init_db()"
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe CLI ingests a single chapter PDF, normalizes the text, chunks it (900–1200 tokens with ~200 token overlap), and writes chunk metadata/text into Postgres. It expects DATABASE_URL to point at your Postgres instance.
python ingest_chapter.py path/to/chapter.pdf \
--course-id=HIST101 \
--course-title="Modern History" \
--chapter-no=1 \
--chapter-title="Foundations" \
--resource-title="Chapter 1"Add --init-schema to create tables on first run, and --replace to delete any existing chapter resource before ingesting.
To ingest many chapters in one shot, define chapters/manifest.csv with columns chapter_no,chapter_title,pdf_path and run:
python scripts/ingest_manifest.py chapters/manifest.csv \
--course-id=HIST101 \
--course-title="Modern History" \
--replaceThe script reads each CSV row, optionally clears prior resources for that (course_id, chapter_no) pair, runs the ingestion pipeline, and prints a compact summary per chapter (resource id, pages, chunk count).
Quick Postgres queries to confirm what was ingested:
-- Per-chapter counts straight from resources (set during ingestion)
SELECT chapter_no, chapter_title, page_count, chunk_count
FROM resources
WHERE course_id = 'HIST101'
ORDER BY chapter_no;
-- If you want the min/max page ranges, the resource_stats view augments those counts
SELECT chapter_no, chapter_title, page_count, chunk_count, min_page_start, max_page_end
FROM resource_stats
WHERE course_id = 'HIST101'
ORDER BY chapter_no;
-- Preview the first few chunks for chapter 1
SELECT chunk_index, LEFT(text, 200)
FROM chunks
WHERE course_id = 'HIST101' AND chapter_no = 1
ORDER BY chunk_index
LIMIT 3;
-- (Optional) verify the generated search_tsv column is populated
SELECT COUNT(*) FROM chunks WHERE search_tsv @@ plainto_tsquery('modern history');