Skip to content

interzone2/booktutor

Repository files navigation

BookTutor Backend Skeleton

This repo implements the ingestion pipeline defined in spec/pdf-ingestion-and-chunking.md, now backed by Postgres so later retrieval/model work can build on the stored chunks.

Postgres Setup

  1. Start Postgres (Docker Compose example):
    docker compose up -d postgres
  2. Configure the connection string for local development:
    export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/booktutor
  3. Create the schema using the provided SQL (includes the generated search_tsv column + GIN index):
    psql "$DATABASE_URL" -f booktutor/schema.sql
    # or: python -c "from booktutor import db; db.init_db()"

Python Environment

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Ingest a Chapter PDF

The CLI ingests a single chapter PDF, normalizes the text, chunks it (900–1200 tokens with ~200 token overlap), and writes chunk metadata/text into Postgres. It expects DATABASE_URL to point at your Postgres instance.

python ingest_chapter.py path/to/chapter.pdf \
  --course-id=HIST101 \
  --course-title="Modern History" \
  --chapter-no=1 \
  --chapter-title="Foundations" \
  --resource-title="Chapter 1"

Add --init-schema to create tables on first run, and --replace to delete any existing chapter resource before ingesting.

Batch Ingest via Manifest

To ingest many chapters in one shot, define chapters/manifest.csv with columns chapter_no,chapter_title,pdf_path and run:

python scripts/ingest_manifest.py chapters/manifest.csv \
  --course-id=HIST101 \
  --course-title="Modern History" \
  --replace

The script reads each CSV row, optionally clears prior resources for that (course_id, chapter_no) pair, runs the ingestion pipeline, and prints a compact summary per chapter (resource id, pages, chunk count).

Inspect Stored Data

Quick Postgres queries to confirm what was ingested:

-- Per-chapter counts straight from resources (set during ingestion)
SELECT chapter_no, chapter_title, page_count, chunk_count
FROM resources
WHERE course_id = 'HIST101'
ORDER BY chapter_no;

-- If you want the min/max page ranges, the resource_stats view augments those counts
SELECT chapter_no, chapter_title, page_count, chunk_count, min_page_start, max_page_end
FROM resource_stats
WHERE course_id = 'HIST101'
ORDER BY chapter_no;

-- Preview the first few chunks for chapter 1
SELECT chunk_index, LEFT(text, 200)
FROM chunks
WHERE course_id = 'HIST101' AND chapter_no = 1
ORDER BY chunk_index
LIMIT 3;

-- (Optional) verify the generated search_tsv column is populated
SELECT COUNT(*) FROM chunks WHERE search_tsv @@ plainto_tsquery('modern history');

About

Book Tutor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages