# 🧠 Oncorag2 Tutorial: End-to-End Clinical Feature Extraction & Question Answering

Welcome to the **Oncorag2** tutorial!

This notebook guides you through the full pipeline for extracting and querying clinical features from oncology patient records using:

- 🤖 Conversational feature engineering (via LLM agent)
- 📄 Clinical text processing + feature extraction
- 🔍 Retrieval-Augmented Generation (RAG) chatbot for patient-specific Q&A

---

## 📌 Step 1: Generate a Feature Extraction Template
This step launches a conversational agent to define a custom feature set for your oncology task (e.g., "Triple-negative breast cancer").

In [4]:
!python scripts/run_feature_generation.py

python: can't open file '/home/patrick/projects/git/oncorag/notebooks/scripts/run_feature_generation.py': [Errno 2] No such file or directory


The agent will ask for:
- The disease/entity you are working with
- Any areas of interest (e.g., biomarkers, staging)
- It will generate 5 features per batch and allow you to confirm or request more

Output is saved in: `config/<entity>_feature_config.json`

## 📌 Step 2: Extract Data From Patient Records
This uses the generated feature config to extract features from either real or synthetic patient notes.

In [None]:
!python scripts/run_data_extraction.py \
  --features config/tripple_negative_breast_feature_config.json \
  --data-dir data/sample_patient \
  --output output/extracted_data.csv \
  --context-csv output/context_data.csv \
  --raw-csv output/raw_context_data.csv \
  --collection patient_contexts

✅ If `--data-dir` is empty or not provided, synthetic notes will be generated automatically.

The following will be created:
- `extracted_data.csv`: structured features per patient
- `context_data.csv`: context snippets used for vector search
- `raw_context_data.csv`: full extracted mentions (raw + cleaned)

These are also ingested into the IRIS vector database for downstream retrieval.

## 📌 Step 3: Launch the Patient Data Chatbot
Use this to query patient-specific details with RAG-based search (structured + unstructured).

In [None]:
!python scripts/run_chatbot.py \
  --extracted-data output/extracted_data.csv \
  --collection patient_contexts \
  --verbose

✅ Example questions you can ask:
- *What is the patient's ER status?*
- *How many positive nodes were found?*
- *Did the patient receive chemotherapy?*

The chatbot will combine context from both structured fields and embedded note segments.

## ✅ Done!
You’ve completed the full Oncorag2 pipeline.

**From feature engineering to clinical Q&A — fully automated.**

Feel free to explore the generated files in `output/` or test other entity configs.