# BibleAssistant

<a target="_blank" href="https://colab.research.google.com/github/simonguest/CS-394/blob/main/src/06/notebooks/generate-synthetic.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://github.com/simonguest/CS-394/raw/refs/heads/main/src/06/notebooks/generate-synthetic.ipynb">
  <img src="https://img.shields.io/badge/Download_.ipynb-blue" alt="Download .ipynb"/>
</a>

## Data generation settings

In [None]:
from pydantic import BaseModel
from typing import List, Literal, Optional, Dict, Any, Tuple
from pydantic import BaseModel, Field, ValidationError

NUM_TRAIN_EXAMPLES = 1000  # @param {type:"number"}
NUM_VAL_EXAMPLES = 100  # @param {type:"number"}
NUM_TEST_EXAMPLES = 10 # @param {type:"number"}
TEMPERATURE = 0.8  # @param {type:"number"}

DATA_FOLDER = "./.data/generated"
!mkdir -p {DATA_FOLDER}

DATAGEN_MODEL = "openai/gpt-5-nano"

## Dataset diversity

In [None]:
BIBLE_BOOKS = [
    "Old Testament",
    "New Testament",
]

ISSUE_QUESTIONS_CATEGORIES = [
    "anxiety", "fear", "grief", "guilt", "anger", "loneliness",
    "guidance", "forgiveness", "burnout", "temptation", "question"
]

SEVERITY_LEVELS = ["mild", "moderate", "severe", "crisis"]

# How often we generate each severity (crisis is rarer)
SEVERITY_WEIGHTS = [0.35, 0.35, 0.22, 0.08]

## Model for structured output

In [None]:
class BibleExplanation(BaseModel):
    issue_question: str
    verse: str
    explanation: str
    guidance: List[str] = Field(default_factory=list)
    note: str

class BibleConversation(BaseModel):
    user: str
    assistant: BibleExplanation


## Get OpenRouter API key

In [None]:
import sys
import os
from dotenv import load_dotenv

if 'google.colab' in sys.modules:
  from google.colab import userdata # type:ignore
  os.environ['OPENROUTER_API_KEY'] = userdata.get('OpenRouter')
else:
  load_dotenv()

## Conversation generation functions

In [None]:
import openai
import os

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

def generate_completion(prompt: str) -> Optional[BibleConversation]:
    try:
        response = client.responses.parse(
            model=DATAGEN_MODEL,
            input=[{"role": "user", "content": prompt}],
            temperature=TEMPERATURE,
            stream=False,
            text_format=BibleConversation,
        )
        return response.output_parsed
    except Exception:
        return None

def create_conversation(issue_question: str, severity: str) -> Optional[BibleConversation]:

    if(issue_question == "question"):
      request = f"Provide a  Bible verse about {issue_question}, and explain the mearning."
    else:
      if severity == "mild":
          request = f"Provide a  Bible verse about {issue_question}, with brief encouragement (1–2 sentences each)."
      elif severity == "moderate":
          request = f"Provide a Bible verse about {issue_question}, each with short meaning explanations and 3–4 practical guidance steps."
      elif severity == "severe":
          request = f"Provide a Bible verse about {issue_question}, with deeper explanations and 4–6 compassionate guidance steps. Include strong reassurance."
      elif severity == "crisis":
          request = f"Provide a Bible verse about {issue_question}, with careful, supportive explanations. Include clear encouragement to seek immediate help from trusted people or local emergency services if the person is in danger."
      else:
          request = f"Provide a Bible verse about {issue_question}, with short explanations and practical encouragement."

    prompt = f"""
    Create a realistic conversation from {request}.

    Context:
    - Topic: {issue_question}
    - Severity: {severity}

    Rules:
    - Use ONLY books from this list:
    {", ".join(BIBLE_BOOKS)}
    - Verse references must be: "Book Chapter:Verse" or "Book Chapter:Verse-Verse"
    - Do NOT cite any book outside the allowed list

    Return the following if {issue_question} is "question":
    1. 1–3 sentences written as the user describing their situation about {issue_question} in your own word as a string.
    2. The single Bible selected verse and its content as a single string.
    3. One paragraphs in one string explaining how the verses address the issue.

    Return the following if {issue_question} is not "question":
    1. 1–3 sentences written as the user describing their situation about {issue_question} in your own word as a string.
    2. The single Bible selected verse and its content as a single string.
    3. One paragraphs in one string explaining how the verses address the issue.
    4. 1–6 short practical action steps
    5. Short compassionate closing note. If severity is crisis, encourage contacting trusted people or emergency services.
    """

    return generate_completion(prompt)

## Dataset generation functions

In [None]:
import random
import json
from tqdm import tqdm
import os

def generate_dataset_files(num_examples: int, jsonl_path: str) -> None:
    os.makedirs(os.path.dirname(jsonl_path), exist_ok=True)

    with open(jsonl_path, "w", encoding="utf-8") as f_jsonl:
        for idx in tqdm(range(num_examples)):
            issue = random.choice(ISSUE_QUESTIONS_CATEGORIES)
            severity = random.choices(SEVERITY_LEVELS, weights=SEVERITY_WEIGHTS)[0]

            conv = None
            tries = 0
            while conv is None and tries < 6:
                conv = create_conversation(issue, severity)
                tries += 1

            if conv is None:
                print(f"Error generating conversation for example {idx}")
                continue

            template = {
                "messages": [
                    {"role": "user", "content": conv.user},
                    {
                        "role": "assistant",
                        "content": json.dumps(conv.assistant.model_dump(), ensure_ascii=False),
                    },
                ]
            }

            f_jsonl.write(json.dumps(template, ensure_ascii=False) + "\n")

## Generate all the data!

In [None]:
from datetime import datetime

TRAIN_FILE = f"{DATA_FOLDER}/train_{datetime.now().strftime('%Y-%m-%d-%H:%M:%S')}.jsonl"
VALID_FILE = f"{DATA_FOLDER}/valid_{datetime.now().strftime('%Y-%m-%d-%H:%M:%S')}.jsonl"
TEST_FILE = f"{DATA_FOLDER}/test_{datetime.now().strftime('%Y-%m-%d-%H:%M:%S')}.jsonl"

generate_dataset_files(NUM_TRAIN_EXAMPLES, TRAIN_FILE)
generate_dataset_files(NUM_VAL_EXAMPLES, VALID_FILE)
generate_dataset_files(NUM_TEST_EXAMPLES, TEST_FILE)

 23%|██▎       | 232/1000 [4:03:24<25:51:19, 121.20s/it]

Error generating conversation for example 231


 33%|███▎      | 331/1000 [5:33:15<20:46:49, 111.82s/it]

Error generating conversation for example 330


 38%|███▊      | 379/1000 [6:19:47<18:45:34, 108.75s/it]

Error generating conversation for example 378


 44%|████▎     | 437/1000 [7:12:24<15:44:21, 100.64s/it]

Error generating conversation for example 436


100%|██████████| 1000/1000 [15:02:54<00:00, 54.17s/it]
 95%|█████████▌| 95/100 [1:19:36<10:00, 120.12s/it]

Error generating conversation for example 94


100%|██████████| 100/100 [1:24:00<00:00, 50.40s/it]
100%|██████████| 10/10 [08:09<00:00, 48.98s/it]


# What use case you chose and why?
In this world, people suffer from many issues in their lives. Therefore, I would like to create a Bible assistant that can answer people's questions about the Bible and help them feel better. I believe the Bible has the ability to heal everyone in the world. The reason I want to create this Bible AI assistant is that it can help answer the questions in people's hearts right away, without making them wait.

# Your diversity dimensions and how they shape the dataset

For the diversity dimension, I limited the Bible books to the Old and New Testaments because they are widely accepted Christian scriptures around the world. I want to ensure that users are protected from false teachings or misleading interpretations that may come from cult sources on the internet.

Regarding input, I expect users to either ask a question about the Bible or share a personal issue they are facing. For the output, the system should first restate the user’s concern in its own words to show understanding. Then it should select an appropriate verse, explain its meaning in clear and simple language—since many verses can be difficult to understand—and provide practical guidance on what the user can do. Finally, it should include a short compassionate note to make the response feel warm and caring.

With this diversity dimension, users will feel as though they are consulting with a sincere and thoughtful Christian.

# Observations from your quality review (what looked good, what needed fixing)

# Sample

{"messages":
[
  {"role": "user", "content": "I've been hurt by someone who betrayed my trust, and I'm trying to forgive and move forward. I'm seeking guidance and peace rather than staying in anger."},
  
  {"role": "assistant", "content": "{
    \"issue_question\": \"I've been hurt by someone who betrayed my trust, and I'm trying to forgive and move forward. I'm seeking guidance and peace rather than staying in anger.\",

   \"verse\": \"Matthew 6:14-15 - For if ye forgive men their trespasses, your heavenly Father will forgive you: But if ye forgive not men their trespasses, neither will your Father forgive your trespasses.\",
   
   \"explanation\": \"These verses teach that forgiveness is a choice we make, not merely a feeling that comes and goes. By choosing to forgive, we align with God’s mercy and open ourselves to healing, even when hurt remains. The passage also cautions that withholding forgiveness can hinder our own experience of divine mercy, reminding us that mercy towards others is connected to mercy received.\",
   
   \"guidance\": [\"Take a moment to acknowledge the hurt and name what you want to release.\", \"Ask God for the strength to forgive and release bitterness.\", \"Choose to forgive in your heart today, even if emotions linger.\", \"If safe and possible, speak with the person about forgiveness in a calm, boundaries-respecting way.\", \"Journal or pray daily about your progress and seek support from a trusted friend or counselor.\"],
   
   \"note\": \"Be gentle with yourself—forgiveness is a process, and progress may be gradual.\"}"}
   ]
   }

# What Looked Good?

I think it looks good so far, it actually feels like what I would expect when consulting with a Christian

# What Needed Fixing?

Having the agent stating the issue feel unnecessary. It doesn't speak in their own word as I expected, but it's fine

# Lesson Learned

I initially tried training a fighting game helper model, but it generated unusable data. I think this happened because fighting games are a niche topic and require highly specific knowledge. Most of that knowledge is found in YouTube tutorial videos rather than structured text sources.

In contrast, the Bible is a widely known and well-documented topic with abundant written material available. That is likely why this approach works much better for the Bible-based assistant.