## OMIM-RAG Score Collection Tutorial

In this tutorial, we demonstrate the pipeline of getting penalty factors for a pre-specified list of genes, with their names supplied as a .pkl or a .txt file. <br> 
Specifically, the process consists of three steps: <br>
(1). Scraping down information from OMIM with OMIM API into JSON files. <br>
(2). Preprocessing the scraped OMIM JSON files and Populating a Chroma-based vectorstore using the OMIM knowledge base. <br>
(3). Collecting penalty factors with a specified user prompt and an LLM model. <br>

In [None]:
import os
import sys
sys.path.insert(0, '..')
from omim_scrape.parse_omim import *
from src.llm_lasso.utils.chunking import chunk_by_gene
from omim_scrape.process_mim_number import *
from src.llm_lasso.task_specific_lasso.llm_lasso import *
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.schema import Document
import constants
import warnings
warnings.filterwarnings("ignore")

### (1). Scraping OMIM entries for pre-specified list of gene names. 


In [None]:
# Step 1: Update _my_constants.py file with your API keys, including OMIM API and OpenRouter and/or OpenAI APIs.
# Step 2: Test that the OMIM API key is working. 
test_omim_api_access()

In [None]:
# Step 3: Fetch MIM numbers for a list of genes
file_path = 'example_data/example_genenames.txt'
save_mim_path = 'example_data/example_mim_nums.pkl'
mim_dict = get_specified_mim(file_path, save_mim_path)
print(mim_dict)

In [None]:
# Step 4: Save scraped output from OMIM database using the fetched MIM numbers
save_json_path = 'example_data/omim_context.json'
process_mim_numbers_to_json(save_mim_path, save_json_path)

### (2). Preprocessing and preparing an OMIM knowledge base for the specified list of genes

In [None]:
# Step 1: preprocess the raw JSON files by chunking
chunked_json_path = 'example_data/omim_context_chunked.json'
chunk_by_gene(save_json_path, chunked_json_path, chunk_size=1000, chunk_overlap=200)

In [None]:
# Step 2: Populating the vector-store using the preprocessed OMIM JSON files.

# (i). Load chunked data from both sources
print("Loading chunked JSON data from both sources...")
documents = []

# Load scraped OMIM data
with open(chunked_json_path, "r", encoding="utf-8") as f:
    for line in f:
        entry = json.loads(line)
        documents.append(entry)

print(f"Loaded {len(documents)} total chunks from omim database.")

In [None]:
# (ii). Initialize embeddings
os.environ["OPENAI_API_KEY"] = constants.OPENAI_API
embeddings = OpenAIEmbeddings()

# (iii). Create or load the OMIM-based vector store
PERSIST = True # Enable persistence to save the database to disk; set False otherwise.
persist_directory = "example_data/omim_vectorstore"  # Directory to save the vectorstore
if PERSIST and os.path.exists(persist_directory):
    print("Reusing existing database...")
    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
else:
    print("Creating a new database...")
    # Wrap each entry into a Document object
    documents_wrapped = [
        Document(page_content=doc['content'], metadata=doc['metadata']) for doc in documents
    ]
    vectorstore = Chroma.from_documents(
        documents=documents_wrapped,  # Use the wrapped documents
        embedding=embeddings,
        persist_directory=persist_directory
    )
    if PERSIST:
        vectorstore.persist()  # Save the combined database to disk

### (3). Collect penalty factors for the list of genes using a RAG-enhanced LLM

In [None]:
# Optional integration with the langsmith API to trace retrieved documents.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = constants.LANGCHAIN_API # YOUR API HERE
os.environ["LANGCHAIN_PROJECT"] = "YOUR PROJECT NAME"

In [None]:
# Step 1: Define your user prompt in "example_data/user_prompt.txt"
# Step 2: Edit constants.py and set OMIM_PERSIST_DIRECTORY = "examples/example_data/omim_vectorstore"
# Step 3: Get LLM scores using the following command line with omim_rag enabled

Navigate to the outer directory to use command line:
```
cd ..
```

With RAG:
 ```
$ python scripts/llm_lasso_scores.py \
        --prompt-filename "examples/example_data/user_prompt.txt" \
        --feature_names_path "examples/example_data/example_genenames.txt" \
        --category "Follicular Lymphoma (FL) and Diffuse Large B-Cell Lymphoma (DLBCL)" \
        --wipe \
        --omim_rag \
        --save_dir "examples/example_data" \
        --n-trials 1 \
        --model-type gpt-4o \
        --temp 0
```

To use additional features for penalty collection, see documentations at from src/llm_lasso/llm_penalty/penalty_collection.py