## Retrieval-Augmented Generation (RAG) Proof of Concept on Healthcare Mortality Data

This notebook demonstrates a **proof of concept (POC)** for building a RAG pipeline. 

The goal: allow users to ask natural language questions (e.g., *"What was the heart disease mortality rate in Texas in 2019?"*) and get grounded answers **only from structured healthcare data**.

**Key steps:**
1. Convert structured rows into natural-language "facts"
2. Embed facts into vector space using `SentenceTransformers`
3. Store embeddings in a FAISS index for efficient retrieval
4. Given a query, retrieve the most relevant facts
5. Pass facts + query into an LLM (via Ollama) for grounded generation

This is a minimal POC to test feasibility before scaling up with bigger datasets.

### Setup

We load required libraries:

- sentence_transformers for embeddings
- faiss for vector similarity search
- ollama for running LLMs

In [43]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import subprocess
import json

### Turn Data into Facts

We use a small dataset of healthcare mortality rates. Each row will be converted into a "fact" — a natural-language string describing the data point.

This makes the data easier for LLMs to consume.

In [24]:
df = pd.read_csv("../poc/data/Underlying Cause of Death, 1999-2020.xls", sep="\t")
df = df.rename(
    columns = {
        'State':'state',
        'Year':'year',
        'ICD-10 113 Cause List':'cause',
        'Deaths':'deaths',
        'Population':'population',
        'Crude Rate':'crude_rate'
    }
)
df = df[['state','year','cause','deaths','population','crude_rate']]
df['crude_rate'] = pd.to_numeric(df['crude_rate'], errors='coerce')
df['year'] = pd.to_numeric(df['year'], errors='coerce')
df = df.astype({
    "deaths" : "float64",
    "population": "float64"
    }    
)
df['cause'] = (
    df['cause']
    .str.replace(r"\(.*?\)","",regex=True)
    .str.replace(r"#","",regex=True)
    .str.strip()
) 
df['crude_rate'] = np.round(df['deaths'] / df['population'] * 1e5,1)
df.head()

Unnamed: 0,state,year,cause,deaths,population,crude_rate
0,Alabama,2015.0,Certain other intestinal infections,159.0,4858979.0,3.3
1,Alabama,2015.0,Tuberculosis,11.0,4858979.0,0.2
2,Alabama,2015.0,Septicemia,1046.0,4858979.0,21.5
3,Alabama,2015.0,Viral hepatitis,96.0,4858979.0,2.0
4,Alabama,2015.0,Human immunodeficiency virus disease,126.0,4858979.0,2.6


In [None]:
facts = []

for i, row in df.iterrows():
    fact = f"In {row['state']} in {row['year']}, the {row['cause']} mortality rate was {row['crude_rate']} per 100,000."
    facts.append(fact)

print(facts[:5])

### Embeddings + FAISS Index

We use SentenceTransformers to embed each fact into a high-dimensional vector. These embeddings capture the semantic meaning of each fact. We then store them in a FAISS index, which allows fast similarity search.

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
fact_embeddings = embedder.encode(facts, convert_to_numpy=True)
print(fact_embeddings.shape)

dimension = fact_embeddings.shape[1]
# set up a search engine that compares vectors by distance
index = faiss.IndexFlatL2(dimension)
# load all the fact embeddings into the search engine so they can be queried later
index.add(fact_embeddings.astype("float32"))

### Why convert rows to "facts"?

An LLM is a language model. It doesn't understand structured databases directly (like rows in a SQL table). It understands text. So if you give it a CSV row, it has no natural way to reason about that. But if you turn it into text like: "In Texas in 2019, the heart disease mortality rate was 153 per 100,000.", the model can now read and reason in its native format: text. That's why we transform structured data into sentences or "facts". This makes it possible to do semantic search. If a use asks: "What was the heart disease mortality rate in Texas in 2019?", the embedding of that query will be very close to the embedding of the dfact string.

### What is RAG?

RAG means Retrieval-Augmented Generation. Imagine you ask an LLM "What was the heart disease mortality rate in Texas in 2019?" The model doesn't actually have your dataset inside it. It will just "guess" based on patterns it learned during training. That guess might be wrong, outdated, or incomplete.

RAG means:

- Retrieve: Before answering, look up relevant information from an external knowledge base (like your facts in FAISS)
- Augment: Add that fact into the prompt given to the LLM. So the prompt will have both the CONTEXT + QUESTION
- Generate: The LLM now produces the answer using both its language skills and the retrieved context. So instead of guessing, it reads from your actual data.

Why it matters:

✅ Keeps answers grounded in your data

✅ Works with private datasets the LLM has never seen


✅ Prevents hallucinations (model making stuff up)

### Query + Retrieval

Given a user query, we:

- Embed the query in the same vector space
- Search FAISS for the closest facts (this one is above)
- Return the top K most relevant facts

In [57]:
query = "Did cancer mortality increase or decrease in Florida between 2015 and 2020??"

query_embeddings = embedder.encode([query], convert_to_numpy=True)
D, I = index.search(query_embeddings.astype("float32"), k=2)
retrieved_facts = [facts[i] for i in I[0]]

print(retrieved_facts)

# Augment + Generate

prompt = f"""
    You are a helpful assistant. Use only the facts provided to answer the question. 
    If the answer is not in the facts, say "I don't have data on that."

    Facts:
    {retrieved_facts}

    Question:

    {query}

    Answer:
"""

process = subprocess.run(
    ["ollama", "run", "mistral", prompt],
    capture_output=True,
    text=True
)

print(process.stdout)

['In Florida in 2020.0, the Malignant neoplasms mortality rate was 210.7 per 100,000.', 'In Florida in 2015.0, the Malignant neoplasms mortality rate was 217.2 per 100,000.']


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


 The cancer mortality rate decreased in Florida between 2015 and 2020. This can be concluded as the given data shows a lower mortality rate in 2020 (210.7) compared to 2015 (217.2).




In [58]:
df[(df['year']==2015)&(df['state']=='Florida')&(df['cause'].str.contains("malignant neoplasm", case=False, na=False))]

Unnamed: 0,state,year,cause,deaths,population,crude_rate
3233,Florida,2015.0,Malignant neoplasms,44027.0,20271272.0,217.2
3234,Florida,2015.0,"Malignant neoplasms of lip, oral cavity and ph...",778.0,20271272.0,3.8
3235,Florida,2015.0,Malignant neoplasm of esophagus,1120.0,20271272.0,5.5
3236,Florida,2015.0,Malignant neoplasm of stomach,751.0,20271272.0,3.7
3237,Florida,2015.0,"Malignant neoplasms of colon, rectum and anus",3779.0,20271272.0,18.6
3238,Florida,2015.0,Malignant neoplasms of liver and intrahepatic ...,1942.0,20271272.0,9.6
3239,Florida,2015.0,Malignant neoplasm of pancreas,3125.0,20271272.0,15.4
3240,Florida,2015.0,Malignant neoplasm of larynx,331.0,20271272.0,1.6
3241,Florida,2015.0,"Malignant neoplasms of trachea, bronchus and lung",11943.0,20271272.0,58.9
3243,Florida,2015.0,Malignant neoplasm of breast,2890.0,20271272.0,14.3
