#Introduction

This notebook is apart of the DSA Databricks Blog Post here: <link here>

This is the final notebook setting up a genie space with fake patient data generated by the python library faker.  

Recommended Compute Type: Classic Compute

Recommended Runtime: 16.4 ML

#Install dependencies

In [0]:
%pip install faker
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from config import volume_label, volume_name, catalog, schema, model_name, model_endpoint_name, embedding_table_name, embedding_table_name_index, registered_model_name, vector_search_endpoint_name

In [0]:
import pandas as pd
import random
from faker import Faker

In [0]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{schema}.patient_visits")
spark.sql(f"DROP TABLE IF EXISTS {catalog}.{schema}.practice_locations")

DataFrame[]

In [0]:
fake = Faker()

# Define fixed options
patients = [fake.unique.first_name() + " " + fake.unique.last_name() for _ in range(15)]
insurance_providers = ["Insurance Company 2", "Insurance Company 3", "Fake Company", "Insurance Company 1"]
insurance_types = ["HMO", "PPO", "EPO"]
reasons_for_visit = [
    "Routine Checkup", "Flu Symptoms", "Injury", "Chronic Condition", 
    "Follow-up", "Prescription Refill", "Surgery", "Physical Therapy"
]
cities = ["LA", "Chicago", "NY"]

# Assign a random but fixed insurance provider and type to each patient
patient_insurance = {patient: random.choice(insurance_providers) for patient in patients}
patient_insurance_type = {patient: random.choice(insurance_types) for patient in patients}

# Generate patient visit data
data = []
for _ in range(300):
    patient = random.choice(patients)
    first_name, last_name = patient.split(" ")
    insurance_provider = patient_insurance[patient]
    insurance_type = patient_insurance_type[patient]
    policy_number = fake.uuid4() if random.random() > 0.2 else None  # 80% chance to have a policy number
    email = fake.email()
    city = random.choice(cities)
    practice_id = fake.random_int(min=1000, max=9999)
    doctor_notes = fake.sentence()
    reason_for_visit = random.choice(reasons_for_visit)

    data.append([
        first_name, last_name, insurance_provider, insurance_type, policy_number,
        email, city, practice_id, doctor_notes, reason_for_visit
    ])

# Create DataFrame
columns = [
    "first_name", "last_name", "insurance_provider_name", "insurance_type",
    "insurance_policy_number", "email", "city", "practice_visited_practice_id",
    "doctor_notes", "reason_for_visit"
]
patients_visits_df = pd.DataFrame(data, columns=columns)

# Convert to Spark DataFrame
patients_visits_spark_df = spark.createDataFrame(patients_visits_df)

# Save DataFrame to specified catalog and schema
patients_visits_spark_df.write.saveAsTable(f"{catalog}.{schema}.patient_visits")

In [0]:
# Practice Location Table 

import random

# Define possible values
cities = ["LA", "Chicago", "NY"]
insurance_providers = ["Insurance Company 2", "Insurance Company 3", "Fake Company", "Insurance Company 1"]
insurance_plan_types = ["HMO", "PPO", "EPO"]
network_statuses = ["In Network", "Out of Network"]

# Generate sample data
num_entries = 50
data = {
    "practice_name": [f"Medical Center {i}" for i in range(1, num_entries + 1)],
    "city": [random.choice(cities) for _ in range(num_entries)],
    "contact": [f"(555) 555-12{str(i).zfill(2)}" for i in range(num_entries)],
    "insurance_id": [f"INS-{random.randint(1000, 9999)}" for _ in range(num_entries)],
    "insurance_company": [random.choice(insurance_providers) for _ in range(num_entries)],
    "insurance_plan_type": [random.choice(insurance_plan_types) for _ in range(num_entries)],
    "network_status": [random.choice(network_statuses) for _ in range(num_entries)],
}

# Create DataFrame
practice_locations_df = spark.createDataFrame(pd.DataFrame(data))

# Save DataFrame to specified catalog and schema
practice_locations_df.write.saveAsTable(f"{catalog}.{schema}.practice_locations")

#Creating the Genie Space

Unfortunately, there is no programatically way to create a genie space at this time. Go to the Genie Space section in the UI and create a new space pointing to the tables below

You can follow the instructions here to understand how to set up a Genie Space: https://docs.databricks.com/aws/en/genie/set-up

In [0]:
print(f"Table 1: {catalog}.{schema}.patient_visits")
print(f"Table 2: {catalog}.{schema}.practice_locations")

Table 1: austin_choi_demo_catalog.agents.patient_visits
Table 2: austin_choi_demo_catalog.agents.practice_locations
