<a href="https://colab.research.google.com/github/nick-allen21/synthetic_patient_analysis/blob/main/nallen21_project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS145: Project 1 | Project Name

## Author
* *Nicholas Allen, nallen21*

# Section 1: Project Overview

---
I am using synthetic patient data to create actionable incomes for patients. These questions will give htem insight to biggest health risk thier specific demographic and state, as well as what providers / payers they should pursue to help mitigate these risks.

1.   Identify the top 3 chronic conditions (e.g., hypertension, diabetes) for each age, gender group
2.   Compare the rate of encounters/disease occurence for high, middle, low income groups
3. Determine the provider, payer combination for each state that has the longest encounter time with their patients
---


# Section 2: Dataset Analysis

###2.1) Init Big Query Resources by connecting to bucket and creating tables

In [46]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery, storage

PROJECT_ID = "cs145-project-1-475101"
BUCKET = "nallen21_bucket_cs145"
DATASET_ID = "cs145_data"

bq_client = bigquery.Client(project=PROJECT_ID)
storage_client = storage.Client(project=PROJECT_ID)

# Create dataset if not already created
dataset_ref = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
dataset_ref.location = "US"
bq_client.create_dataset(dataset_ref, exists_ok=True)

Dataset(DatasetReference('cs145-project-1-475101', 'cs145_data'))

### 2.2) Detailed Overview

Using google cloud so cannot get table information from the schema, need to resort to gsutil to get the table schema information. Creating external tables at the same exact time


In [48]:
import subprocess
from google.cloud import bigquery

table_ids = [
    "conditions",
    "encounters",
    "medications",
    "observations",
    "patients",
    "payer_transitions",
    "payers",
    "providers",
]

for table_name in table_ids:
    table_id = f"{PROJECT_ID}.{DATASET_ID}.{table_name}_ext"
    gcs_path = f"gs://{BUCKET}/{table_name}.csv"

    # ---- Create external table ----
    external_config = bigquery.ExternalConfig("CSV")
    external_config.source_uris = [gcs_path]
    external_config.autodetect = True
    external_config.options.skip_leading_rows = 1

    table = bigquery.Table(table_id)
    table.external_data_configuration = external_config

    bq_client.create_table(table, exists_ok=True)
    print(f"\nExternal table created: {table_id}")

    # ---- Get row count ----
    row_query = f"SELECT COUNT(*) AS row_count FROM `{table_id}`"
    row_count = bq_client.query(row_query).to_dataframe().iloc[0, 0]

    # ---- Get GCS size ----
    # ---- Get GCS size (with units) ----
    # Use -s (bytes) so we can control the units
    raw = subprocess.run(["gsutil", "du", "-s", gcs_path], capture_output=True, text=True).stdout.strip()

    if raw:
        bytes_size = int(raw.split()[0])
        mb = bytes_size / (1024 * 1024)
        # choose MB or GB for nicer display
        gcs_size = f"{mb:.2f} MB" if mb < 1024 else f"{mb/1024:.2f} GB"
    else:
        gcs_size = "N/A"


    print(f"{table_name}_ext — Rows: {row_count:,}, Size: {gcs_size}")



External table created: cs145-project-1-475101.cs145_data.conditions_ext
conditions_ext — Rows: 2,522,756, Size: 379.36 MB

External table created: cs145-project-1-475101.cs145_data.encounters_ext
encounters_ext — Rows: 4,850,448, Size: 1.51 GB

External table created: cs145-project-1-475101.cs145_data.medications_ext
medications_ext — Rows: 5,992,699, Size: 1.47 GB

External table created: cs145-project-1-475101.cs145_data.observations_ext
observations_ext — Rows: 48,847,506, Size: 8.17 GB

External table created: cs145-project-1-475101.cs145_data.patients_ext
patients_ext — Rows: 61,208, Size: 17.28 MB

External table created: cs145-project-1-475101.cs145_data.payer_transitions_ext
payer_transitions_ext — Rows: 2,438,672, Size: 392.72 MB

External table created: cs145-project-1-475101.cs145_data.payers_ext
payers_ext — Rows: 440, Size: 0.07 MB

External table created: cs145-project-1-475101.cs145_data.providers_ext
providers_ext — Rows: 40,002, Size: 7.23 MB


SQL: Show table sizes, row counts

## 2.3) Table Relationships

#### patients_ext
*   **primary_key:** `Id`

#### encounters_ext
*   **primary_key:** `Id`
*   **foreign key:** `PATIENT` to `Id` in patients_ext,

#### condition_ext
*   **foreign key:** `ENCOUNTER` to `Id` in encounters_ext
*   **foreign key:** `PATIENT` to `Id` in patients_ext,

#### payers_ext
*   **primary_key:** `Id`

#### payer_transitions_ext
*   **foreign key:** `PAYER` to `Id` in payers_ext
*   **foreign key:** `PATIENT` to `PATIENT` in patients_ext,

#### observation_ext
*   **foreign key:** `PATIENT` to `Id` in patients_ext,
*   **foreign key:** `ENCOUNTER` to `Id` in encounters_ext

#### medications_ext
*   **foreign key:** `PATIENT` to `PATIENT` in patients_ext
*   **foreign key:** `ENCOUNTER` to `Id` in encounters_ext









## Data Issues


SQL: Check for NULLs, duplicates

In [80]:
query = f"""
SELECT Id as null_patient_id
FROM {PROJECT_ID}.{DATASET_ID}.patients_ext
WHERE Id IS NULL
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT COUNT(Id) as duplicate_id_patient_count
FROM {PROJECT_ID}.{DATASET_ID}.patients_ext
GROUP BY Id
HAVING COUNT(Id) > 1
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT Id as null_encounters_id
FROM {PROJECT_ID}.{DATASET_ID}.encounters_ext
WHERE Id IS NULL
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT COUNT(Id) as duplicate_encounters_id
FROM {PROJECT_ID}.{DATASET_ID}.encounters_ext
GROUP BY Id
HAVING COUNT(Id) > 1
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT ENCOUNTER as null_condition_encounter, PATIENT as null_condition_patient
FROM {PROJECT_ID}.{DATASET_ID}.conditions_ext
WHERE (
  ENCOUNTER IS NULL
  Or
  PATIENT IS NULL
)
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT PATIENT as null_observation_patient
FROM {PROJECT_ID}.{DATASET_ID}.observations_ext
WHERE (
  PATIENT IS NULL
)
"""
df = bq_client.query(query).to_dataframe()
display(df)

query = f"""
SELECT ENCOUNTER as null_observation_encounter
FROM {PROJECT_ID}.{DATASET_ID}.observations_ext
WHERE (
  ENCOUNTER IS NULL
)
"""
df = bq_client.query(query).to_dataframe()
display(df)


query = f"""
SELECT PATIENT as dup_payer_transitions_patient
FROM {PROJECT_ID}.{DATASET_ID}.payer_transitions_ext
GROUP BY PATIENT
HAVING COUNT(PATIENT) > 1
"""
df = bq_client.query(query).to_dataframe()
display(df)

Unnamed: 0,null_patient_id


Unnamed: 0,duplicate_id_patient_count


Unnamed: 0,null_encounters_id


Unnamed: 0,duplicate_encounters_id


Unnamed: 0,null_condition_encounter,null_condition_patient


Unnamed: 0,null_observation_patient


Unnamed: 0,null_observation_encounter
0,
1,
2,
3,
4,
...,...
891982,
891983,
891984,
891985,


Unnamed: 0,dup_payer_transitions_patient
0,b5462c9a-0893-fad5-4a10-c99d6fe567cc
1,bf8cba8b-d398-a9b8-5a52-9666caa2ac9b
2,cd06b749-3b9e-22b7-49e6-bafdeff7747b
3,0c654c40-c1d7-bb91-c256-cf794a866fbc
4,9874751e-a4c0-4755-00fd-12b4c57db3ab
...,...
59992,9bd44063-a30b-6e76-0228-a67f32996fb1
59993,36b89681-41d8-5d04-77fb-acdf3710d237
59994,f08f8921-03b8-38e4-c988-47b6006d81e8
59995,6a9301d2-8534-9e22-b866-42a6a277995a


# Section 3: Get Your Feet Wet

*DELETE WHEN DONE READING: please write a tiny title and description for each query. Don't forget to add comments!*

*DELETE WHEN DONE READING: feel free to add more queries! But keep the format the same :)*

*DELETE WHEN DONE READING: In this cell, please specify which two queries you are uploading a debug table. Then underneath those queries, show the debug table. Below is an example:*

Debug Tabls for the following queries:
* Subquery 1: [your title]
* CTE 1: [your title]

## Subqueries

2 queries with scoped variables

### SubQuery 1:


Select all patients who have paid an average over twice the base encounter average cost

Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [104]:
query = f"""
SELECT DISTINCT p.FIRST, p.MIDDLE, p.LAST
FROM `{PROJECT_ID}.{DATASET_ID}.patients_ext` p
LEFT JOIN `{PROJECT_ID}.{DATASET_ID}.encounters_ext` e on p.Id = e.PATIENT
WHERE e.BASE_ENCOUNTER_COST > (
  SELECT AVG(e2.BASE_ENCOUNTER_COST) * 2 AS double_avg_cost
  FROM `{PROJECT_ID}.{DATASET_ID}.encounters_ext` AS e2
  WHERE e2.PATIENT = p.Id -- correlated to outer query, uses the patient id from the outer query
)
"""
df = bq_client.query(query).to_dataframe()
display(df)

Unnamed: 0,FIRST,MIDDLE,LAST
0,Alejandro916,Manuel446,Acevedo301
1,Lynda214,Lura184,Kreiger457
2,Hester117,Valda518,Legros616
3,Tom274,,Wilkinson796
4,Zane918,Noel608,Rohan584
...,...,...,...
343,Lucas404,Marcus77,Erdman779
344,Otto672,,Lindgren255
345,Mickey576,Owen89,Bradtke547
346,Quincy153,Willie882,Mertz280


### SubQuery 2:


Select all providers who insure personal plan owners

In [107]:
query = f"""
SELECT p.NAME
FROM `{PROJECT_ID}.{DATASET_ID}.payers_ext` p
WHERE EXISTS (
  SELECT 1
  FROM `{PROJECT_ID}.{DATASET_ID}.payer_transitions_ext` pt
  WHERE pt.PAYER = p.Id
    AND pt.PLAN_OWNERSHIP = 'Self'
)
"""
df = bq_client.query(query).to_dataframe()
display(df)

Unnamed: 0,NAME
0,Medicare
1,Medicaid
2,Dual Eligible
3,Humana
4,Blue Cross Blue Shield
...,...
391,Blue Cross Blue Shield
392,UnitedHealthcare
393,Aetna
394,Cigna Health


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [50]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: c21c91da-4a66-46cc-9132-43e054825caa



## CTEs

2 queries with WITH clauses

### CTE 1:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [51]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: a4cbd25f-30da-48d0-a454-2b3b7e2e76b9



### CTE 2:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [52]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: 472147aa-3a08-4098-9df5-d41ba0c2bd82



## Window Functions

3 queries with OVER, including RANK vs ROW_NUMBER. Please note the FAQs!

### Window Function 1:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [53]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: e85cecf4-e942-4ee0-a568-c69cd1c57436



### Window Function 2:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [54]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: be466331-d862-4f73-a96d-45990bf56335



### Window Function 3:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [55]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;


ERROR:
 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/$project_id/jobs?prettyPrint=false: ProjectId must be non-empty

Location: None
Job ID: ba63f03a-059c-4636-a66e-3c6200d6d132



# Section 4: Exploring Central Questions

*[Please delete when done reading]Friendly reminder: Do not forget comments!!!*

## Question 1: [Your question]

sql query and analysis

## Question 2: [Your question]

sql query and analysis

## Question 3: [Your question]

sql query and analysis

# Section 5: Takeaways

---

*TODO: Final conclusions based on the rest of your project*

---