<a href="https://colab.research.google.com/github/nick-allen21/synthetic_patient_analysis/blob/main/nallen21_project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS145: Project 1 | Project Name

## Author
* *Nicholas Allen, nallen21*

# Section 1: Project Overview

---
I am using synthetic patient data to create actionable incomes for patients. These questions will give htem insight to biggest health risk thier specific demographic and state, as well as what providers / payers they should pursue to help mitigate these risks.

1.   Identify the top 3 chronic conditions (e.g., hypertension, diabetes) for each age, gender group
2.   Compare the rate of encounters/disease occurence for high, middle, low income groups
3. Determine the provider, payer combination for each state that has the longest encounter time with their patients
---


# Section 2: Dataset Analysis

###2.1) Init Big Query Resources by connecting to bucket and creating tables

In [36]:
from google.colab import auth
auth.authenticate_user()

from google.cloud import bigquery, storage

PROJECT_ID = "cs145-project-1-475101"
BUCKET = "nallen21_bucket_cs145"
DATASET_ID = "cs145_data"

bq_client = bigquery.Client(project=PROJECT_ID)
storage_client = storage.Client(project=PROJECT_ID)

# Create dataset if not already created
dataset_ref = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
dataset_ref.location = "US"
bq_client.create_dataset(dataset_ref, exists_ok=True)

Dataset(DatasetReference('cs145-project-1-475101', 'cs145_data'))

In [43]:
table_ids = [
  "conditions",
  "encounters",
  "medications",
  "observations",
  "patients",
  "payer_transitions",
  "payers",
  "providers"
]
for table_name in table_ids:

  # Table name in BigQuery
  table_id = f"{PROJECT_ID}.{DATASET_ID}.{table_name}_ext"

  # GCS path for patients.csv
  source_uris = [f"gs://{BUCKET}/{table_name}.csv"]

  # Define external table configuration
  external_config = bigquery.ExternalConfig("CSV")
  external_config.source_uris = source_uris
  external_config.autodetect = True
  external_config.options.skip_leading_rows = 1  # skip CSV headers

  # Create table object and attach configuration
  table = bigquery.Table(table_id)
  table.external_data_configuration = external_config

  # Create or overwrite the table
  bq_client.create_table(table, exists_ok=True)
  print("External table successfully created:", table_id, "✅")


External table successfully created: cs145-project-1-475101.cs145_data.conditions_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.encounters_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.medications_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.observations_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.patients_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.payer_transitions_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.payers_ext ✅
External table successfully created: cs145-project-1-475101.cs145_data.providers_ext ✅


In [45]:
import subprocess
from google.cloud import bigquery

table_ids = [
    "conditions",
    "encounters",
    "medications",
    "observations",
    "patients",
    "payer_transitions",
    "payers",
    "providers",
]

for table_name in table_ids:
    table_id = f"{PROJECT_ID}.{DATASET_ID}.{table_name}_ext"
    gcs_path = f"gs://{BUCKET}/{table_name}.csv"

    # ---- Create external table ----
    external_config = bigquery.ExternalConfig("CSV")
    external_config.source_uris = [gcs_path]
    external_config.autodetect = True
    external_config.options.skip_leading_rows = 1

    table = bigquery.Table(table_id)
    table.external_data_configuration = external_config

    bq_client.create_table(table, exists_ok=True)
    print(f"\nExternal table created: {table_id}")

    # ---- Get row count ----
    row_query = f"SELECT COUNT(*) AS row_count FROM `{table_id}`"
    row_count = bq_client.query(row_query).to_dataframe().iloc[0, 0]

    # ---- Get GCS size ----
    # ---- Get GCS size (with units) ----
    # Use -s (bytes) so we can control the units
    raw = subprocess.run(["gsutil", "du", "-s", gcs_path], capture_output=True, text=True).stdout.strip()

    if raw:
        bytes_size = int(raw.split()[0])
        mb = bytes_size / (1024 * 1024)
        # choose MB or GB for nicer display
        gcs_size = f"{mb:.2f} MB" if mb < 1024 else f"{mb/1024:.2f} GB"
    else:
        gcs_size = "N/A"


    print(f"{table_name}_ext — Rows: {row_count:,}, Size: {gcs_size}")



External table created: cs145-project-1-475101.cs145_data.conditions_ext
📊 conditions_ext — Rows: 2,522,756, Size: 379.36 MB

External table created: cs145-project-1-475101.cs145_data.encounters_ext
📊 encounters_ext — Rows: 4,850,448, Size: 1.51 GB

External table created: cs145-project-1-475101.cs145_data.medications_ext
📊 medications_ext — Rows: 5,992,699, Size: 1.47 GB

External table created: cs145-project-1-475101.cs145_data.observations_ext
📊 observations_ext — Rows: 48,847,506, Size: 8.17 GB

External table created: cs145-project-1-475101.cs145_data.patients_ext
📊 patients_ext — Rows: 61,208, Size: 17.28 MB

External table created: cs145-project-1-475101.cs145_data.payer_transitions_ext
📊 payer_transitions_ext — Rows: 2,438,672, Size: 392.72 MB

External table created: cs145-project-1-475101.cs145_data.payers_ext
📊 payers_ext — Rows: 440, Size: 0.07 MB

External table created: cs145-project-1-475101.cs145_data.providers_ext
📊 providers_ext — Rows: 40,002, Size: 7.23 MB


### 2.2) Detailed Overview


NotFound: 404 Not found: Table cs145-project-1-475101:cs145_data.patient_ext was not found in location US; reason: notFound, message: Not found: Table cs145-project-1-475101:cs145_data.patient_ext was not found in location US

Location: US
Job ID: b7d742c2-2256-4c65-a989-14aea0ae0791


SQL: Show table sizes, row counts

## Table Relationships

SQL: Explore keys and joins

## Data Issues


SQL: Check for NULLs, duplicates

# Section 3: Get Your Feet Wet

*DELETE WHEN DONE READING: please write a tiny title and description for each query. Don't forget to add comments!*

*DELETE WHEN DONE READING: feel free to add more queries! But keep the format the same :)*

*DELETE WHEN DONE READING: In this cell, please specify which two queries you are uploading a debug table. Then underneath those queries, show the debug table. Below is an example:*

Debug Tabls for the following queries:
* Subquery 1: [your title]
* CTE 1: [your title]

## Subqueries

2 queries with scoped variables

### SubQuery 1:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

### SubQuery 2:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

## CTEs

2 queries with WITH clauses

### CTE 1:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

### CTE 2:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

## Window Functions

3 queries with OVER, including RANK vs ROW_NUMBER. Please note the FAQs!

### Window Function 1:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

### Window Function 2:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

### Window Function 3:


Please copy and paste your debug table image here for this query. Otherwise delete this cell.

In [None]:
# write code here
%%bigquery --project $project_id
# make sure you have the header above for each SQL cell

# here's an example of a SQL query using the NCAA basketball dataset in bigquery-public-data
SELECT id, market, name, mascot, mascot_name
FROM `bigquery-public-data.ncaa_basketball.mascots`
LIMIT 5;

# Section 4: Exploring Central Questions

*[Please delete when done reading]Friendly reminder: Do not forget comments!!!*

## Question 1: [Your question]

sql query and analysis

## Question 2: [Your question]

sql query and analysis

## Question 3: [Your question]

sql query and analysis

# Section 5: Takeaways

---

*TODO: Final conclusions based on the rest of your project*

---