# Overview to approach "machine-readable" files
1. Try to load it in duckdb
2. Try to load it in pandas
3. Use command-line tools to determine the encoding
4. Re-encode the file
5. Repeat to try to load it in duckdb and pandas
6. Remove any escape characters using command-line tools

## Command-line dependencies
These can be installed using homebrew on Mac OS X:
```
brew install libiconv
brew install uchardet
```

In [1]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Load pandas, which lets us manipulate dataframes
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Allow named parameters (python variables) in SQL cells
%config SqlMagic.named_parameters=True

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

In [2]:
!wget https://www.lvhn.org/sites/default/files/2022-12/231689692_Lehigh_Valley_Hospital_StandardCharges.zip -P /tmp

--2023-08-31 10:14:43--  https://www.lvhn.org/sites/default/files/2022-12/231689692_Lehigh_Valley_Hospital_StandardCharges.zip
Resolving www.lvhn.org (www.lvhn.org)... 2620:12a:8001::1, 2620:12a:8000::1, 23.185.0.1
Connecting to www.lvhn.org (www.lvhn.org)|2620:12a:8001::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37214088 (35M) [application/zip]
Saving to: ‘/tmp/231689692_Lehigh_Valley_Hospital_StandardCharges.zip’


2023-08-31 10:14:49 (6.85 MB/s) - ‘/tmp/231689692_Lehigh_Valley_Hospital_StandardCharges.zip’ saved [37214088/37214088]



In [5]:
!unzip /tmp/231689692_Lehigh_Valley_Hospital_StandardCharges.zip && mv 231689692_Lehigh_Valley_Hospital_StandardCharges.JSON ~/data/payless_health

Archive:  /tmp/231689692_Lehigh_Valley_Hospital_StandardCharges.zip
  inflating: 231689692_Lehigh_Valley_Hospital_StandardCharges.JSON  


In [6]:
ls -lh ~/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON

-rw-r--r--  1 me  staff   858M Dec  2  2022 /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON


In [7]:
!head /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON

{
"Standard Charges":[
{
"Header":"Lehigh Valley Hospital-Cedar Crest"
}
,
{
"Header":"Lehigh Valley Hospital-Muhlenberg"
}
,


In [5]:
file_path = '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON'

In [6]:
%%sql
SELECT * FROM read_json_auto(:file_path, records=true, maximum_object_size=500000000)

RuntimeError: (duckdb.InvalidInputException) Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: Invalid Input Error: Malformed JSON in file "/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON", at byte 1964 in record/value 2: invalid UTF-8 encoding in string. 
[SQL: SELECT * FROM read_json_auto(?, records=true, maximum_object_size=500000000)]
[parameters: ('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON',)]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


In [7]:
!uchardet /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON

ISO-8859-2


In [8]:
import io
import chardet
with open(file_path, 'rb') as f:
    result = chardet.detect(f.read(5000000))  # or readline if the file is large

result

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}

In [10]:
!file -bI /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON

text/plain; charset=iso-8859-1


In [11]:
!iconv -f ISO-8859-1 -t UTF-8 /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON > /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON

In [12]:
file_path_utf8 = '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON'

In [13]:
%%sql
SELECT * FROM read_json_auto(:file_path_utf8, records=true, maximum_object_size=500000000)

RuntimeError: (duckdb.InvalidInputException) Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: Invalid Input Error: Malformed JSON in file "/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON", at byte 878701 in record/value 2: invalid escaped character in string. 
[SQL: SELECT * FROM read_json_auto(?, records=true, maximum_object_size=500000000)]
[parameters: ('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON',)]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


In [18]:
!od -j 878699 -N 16 -a /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON 

3264153    \   F   I   N   G   E   R  sp   L   S   N   ;  sp   S   U   B
3264173


The offending escape character seems to be a backslash. We can remove it using sed:
```
sed -i '' 's/\\//g' 230831-lehigh-valley.csv
```

Or with tr:
```
tr -d '\\' < 230831-lehigh-valley.csv > 230831-lehigh-valley.csv
```

In [19]:
!tr -d '\\' < /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.JSON > /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON

In [2]:
%%sql
SELECT * FROM read_json_auto('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON', records=true, maximum_object_size=500000000)

Unnamed: 0,Standard Charges,Minimum & Maximum Allowed,NCC 4,NCC 5,NCC 5 RCH,NCC 6,NCC 9,NCC 14,NCC 14 RCH,NCC 15,...,NCC 484,NCC 485,NCC 486,NCC 494,NCC 499,NCC 526,NCC 535,NCC 536,NCC 571,NCC 574
0,[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,"[{'Payment Type': 'APR DRG', 'Description': 'L...","[{'Payment Type': 'MS DRG', 'Description': 'HE...",[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,[{'Header': 'Lehigh Valley Hospital-Reilly Chi...,[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...",[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,[{'Header': 'Lehigh Valley Hospital-Reilly Chi...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...",...,[{'Header': 'Lehigh Valley Hospital-Hecktown O...,[{'Header': 'Lehigh Valley Hospital-Hecktown O...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'APR DRG', 'Description': 'L...","[{'Payment Type': 'PER DIEM', 'Description': '...","[{'Header': 'Lehigh Valley Hospital-Carbon', '...",[{'Header': 'Lehigh Valley Hospital-Coordinate...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'APR DRG', 'Description': 'L..."


In [4]:
!head -n 100 /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON

{
"Standard Charges":[
{
"Header":"Lehigh Valley Hospital-Cedar Crest"
}
,
{
"Header":"Lehigh Valley Hospital-Muhlenberg"
}
,
{
"Header":"Lehigh Valley Hospital-Hecktown Oaks"
}
,
{
"Header":"Lehigh Valley Hospital-Reilly Children's Hospital"
}
,
{
"Header":"Lehigh Valley Hospital-Tilghman Surgery Center"
}
,
{
"Header":"Lehigh Valley Hospital-17th Street"
}
,
{
"Header":"Lehigh Valley Hospital-Carbon"
}
,
{
"Header":"Lehigh Valley Hospital-Coordinated Health Allentown"
}
,
{
"Header":"Lehigh Valley Hospital-Coordinated Health Bethlehem"
}
,
{
"Header":"Comprehensive Machine Readable File"
}
,
{
"Header":"Date of last Update: 12/15/2022"
}
,
{
"DESCRIPTION":"HB MED SURG PRIVATE BED"
,
"BILLING_CODE":"110"
,
"GROSS_CHARGES_IP":"$3,730.00"
,
"GROSS_CHARGES_OP":"N/A"
,
"SELF-PAY_PRICE_IP":"$1,865.00"
,
"SELF-PAY_PRICE_OP":"N/A"
}
,
{
"DESCRIPTION":"HB EBOLA PRIVATE"
,
"BILLING_CODE":"110"
,
"GROSS_CHARGES_IP":"$34,485.00"
,
"GROSS_CHARGES_OP":"N/A"
,
"SELF-PAY_PRICE_IP":"$17,242.50"
,
"SE

In [15]:
file_path_iso_8859_1_utf8 = '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON.ISO-8859-1.utf8'

In [2]:
%%sql
SELECT * FROM read_json_auto('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.JSON.ISO-8859-1.utf8', records=true, maximum_object_size=500000000)

Unnamed: 0,Standard Charges,Minimum & Maximum Allowed,NCC 4,NCC 5,NCC 5 RCH,NCC 6,NCC 9,NCC 14,NCC 14 RCH,NCC 15,...,NCC 484,NCC 485,NCC 486,NCC 494,NCC 499,NCC 526,NCC 535,NCC 536,NCC 571,NCC 574
0,[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,"[{'Payment Type': 'APR DRG', 'Description': 'L...","[{'Payment Type': 'MS DRG', 'Description': 'HE...",[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,[{'Header': 'Lehigh Valley Hospital-Reilly Chi...,[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...",[{'Header': 'Lehigh Valley Hospital-Cedar Cres...,[{'Header': 'Lehigh Valley Hospital-Reilly Chi...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...",...,[{'Header': 'Lehigh Valley Hospital-Hecktown O...,[{'Header': 'Lehigh Valley Hospital-Hecktown O...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'APR DRG', 'Description': 'L...","[{'Payment Type': 'PER DIEM', 'Description': '...","[{'Header': 'Lehigh Valley Hospital-Carbon', '...",[{'Header': 'Lehigh Valley Hospital-Coordinate...,"[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'MS DRG', 'Description': 'HE...","[{'Payment Type': 'APR DRG', 'Description': 'L..."


# Parsing this json with `jq` 

1. Install `jq` with homebrew
2. Test it out with `jq . 230831-lehigh-valley.json`
3. Use `jq` to extract the `data` array: `jq .data 230831-lehigh-valley.json`
4. Use `jq` to extract the `data` array and write it to a file: `jq .data 230831-lehigh-valley.json > 230831-lehigh-valley.json`
5. Use `jq.py` (https://github.com/mwilliamson/jq.py) in python to process the 83 columns.

## Copilot prompt

`jq` command to retrieve column names from the json file:
````
jq .data[0] 230831-lehigh-valley.json
````

## Claude prompt
````
please write a jq command to retrieve the 83 columns names from the json file at /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON where the column names are at the top level and followed by lists.
````

Claude response:
```
jq -r 'keys | .[]' /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON   
```




In [3]:
keys = !jq -r 'keys | .[]' /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON   

In [4]:
keys

['Minimum & Maximum Allowed',
 'NCC 100',
 'NCC 102',
 'NCC 105',
 'NCC 108',
 'NCC 117',
 'NCC 128',
 'NCC 133',
 'NCC 14',
 'NCC 14 RCH',
 'NCC 148',
 'NCC 15',
 'NCC 151',
 'NCC 153',
 'NCC 157',
 'NCC 171',
 'NCC 19',
 'NCC 26',
 'NCC 265',
 'NCC 269',
 'NCC 269 RCH',
 'NCC 270',
 'NCC 270 RCH',
 'NCC 285',
 'NCC 286',
 'NCC 298',
 'NCC 299',
 'NCC 31',
 'NCC 33',
 'NCC 333',
 'NCC 350',
 'NCC 350 MHC',
 'NCC 359',
 'NCC 378',
 'NCC 38',
 'NCC 4',
 'NCC 42',
 'NCC 421',
 'NCC 421 RCH',
 'NCC 422',
 'NCC 422 RCH',
 'NCC 423',
 'NCC 429',
 'NCC 43 HO',
 'NCC 43 LVH',
 'NCC 43 MHC',
 'NCC 43 RCH',
 'NCC 432',
 'NCC 44',
 'NCC 447',
 'NCC 474',
 'NCC 475',
 'NCC 484',
 'NCC 485',
 'NCC 486',
 'NCC 494',
 'NCC 499',
 'NCC 5',
 'NCC 5 RCH',
 'NCC 52',
 'NCC 526',
 'NCC 53',
 'NCC 535',
 'NCC 536',
 'NCC 57',
 'NCC 571',
 'NCC 574',
 'NCC 59',
 'NCC 6',
 'NCC 62',
 'NCC 64',
 'NCC 66',
 'NCC 70',
 'NCC 73',
 'NCC 77',
 'NCC 78',
 'NCC 80',
 'NCC 80 RCH',
 'NCC 82',
 'NCC 87',
 'NCC 9',
 '

Claude prompt:
```
one of the top level keys is NCC 14, please select the records for only this key and print the first 10 records in it using `jq`

```

In [18]:
!jq '.["NCC 14"] | .[0:10]' /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON

[1;39m[
  [1;39m{
    [0m[34;1m"Header"[0m[1;39m: [0m[0;32m"Lehigh Valley Hospital-Cedar Crest"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"Header"[0m[1;39m: [0m[0;32m"Lehigh Valley Hospital-Muhlenberg"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"Header"[0m[1;39m: [0m[0;32m"Lehigh Valley Hospital-Tilghman Surgery Center"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"Header"[0m[1;39m: [0m[0;32m"Lehigh Valley Hospital-17th Street"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"Payment Type"[0m[1;39m: [0m[0;32m"MS DRG"[0m[1;39m,
    [0m[34;1m"Description"[0m[1;39m: [0m[0;32m"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM WITH MCC"[0m[1;39m,
    [0m[34;1m"Billing Code"[0m[1;39m: [0m[0;32m"MS001"[0m[1;39m,
    [0m[34;1m"Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_IP"[0m[1;39m: [0m[0;32m"$245,068.11"[0m[1;39m,
    [0m[34;1m"Payor:CAPITAL_BLUE_CROSS_Pl

Claude prompt:

> write a jq command to retrieve the records that do not have the key "Header" from the file /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON


next , print records with out "HEADER'" field, or split into 83 files and then skip "Header" field with JQ, then split using the colon into values, then ready for duckdb and box and whisker plot




In [1]:
!jq '.["NCC 14"][] | select(."Header" == null)' /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON > /tmp/output.txt

In [2]:
!head -n 50 /tmp/output.txt

{
  "Payment Type": "MS DRG",
  "Description": "HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM WITH MCC",
  "Billing Code": "MS001",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_IP": "$245,068.11",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_OP": "N/A"
}
{
  "Payment Type": "MS DRG",
  "Description": "HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM WITHOUT MCC",
  "Billing Code": "MS002",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_IP": "$146,820.91",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_OP": "N/A"
}
{
  "Payment Type": "MS DRG",
  "Description": "ECMO OR TRACHEOSTOMY WITH MV >96 HOURS OR PRINCIPAL DIAGNOSIS EXCEPT FACE, MOUTH AND NECK WITH MAJOR O.R. PROCEDURES",
  "Billing Code": "MS003",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_IP": "$169,723.94",
  "Payor:CAPITAL_BLUE_CROSS_Plan:CBC_KEYSTONE_HEALTH_PLAN_CENTRAL_KIDS_OP": "N/A"
}
{
  

# Generate json files for every one of the 80 headers

In [15]:
# Run this on every file: 
# !jq '.["NCC 14"][] | select(."Header" == null)' /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON > /tmp/output.txt
import subprocess
from pathlib import Path

file_path = Path('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.ISO-8859-1.utf8.nobackslash.JSON')
output_path = Path('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges')
# output_path.mkdir()


In [51]:
for key in keys[1:]:
  print(key)
  # get the rows that don't have a header
  temp_file = output_path / f'{key}.tmp'
  with open(temp_file, 'w') as f:
    subprocess.run(['jq', f'.["{key}"][] | select(."Header" == null)', str(file_path)], stdout=f)
  # create valid json
  output_file = output_path / f'{key}.json'
  with open(output_file, 'w') as f:
    subprocess.run(['jq', '-s', '.', temp_file], stdout=f)
  break
  

NCC 100


In [54]:
# read output_file with json
import json
with open(output_file) as f:
  data = json.load(f)
  print(data[0:1])

[{'Payment Type': 'APR DRG', 'Description': 'LIVER TRANSPLANT &/OR INTESTINAL TRANSPLANT', 'Billing Code': 'APR001-1', 'Payor:AETNA_BETTER_HEALTH_Plan:PA_AETNA_BETTER_HEALTH_MEDICAID_IP': '$83,546.60', 'Payor:AETNA_BETTER_HEALTH_Plan:PA_AETNA_BETTER_HEALTH_MEDICAID_OP': 'N/A', 'Payor:AETNA_BETTER_HEALTH_Plan:AETNA_BETTER_HEALTH_KIDS_IP': '$83,546.60', 'Payor:AETNA_BETTER_HEALTH_Plan:AETNA_BETTER_HEALTH_KIDS_OP': 'N/A', 'Payor:AETNA_BETTER_HEALTH_DENTAL_Plan:DENTAL_AETNA_BETTER_HEALTH_MEDICAID_IP': '$83,546.60', 'Payor:AETNA_BETTER_HEALTH_DENTAL_Plan:DENTAL_AETNA_BETTER_HEALTH_MEDICAID_OP': 'N/A', 'Payor:AETNA_BETTER_HEALTH_DENTAL_Plan:DENTAL_AETNA_BETTER_HEALTH_KIDS_IP': '$83,546.60', 'Payor:AETNA_BETTER_HEALTH_DENTAL_Plan:DENTAL_AETNA_BETTER_HEALTH_KIDS_OP': 'N/A'}]


Claude prompt:

```
edit the json records in the data dictionary and split the keys in the records in the following way:

if the key begins with the string Payor, then split the string according to colons (`:`) that appear in the key.
place the value of the key into a new field with a key of Value
```

In [13]:
import json

def process_lehigh_valley_column_file(file_path, output_path):
  """Split the records for payor into multiple fields separated by colons.
  """
  with open(file_path) as f:
    data = json.load(f)

  for record in data:
    # if 'Payment Type' in record:    
    record['payment_type'] = record.pop('Payment Type')
    record['billing_code'] = record.pop('Billing Code')
    record['description'] = record.pop('Description')
    for key in list(record.keys()):
      if key.startswith('Payor'):
        parts = key.split(':')
        record['payor_name'] = parts[1]
        record['plan_name'] = parts[2]
        record['charge'] = record.pop(key)
        
  with open(output_path, 'w') as f:
    json.dump(data, f)

In [66]:
process_lehigh_valley_column_file(output_file, output_path / 'lehigh_valley.json')

In [68]:
from tqdm.notebook import tqdm

In [72]:
for key in tqdm(keys):
  print(key)
  # get the rows that don't have a header
  temp_file = output_path / f'{key}.tmp'
  with open(temp_file, 'w') as f:
    subprocess.run(['jq', f'.["{key}"][] | select(."Header" == null)', str(file_path)], stdout=f)
  # create valid json
  filtered_file = output_path / f'{key}.json'
  with open(filtered_file, 'w') as f:
    subprocess.run(['jq', '-s', '.', temp_file], stdout=f)
  processed_file = output_path / f'{key}_processed.json'
  process_lehigh_valley_column_file(filtered_file, processed_file)

  0%|          | 0/83 [00:00<?, ?it/s]

Minimum & Maximum Allowed
NCC 100
NCC 102
NCC 105
NCC 108
NCC 117
NCC 128
NCC 133
NCC 14
NCC 14 RCH
NCC 148
NCC 15
NCC 151
NCC 153
NCC 157
NCC 171
NCC 19
NCC 26
NCC 265
NCC 269
NCC 269 RCH
NCC 270
NCC 270 RCH
NCC 285
NCC 286
NCC 298
NCC 299
NCC 31
NCC 33
NCC 333
NCC 350
NCC 350 MHC
NCC 359
NCC 378
NCC 38
NCC 4
NCC 42
NCC 421
NCC 421 RCH
NCC 422
NCC 422 RCH
NCC 423
NCC 429
NCC 43 HO
NCC 43 LVH
NCC 43 MHC
NCC 43 RCH
NCC 432
NCC 44
NCC 447
NCC 474
NCC 475
NCC 484
NCC 485
NCC 486
NCC 494
NCC 499
NCC 5
NCC 5 RCH
NCC 52
NCC 526
NCC 53
NCC 535
NCC 536
NCC 57
NCC 571
NCC 574
NCC 59
NCC 6
NCC 62
NCC 64
NCC 66
NCC 70
NCC 73
NCC 77
NCC 78
NCC 80
NCC 80 RCH
NCC 82
NCC 87
NCC 9
NCC 98
Standard Charges


In [16]:
file_list_subset = [f for f in output_path.glob('*.json') if 'Standard Charges' not in str(f) and 'Minimum & Maximum Allowed' not in str(f)]

for file in tqdm(file_list_subset):
  print(file)
  processed_file = file.with_suffix('.processed.json')
  process_lehigh_valley_column_file(file, processed_file)

  0%|          | 0/82 [00:00<?, ?it/s]

/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 43 MHC.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 484.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 105.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 447.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 5.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 59.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 133.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 80.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 38.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 298.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC

In [17]:
for file in output_path.glob('*.processed.json'):
  print(file)
  with open(file) as f:
    data = json.load(f)
    print(data[0:1])

/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 82.processed.json
[{'payment_type': 'APR DRG', 'billing_code': 'APR001-1', 'description': 'LIVER TRANSPLANT &/OR INTESTINAL TRANSPLANT', 'payor_name': 'UNITED_HEALTHCARE_COMMUNITY_PLAN_DENTAL_Plan', 'plan_name': 'DENTAL_PA_UHCCP_FOR_FAMILIES_MEDICAID_OP', 'charge': 'N/A'}]
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 494.processed.json
[{'payment_type': 'APR DRG', 'billing_code': 'APR001-1', 'description': 'LIVER TRANSPLANT &/OR INTESTINAL TRANSPLANT', 'payor_name': 'HEALTH_PARTNERS_Plan', 'plan_name': 'KIDZPARTNERS_OP', 'charge': 'N/A'}]
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 19.processed.json
[{'payment_type': 'APR DRG', 'billing_code': 'APR001-1', 'description': 'LIVER TRANSPLANT &/OR INTESTINAL TRANSPLANT', 'payor_name': 'HIGHMARK_WHOLECARE_GATEWAY_DENTAL_Plan', 'plan_name': 'DENTAL_HIGHMARK_WC_GATEWAY_MEDICAID_OP', 'ch

In [2]:
%%sql 
SELECT * FROM read_json_auto('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 64.processed.json') 

Unnamed: 0,payment_type,billing_code,description,payor_name,plan_name,charge
0,MS DRG,MS001,HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SY...,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
1,MS DRG,MS002,HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SY...,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
2,MS DRG,MS003,ECMO OR TRACHEOSTOMY WITH MV >96 HOURS OR PRIN...,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
3,MS DRG,MS004,TRACHEOSTOMY WITH MV >96 HOURS OR PRINCIPAL DI...,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
4,MS DRG,MS005,LIVER TRANSPLANT WITH MCC OR INTESTINAL TRANSP...,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
...,...,...,...,...,...,...
13459,NOT SEPARATELY PAID,V2790,AMNIOTIC MEMBRANE,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,
13460,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-FULL DAY ADOLESCENT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$399.00
13461,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-1/2 DAY ADOLESCENT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$323.00
13462,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-FULL DAY ADULT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$399.00


In [3]:
%%sql 
SELECT * FROM read_json_auto('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 64.processed.json') 
WHERE charge != 'N/A'

Unnamed: 0,payment_type,billing_code,description,payor_name,plan_name,charge
0,PER UNIT,10004,FNA BX W/O IMG GDN EA ADDL,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$44.14
1,PER UNIT,10005,FNA BX W/US GDN 1ST LES,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$685.07
2,PER UNIT,10006,FNA BX W/US GDN EA ADDL,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$52.03
3,PER UNIT,10007,FNA BX W/FLUOR GDN 1ST LES,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$685.07
4,PER UNIT,10008,FNA BX W/FLUOR GDN EA ADDL,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$59.17
...,...,...,...,...,...,...
10921,PER VISIT,V2785,CORNEAL TISSUE PROCESSING,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,"$3,576.30"
10922,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-FULL DAY ADOLESCENT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$399.00
10923,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-1/2 DAY ADOLESCENT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$323.00
10924,PER DIEM,90853,OP PARTIAL HOSPITALIZATION-FULL DAY ADULT,UNITED_HEALTHCARE_MEDICARE_ADVANTAGE_DENTAL_Plan,DENTAL_PA_UHC_DUAL_COMPLETE_MED_ADV_OP,$399.00


In [4]:
%%sql 
COPY (
SELECT * FROM read_json_auto('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 64.processed.json') 
WHERE charge != 'N/A'
) TO '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 64_processed.parquet' (COMPRESSION ZSTD)

Unnamed: 0,Success


In [7]:
from pathlib import Path
output_path = Path('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges')
file_list_subset = [f for f in output_path.glob('*.processed.json') if 'Standard Charges' not in str(f) and 'Minimum & Maximum Allowed' not in str(f)]

In [8]:
file_list_subset

[PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 82.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 494.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 19.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 350.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 105.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 486.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 333.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 98.processed.json'),
 PosixPath('/Users/me/data/payless_health/231689692_Lehigh_Valley_H

In [9]:
# execute duckdb query to convert to parquet for every processed file
import duckdb
from tqdm.notebook import tqdm

for file in tqdm(file_list_subset):
  print(file)
  file_path = file.absolute().as_posix()
  file_path_output = file.with_suffix('.parquet').absolute().as_posix()
  duckdb.sql(f"COPY (SELECT * FROM read_json_auto('{file_path}') WHERE charge != 'N/A') TO '{file_path_output}' (COMPRESSION ZSTD)")

  0%|          | 0/82 [00:00<?, ?it/s]

/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 82.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 494.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 19.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 350.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 105.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 486.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 333.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 98.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 5.processed.json
/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/NCC 117.

In [10]:
%%sql 
SELECT DISTINCT description FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/*.parquet'

Unnamed: 0,description
0,FNA BX W/O IMG GDN EA ADDL
1,DRAINAGE OF SKIN ABSCESS
2,DRAINAGE OF HEMATOMA/FLUID
3,PUNCTURE DRAINAGE OF LESION
4,DEB SUBQ TISSUE 20 SQ CM/<
...,...
88495,STNT VIABAHN HEP 7X25X120
88496,CAGE CENTERPIECE 19X28MM 19C
88497,CAGE XPN CMED-L25X32 24-26 0/3
88498,DISC MOBI-C 13X15 H6 US


In [11]:
%%sql 
SELECT * FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/*.parquet'

Unnamed: 0,payment_type,billing_code,description,payor_name,plan_name,charge
0,PER UNIT,10004,FNA BX W/O IMG GDN EA ADDL,UPMC_COMMERCIAL_DENTAL_Plan,DENTAL_UPMC_FOR_KIDS_OP,$617.56
1,PER UNIT,10005,FNA BX W/US GDN 1ST LES,UPMC_COMMERCIAL_DENTAL_Plan,DENTAL_UPMC_FOR_KIDS_OP,"$1,511.88"
2,PER UNIT,10006,FNA BX W/US GDN EA ADDL,UPMC_COMMERCIAL_DENTAL_Plan,DENTAL_UPMC_FOR_KIDS_OP,$758.50
3,PER UNIT,10021,FNA BX W/O IMG GDN 1ST LES,UPMC_COMMERCIAL_DENTAL_Plan,DENTAL_UPMC_FOR_KIDS_OP,$855.88
4,PER UNIT,10060,DRAINAGE OF SKIN ABSCESS,UPMC_COMMERCIAL_DENTAL_Plan,DENTAL_UPMC_FOR_KIDS_OP,$494.56
...,...,...,...,...,...,...
878217,PER UNIT,U0004,COV-19 TEST NON-CDC HGH THRU,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,IBC_KEYSTONE_FIRST_VIP_CHOICE_DUAL_MEDICARE_AD...,$75.00
878218,PER UNIT,U0005,INFEC AGEN DETEC AMPLI PROBE,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,IBC_KEYSTONE_FIRST_VIP_CHOICE_DUAL_MEDICARE_AD...,$25.00
878219,PER UNIT,V2630,ANTER CHAMBER INTRAOCUL LENS,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,IBC_KEYSTONE_FIRST_VIP_CHOICE_DUAL_MEDICARE_AD...,$125.10
878220,PER UNIT,V2632,POST CHMBR INTRAOCULAR LENS,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,IBC_KEYSTONE_FIRST_VIP_CHOICE_DUAL_MEDICARE_AD...,$125.10


In [15]:
%%sql 
SELECT DISTINCT 
  billing_code, 
  payor_name,
  description,
  CAST(regexp_replace(charge, '[$,]', '', 'g') AS FLOAT)
FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/*.parquet'

Unnamed: 0,billing_code,payor_name,description,"CAST(regexp_replace(charge, '[$,]', '', 'g') AS FLOAT)"
0,10030,HEALTH_PARTNERS_MEDICARE_ADVANTAGE_Plan,GUIDE CATHET FLUID DRAINAGE,691.659973
1,10035,HEALTH_PARTNERS_MEDICARE_ADVANTAGE_Plan,PERQ DEV SOFT TISS 1ST IMAG,691.659973
2,10080,HEALTH_PARTNERS_MEDICARE_ADVANTAGE_Plan,DRAINAGE OF PILONIDAL CYST,691.659973
3,10121,HEALTH_PARTNERS_MEDICARE_ADVANTAGE_Plan,REMOVE FOREIGN BODY,1563.880005
4,11301,HEALTH_PARTNERS_MEDICARE_ADVANTAGE_Plan,SHAVE SKIN LESION 0.6-1.0 CM,199.589996
...,...,...,...,...
790133,C1713,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,CAPSTONE 12 DEG 13X27,15074.799805
790134,C1713,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,SPACER FRA 4A 13MM BRD FF,15098.919922
790135,C1776,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,STEM CORAIL COLLARLESS STD 15,11016.200195
790136,270,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,GRAFT PRECISION 14X26X26,18553.599609


In [24]:
%%sql 
SELECT 
  billing_code,
  description,
  payor_name,
  CAST(regexp_replace(charge, '[$,]', '', 'g') AS FLOAT) AS charge,
FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/*.parquet'

Unnamed: 0,billing_code,description,payor_name,charge
0,10004,FNA BX W/O IMG GDN EA ADDL,UPMC_COMMERCIAL_DENTAL_Plan,617.559998
1,10005,FNA BX W/US GDN 1ST LES,UPMC_COMMERCIAL_DENTAL_Plan,1511.880005
2,10006,FNA BX W/US GDN EA ADDL,UPMC_COMMERCIAL_DENTAL_Plan,758.500000
3,10021,FNA BX W/O IMG GDN 1ST LES,UPMC_COMMERCIAL_DENTAL_Plan,855.880005
4,10060,DRAINAGE OF SKIN ABSCESS,UPMC_COMMERCIAL_DENTAL_Plan,494.559998
...,...,...,...,...
878217,U0004,COV-19 TEST NON-CDC HGH THRU,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,75.000000
878218,U0005,INFEC AGEN DETEC AMPLI PROBE,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,25.000000
878219,V2630,ANTER CHAMBER INTRAOCUL LENS,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,125.099998
878220,V2632,POST CHMBR INTRAOCULAR LENS,INDEPENDENCE_BLUE_CROSS_MEDICARE_ADVANTAGE_Plan,125.099998


In [25]:
%%sql 
COPY (
SELECT 
  billing_code,
  description,
  payor_name,
  CAST(regexp_replace(charge, '[$,]', '', 'g') AS FLOAT) AS charge,
FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges/*.parquet'
) TO '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.parquet' (COMPRESSION ZSTD)

Unnamed: 0,Success


# Create the file for visualization

Prompt:
```
please revise this query for the file '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.parquet' with the following columns `

Copy code

billing_code,
  description,
  payor_name, charge
  ```

In [27]:
%%sql 
COPY (
  WITH max_min_charges AS (
    SELECT
      billing_code,
      description,
      MIN(charge) AS min_charge,
      MAX(charge) AS max_charge
    FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.parquet'
    GROUP BY billing_code, description
  ),

  min_max_names AS (
    SELECT
      mmc.billing_code,
      mmc.description,
      mmc.min_charge,
      mmc.max_charge,
      FIRST_VALUE(payor_name) OVER (PARTITION BY mmc.billing_code, mmc.description ORDER BY charge ASC) AS name_minimum,
      FIRST_VALUE(payor_name) OVER (PARTITION BY mmc.billing_code, mmc.description ORDER BY charge DESC) AS name_maximum
    FROM max_min_charges mmc
    JOIN '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges.parquet' sc
    ON mmc.billing_code = sc.billing_code AND mmc.description = sc.description
  )

  SELECT DISTINCT
    billing_code,
    description,
    min_charge,
    max_charge,
    name_minimum,
    name_maximum
  FROM min_max_names
) TO '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet' (FORMAT 'parquet');

Unnamed: 0,Success


In [28]:
%%sql 
SELECT * FROM '/Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet'

Unnamed: 0,billing_code,description,min_charge,max_charge,name_minimum,name_maximum
0,C1776,CUP BIMENTUM PRESS 63MM,5508.100098,6355.500000,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
1,27043,HB 27043 EXCISION TUMOR SOFT TISSUE PELVIS&HIP...,5298.799805,6114.000000,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
2,C1713,SCREW SKYLINE VAR S-D 12MM,878.799988,1014.000000,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
3,C1713,SCREW TI POLY 5.00X60,4406.479980,5084.399902,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
4,C1776,LINER PINN MARA NEUT 28X66 121928066,4455.419922,5140.870117,AMERIHEALTH_ADMINISTRATORS_VALLEY_PREFERRED_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
...,...,...,...,...,...,...
90620,272,BIT MILLING 2.0MM,1176.459961,2094.590088,AETNA_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
90621,272,FIBER FLEXIVA 550,1661.449951,2958.070068,AETNA_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
90622,272,KIT VERTEROPLASTY STABILIT VP,2994.770020,5331.930176,AETNA_Plan,HOMESTEAD_CLAIM_WATCHER_Plan
90623,272,SUT LASSO CVD R 45 DEG,736.010010,1310.400024,AETNA_Plan,HOMESTEAD_CLAIM_WATCHER_Plan


In [29]:
!ls -lh /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet

-rw-r--r--  1 me  staff   2.7M Aug 31 18:12 /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet


In [30]:
!cp /Users/me/data/payless_health/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet /Users/me/projects/payless.health/docs/public/data/

# Create the file for visualization

```` 
meta:
  title: Hospital Charge Data
  description: An interactive dashboard of hospital charge data

data:
  charges: {file: data/231352213_StLukesHospitalBethlehemCampus_standardcharges-analysis.parquet}  
  dd: [
    { u: 0, v: 0 },
    { u: 500000, v: 500000 },
  ]
hconcat:
- vconcat:
  - hconcat:
    - input: menu 
      label: Insurance_minimum
      as: $query
      from: charges
      column: name_minimum
    - input: search
      label: description
      as: $query
      from: charges
      column: description
      type: contains
  - vspace: 10
  - plot:
    - mark: dot
      data: {from: charges, filterBy: $query}
      x: min_charge  
      y: max_charge
      opacity: 0.1
      fill: name_minimum
      r: 2      
    - mark: regressionY
      data: {from: charges, filterBy: $query}  
      x: min_charge
      y: max_charge
    - select: intervalXY
      as: $query
      brush: { fillOpacity: 0, stroke: black }     
    - mark: lineY
      data: { from: dd }
      x: u
      y: v
      stroke: red
      # curve: monotone-x
    margins: { left: 60, top: 20, right: 60 }
    xyDomain: Fixed
    width: 590
    height: 350
    yDomain: [0, 100000]
    xDomain: [0, 100000]
    yGrid: true
    xGrid: true
  - vspace: 5
  - input: table
    from: charges
    filterBy: $query
    maxWidth: 700
    columns: 
      - cpt_drg
      - record_id
      - description
      - min_charge
      - max_charge
      - name_minimum
      - name_maximum
    width: 
      cpt_drg: 60
      record_id: 70  
      description: 200
      min_charge: 90
      max_charge: 90
      name_maximum: 120
      name_minimum: 120

please revise this yaml specification for the file at data/231689692_Lehigh_Valley_Hospital_StandardCharges_analysis.parquet with the columns billing_code, description, payor_name, charge
````