# Displaying raw data
First check out what the header of the column-separated value (csv) file looks like:

In [1]:
!wget https://data.payless.health/data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv

--2023-07-21 08:38:56--  https://data.payless.health/data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv
Resolving data.payless.health (data.payless.health)... 2600:9000:2511:9800:1f:282d:1900:93a1, 2600:9000:2511:8200:1f:282d:1900:93a1, 2600:9000:2511:dc00:1f:282d:1900:93a1, ...
Connecting to data.payless.health (data.payless.health)|2600:9000:2511:9800:1f:282d:1900:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34535 (34K) [text/csv]
Saving to: ‘data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv’


2023-07-21 08:38:56 (3.92 MB/s) - ‘data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv’ saved [34535/34535]



In [2]:
!head -n 4 data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv

,Direction,Higher is Better,Lower is Better,Higher is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Higher is Better,Higher is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Higher is Better,Higher is Better
PRVDR_NUM,FAC_NAME,CMS Overall Rating (Stars),"30-Day Unplanned Readmissions, Hospital Wide",HCAHPS Overall hospital rating (H_HSP_RATING_LINEAR_SCORE),CLABSI Standardized Infection Ratio,CAUTI Standardized Infection Ratio,Patient Safety and Adverse Events Composite (PSI-90),Early Delivery (PC-01),Exclusive Breast Milk Feeding (PC-05),Appropriate care for severe sepsis and septic shock (SEP-1),Average (median) time patients spent in the emergency department before leaving from the visit (OP-18B),Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OPIOIDS),MRI Lumbar Spine for Low Back Pain (OP-8),Abdomen CT Use of Contrast Material (OP-10),Outpatients who got cardiac imaging stress tests before low-risk outpatient 

# Loading the data into a database

We will use the `duckdb` database to load the data into a database. This will allow us to query the data using the structured query language (SQL).

In [3]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Load pandas, which lets us manipulate dataframes
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

```
 %%sql
    SELECT *
    FROM read_csv('https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD',
        header=True,
        delim=',',
        quote='"',
        columns={'Unique Key': 'BIGINT',
        'Created Date': 'VARCHAR',
        'Closed Date': 'VARCHAR',
        'Agency': 'VARCHAR',
        'Agency Name': 'VARCHAR',
        'Complaint Type': 'VARCHAR',
        'Descriptor': 'VARCHAR',
        'Location Type': 'VARCHAR',
        'Incident Zip': 'VARCHAR',
        'Incident Address': 'VARCHAR',
        'Street Name': 'VARCHAR',
        'Cross Street 1': 'VARCHAR',
        'Cross Street 2': 'VARCHAR',
        'Intersection Street 1': 'VARCHAR',
        'Intersection Street 2': 'VARCHAR',
        'Address Type': 'VARCHAR',
        'City': 'VARCHAR',
        'Landmark': 'VARCHAR',
        'Facility Type': 'VARCHAR',
        'Status': 'VARCHAR',
        'Due Date': 'VARCHAR',
        'Resolution Description': 'VARCHAR',
        'Resolution Action Updated Date': 'VARCHAR',
        'Community Board': 'VARCHAR',
        'BBL': 'VARCHAR',
        'Borough': 'VARCHAR',
        'X Coordinate (State Plane)': 'VARCHAR',
        'Y Coordinate (State Plane)': 'VARCHAR',
        'Open Data Channel Type': 'VARCHAR',
        'Park Facility Name': 'VARCHAR',
        'Park Borough': 'VARCHAR',
        'Vehicle Type': 'VARCHAR',
        'Taxi Company Borough': 'VARCHAR',
        'Taxi Pick Up Location': 'VARCHAR',
        'Bridge Highway Name': 'VARCHAR',
        'Bridge Highway Direction': 'VARCHAR',
        'Road Ramp': 'VARCHAR',
        'Bridge Highway Segment': 'VARCHAR',
        'Latitude': 'DOUBLE',
        'Longitude': 'DOUBLE',
        'Location': 'VARCHAR'}) 
    LIMIT 10;

please rewrite this duckdb SQL query using the following header of the CSV file to be loaded from

    ./data/newyork_newark_core-based-statistical-area_hospital.csv

    ,Direction,Higher is Better,Lower is Better,Higher is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Higher is Better,Higher is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Lower is Better,Higher is Better,Higher is Better
    PRVDR_NUM,FAC_NAME,CMS Overall Rating (Stars),"30-Day Unplanned Readmissions, Hospital Wide",HCAHPS Overall hospital rating (H_HSP_RATING_LINEAR_SCORE),CLABSI Standardized Infection Ratio,CAUTI Standardized Infection Ratio,Patient Safety and Adverse Events Composite (PSI-90),Early Delivery (PC-01),Exclusive Breast Milk Feeding (PC-05),Appropriate care for severe sepsis and septic shock (SEP-1),Average (median) time patients spent in the emergency department before leaving from the visit (OP-18B),Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OPIOIDS),MRI Lumbar Spine for Low Back Pain (OP-8),Abdomen CT Use of Contrast Material (OP-10),Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery (OP-13),Percent of patients receiving follow-up care within 30 days after hospitalization for mental illness (FUH-30),Alcohol and other drug use disorder treatment provided or offered at discharge (SUB-3)
    310001,HACKENSACK UNIVERSITY MEDICAL CENTER,4,14.2,84,0.354,0.391,0.74,7%,Not Available,72%,209,Not Available,50%,5%,5%,75%,97%
    310002,NEWARK BETH ISRAEL MEDICAL CENTER,2,16.0,83,0.421,0.744,0.95,0%,Not Available,78%,218,Not Available,Not Available,8%,4%,41%,100%
```

In [4]:
%%sql
SELECT * 
FROM read_csv('data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv', 
  header=True, 
  delim=',',
  skip=1,
  nullstr='Not Available',
  quote='"',
  columns={
    'PRVDR_NUM': 'VARCHAR',
    'FAC_NAME': 'VARCHAR',
    'CMS Overall Rating (Stars)': 'DOUBLE',
    '30-Day Unplanned Readmissions, Hospital Wide': 'DOUBLE',
    'HCAHPS Overall hospital rating (H_HSP_RATING_LINEAR_SCORE)': 'DOUBLE',
    'CLABSI Standardized Infection Ratio': 'DOUBLE',
    'CAUTI Standardized Infection Ratio': 'DOUBLE',
    'Patient Safety and Adverse Events Composite (PSI-90)': 'DOUBLE',
    'Early Delivery (PC-01)': 'VARCHAR',
    'Exclusive Breast Milk Feeding (PC-05)': 'VARCHAR',
    'Appropriate care for severe sepsis and septic shock (SEP-1)': 'VARCHAR',
    'Average (median) time patients spent in the emergency department before leaving from the visit (OP-18B)': 'DOUBLE',
    'Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OPIOIDS)': 'VARCHAR',
    'MRI Lumbar Spine for Low Back Pain (OP-8)': 'VARCHAR',
    'Abdomen CT Use of Contrast Material (OP-10)': 'VARCHAR',
    'Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery (OP-13)': 'VARCHAR',
    'Percent of patients receiving follow-up care within 30 days after hospitalization for mental illness (FUH-30)': 'VARCHAR',
    'Alcohol and other drug use disorder treatment provided or offered at discharge (SUB-3)': 'VARCHAR'
  })
LIMIT 10;

Unnamed: 0,PRVDR_NUM,FAC_NAME,CMS Overall Rating (Stars),"30-Day Unplanned Readmissions, Hospital Wide",HCAHPS Overall hospital rating (H_HSP_RATING_LINEAR_SCORE),CLABSI Standardized Infection Ratio,CAUTI Standardized Infection Ratio,Patient Safety and Adverse Events Composite (PSI-90),Early Delivery (PC-01),Exclusive Breast Milk Feeding (PC-05),Appropriate care for severe sepsis and septic shock (SEP-1),Average (median) time patients spent in the emergency department before leaving from the visit (OP-18B),Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OPIOIDS),MRI Lumbar Spine for Low Back Pain (OP-8),Abdomen CT Use of Contrast Material (OP-10),Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery (OP-13),Percent of patients receiving follow-up care within 30 days after hospitalization for mental illness (FUH-30),Alcohol and other drug use disorder treatment provided or offered at discharge (SUB-3)
0,310001,HACKENSACK UNIVERSITY MEDICAL CENTER,4.0,14.2,84.0,0.354,0.391,0.74,7%,,72%,209.0,,50%,5%,5%,75%,97%
1,310002,NEWARK BETH ISRAEL MEDICAL CENTER,2.0,16.0,83.0,0.421,0.744,0.95,0%,,78%,218.0,,,8%,4%,41%,100%
2,310003,PALISADES MEDICAL CENTER,2.0,15.3,85.0,1.187,0.36,0.96,1%,,59%,182.0,,,5%,5%,,
3,310005,HUNTERDON MEDICAL CENTER,4.0,15.1,88.0,0.72,1.263,0.89,0%,,74%,225.0,,,6%,3%,,
4,310006,ST MARY'S GENERAL HOSPITAL,3.0,16.1,84.0,0.0,0.365,0.8,4%,,78%,166.0,,,8%,5%,27%,
5,310008,HOLY NAME MEDICAL CENTER,2.0,14.4,88.0,2.391,0.302,1.32,7%,,69%,208.0,,40%,6%,6%,81%,99%
6,310009,CLARA MAASS MEDICAL CENTER,4.0,15.2,83.0,0.454,0.286,1.18,0%,,82%,179.0,,,5%,3%,39%,100%
7,310010,PENN MEDICINE PRINCETON MEDICAL CENTER,4.0,15.4,90.0,0.0,0.138,0.79,8%,,68%,205.0,,,6%,6%,74%,84%
8,310012,VALLEY HOSPITAL,4.0,13.6,87.0,1.552,1.957,0.74,0%,,57%,207.0,,43%,4%,2%,,
9,310015,MORRISTOWN MEDICAL CENTER,5.0,13.5,90.0,0.318,0.285,0.66,1%,,45%,228.0,,28%,5%,4%,82%,98%


## Saving the database to a parquet file

A parquet file is a columnar data format that is optimized for reading and writing data. `duckdb` can save the data to a parquet file.

In [5]:
%%sql
COPY (SELECT * 
FROM read_csv('data.cms.gov%2Fnewyork_newark_core-based-statistical-area-hospital-quality-measures.csv', 
  header=True, 
  delim=',',
  skip=1,
  nullstr='Not Available',
  quote='"',
  columns={
    'PRVDR_NUM': 'VARCHAR',
    'FAC_NAME': 'VARCHAR',
    'CMS Overall Rating (Stars)': 'DOUBLE',
    '30-Day Unplanned Readmissions, Hospital Wide': 'DOUBLE',
    'HCAHPS Overall hospital rating (H_HSP_RATING_LINEAR_SCORE)': 'DOUBLE',
    'CLABSI Standardized Infection Ratio': 'DOUBLE',
    'CAUTI Standardized Infection Ratio': 'DOUBLE',
    'Patient Safety and Adverse Events Composite (PSI-90)': 'DOUBLE',
    'Early Delivery (PC-01)': 'VARCHAR',
    'Exclusive Breast Milk Feeding (PC-05)': 'VARCHAR',
    'Appropriate care for severe sepsis and septic shock (SEP-1)': 'VARCHAR',
    'Average (median) time patients spent in the emergency department before leaving from the visit (OP-18B)': 'DOUBLE',
    'Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OPIOIDS)': 'VARCHAR',
    'MRI Lumbar Spine for Low Back Pain (OP-8)': 'VARCHAR',
    'Abdomen CT Use of Contrast Material (OP-10)': 'VARCHAR',
    'Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery (OP-13)': 'VARCHAR',
    'Percent of patients receiving follow-up care within 30 days after hospitalization for mental illness (FUH-30)': 'VARCHAR',
    'Alcohol and other drug use disorder treatment provided or offered at discharge (SUB-3)': 'VARCHAR'
  })
-- LIMIT 10
) TO './newyork_newark_quality_metrics.parquet' (COMPRESSION ZSTD);

Unnamed: 0,Count
0,176


## Version control of the SQL query 

This is done in https://github.com/onefact/data_build_tool_payless.health/blob/main/payless_health/models/quality_metrics/newyork_newark_quality_metrics.sql 