# Goals

This notebook serves to download and process the CMS.gov National Plan and Provider Enumeration System (NPPES) database into a developer-friendly format on the Payless Health public S3 bucket at https://data.payless.health.

# Background

The Centers for Medicare & Medicaid Services maintains a database of national provider identification (NPI) numbers here:

https://www.cms.gov/Regulations-and-Guidance/Administrative-Simplification/NationalProvIdentStand/DataDissemination 

The database is updated monthly. The most recent update is available here:

https://download.cms.gov/nppes/NPPES_Data_Dissemination_July_2023.zip 

This database is needed to link to hospital price transparency data and transparency in coverage data.

In [1]:
!wget https://download.cms.gov/nppes/NPPES_Data_Dissemination_July_2023.zip

--2023-07-28 11:41:13--  https://download.cms.gov/nppes/NPPES_Data_Dissemination_July_2023.zip
Resolving download.cms.gov (download.cms.gov)... 104.127.188.67
Connecting to download.cms.gov (download.cms.gov)|104.127.188.67|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 942006146 (898M) [application/zip]
Saving to: ‘NPPES_Data_Dissemination_July_2023.zip’


2023-07-28 11:44:43 (4.29 MB/s) - ‘NPPES_Data_Dissemination_July_2023.zip’ saved [942006146/942006146]



In [2]:
!unzip NPPES_Data_Dissemination_July_2023.zip

Archive:  NPPES_Data_Dissemination_July_2023.zip
  inflating: othername_pfile_20050523-20230709.csv  
  inflating: othername_pfile_20050523-20230709_fileheader.csv  
  inflating: endpoint_pfile_20050523-20230709.csv  
  inflating: endpoint_pfile_20050523-20230709_fileheader.csv  
  inflating: pl_pfile_20050523-20230709.csv  
  inflating: pl_pfile_20050523-20230709_fileheader.csv  
  inflating: npidata_pfile_20050523-20230709.csv  
  inflating: npidata_pfile_20050523-20230709_fileheader.csv  
  inflating: NPPES_Data_Dissemination_Readme.pdf  
  inflating: NPPES_Data_Dissemination_CodeValues.pdf  


In [3]:
!head npidata_pfile_20050523-20230709.csv

"NPI","Entity Type Code","Replacement NPI","Employer Identification Number (EIN)","Provider Organization Name (Legal Business Name)","Provider Last Name (Legal Name)","Provider First Name","Provider Middle Name","Provider Name Prefix Text","Provider Name Suffix Text","Provider Credential Text","Provider Other Organization Name","Provider Other Organization Name Type Code","Provider Other Last Name","Provider Other First Name","Provider Other Middle Name","Provider Other Name Prefix Text","Provider Other Name Suffix Text","Provider Other Credential Text","Provider Other Last Name Type Code","Provider First Line Business Mailing Address","Provider Second Line Business Mailing Address","Provider Business Mailing Address City Name","Provider Business Mailing Address State Name","Provider Business Mailing Address Postal Code","Provider Business Mailing Address Country Code (If outside U.S.)","Provider Business Mailing Address Telephone Number","Provider Business Mailing Address Fax Number",

In [2]:
!ls -lh npidata_pfile_20050523-20230709.csv

-rw-r--r--@ 1 me  staff   8.8G Jul 10 03:57 npidata_pfile_20050523-20230709.csv


In [7]:
!wc -l npidata_pfile_20050523-20230709.csv

 7890767 npidata_pfile_20050523-20230709.csv


## Example SQL query that hangs with LIMIT 1 on 8.8GB csv file

In [2]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

In [3]:
%%sql
SELECT *
FROM read_csv('npidata_pfile_20050523-20230709.csv', 
  header=True,
  delim=',',
  quote='"',
  nullstr='<UNAVAIL>',
  dateformat='%m/%d/%Y',
  parallel=True,
  columns={
    'NPI': 'BIGINT',
    'Entity Type Code': 'INT',
    'Replacement NPI': 'BIGINT', 
    'Employer Identification Number (EIN)': 'VARCHAR',
    'Provider Organization Name (Legal Business Name)': 'VARCHAR',
    'Provider Last Name (Legal Name)': 'VARCHAR',
    'Provider First Name': 'VARCHAR',
    'Provider Middle Name': 'VARCHAR',
    'Provider Name Prefix Text': 'VARCHAR',
    'Provider Name Suffix Text': 'VARCHAR',
    'Provider Credential Text': 'VARCHAR',
    'Provider Other Organization Name': 'VARCHAR',
    'Provider Other Organization Name Type Code': 'VARCHAR',
    'Provider Other Last Name': 'VARCHAR',
    'Provider Other First Name': 'VARCHAR',
    'Provider Other Middle Name': 'VARCHAR',
    'Provider Other Name Prefix Text': 'VARCHAR',
    'Provider Other Name Suffix Text': 'VARCHAR',
    'Provider Other Credential Text': 'VARCHAR',
    'Provider Other Last Name Type Code': 'INT',
    'Provider First Line Business Mailing Address': 'VARCHAR',
    'Provider Second Line Business Mailing Address': 'VARCHAR',
    'Provider Business Mailing Address City Name': 'VARCHAR',
    'Provider Business Mailing Address State Name': 'VARCHAR',
    'Provider Business Mailing Address Postal Code': 'VARCHAR',
    'Provider Business Mailing Address Country Code (If outside U.S.)': 'VARCHAR',
    'Provider Business Mailing Address Telephone Number': 'VARCHAR',
    'Provider Business Mailing Address Fax Number': 'VARCHAR',
    'Provider First Line Business Practice Location Address': 'VARCHAR',
    'Provider Second Line Business Practice Location Address': 'VARCHAR',
    'Provider Business Practice Location Address City Name': 'VARCHAR',
    'Provider Business Practice Location Address State Name': 'VARCHAR',
    'Provider Business Practice Location Address Postal Code': 'VARCHAR',
    'Provider Business Practice Location Address Country Code (If outside U.S.)': 'VARCHAR',
    'Provider Business Practice Location Address Telephone Number': 'VARCHAR',
    'Provider Business Practice Location Address Fax Number': 'VARCHAR',
    'Provider Enumeration Date': 'DATE',
    'Last Update Date': 'DATE',
    'NPI Deactivation Reason Code': 'VARCHAR',
    'NPI Deactivation Date': 'DATE',
    'NPI Reactivation Date': 'DATE',
    'Provider Gender Code': 'VARCHAR', 
    'Authorized Official Last Name': 'VARCHAR',
    'Authorized Official First Name': 'VARCHAR',
    'Authorized Official Middle Name': 'VARCHAR',
    'Authorized Official Title or Position': 'VARCHAR',
    'Authorized Official Telephone Number': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_1': 'VARCHAR',
    'Provider License Number_1': 'VARCHAR',
    'Provider License Number State Code_1': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_1': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_2': 'VARCHAR',
    'Provider License Number_2': 'VARCHAR',
    'Provider License Number State Code_2': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_2': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_3': 'VARCHAR',
    'Provider License Number_3': 'VARCHAR',
    'Provider License Number State Code_3': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_3': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_4': 'VARCHAR',
    'Provider License Number_4': 'VARCHAR',
    'Provider License Number State Code_4': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_4': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_5': 'VARCHAR',
    'Provider License Number_5': 'VARCHAR',
    'Provider License Number State Code_5': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_5': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_6': 'VARCHAR',
    'Provider License Number_6': 'VARCHAR',
    'Provider License Number State Code_6': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_6': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_7': 'VARCHAR',
    'Provider License Number_7': 'VARCHAR',
    'Provider License Number State Code_7': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_7': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_8': 'VARCHAR',
    'Provider License Number_8': 'VARCHAR',
    'Provider License Number State Code_8': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_8': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_9': 'VARCHAR',
    'Provider License Number_9': 'VARCHAR',
    'Provider License Number State Code_9': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_9': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_10': 'VARCHAR',
    'Provider License Number_10': 'VARCHAR',
    'Provider License Number State Code_10': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_10': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_11': 'VARCHAR',
    'Provider License Number_11': 'VARCHAR',
    'Provider License Number State Code_11': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_11': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_12': 'VARCHAR',
    'Provider License Number_12': 'VARCHAR',
    'Provider License Number State Code_12': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_12': 'VARCHAR',
    'Healthcare Provider Taxonomy Code_13': 'VARCHAR',
    'Provider License Number_13': 'VARCHAR',
    'Provider License Number State Code_13': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_13': 'VARCHAR',   
    'Healthcare Provider Taxonomy Code_14': 'VARCHAR',
    'Provider License Number_14': 'VARCHAR',
    'Provider License Number State Code_14': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_14': 'VARCHAR',  
    'Healthcare Provider Taxonomy Code_15': 'VARCHAR',
    'Provider License Number_15': 'VARCHAR',
    'Provider License Number State Code_15': 'VARCHAR',
    'Healthcare Provider Primary Taxonomy Switch_15': 'VARCHAR',
    'Other Provider Identifier_1': 'VARCHAR',
    'Other Provider Identifier Type Code_1': 'VARCHAR',
    'Other Provider Identifier State_1': 'VARCHAR',
    'Other Provider Identifier Issuer_1': 'VARCHAR',
    'Other Provider Identifier_2': 'VARCHAR',
    'Other Provider Identifier Type Code_2': 'VARCHAR',
    'Other Provider Identifier State_2': 'VARCHAR',
    'Other Provider Identifier Issuer_2': 'VARCHAR',
    'Other Provider Identifier_3': 'VARCHAR',
    'Other Provider Identifier Type Code_3': 'VARCHAR',
    'Other Provider Identifier State_3': 'VARCHAR',
    'Other Provider Identifier Issuer_3': 'VARCHAR',
    'Other Provider Identifier_4': 'VARCHAR',
    'Other Provider Identifier Type Code_4': 'VARCHAR',
    'Other Provider Identifier State_4': 'VARCHAR',
    'Other Provider Identifier Issuer_4': 'VARCHAR',
    'Other Provider Identifier_5': 'VARCHAR',
    'Other Provider Identifier Type Code_5': 'VARCHAR',
    'Other Provider Identifier State_5': 'VARCHAR',
    'Other Provider Identifier Issuer_5': 'VARCHAR',
    'Other Provider Identifier_6': 'VARCHAR',
    'Other Provider Identifier Type Code_6': 'VARCHAR',
    'Other Provider Identifier State_6': 'VARCHAR',
    'Other Provider Identifier Issuer_6': 'VARCHAR',
    'Other Provider Identifier_7': 'VARCHAR',
    'Other Provider Identifier Type Code_7': 'VARCHAR',
    'Other Provider Identifier State_7': 'VARCHAR',
    'Other Provider Identifier Issuer_7': 'VARCHAR',
    'Other Provider Identifier_8': 'VARCHAR',
    'Other Provider Identifier Type Code_8': 'VARCHAR',
    'Other Provider Identifier State_8': 'VARCHAR',
    'Other Provider Identifier Issuer_8': 'VARCHAR',
    'Other Provider Identifier_9': 'VARCHAR',
    'Other Provider Identifier Type Code_9': 'VARCHAR',
    'Other Provider Identifier State_9': 'VARCHAR',
    'Other Provider Identifier Issuer_9': 'VARCHAR',
    'Other Provider Identifier_10': 'VARCHAR',
    'Other Provider Identifier Type Code_10': 'VARCHAR',
    'Other Provider Identifier State_10': 'VARCHAR',
    'Other Provider Identifier Issuer_10': 'VARCHAR',
    'Other Provider Identifier_11': 'VARCHAR',
    'Other Provider Identifier Type Code_11': 'VARCHAR',
    'Other Provider Identifier State_11': 'VARCHAR',
    'Other Provider Identifier Issuer_11': 'VARCHAR',
    'Other Provider Identifier_12': 'VARCHAR',
    'Other Provider Identifier Type Code_12': 'VARCHAR',
    'Other Provider Identifier State_12': 'VARCHAR',
    'Other Provider Identifier Issuer_12': 'VARCHAR',
    'Other Provider Identifier_13': 'VARCHAR',
    'Other Provider Identifier Type Code_13': 'VARCHAR',
    'Other Provider Identifier State_13': 'VARCHAR',
    'Other Provider Identifier Issuer_13': 'VARCHAR',
    'Other Provider Identifier_14': 'VARCHAR',
    'Other Provider Identifier Type Code_14': 'VARCHAR', 
    'Other Provider Identifier State_14': 'VARCHAR',
    'Other Provider Identifier Issuer_14': 'VARCHAR',
    'Other Provider Identifier_15': 'VARCHAR',
    'Other Provider Identifier Type Code_15': 'VARCHAR',
    'Other Provider Identifier State_15': 'VARCHAR',
    'Other Provider Identifier Issuer_15': 'VARCHAR',
    'Other Provider Identifier_16': 'VARCHAR',
    'Other Provider Identifier Type Code_16': 'VARCHAR',
    'Other Provider Identifier State_16': 'VARCHAR',
    'Other Provider Identifier Issuer_16': 'VARCHAR',
    'Other Provider Identifier_17': 'VARCHAR',
    'Other Provider Identifier Type Code_17': 'VARCHAR',
    'Other Provider Identifier State_17': 'VARCHAR',
    'Other Provider Identifier Issuer_17': 'VARCHAR',
    'Other Provider Identifier_18': 'VARCHAR',
    'Other Provider Identifier Type Code_18': 'VARCHAR',
    'Other Provider Identifier State_18': 'VARCHAR',
    'Other Provider Identifier Issuer_18': 'VARCHAR',
    'Other Provider Identifier_19': 'VARCHAR',
    'Other Provider Identifier Type Code_19': 'VARCHAR',
    'Other Provider Identifier State_19': 'VARCHAR',
    'Other Provider Identifier Issuer_19': 'VARCHAR',
    'Other Provider Identifier_20': 'VARCHAR',
    'Other Provider Identifier Type Code_20': 'VARCHAR',
    'Other Provider Identifier State_20': 'VARCHAR',
    'Other Provider Identifier Issuer_20': 'VARCHAR',
    'Other Provider Identifier_21': 'VARCHAR',
    'Other Provider Identifier Type Code_21': 'VARCHAR',
    'Other Provider Identifier State_21': 'VARCHAR',
    'Other Provider Identifier Issuer_21': 'VARCHAR',
    'Other Provider Identifier_22': 'VARCHAR',
    'Other Provider Identifier Type Code_22': 'VARCHAR',
    'Other Provider Identifier State_22': 'VARCHAR',
    'Other Provider Identifier Issuer_22': 'VARCHAR',
    'Other Provider Identifier_23': 'VARCHAR',
    'Other Provider Identifier Type Code_23': 'VARCHAR',
    'Other Provider Identifier State_23': 'VARCHAR',
    'Other Provider Identifier Issuer_23': 'VARCHAR',
    'Other Provider Identifier_24': 'VARCHAR',  
    'Other Provider Identifier Type Code_24': 'VARCHAR',
    'Other Provider Identifier State_24': 'VARCHAR',
    'Other Provider Identifier Issuer_24': 'VARCHAR',
    'Other Provider Identifier_25': 'VARCHAR',
    'Other Provider Identifier Type Code_25': 'VARCHAR',
    'Other Provider Identifier State_25': 'VARCHAR',
    'Other Provider Identifier Issuer_25': 'VARCHAR',
    'Other Provider Identifier_26': 'VARCHAR',
    'Other Provider Identifier Type Code_26': 'VARCHAR',
    'Other Provider Identifier State_26': 'VARCHAR',
    'Other Provider Identifier Issuer_26': 'VARCHAR',
    'Other Provider Identifier_27': 'VARCHAR',
    'Other Provider Identifier Type Code_27': 'VARCHAR',
    'Other Provider Identifier State_27': 'VARCHAR',
    'Other Provider Identifier Issuer_27': 'VARCHAR',
    'Other Provider Identifier_28': 'VARCHAR',
    'Other Provider Identifier Type Code_28': 'VARCHAR',
    'Other Provider Identifier State_28': 'VARCHAR',
    'Other Provider Identifier Issuer_28': 'VARCHAR',
    'Other Provider Identifier_29': 'VARCHAR',
    'Other Provider Identifier Type Code_29': 'VARCHAR',
    'Other Provider Identifier State_29': 'VARCHAR',
    'Other Provider Identifier Issuer_29': 'VARCHAR',
    'Other Provider Identifier_30': 'VARCHAR',
    'Other Provider Identifier Type Code_30': 'VARCHAR',
    'Other Provider Identifier State_30': 'VARCHAR',
    'Other Provider Identifier Issuer_30': 'VARCHAR',
    'Other Provider Identifier_31': 'VARCHAR',
    'Other Provider Identifier Type Code_31': 'VARCHAR',
    'Other Provider Identifier State_31': 'VARCHAR',
    'Other Provider Identifier Issuer_31': 'VARCHAR',
    'Other Provider Identifier_32': 'VARCHAR',
    'Other Provider Identifier Type Code_32': 'VARCHAR',
    'Other Provider Identifier State_32': 'VARCHAR',
    'Other Provider Identifier Issuer_32': 'VARCHAR',
    'Other Provider Identifier_33': 'VARCHAR',
    'Other Provider Identifier Type Code_33': 'VARCHAR',
    'Other Provider Identifier State_33': 'VARCHAR',
    'Other Provider Identifier Issuer_33': 'VARCHAR',
    'Other Provider Identifier_34': 'VARCHAR',
    'Other Provider Identifier Type Code_34': 'VARCHAR',
    'Other Provider Identifier State_34': 'VARCHAR',
    'Other Provider Identifier Issuer_34': 'VARCHAR',
    'Other Provider Identifier_35': 'VARCHAR',
    'Other Provider Identifier Type Code_35': 'VARCHAR',
    'Other Provider Identifier State_35': 'VARCHAR',
    'Other Provider Identifier Issuer_35': 'VARCHAR',
    'Other Provider Identifier_36': 'VARCHAR',
    'Other Provider Identifier Type Code_36': 'VARCHAR',
    'Other Provider Identifier State_36': 'VARCHAR',
    'Other Provider Identifier Issuer_36': 'VARCHAR',
    'Other Provider Identifier_37': 'VARCHAR',
    'Other Provider Identifier Type Code_37': 'VARCHAR',
    'Other Provider Identifier State_37': 'VARCHAR',
    'Other Provider Identifier Issuer_37': 'VARCHAR',
    'Other Provider Identifier_38': 'VARCHAR',
    'Other Provider Identifier Type Code_38': 'VARCHAR',
    'Other Provider Identifier State_38': 'VARCHAR',
    'Other Provider Identifier Issuer_38': 'VARCHAR',
    'Other Provider Identifier_39': 'VARCHAR',
    'Other Provider Identifier Type Code_39': 'VARCHAR',
    'Other Provider Identifier State_39': 'VARCHAR',
    'Other Provider Identifier Issuer_39': 'VARCHAR',
    'Other Provider Identifier_40': 'VARCHAR',
    'Other Provider Identifier Type Code_40': 'VARCHAR',
    'Other Provider Identifier State_40': 'VARCHAR',
    'Other Provider Identifier Issuer_40': 'VARCHAR',
    'Other Provider Identifier_41': 'VARCHAR',
    'Other Provider Identifier Type Code_41': 'VARCHAR',
    'Other Provider Identifier State_41': 'VARCHAR',
    'Other Provider Identifier Issuer_41': 'VARCHAR',
    'Other Provider Identifier_42': 'VARCHAR',
    'Other Provider Identifier Type Code_42': 'VARCHAR',
    'Other Provider Identifier State_42': 'VARCHAR',
    'Other Provider Identifier Issuer_42': 'VARCHAR',
    'Other Provider Identifier_43': 'VARCHAR',
    'Other Provider Identifier Type Code_43': 'VARCHAR',
    'Other Provider Identifier State_43': 'VARCHAR',
    'Other Provider Identifier Issuer_43': 'VARCHAR',
    'Other Provider Identifier_44': 'VARCHAR',
    'Other Provider Identifier Type Code_44': 'VARCHAR',
    'Other Provider Identifier State_44': 'VARCHAR',
    'Other Provider Identifier Issuer_44': 'VARCHAR',
    'Other Provider Identifier_45': 'VARCHAR',
    'Other Provider Identifier Type Code_45': 'VARCHAR',
    'Other Provider Identifier State_45': 'VARCHAR',
    'Other Provider Identifier Issuer_45': 'VARCHAR',
    'Other Provider Identifier_46': 'VARCHAR',
    'Other Provider Identifier Type Code_46': 'VARCHAR',
    'Other Provider Identifier State_46': 'VARCHAR',
    'Other Provider Identifier Issuer_46': 'VARCHAR',
    'Other Provider Identifier_47': 'VARCHAR',
    'Other Provider Identifier Type Code_47': 'VARCHAR',
    'Other Provider Identifier State_47': 'VARCHAR',
    'Other Provider Identifier Issuer_47': 'VARCHAR',
    'Other Provider Identifier_48': 'VARCHAR',
    'Other Provider Identifier Type Code_48': 'VARCHAR',
    'Other Provider Identifier State_48': 'VARCHAR',
    'Other Provider Identifier Issuer_48': 'VARCHAR',
    'Other Provider Identifier_49': 'VARCHAR',
    'Other Provider Identifier Type Code_49': 'VARCHAR',
    'Other Provider Identifier State_49': 'VARCHAR',
    'Other Provider Identifier Issuer_49': 'VARCHAR',  
    'Other Provider Identifier_50': 'VARCHAR',
    'Other Provider Identifier Type Code_50': 'VARCHAR',
    'Other Provider Identifier State_50': 'VARCHAR',
    'Other Provider Identifier Issuer_50': 'VARCHAR',
    'Is Sole Proprietor': 'VARCHAR',
    'Is Organization Subpart': 'VARCHAR', 
    'Parent Organization LBN': 'VARCHAR',
    'Parent Organization TIN': 'VARCHAR',
    'Authorized Official Name Prefix Text': 'VARCHAR',
    'Authorized Official Name Suffix Text': 'VARCHAR',
    'Authorized Official Credential Text': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_1': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_2': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_3': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_4': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_5': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_6': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_7': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_8': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_9': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_10': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_11': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_12': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_13': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_14': 'VARCHAR',
    'Healthcare Provider Taxonomy Group_15': 'VARCHAR',
    'Certification Date': 'DATE'
  }
)
LIMIT 1;

## Working example of 18GB CSV file

From https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 - 

https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD

In [1]:
!wget https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv\?accessType=DOWNLOAD

--2023-07-28 12:41:21--  https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD
Resolving data.cityofnewyork.us (data.cityofnewyork.us)... 52.206.68.26, 52.206.140.205, 52.206.140.199
Connecting to data.cityofnewyork.us (data.cityofnewyork.us)|52.206.68.26|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘rows.csv?accessType=DOWNLOAD.1’

rows.csv?accessType     [                <=> ]  17.92G  3.32MB/s    in 1h 43m  

2023-07-28 14:24:27 (2.97 MB/s) - ‘rows.csv?accessType=DOWNLOAD.1’ saved [19239179665]



In [3]:
!head rows.csv?accessType=DOWNLOAD.1

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Description,Resolution Action Updated Date,Community Board,BBL,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Open Data Channel Type,Park Facility Name,Park Borough,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
47812452,10/07/2020 12:41:00 PM,10/09/2020 12:00:00 PM,DSNY,Department of Sanitation,Sanitation Condition,15 Street Cond/Dump-Out/Drop-Off,Street,11223,,,,,WEST    1 STREET,KINGS HIGHWAY,INTERSECTION,BROOKLYN,,DSNY Garage,Closed,,The Department of Sanitation removed the items.,10/09/2020 12:00:00 PM,11 BROOKLYN,,BROOKLYN,991413,159481,PHONE,Unspecified,BROOKLYN,,,,,,,,40.60441

In [4]:
!ls -lh rows.csv?accessType=DOWNLOAD.1

-rw-r--r--@ 1 me  staff    18G Jul 27 21:32 rows.csv?accessType=DOWNLOAD.1


In [6]:
!wc -l rows.csv?accessType=DOWNLOAD.1

 33705608 rows.csv?accessType=DOWNLOAD.1


In [5]:
%%sql
SELECT *
FROM read_csv('rows.csv?accessType=DOWNLOAD.1',
    header=True,
    delim=',',
    quote='"',
    parallel=false,
    columns={'Unique Key': 'BIGINT',
    'Created Date': 'VARCHAR',
    'Closed Date': 'VARCHAR',
    'Agency': 'VARCHAR',
    'Agency Name': 'VARCHAR',
    'Complaint Type': 'VARCHAR',
    'Descriptor': 'VARCHAR',
    'Location Type': 'VARCHAR',
    'Incident Zip': 'VARCHAR',
    'Incident Address': 'VARCHAR',
    'Street Name': 'VARCHAR',
    'Cross Street 1': 'VARCHAR',
    'Cross Street 2': 'VARCHAR',
    'Intersection Street 1': 'VARCHAR',
    'Intersection Street 2': 'VARCHAR',
    'Address Type': 'VARCHAR',
    'City': 'VARCHAR',
    'Landmark': 'VARCHAR',
    'Facility Type': 'VARCHAR',
    'Status': 'VARCHAR',
    'Due Date': 'VARCHAR',
    'Resolution Description': 'VARCHAR',
    'Resolution Action Updated Date': 'VARCHAR',
    'Community Board': 'VARCHAR',
    'BBL': 'VARCHAR',
    'Borough': 'VARCHAR',
    'X Coordinate (State Plane)': 'VARCHAR',
    'Y Coordinate (State Plane)': 'VARCHAR',
    'Open Data Channel Type': 'VARCHAR',
    'Park Facility Name': 'VARCHAR',
    'Park Borough': 'VARCHAR',
    'Vehicle Type': 'VARCHAR',
    'Taxi Company Borough': 'VARCHAR',
    'Taxi Pick Up Location': 'VARCHAR',
    'Bridge Highway Name': 'VARCHAR',
    'Bridge Highway Direction': 'VARCHAR',
    'Road Ramp': 'VARCHAR',
    'Bridge Highway Segment': 'VARCHAR',
    'Latitude': 'DOUBLE',
    'Longitude': 'DOUBLE',
    'Location': 'VARCHAR'}) 
LIMIT 10;



Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,47812452,10/07/2020 12:41:00 PM,10/09/2020 12:00:00 PM,DSNY,Department of Sanitation,Sanitation Condition,15 Street Cond/Dump-Out/Drop-Off,Street,11223,,...,,,,,,,,40.604412,-73.974204,"(40.60441192977613, -73.9742039699663)"
1,47812453,10/07/2020 09:52:16 PM,10/07/2020 10:12:26 PM,NYPD,New York City Police Department,Noise - Commercial,Car/Truck Music,Store/Commercial,10460,1569 HOE AVENUE,...,,,,,,,,40.834901,-73.888121,"(40.834900968256505, -73.88812143532677)"
2,47812454,10/07/2020 07:46:00 PM,10/10/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,10465,720 WILCOX AVENUE,...,,,,,,,,40.82964,-73.816197,"(40.82964046550051, -73.8161966544683)"
3,47812455,10/07/2020 08:04:00 PM,10/10/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11422,149-95 254 STREET,...,,,,,,,,40.650345,-73.73589,"(40.6503448115306, -73.7358902635658)"
4,47812456,10/07/2020 08:16:00 PM,10/09/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,10465,219 REVERE AVENUE,...,,,,,,,,40.816393,-73.816988,"(40.81639294813431, -73.81698831093635)"
5,47812457,10/07/2020 04:25:46 PM,10/30/2020 06:15:23 PM,OSE,Mayorâs Office of Special Enforcement,NonCompliance with Phased Reopening,Business not in compliance,Store/Commercial,10036,1466 BROADWAY,...,,,,,,,,40.755569,-73.986465,"(40.75556864630674, -73.98646452281714)"
6,47812458,10/07/2020 09:34:00 PM,10/09/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,10460,1440 TAYLOR AVENUE,...,,,,,,,,40.836242,-73.866147,"(40.83624168153411, -73.86614702832426)"
7,47812459,10/07/2020 11:02:00 PM,10/09/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11221,517A LEXINGTON AVENUE,...,,,,,,,,40.688832,-73.94099,"(40.688831986186905, -73.94099029352407)"
8,47812460,10/07/2020 11:45:00 PM,10/16/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,10028,513 EAST 82 STREET,...,,,,,,,,40.773116,-73.948451,"(40.77311602453136, -73.94845053944576)"
9,47812461,10/08/2020 12:48:00 AM,10/11/2020 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11216,307 PUTNAM AVENUE,...,,,,,,,,40.684115,-73.949377,"(40.68411497439874, -73.94937747284051)"


In [4]:
%%sql
COPY (SELECT *
FROM read_csv('rows.csv?accessType=DOWNLOAD.1',
    header=True,
    delim=',',
    quote='"',
    parallel=false,
    columns={'Unique Key': 'BIGINT',
    'Created Date': 'VARCHAR',
    'Closed Date': 'VARCHAR',
    'Agency': 'VARCHAR',
    'Agency Name': 'VARCHAR',
    'Complaint Type': 'VARCHAR',
    'Descriptor': 'VARCHAR',
    'Location Type': 'VARCHAR',
    'Incident Zip': 'VARCHAR',
    'Incident Address': 'VARCHAR',
    'Street Name': 'VARCHAR',
    'Cross Street 1': 'VARCHAR',
    'Cross Street 2': 'VARCHAR',
    'Intersection Street 1': 'VARCHAR',
    'Intersection Street 2': 'VARCHAR',
    'Address Type': 'VARCHAR',
    'City': 'VARCHAR',
    'Landmark': 'VARCHAR',
    'Facility Type': 'VARCHAR',
    'Status': 'VARCHAR',
    'Due Date': 'VARCHAR',
    'Resolution Description': 'VARCHAR',
    'Resolution Action Updated Date': 'VARCHAR',
    'Community Board': 'VARCHAR',
    'BBL': 'VARCHAR',
    'Borough': 'VARCHAR',
    'X Coordinate (State Plane)': 'VARCHAR',
    'Y Coordinate (State Plane)': 'VARCHAR',
    'Open Data Channel Type': 'VARCHAR',
    'Park Facility Name': 'VARCHAR',
    'Park Borough': 'VARCHAR',
    'Vehicle Type': 'VARCHAR',
    'Taxi Company Borough': 'VARCHAR',
    'Taxi Pick Up Location': 'VARCHAR',
    'Bridge Highway Name': 'VARCHAR',
    'Bridge Highway Direction': 'VARCHAR',
    'Road Ramp': 'VARCHAR',
    'Bridge Highway Segment': 'VARCHAR',
    'Latitude': 'DOUBLE',
    'Longitude': 'DOUBLE',
    'Location': 'VARCHAR'}) 
-- LIMIT 1000000 -- uncomment this line to create a smaller version of the file for testing purposes
) TO './service_requests.parquet' (COMPRESSION ZSTD);

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Count
0,33705607


In [11]:
import polars as pl

In [12]:
df = pl.read_parquet('service_requests.parquet')

In [13]:
df

Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Description,Resolution Action Updated Date,Community Board,BBL,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Open Data Channel Type,Park Facility Name,Park Borough,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
i64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,str
47812452,"""10/07/2020 12:…","""10/09/2020 12:…","""DSNY""","""Department of …","""Sanitation Con…","""15 Street Cond…","""Street""","""11223""",,,,,"""WEST 1 STRE…","""KINGS HIGHWAY""","""INTERSECTION""","""BROOKLYN""",,"""DSNY Garage""","""Closed""",,"""The Department…","""10/09/2020 12:…","""11 BROOKLYN""",,"""BROOKLYN""","""991413""","""159481""","""PHONE""","""Unspecified""","""BROOKLYN""",,,,,,,,40.604412,-73.974204,"""(40.6044119297…"
47812453,"""10/07/2020 09:…","""10/07/2020 10:…","""NYPD""","""New York City …","""Noise - Commer…","""Car/Truck Musi…","""Store/Commerci…","""10460""","""1569 HOE AVENU…","""HOE AVENUE""","""EAST 172 STRE…","""EAST 173 STRE…","""EAST 172 STRE…","""EAST 173 STRE…",,"""BRONX""","""HOE AVENUE""",,"""Closed""",,"""The Police Dep…","""10/07/2020 10:…","""03 BRONX""","""2029820027""","""BRONX""","""1015209""","""243474""","""MOBILE""","""Unspecified""","""BRONX""",,,,,,,,40.834901,-73.888121,"""(40.8349009682…"
47812454,"""10/07/2020 07:…","""10/10/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""10465""","""720 WILCOX AVE…","""WILCOX AVENUE""","""RANDALL AVENUE…","""PHILIP AVENUE""",,,"""ADDRESS""","""BRONX""",,,"""Closed""",,,"""10/10/2020 12:…","""10 BRONX""","""2054800051""","""BRONX""","""1035116""","""241591""","""UNKNOWN""","""Unspecified""","""BRONX""",,,,,,,,40.82964,-73.816197,"""(40.8296404655…"
47812455,"""10/07/2020 08:…","""10/10/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""11422""","""149-95 254 STR…","""254 STREET""","""149 DRIVE""","""CRAFT AVENUE""",,,"""ADDRESS""","""Rosedale""",,,"""Closed""",,,"""10/10/2020 12:…","""13 QUEENS""","""4136580031""","""QUEENS""","""1057537""","""176325""","""UNKNOWN""","""Unspecified""","""QUEENS""",,,,,,,,40.650345,-73.73589,"""(40.6503448115…"
47812456,"""10/07/2020 08:…","""10/09/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""10465""","""219 REVERE AVE…","""REVERE AVENUE""","""HARDING AVENUE…","""LAWTON AVENUE""",,,"""ADDRESS""","""BRONX""",,,"""Closed""",,,"""10/09/2020 12:…","""10 BRONX""","""2055900181""","""BRONX""","""1034907""","""236764""","""UNKNOWN""","""Unspecified""","""BRONX""",,,,,,,,40.816393,-73.816988,"""(40.8163929481…"
47812457,"""10/07/2020 04:…","""10/30/2020 06:…","""OSE""","""Mayorâs Offi…","""NonCompliance …","""Business not i…","""Store/Commerci…","""10036""","""1466 BROADWAY""","""BROADWAY""","""WEST 41 STRE…","""WEST 42 STRE…","""WEST 41 STRE…","""WEST 42 STRE…",,"""NEW YORK""","""BROADWAY""",,"""Closed""",,"""Thank you for …","""10/30/2020 06:…","""05 MANHATTAN""","""1009947502""","""MANHATTAN""","""988000""","""214551""","""ONLINE""","""Unspecified""","""MANHATTAN""",,,,,,,,40.755569,-73.986465,"""(40.7555686463…"
47812458,"""10/07/2020 09:…","""10/09/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""10460""","""1440 TAYLOR AV…","""TAYLOR AVENUE""","""EAST 174 STREE…","""ARCHER STREET""",,,"""ADDRESS""","""BRONX""",,,"""Closed""",,,"""10/09/2020 12:…","""09 BRONX""","""2039000021""","""BRONX""","""1021289""","""243971""","""UNKNOWN""","""Unspecified""","""BRONX""",,,,,,,,40.836242,-73.866147,"""(40.8362416815…"
47812459,"""10/07/2020 11:…","""10/09/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""11221""","""517A LEXINGTON…","""LEXINGTON AVEN…","""THROOP AVENUE""","""MARCUS GARVEY …",,,"""ADDRESS""","""BROOKLYN""",,,"""Closed""",,,"""10/09/2020 12:…","""03 BROOKLYN""","""3018010070""","""BROOKLYN""","""1000615""","""190242""","""UNKNOWN""","""Unspecified""","""BROOKLYN""",,,,,,,,40.688832,-73.94099,"""(40.6888319861…"
47812460,"""10/07/2020 11:…","""10/16/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""10028""","""513 EAST 82 …","""EAST 82 STRE…","""YORK AVENUE""","""EAST END AVENU…",,,"""ADDRESS""","""NEW YORK""",,,"""Closed""",,,"""10/16/2020 12:…","""08 MANHATTAN""","""1015790009""","""MANHATTAN""","""998528""","""220948""","""UNKNOWN""","""Unspecified""","""MANHATTAN""",,,,,,,,40.773116,-73.948451,"""(40.7731160245…"
47812461,"""10/08/2020 12:…","""10/11/2020 12:…","""DSNY""","""Department of …","""Request Large …","""Request Large …","""Sidewalk""","""11216""","""307 PUTNAM AVE…","""PUTNAM AVENUE""","""NOSTRAND AVENU…","""MARCY AVENUE""",,,"""ADDRESS""","""BROOKLYN""",,,"""Closed""",,,"""10/11/2020 12:…","""03 BROOKLYN""","""3018230089""","""BROOKLYN""","""998290""","""188522""","""UNKNOWN""","""Unspecified""","""BROOKLYN""",,,,,,,,40.684115,-73.949377,"""(40.6841149743…"


## System information

In [14]:
import duckdb 
duckdb.__version__

'0.8.1'