# Parsing New York City Department of Buildings Permit Issuance Data

Makeshift data Dictionary:
* https://www.nyc.gov/site/buildings/industry/permit-type-and-job-status-codes.page

Data source: 
* https://data.cityofnewyork.us/Housing-Development/DOB-Permit-Issuance/ipu4-2q9a/about_data

Via: Aaron ðŸ§Ž

We will follow the data thinking workflow: 

1. First use the Jupyter notebook `%%sql` cell magic command to parse the entiretiy of the comma-separated value file using a `SELECT *` statement.
2. Then we will specify data types for every field needed.
3. Then we will add `ENUM`s to properly enumerate the data dictionary references.
4. Then we will reach out to a human incentizied to help us understand whether we have made the correct decisions in type inference of the preceding Steps 1â€“3.
5. If time, we will file bugs for any open source issues we believe can be readily reproduced and will compound into technical debt or user experience regression down the line, if we were to rely on the above infrastructure in the future.

We will use large language models throughout this exercise to write as little code as possible. On airplanes or without wireless internet, we will use ethernet cables or local large language models to help us (though a Visual Studio Code plugin for WizardCoder or other coding-specific LLMs that run with MLX may not be available yet).

On airgapped systems or airplanes we will download documentation a priori: 

* https://duckdb.org/duckdb-docs.pdf 

This will also be helpful to feed to the large language models :)

In [2]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Load pandas, which lets us manipulate dataframes
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Allow named parameters (python variables) in SQL cells
%config SqlMagic.named_parameters=True

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

Please use a valid option: "warn", "enabled", or "disabled". 
For more information, see the docs: https://jupysql.ploomber.io/en/latest/api/configuration.html#named-parameters


# 1. Load with `SELECT *`

In [10]:
%%sql
-- dates are 2017-07-27 not default: date_format = %m/%d/%Y (Auto-Detected)
-- issues: (1) bug for jupysql PLOOMBER, cannot autoformat sql magic cell :(
-- (1) first get it to parse with SELECT *, (2) then add ENUMs/data dictionary references (3) then reach a human whose job is on the line with issues and confirm your decisions.
SELECT * FROM read_csv_auto('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                            types={
                                'Job Start Date': 'VARCHAR', 
                                'Filing Date': 'VARCHAR', 
                                'Issuance Date': 'VARCHAR',
                                'Expiration Date': 'VARCHAR'
                                })

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Ownerâ€™s House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,01,A3,Y,00853,00002,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2.0,56.0,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,01,A2,Y,05687,00066,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51.0,17008.0,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,01,A2,Y,00064,00022,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49.0,81.0,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,01,A2,Y,05260,00001,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51.0,15601.0,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,01,A2,Y,04242,00045,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,50.0,134.0,New Dorp-Midland Beach
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3964751,BROOKLYN,3081923,2096,STRAUSS ST.,321455849,01,A1,N,03569,00040,...,,,7189643877,04/18/2024 00:00:00,3974607,40.662655,-73.914773,42.0,898.0,Brownsville
3964752,BRONX,2096464,2050,SEDGWICK AVENUE,220682740,02,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974750,40.858855,-73.915100,14.0,249.0,University Heights-Morris Heights
3964753,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,01,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974751,40.858905,-73.914981,14.0,249.0,University Heights-Morris Heights
3964754,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,02,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974752,40.858905,-73.914981,14.0,249.0,University Heights-Morris Heights


# 2. Fix date types

In [14]:
%%sql
SELECT *
FROM read_csv_auto('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                            types={
                                'Job Start Date': 'VARCHAR', 
                                'Filing Date': 'VARCHAR', 
                                'Issuance Date': 'VARCHAR',
                                'Expiration Date': 'VARCHAR'
                                })
LIMIT 10;

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Ownerâ€™s House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,1,A3,Y,853,2,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2,56,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,1,A2,Y,5687,66,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51,17008,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,1,A2,Y,64,22,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49,81,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,1,A2,Y,5260,1,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51,15601,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,1,A2,Y,4242,45,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,50,134,New Dorp-Midland Beach
5,BROOKLYN,3006577,101,DOUGLASS STREET,321004603,1,NB,N,409,48,...,,,7187079550,05/11/2022 00:00:00,3905856,40.683034,-73.991282,33,69,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill
6,BROOKLYN,3171950,1272,EAST 10TH ST,321980335,1,A1,N,6545,36,...,,,9292949460,05/11/2022 00:00:00,3905857,40.618282,-73.964974,44,454,Ocean Parkway South
7,STATEN ISLAND,5067915,203,THORNYCROFT AVE,540218584,1,A2,Y,5289,54,...,,,7189094572,05/11/2022 00:00:00,3905858,40.537587,-74.154776,51,15601,Great Kills
8,BROOKLYN,3039744,91,RALPH AVENUE,340811081,1,A3,Y,1485,8,...,,,7184751836,05/11/2022 00:00:00,3905859,40.686912,-73.923501,41,375,Stuyvesant Heights
9,BRONX,2075402,3421,COUNTRY CLUB ROAD,220516118,1,A1,N,5409,424,...,,,9145303057,05/11/2022 00:00:00,3905860,40.839666,-73.815547,13,27402,Pelham Bay-Country Club-City Island


In [19]:
%%sql
SELECT "Job Start Date"
FROM read_csv_auto('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                            types={
                                'Job Start Date': 'VARCHAR', 
                                'Filing Date': 'VARCHAR', 
                                'Issuance Date': 'VARCHAR',
                                'Expiration Date': 'VARCHAR'
                                })
LIMIT 10;

Unnamed: 0,Job Start Date
0,05/10/2022
1,05/12/2022
2,05/15/2022
3,05/15/2022
4,05/24/2022
5,06/19/2017
6,02/17/2021
7,05/10/2022
8,05/06/2021
9,12/15/2017


In [37]:
%%sql
SELECT "Job Start Date", "Filing Date", "Issuance Date", "Expiration Date"
FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                            types={
                                'Job Start Date': 'DATE', 
                                'Filing Date': 'DATE', 
                                'Issuance Date': 'DATE',
                                'Expiration Date': 'DATE',
                                },
                            dateformat='%m/%d/%Y')
LIMIT 10;

Unnamed: 0,Job Start Date,Filing Date,Issuance Date,Expiration Date
0,2022-05-10,2022-05-10,2022-05-10,2023-05-10
1,2022-05-12,2022-05-10,2022-05-10,2022-10-01
2,2022-05-15,2022-05-10,2022-05-10,2022-10-01
3,2022-05-15,2022-05-10,2022-05-10,2022-10-01
4,2022-05-24,2022-05-10,2022-05-10,2022-11-15
5,2017-06-19,2022-05-10,2022-05-10,2023-05-10
6,2021-02-17,2022-05-10,2022-05-10,2023-04-10
7,2022-05-10,2022-05-10,2022-05-10,2022-08-12
8,2021-05-06,2022-05-10,2022-05-10,2023-05-10
9,2017-12-15,2022-05-10,2022-05-10,2022-11-16


In [39]:
%%sql
SELECT *
FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                            types={
                                'Job Start Date': 'DATE', 
                                'Filing Date': 'DATE', 
                                'Issuance Date': 'DATE',
                                'Expiration Date': 'DATE',
                                },
                            dateformat='%m/%d/%Y')
LIMIT 10;

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Ownerâ€™s House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,1,A3,Y,853,2,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2,56,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,1,A2,Y,5687,66,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51,17008,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,1,A2,Y,64,22,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49,81,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,1,A2,Y,5260,1,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51,15601,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,1,A2,Y,4242,45,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,50,134,New Dorp-Midland Beach
5,BROOKLYN,3006577,101,DOUGLASS STREET,321004603,1,NB,N,409,48,...,,,7187079550,05/11/2022 00:00:00,3905856,40.683034,-73.991282,33,69,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill
6,BROOKLYN,3171950,1272,EAST 10TH ST,321980335,1,A1,N,6545,36,...,,,9292949460,05/11/2022 00:00:00,3905857,40.618282,-73.964974,44,454,Ocean Parkway South
7,STATEN ISLAND,5067915,203,THORNYCROFT AVE,540218584,1,A2,Y,5289,54,...,,,7189094572,05/11/2022 00:00:00,3905858,40.537587,-74.154776,51,15601,Great Kills
8,BROOKLYN,3039744,91,RALPH AVENUE,340811081,1,A3,Y,1485,8,...,,,7184751836,05/11/2022 00:00:00,3905859,40.686912,-73.923501,41,375,Stuyvesant Heights
9,BRONX,2075402,3421,COUNTRY CLUB ROAD,220516118,1,A1,N,5409,424,...,,,9145303057,05/11/2022 00:00:00,3905860,40.839666,-73.815547,13,27402,Pelham Bay-Country Club-City Island


In [41]:
!mkdir -p ~/data/cityofnewyork.us

In [45]:
%%sql
COPY(
    SELECT *
    FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                types={
                    'Job Start Date': 'DATE', 
                    'Filing Date': 'DATE', 
                    'Issuance Date': 'DATE',
                    'Expiration Date': 'DATE',
                    },
                dateformat='%m/%d/%Y',
                ignore_errors=false)
) TO '~/data/cityofnewyork.us/DOB_Permit_Issuance_20240419.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')

RuntimeError: (duckdb.duckdb.ConversionException) Conversion Error: CSV Error on Line: 260774
Error when converting column "Job Start Date".
Could not parse string "06/00/2000" according to format specifier "%m/%d/%Y"
06/00/2000
   ^
Error: Day out of range, expected a value between 1 and 31
Column Job Start Date is being converted as type DATE
This type was either manually set or derived from an existing table. Select a different type to correctly parse this column.
  file=/Users/me/Downloads/DOB_Permit_Issuance_20240419.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = " (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format = %m/%d/%Y (Set By User)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=0
  all_varchar=0

[SQL: COPY(
    SELECT *
    FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv',
                types={
                   

In [9]:
%%sql
COPY(
    SELECT *
    FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                types={
                    'Job Start Date': 'DATE', 
                    'Filing Date': 'DATE', 
                    'Issuance Date': 'DATE',
                    'Expiration Date': 'DATE',
                    },
                dateformat='%m/%d/%Y',
                ignore_errors=true,)
    LIMIT 10
) TO '~/data/cityofnewyork.us/DOB_Permit_Issuance_20240419.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')

RuntimeError: If using snippets, you may pass the --with argument explicitly.
For more details please refer: https://jupysql.ploomber.io/en/latest/compose.html#with-argument


Original error message from DB driver:
(duckdb.duckdb.ParserException) Parser Error: syntax error at or near ")"
[SQL: COPY(
    SELECT *
    FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv',
                types={
                    'Job Start Date': 'DATE',
                    'Filing Date': 'DATE',
                    'Issuance Date': 'DATE',
                    'Expiration Date': 'DATE',
                    },
                dateformat='%m/%d/%Y',
                ignore_errors=true,)
    LIMIT 10
) TO '~/data/cityofnewyork.us/DOB_Permit_Issuance_20240419.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')]
(Background on this error at: https://sqlalche.me/e/20/f405)

If you need help solving this issue, send us a message: https://ploomber.io/community


In [50]:
%%sql 
SELECT * FROM '~/data/cityofnewyork.us/DOB_Permit_Issuance_20240419.parquet' 


Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Ownerâ€™s House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,01,A3,Y,00853,00002,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2.0,56.0,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,01,A2,Y,05687,00066,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51.0,17008.0,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,01,A2,Y,00064,00022,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49.0,81.0,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,01,A2,Y,05260,00001,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51.0,15601.0,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,01,A2,Y,04242,00045,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,50.0,134.0,New Dorp-Midland Beach
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3956124,BROOKLYN,3081923,2096,STRAUSS ST.,321455849,01,A1,N,03569,00040,...,,,7189643877,04/18/2024 00:00:00,3974607,40.662655,-73.914773,42.0,898.0,Brownsville
3956125,BRONX,2096464,2050,SEDGWICK AVENUE,220682740,02,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974750,40.858855,-73.915100,14.0,249.0,University Heights-Morris Heights
3956126,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,01,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974751,40.858905,-73.914981,14.0,249.0,University Heights-Morris Heights
3956127,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,02,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974752,40.858905,-73.914981,14.0,249.0,University Heights-Morris Heights


In [52]:
!head ~/Downloads/DOB_Permit_Issuance_20240419.csv

BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,Community Board,Zip Code,Bldg Type,Residential,Special District 1,Special District 2,Work Type,Permit Status,Filing Status,Permit Type,Permit Sequence #,Permit Subtype,Oil Gas,Site Fill,Filing Date,Issuance Date,Expiration Date,Job Start Date,Permittee's First Name,Permittee's Last Name,Permittee's Business Name,Permittee's Phone #,Permittee's License Type,Permittee's License #,Act as Superintendent,Permittee's Other Title,HIC License,Site Safety Mgr's First Name,Site Safety Mgr's Last Name,Site Safety Mgr Business Name,Superintendent First & Last Name,Superintendent Business Name,Owner's Business Type,Non-Profit,Owner's Business Name,Owner's First Name,Owner's Last Name,Owner's House #,Owner's House Street Name,Ownerâ€™s House City,Ownerâ€™s House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
MANHATTAN,1088749,1,MADISON AVE,

## Format the first row of the file (column names) into multiple
```
BOROUGH
"Bin #",
"House #",
"Street Name",
"Job #",
"Job doc. #",
"Job Type",
"Self_Cert",
"Block",
"Lot",
"Community Board",
"Zip Code",
"Bldg Type",
"Residential",
"Special District 1",
"Special District 2",
"Work Type",
"Permit Status",
"Filing Status",
"Permit Type",
"Permit Sequence #",
"Permit Subtype",
"Oil Gas",
"Site Fill",
"Filing Date",
"Issuance Date",
"Expiration Date",
"Job Start Date",
"Permittee's First Name",
"Permittee's Last Name",
"Permittee's Business Name",
"Permittee's Phone #",
"Permittee's License Type",
"Permittee's License #",
"Act as Superintendent",
"Permittee's Other Title",
"HIC License",
"Site Safety Mgr's First Name",
"Site Safety Mgr's Last Name",
"Site Safety Mgr Business Name",
"Superintendent First & Last Name",
"Superintendent Business Name",
"Owner's Business Type",
"Non-Profit",
"Owner's Business Name",
"Owner's First Name",
"Owner's Last Name",
"Owner's House #",
"Owner's House Street Name",
"Ownerâ€™s House City",
"Ownerâ€™s House State",
"Ownerâ€™s House Zip Code",
"Owner's Phone #",
"DOBRunDate",
"PERMIT_SI_NO",
"LATITUDE",
"LONGITUDE",
"COUNCIL_DISTRICT",
"CENSUS_TRACT",
"NTA_NAME",
```

## Print the unique values in every column

In [84]:
import pandas as pd
df = pd.read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv')
for column in df.columns:
    unique = df[column].unique()
    if len(unique) < 100:
        print(column, unique.tolist())

  df = pd.read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv')


BOROUGH ['MANHATTAN', 'STATEN ISLAND', 'STATEN I;iay;iaSLAND', 'BROOKLYN', 'BRONX', 'QUEENS']
Job doc. # [1, 3, 2, 4, 5, 6, 9, 7, 8, 10, 11, 12]
Job Type ['A3', 'A2', 'NB', 'A1', 'DM', 'SG']
Self_Cert ['Y', 'N', nan, 'R', 'J', 'X']
Bldg Type [2.0, 1.0, nan]
Residential [nan, 'YES']
Special District 2 [nan, 'POPS', 'GW', 'IBZ', 'JAM', 'HILI', 'BPRK', 'GCP2']
Work Type ['EQ', 'OT', 'PL', nan, 'BL', 'MH', 'SD', 'SP', 'FB', 'FS', 'FP', 'CC', 'FA', 'NB']
Permit Status ['ISSUED', 'IN PROCESS', 'RE-ISSUED', nan, 'REVOKED']
Filing Status ['INITIAL', 'RENEWAL']
Permit Type ['EQ', 'EW', 'PL', 'AL', 'NB', 'FO', 'DM', 'SG', nan]
Permit Sequence # [1, 4, 2, 7, 5, 3, 6, 8, 9, 15, 10, 16, 12, 14, 13, 11, 17, 18, 24, 21, 22, 31, 20, 19, 25, 33, 34, 23, 28, 26, 27, 29, 35, 36, 30, 32]
Permit Subtype ['OT', nan, 'FN', 'EA', 'BL', 'MH', 'SD', 'SP', 'SH', 'SF', 'FB', 'FS', 'FP', 'CH', 'FA', 'SC']
Oil Gas [nan, 'OIL', 'GAS']
Site Fill [nan, 'NOT APPLICABLE', 'ON-SITE', 'USE UNDER 300 CU.YD', 'OFF-SITE', 'N

## Create a data dictionary of types

In [3]:
data_types = {
    "BOROUGH": str,
    "Bin #": str,
    "House #": str,
    "Street Name": str,
    "Job #": int,
    "Job doc. #": int,
    "Job Type": str,
    "Self_Cert": str,
    "Block": str,
    "Lot": str,
    "Community Board": int,
    "Zip Code": str,
    "Bldg Type": int,
    "Residential": str,
    "Special District 1": str,
    "Special District 2": str,
    "Work Type": str,
    "Permit Status": str,
    "Filing Status": str,
    "Permit Type": str,
    "Permit Sequence #": int,
    "Permit Subtype": str,
    "Oil Gas": str,
    "Site Fill": str,
    "Filing Date": str,
    "Issuance Date": str,
    "Expiration Date": str,
    "Job Start Date": str,
    "Permittee's First Name": str,
    "Permittee's Last Name": str,
    "Permittee's Business Name": str,
    "Permittee's Phone #": str,
    "Permittee's License Type": str,
    "Permittee's License #": str,
    "Act as Superintendent": str,
    "Permittee's Other Title": str,
    "HIC License": str,
    "Site Safety Mgr's First Name": str,
    "Site Safety Mgr's Last Name": str,
    "Site Safety Mgr Business Name": str,
    "Superintendent First & Last Name": str,
    "Superintendent Business Name": str,
    "Owner's Business Type": str,
    "Non-Profit": str,
    "Owner's Business Name": str,
    "Owner's First Name": str,
    "Owner's Last Name": str,
    "Owner's House #": str,
    "Owner's House Street Name": str,
    "Ownerâ€™s House City": str,
    "Ownerâ€™s House State": str,
    "Ownerâ€™s House Zip Code": str,
    "Owner's Phone #": str,
    "DOBRunDate": str,
    "PERMIT_SI_NO": str,
    "LATITUDE": float,
    "LONGITUDE": float,
    "COUNCIL_DISTRICT": int,
    "CENSUS_TRACT": str,
    "NTA_NAME": str,
}

## Generate SQL for the unique entries

In [4]:
import pandas as pd

df = pd.read_csv("~/Downloads/DOB_Permit_Issuance_20240419.csv")

  df = pd.read_csv("~/Downloads/DOB_Permit_Issuance_20240419.csv")


In [79]:
from typing import List


def clean(value, strings_to_remove: List[str]):
    if isinstance(value, str):
        for string in strings_to_remove:
            value = value.replace(string, "")
    return value


string = ""
strings_to_remove = ["â€™", "'"]
for column in df.columns:
    unique_values = df[column].dropna().unique()
    if len(unique_values) < 100:
        type_cls = data_types[column]
        clean_values = [clean(value, strings_to_remove) for value in unique_values]
        sorted_values = sorted([type_cls(value) for value in clean_values])
        enum_args = ",".join([f"'{value}'" for value in sorted_values])
        clean_column = clean(column, strings_to_remove)
        if type_cls == int:
            # Remove 0 for COBOL-formatted integers with two positions and a leading zero? 
            string += f"""
        regexp_replace("{column}", '0', '')::ENUM ({enum_args}) AS "{clean_column}","""
        elif any(
            any([char in strings_to_remove for char in string])
            for string in unique_values
        ):
            tmp = [x for x in strings_to_remove]
            # allow: spaces, hyphens, colons, slashes, semicolons, and letters
            regexp = f"""regexp_replace("{column}", '[^a-zA-Z0-9-/;: ]', '', 'g')"""
            string += f"""
        {regexp}::ENUM ({enum_args}) AS "{clean_column}","""
        else:
            string += f"""
        "{column}"::ENUM ({enum_args}) AS "{clean_column}","""
    # else:
    #     string += f"""
    #     "{column}" AS "{column}", """
print(string)


        "BOROUGH"::ENUM ('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN I;iay;iaSLAND','STATEN ISLAND') AS "BOROUGH",
        regexp_replace("Job doc. #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12') AS "Job doc. #",
        "Job Type"::ENUM ('A1','A2','A3','DM','NB','SG') AS "Job Type",
        "Self_Cert"::ENUM ('J','N','R','X','Y') AS "Self_Cert",
        regexp_replace("Bldg Type", '0', '')::ENUM ('1','2') AS "Bldg Type",
        "Residential"::ENUM ('YES') AS "Residential",
        "Special District 2"::ENUM ('BPRK','GCP2','GW','HILI','IBZ','JAM','POPS') AS "Special District 2",
        "Work Type"::ENUM ('BL','CC','EQ','FA','FB','FP','FS','MH','NB','OT','PL','SD','SP') AS "Work Type",
        "Permit Status"::ENUM ('IN PROCESS','ISSUED','RE-ISSUED','REVOKED') AS "Permit Status",
        "Filing Status"::ENUM ('INITIAL','RENEWAL') AS "Filing Status",
        "Permit Type"::ENUM ('AL','DM','EQ','EW','FO','NB','PL','SG') AS "Permit Type",
        regexp_repl

In [62]:
%%sql 
SELECT
        "BOROUGH"::ENUM ('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN I;iay;iaSLAND','STATEN ISLAND') AS "BOROUGH",
        regexp_replace("Job doc. #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12') AS "Job doc. #",
        "Job Type"::ENUM ('A1','A2','A3','DM','NB','SG') AS "Job Type",
        "Self_Cert"::ENUM ('J','N','R','X','Y') AS "Self_Cert",
        regexp_replace("Bldg Type", '0', '')::ENUM ('1','2') AS "Bldg Type",
        "Residential"::ENUM ('YES') AS "Residential",
        "Special District 2"::ENUM ('BPRK','GCP2','GW','HILI','IBZ','JAM','POPS') AS "Special District 2",
        "Work Type"::ENUM ('BL','CC','EQ','FA','FB','FP','FS','MH','NB','OT','PL','SD','SP') AS "Work Type",
        "Permit Status"::ENUM ('IN PROCESS','ISSUED','RE-ISSUED','REVOKED') AS "Permit Status",
        "Filing Status"::ENUM ('INITIAL','RENEWAL') AS "Filing Status",
        "Permit Type"::ENUM ('AL','DM','EQ','EW','FO','NB','PL','SG') AS "Permit Type",
        regexp_replace("Permit Sequence #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36') AS "Permit Sequence #",
        "Permit Subtype"::ENUM ('BL','CH','EA','FA','FB','FN','FP','FS','MH','OT','SC','SD','SF','SH','SP') AS "Permit Subtype",
        "Oil Gas"::ENUM ('GAS','OIL') AS "Oil Gas",
        "Site Fill"::ENUM ('NONE','NOT APPLICABLE','OFF-SITE','ON-SITE','USE UNDER 300 CU.YD') AS "Site Fill",
        "Permittee's License Type"::ENUM ('5S','DM','FS','GC','HI','MP','N','NW','OB','OW','PE','RA','SI','T@') AS "Permittees License Type",
        "Act as Superintendent"::ENUM ('A','N','Y') AS "Act as Superintendent",
        regexp_replace("Owner's Business Type", '[^a-zA-Z0-9-/;: ]', '', 'g')::ENUM ('2022-05-09 00:00:00','CONDO/CO-OP','CORPORATION','DCAS','DOE','HHC','HPD','INDIVIDUAL','NY STATE','NYC AGENCY','NYCHA','NYCHA/HHC','OTHER','OTHER GOVT AGENCY','PARTNERSHIP') AS "Owners Business Type",
        "Non-Profit"::ENUM ('8','N','Y','Â—') AS "Non-Profit",
        "Ownerâ€™s House State"::ENUM ('AK','AZ','CA','CO','CT','DC','DE','FL','GA','IA','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MO','NC','ND','NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR','RI','SC','SD','TN','TX','UT','VA','VT','WA') AS "Owners House State",
        regexp_replace("COUNCIL_DISTRICT", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49','50','51') AS "COUNCIL_DISTRICT",
FROM read_csv('/Users/me/Downloads/DOB_Permit_Issuance_20240419.csv', 
            types={
                'Bldg Type': 'VARCHAR',
                'Residential': 'VARCHAR',
                'Permit Sequence #': 'VARCHAR',
                'COUNCIL_DISTRICT': 'VARCHAR',
                'Job Start Date': 'DATE', 
                'Filing Date': 'DATE', 
                'Issuance Date': 'DATE',
                'Expiration Date': 'DATE',
                },
            dateformat='%m/%d/%Y',
            ignore_errors=true)
LIMIT 10

Unnamed: 0,BOROUGH,Job doc. #,Job Type,Self_Cert,Bldg Type,Residential,Special District 2,Work Type,Permit Status,Filing Status,...,Permit Sequence #,Permit Subtype,Oil Gas,Site Fill,Permittees License Type,Act as Superintendent,Owners Business Type,Non-Profit,Owners House State,COUNCIL_DISTRICT
0,MANHATTAN,1,A3,Y,2,,,EQ,ISSUED,INITIAL,...,1,OT,,,GC,,CORPORATION,N,,2
1,STATEN ISLAND,1,A2,Y,1,YES,,OT,ISSUED,INITIAL,...,1,OT,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,51
2,STATEN I;iay;iaSLAND,1,A2,Y,1,YES,,OT,ISSUED,INITIAL,...,1,OT,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,49
3,STATEN ISLAND,1,A2,Y,1,YES,,OT,ISSUED,INITIAL,...,1,OT,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,51
4,STATEN ISLAND,1,A2,Y,1,YES,,OT,ISSUED,INITIAL,...,1,OT,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,5
5,BROOKLYN,1,NB,N,2,YES,,PL,ISSUED,RENEWAL,...,4,,,ON-SITE,MP,,PARTNERSHIP,N,,33
6,BROOKLYN,1,A1,N,1,YES,,,ISSUED,RENEWAL,...,4,,,USE UNDER 300 CU.YD,GC,,INDIVIDUAL,N,,44
7,STATEN ISLAND,1,A2,Y,1,YES,,OT,ISSUED,INITIAL,...,1,OT,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,51
8,BROOKLYN,1,A3,Y,2,YES,,EQ,ISSUED,RENEWAL,...,2,OT,,,GC,,CORPORATION,N,,41
9,BRONX,1,A1,N,1,YES,,,ISSUED,RENEWAL,...,7,,,NOT APPLICABLE,GC,,INDIVIDUAL,N,,13


In [65]:
%%sql

COPY(
    SELECT
        "BOROUGH"::ENUM ('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN I;iay;iaSLAND','STATEN ISLAND') AS "BOROUGH",
        regexp_replace("Job doc. #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12') AS "Job doc. #",
        "Job Type"::ENUM ('A1','A2','A3','DM','NB','SG') AS "Job Type",
        "Self_Cert"::ENUM ('J','N','R','X','Y') AS "Self_Cert",
        regexp_replace("Bldg Type", '0', '')::ENUM ('1','2') AS "Bldg Type",
        "Residential"::ENUM ('YES') AS "Residential",
        "Special District 2"::ENUM ('BPRK','GCP2','GW','HILI','IBZ','JAM','POPS') AS "Special District 2",
        "Work Type"::ENUM ('BL','CC','EQ','FA','FB','FP','FS','MH','NB','OT','PL','SD','SP') AS "Work Type",
        "Permit Status"::ENUM ('IN PROCESS','ISSUED','RE-ISSUED','REVOKED') AS "Permit Status",
        "Filing Status"::ENUM ('INITIAL','RENEWAL') AS "Filing Status",
        "Permit Type"::ENUM ('AL','DM','EQ','EW','FO','NB','PL','SG') AS "Permit Type",
        regexp_replace("Permit Sequence #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36') AS "Permit Sequence #",
        "Permit Subtype"::ENUM ('BL','CH','EA','FA','FB','FN','FP','FS','MH','OT','SC','SD','SF','SH','SP') AS "Permit Subtype",
        "Oil Gas"::ENUM ('GAS','OIL') AS "Oil Gas",
        "Site Fill"::ENUM ('NONE','NOT APPLICABLE','OFF-SITE','ON-SITE','USE UNDER 300 CU.YD') AS "Site Fill",
        "Permittee's License Type"::ENUM ('5S','DM','FS','GC','HI','MP','N','NW','OB','OW','PE','RA','SI','T@') AS "Permittees License Type",
        "Act as Superintendent"::ENUM ('A','N','Y') AS "Act as Superintendent",
        regexp_replace("Owner's Business Type", '[^a-zA-Z0-9-/;: ]', '', 'g')::ENUM ('2022-05-09 00:00:00','CONDO/CO-OP','CORPORATION','DCAS','DOE','HHC','HPD','INDIVIDUAL','NY STATE','NYC AGENCY','NYCHA','NYCHA/HHC','OTHER','OTHER GOVT AGENCY','PARTNERSHIP') AS "Owners Business Type",
        "Non-Profit"::ENUM ('8','N','Y','Â—') AS "Non-Profit",
        "Ownerâ€™s House State"::ENUM ('AK','AZ','CA','CO','CT','DC','DE','FL','GA','IA','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MO','NC','ND','NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR','RI','SC','SD','TN','TX','UT','VA','VT','WA') AS "Owners House State",
        regexp_replace("COUNCIL_DISTRICT", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49','50','51') AS "COUNCIL_DISTRICT",
    FROM read_csv('~/Downloads/DOB_Permit_Issuance_20240419.csv', 
                types={
                    'Bldg Type': 'VARCHAR',
                    'Residential': 'VARCHAR',
                    'Permit Sequence #': 'VARCHAR',
                    'COUNCIL_DISTRICT': 'VARCHAR',
                    'Job Start Date': 'DATE', 
                    'Filing Date': 'DATE', 
                    'Issuance Date': 'DATE',
                    'Expiration Date': 'DATE',
                    },
                dateformat='%m/%d/%Y',
                ignore_errors=true)
) TO '~/data/cityofnewyork.us/DOB_Permit_Issuance_20240419.parquet' (FORMAT 'parquet', COMPRESSION 'zstd')

Unnamed: 0,Success


In [76]:
%%sql 
DESCRIBE SELECT * FROM '~/data/dob_permit_issuance.parquet'

Unnamed: 0,Success


## Now add the rest of the columns

In [80]:
from typing import List


def clean(value, strings_to_remove: List[str]):
    if isinstance(value, str):
        for string in strings_to_remove:
            value = value.replace(string, "")
    return value


string = ""
strings_to_remove = ["â€™", "'"]
for column in df.columns:
    unique_values = df[column].dropna().unique()
    if len(unique_values) < 100:
        type_cls = data_types[column]
        clean_values = [clean(value, strings_to_remove) for value in unique_values]
        sorted_values = sorted([type_cls(value) for value in clean_values])
        enum_args = ",".join([f"'{value}'" for value in sorted_values])
        clean_column = clean(column, strings_to_remove)
        if type_cls == int:
            # Remove 0 for COBOL-formatted integers with two positions and a leading zero? 
            string += f"""
        regexp_replace("{column}", '0', '')::ENUM ({enum_args}) AS "{clean_column}","""
        elif any(
            any([char in strings_to_remove for char in string])
            for string in unique_values
        ):
            tmp = [x for x in strings_to_remove]
            # allow: spaces, hyphens, colons, slashes, semicolons, and letters
            regexp = f"""regexp_replace("{column}", '[^a-zA-Z0-9-/;: ]', '', 'g')"""
            string += f"""
        {regexp}::ENUM ({enum_args}) AS "{clean_column}","""
        else:
            string += f"""
        "{column}"::ENUM ({enum_args}) AS "{clean_column}","""
    else:
        string += f"""
        "{column}" AS "{column}", """
print(string)


        "BOROUGH"::ENUM ('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN I;iay;iaSLAND','STATEN ISLAND') AS "BOROUGH",
        "Bin #" AS "Bin #", 
        "House #" AS "House #", 
        "Street Name" AS "Street Name", 
        "Job #" AS "Job #", 
        regexp_replace("Job doc. #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12') AS "Job doc. #",
        "Job Type"::ENUM ('A1','A2','A3','DM','NB','SG') AS "Job Type",
        "Self_Cert"::ENUM ('J','N','R','X','Y') AS "Self_Cert",
        "Block" AS "Block", 
        "Lot" AS "Lot", 
        "Community Board" AS "Community Board", 
        "Zip Code" AS "Zip Code", 
        regexp_replace("Bldg Type", '0', '')::ENUM ('1','2') AS "Bldg Type",
        "Residential"::ENUM ('YES') AS "Residential",
        "Special District 1" AS "Special District 1", 
        "Special District 2"::ENUM ('BPRK','GCP2','GW','HILI','IBZ','JAM','POPS') AS "Special District 2",
        "Work Type"::ENUM ('BL','CC','EQ','FA','FB','FP','F

In [81]:
%%sql 
SELECT
        "BOROUGH"::ENUM ('BRONX','BROOKLYN','MANHATTAN','QUEENS','STATEN I;iay;iaSLAND','STATEN ISLAND') AS "BOROUGH",
        "Bin #" AS "Bin #", 
        "House #" AS "House #", 
        "Street Name" AS "Street Name", 
        "Job #" AS "Job #", 
        regexp_replace("Job doc. #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12') AS "Job doc. #",
        "Job Type"::ENUM ('A1','A2','A3','DM','NB','SG') AS "Job Type",
        "Self_Cert"::ENUM ('J','N','R','X','Y') AS "Self_Cert",
        "Block" AS "Block", 
        "Lot" AS "Lot", 
        "Community Board" AS "Community Board", 
        "Zip Code" AS "Zip Code", 
        regexp_replace("Bldg Type", '0', '')::ENUM ('1','2') AS "Bldg Type",
        "Residential"::ENUM ('YES') AS "Residential",
        "Special District 1" AS "Special District 1", 
        "Special District 2"::ENUM ('BPRK','GCP2','GW','HILI','IBZ','JAM','POPS') AS "Special District 2",
        "Work Type"::ENUM ('BL','CC','EQ','FA','FB','FP','FS','MH','NB','OT','PL','SD','SP') AS "Work Type",
        "Permit Status"::ENUM ('IN PROCESS','ISSUED','RE-ISSUED','REVOKED') AS "Permit Status",
        "Filing Status"::ENUM ('INITIAL','RENEWAL') AS "Filing Status",
        "Permit Type"::ENUM ('AL','DM','EQ','EW','FO','NB','PL','SG') AS "Permit Type",
        regexp_replace("Permit Sequence #", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36') AS "Permit Sequence #",
        "Permit Subtype"::ENUM ('BL','CH','EA','FA','FB','FN','FP','FS','MH','OT','SC','SD','SF','SH','SP') AS "Permit Subtype",
        "Oil Gas"::ENUM ('GAS','OIL') AS "Oil Gas",
        "Site Fill"::ENUM ('NONE','NOT APPLICABLE','OFF-SITE','ON-SITE','USE UNDER 300 CU.YD') AS "Site Fill",
        "Filing Date" AS "Filing Date", 
        "Issuance Date" AS "Issuance Date", 
        "Expiration Date" AS "Expiration Date", 
        "Job Start Date" AS "Job Start Date", 
        "Permittee's First Name" AS "Permittee's First Name", 
        "Permittee's Last Name" AS "Permittee's Last Name", 
        "Permittee's Business Name" AS "Permittee's Business Name", 
        "Permittee's Phone #" AS "Permittee's Phone #", 
        "Permittee's License Type"::ENUM ('5S','DM','FS','GC','HI','MP','N','NW','OB','OW','PE','RA','SI','T@') AS "Permittees License Type",
        "Permittee's License #" AS "Permittee's License #", 
        "Act as Superintendent"::ENUM ('A','N','Y') AS "Act as Superintendent",
        "Permittee's Other Title" AS "Permittee's Other Title", 
        "HIC License" AS "HIC License", 
        "Site Safety Mgr's First Name" AS "Site Safety Mgr's First Name", 
        "Site Safety Mgr's Last Name" AS "Site Safety Mgr's Last Name", 
        "Site Safety Mgr Business Name" AS "Site Safety Mgr Business Name", 
        "Superintendent First & Last Name" AS "Superintendent First & Last Name", 
        "Superintendent Business Name" AS "Superintendent Business Name", 
        regexp_replace("Owner's Business Type", '[^a-zA-Z0-9-/;: ]', '', 'g')::ENUM ('2022-05-09 00:00:00','CONDO/CO-OP','CORPORATION','DCAS','DOE','HHC','HPD','INDIVIDUAL','NY STATE','NYC AGENCY','NYCHA','NYCHA/HHC','OTHER','OTHER GOVT AGENCY','PARTNERSHIP') AS "Owners Business Type",
        "Non-Profit"::ENUM ('8','N','Y','Â—') AS "Non-Profit",
        "Owner's Business Name" AS "Owner's Business Name", 
        "Owner's First Name" AS "Owner's First Name", 
        "Owner's Last Name" AS "Owner's Last Name", 
        "Owner's House #" AS "Owner's House #", 
        "Owner's House Street Name" AS "Owner's House Street Name", 
        "Ownerâ€™s House City" AS "Ownerâ€™s House City", 
        "Ownerâ€™s House State"::ENUM ('AK','AZ','CA','CO','CT','DC','DE','FL','GA','IA','IL','IN','KS','KY','LA','MA','MD','ME','MI','MN','MO','NC','ND','NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','PR','RI','SC','SD','TN','TX','UT','VA','VT','WA') AS "Owners House State",
        "Ownerâ€™s House Zip Code" AS "Ownerâ€™s House Zip Code", 
        "Owner's Phone #" AS "Owner's Phone #", 
        "DOBRunDate" AS "DOBRunDate", 
        "PERMIT_SI_NO" AS "PERMIT_SI_NO", 
        "LATITUDE" AS "LATITUDE", 
        "LONGITUDE" AS "LONGITUDE", 
        regexp_replace("COUNCIL_DISTRICT", '0', '')::ENUM ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49','50','51') AS "COUNCIL_DISTRICT",
        "CENSUS_TRACT" AS "CENSUS_TRACT", 
        "NTA_NAME" AS "NTA_NAME", 
FROM read_csv('/Users/me/Downloads/DOB_Permit_Issuance_20240419.csv', 
            types={
                'Bldg Type': 'VARCHAR',
                'Residential': 'VARCHAR',
                'Permit Sequence #': 'VARCHAR',
                'COUNCIL_DISTRICT': 'VARCHAR',
                'Job Start Date': 'DATE', 
                'Filing Date': 'DATE', 
                'Issuance Date': 'DATE',
                'Expiration Date': 'DATE',
                },
            dateformat='%m/%d/%Y',
            ignore_errors=true)
LIMIT 10

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Owners House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,1,A3,Y,853,2,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2,56,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,1,A2,Y,5687,66,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51,17008,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,1,A2,Y,64,22,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49,81,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,1,A2,Y,5260,1,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51,15601,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,1,A2,Y,4242,45,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,5,134,New Dorp-Midland Beach
5,BROOKLYN,3006577,101,DOUGLASS STREET,321004603,1,NB,N,409,48,...,,,7187079550,05/11/2022 00:00:00,3905856,40.683034,-73.991282,33,69,DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill
6,BROOKLYN,3171950,1272,EAST 10TH ST,321980335,1,A1,N,6545,36,...,,,9292949460,05/11/2022 00:00:00,3905857,40.618282,-73.964974,44,454,Ocean Parkway South
7,STATEN ISLAND,5067915,203,THORNYCROFT AVE,540218584,1,A2,Y,5289,54,...,,,7189094572,05/11/2022 00:00:00,3905858,40.537587,-74.154776,51,15601,Great Kills
8,BROOKLYN,3039744,91,RALPH AVENUE,340811081,1,A3,Y,1485,8,...,,,7184751836,05/11/2022 00:00:00,3905859,40.686912,-73.923501,41,375,Stuyvesant Heights
9,BRONX,2075402,3421,COUNTRY CLUB ROAD,220516118,1,A1,N,5409,424,...,,,9145303057,05/11/2022 00:00:00,3905860,40.839666,-73.815547,13,27402,Pelham Bay-Country Club-City Island


## Now link the data to the shape files for tax lots

In [83]:
%%sql 

SELECT * FROM '~/data/dob_permit_issuance.parquet';

Unnamed: 0,BOROUGH,Bin #,House #,Street Name,Job #,Job doc. #,Job Type,Self_Cert,Block,Lot,...,Owners House State,Ownerâ€™s House Zip Code,Owner's Phone #,DOBRunDate,PERMIT_SI_NO,LATITUDE,LONGITUDE,COUNCIL_DISTRICT,CENSUS_TRACT,NTA_NAME
0,MANHATTAN,1088749,1,MADISON AVE,141008987,1,A3,Y,00853,00002,...,,,2125942700,05/11/2022 00:00:00,3905851,40.740909,-73.987947,2,56.0,Hudson Yards-Chelsea-Flatiron-Union Square
1,STATEN ISLAND,5076937,87,BOYLAN STREET,540218539,1,A2,Y,05687,00066,...,,,9174201655,05/11/2022 00:00:00,3905852,40.563654,-74.179584,51,17008.0,Arden Heights
2,STATEN I;iay;iaSLAND,5001506,217,LAFAYETTE AVENUE,540218575,1,A2,Y,00064,00022,...,,,7188125847,05/11/2022 00:00:00,3905853,40.639633,-74.094169,49,81.0,West New Brighton-New Brighton-St. George
3,STATEN ISLAND,5067021,170,OAKDALE STREET,540218600,1,A2,Y,05260,00001,...,,,3478575846,05/11/2022 00:00:00,3905854,40.544597,-74.157153,51,15601.0,Great Kills
4,STATEN ISLAND,5058036,273,10 STREET,540218628,1,A2,Y,04242,00045,...,,,7186195891,05/11/2022 00:00:00,3905855,40.566798,-74.119726,5,134.0,New Dorp-Midland Beach
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3956124,BROOKLYN,3081923,2096,STRAUSS ST.,321455849,1,A1,N,03569,00040,...,,,7189643877,04/18/2024 00:00:00,3974607,40.662655,-73.914773,42,898.0,Brownsville
3956125,BRONX,2096464,2050,SEDGWICK AVENUE,220682740,2,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974750,40.858855,-73.915100,14,249.0,University Heights-Morris Heights
3956126,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,1,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974751,40.858905,-73.914981,14,249.0,University Heights-Morris Heights
3956127,BRONX,2100243,2060,SEDGWICK AVENUE,220682759,2,A2,N,03222,00062,...,,,6466642624,04/18/2024 00:00:00,3974752,40.858905,-73.914981,14,249.0,University Heights-Morris Heights


In [77]:
%%sql 

SELECT * FROM "/Users/me/data/nyc_mappluto_24v1_fgdb/MapPLUTO24v1_wgs84.parquet";

Unnamed: 0,Borough,Block,Lot,CD,BCT2020,BCTCB2020,CT2010,CB2010,SchoolDist,Council,...,FIRM07_FLAG,PFIRM15_FLAG,Version,DCPEdited,Latitude,Longitude,Notes,Shape_Leng,Shape_Area,geometry
0,MN,1,10,101.0,1000500,10005000003,5,1000,02,1.0,...,1,1,24v1,,40.688766,-74.018682,,0.0,7.478663e+06,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 2, ..."
1,MN,1,201,101.0,1000100,10001001000,1,1000,02,1.0,...,,1,24v1,,40.698188,-74.041329,,0.0,1.148538e+06,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
2,MN,2,1,101.0,1000900,10009001022,9,1025,02,1.0,...,1,1,24v1,t,40.700369,-74.012911,,0.0,1.008251e+05,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
3,MN,3,10,101.0,1031900,10319001006,319,1003,02,1.0,...,1,1,24v1,,40.700918,-74.014444,,0.0,4.216466e+04,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
4,MN,10,23,101.0,1000900,10009001023,9,1026,02,1.0,...,1,1,24v1,,40.703808,-74.012757,,0.0,2.051657e+04,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
856808,SI,8050,37,503.0,5024800,50248001014,248,1016,31,51.0,...,,,24v1,,40.507801,-74.251802,,0.0,9.977568e+03,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
856809,SI,8050,50,503.0,5024800,50248001014,248,1016,31,51.0,...,,,24v1,,40.508139,-74.251512,,0.0,5.566174e+03,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
856810,SI,8050,55,503.0,5024800,50248001014,248,1016,31,51.0,...,,,24v1,,40.508318,-74.251307,,0.0,1.183376e+02,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."
856811,SI,8050,56,503.0,5024800,50248001014,248,1016,31,51.0,...,,,24v1,,40.508362,-74.251257,,0.0,4.872917e+03,"[1, 6, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 1, ..."


In [None]:
!gpq convert "/Users/me/data/nyc_mappluto_24v1_fgdb/MapPLUTO24v1_wgs84.parquet" --to=geojson > ~/data/nyc_mappluto_24v1_fgdb/dataMapPLUTO24v1_wgs84.geojson