# Search Text Flood
Author: Mark Bauer

Goal: To search for the text pattern *flood* in every column in every dataset with more than 1M rows on NYC Open Data. Because of the size of these datasets, we'll utilize the Socrata API.

# Importing Libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from sodapy import Socrata
import requests
import time

Documention for installing watermark: https://github.com/rasbt/watermark.

In [2]:
# performed for reproducibility
%reload_ext watermark
%watermark -t -d -v -p pandas,sodapy

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.6.0

pandas: 1.5.1
sodapy: 2.2.0



# Socrata API
I used the Socrata API to retrieve metadata for datasets hosted on NYC Open Data. Documentation can be found here: https://dev.socrata.com/. Additionally, I used sodapy, the python client for the Socrata API, to query the metadata.

We'll use this API to gather all the datasets on NYC Open Data.

### Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`

Read more from the SODA documentation here: https://dev.socrata.com/docs/app-tokens.html

In [3]:
ls

analysis.ipynb                          search-text-flood.ipynb
cover-photo.ipynb                       searchstring-flood.ipynb
search-text-flood-large-datasets.ipynb


In [7]:
# retrieve export log from the other script, datasets less than 1M rows
datasets = pd.read_csv(
    "../logs/export-log.txt",
    on_bad_lines='skip',
    names=['timestamp', 'dataset', 'column'],
    low_memory=False
)

print(datasets.shape)
datasets.head()

(55047, 3)


Unnamed: 0,timestamp,dataset,column
0,2024-12-28 22:16:19,Processing mhyv-6iza,
1,2024-12-28 22:16:20,Processing mhyv-6iza,column Trust Name
2,2024-12-28 22:16:20,Processing mhyv-6iza,column Expenditure Name
3,2024-12-28 22:16:20,Processing mhyv-6iza,column Amount
4,2024-12-28 22:16:20,Processing mhyv-6iza,column Date of Incurrence


In [8]:
datasets = (
    datasets
    .loc[datasets['dataset'].fillna("").str.contains('too many rows')]
    .drop(columns=['column'])
    .reset_index(drop=True)
)

print(datasets.shape)
datasets.head()

(162, 2)


Unnamed: 0,timestamp,dataset
0,2024-12-28 22:19:05,Skipping u9wf-3gbt due to too many rows.
1,2024-12-28 22:25:47,Skipping c23c-uwsm due to too many rows.
2,2024-12-28 22:34:17,Skipping 5zhs-2jue due to too many rows.
3,2024-12-28 23:00:59,Skipping 6a2s-2t65 due to too many rows.
4,2024-12-28 23:04:24,Skipping fvp3-gcb2 due to too many rows.


In [9]:
datasets['dataset_id'] = datasets['dataset'].str.split(" ").str[1]

datasets

Unnamed: 0,timestamp,dataset,dataset_id
0,2024-12-28 22:19:05,Skipping u9wf-3gbt due to too many rows.,u9wf-3gbt
1,2024-12-28 22:25:47,Skipping c23c-uwsm due to too many rows.,c23c-uwsm
2,2024-12-28 22:34:17,Skipping 5zhs-2jue due to too many rows.,5zhs-2jue
3,2024-12-28 23:00:59,Skipping 6a2s-2t65 due to too many rows.,6a2s-2t65
4,2024-12-28 23:04:24,Skipping fvp3-gcb2 due to too many rows.,fvp3-gcb2
...,...,...,...
157,2024-12-29 08:26:16,Skipping ic3t-wcy2 due to too many rows.,ic3t-wcy2
158,2024-12-29 08:28:43,Skipping ipu4-2q9a due to too many rows.,ipu4-2q9a
159,2024-12-29 08:29:48,Skipping rhe8-mgbb due to too many rows.,rhe8-mgbb
160,2024-12-29 08:30:28,Skipping h9gi-nx95 due to too many rows.,h9gi-nx95


In [10]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# initialize Socrata object to fetch data
client = Socrata(
    domain=socrata_domain,
    app_token=None,
    timeout=10000
)

print(client)



<sodapy.socrata.Socrata object at 0x1683eb150>


In [11]:
# Discovery API
url = 'https://api.us.socrata.com/api/catalog/v1?search_context=data.cityofnewyork.us&limit=50000'

# fetch the JSON data from the web
response = requests.get(url)

# parse the JSON response
data_dict = response.json() 

# preview keys    
data_dict.keys() 



In [12]:
# convert into df
df = pd.DataFrame.from_records(data_dict['results'])

# sanity check
print(df.shape)
df.head()

(3240, 8)


Unnamed: 0,resource,classification,metadata,permalink,link,owner,creator,preview_image_url
0,"{'name': 'For Hire Vehicles (FHV) - Active', '...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/8wbx-tsch,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
1,"{'name': 'Civil Service List (Active)', 'id': ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/vx8i-nprf,https://data.cityofnewyork.us/City-Government/...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
2,"{'name': 'DOB Job Application Filings', 'id': ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/ic3t-wcy2,https://data.cityofnewyork.us/Housing-Developm...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
3,"{'name': 'TLC New Driver Application Status', ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/dpec-ucu7,https://data.cityofnewyork.us/Transportation/T...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
4,{'name': 'For Hire Vehicles (FHV) - Active Dri...,"{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/xjfq-wh2d,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",


In [13]:
# convert resource key to a dataframe
df = pd.DataFrame.from_records(df['resource'])

# sanity check
print(df.shape)
df.head()

(3240, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,For Hire Vehicles (FHV) - Active,8wbx-tsch,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:05:32.000Z,...,"[Last Time Updated, Certification Date, Base N...","[{'displayStyle': 'plain', 'align': 'left'}, {...",535601,official,tabular,table,False,,False,2021-04-05T13:20:47.000Z
1,Civil Service List (Active),vx8i-nprf,,[],A Civil Service List consists of all candidate...,Department of Citywide Administrative Services...,,,dataset,2024-12-27T14:09:28.000Z,...,[A candidate’s last name as it appears on thei...,"[{'displayStyle': 'plain', 'align': 'left'}, {...",68870,official,tabular,table,False,,False,2024-01-12T16:15:05.000Z
2,DOB Job Application Filings,ic3t-wcy2,,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2024-12-29T21:06:44.000Z,...,"[Proposed Dwelling Units, Document Number, Num...","[{'align': 'right'}, {'align': 'right'}, {'ali...",59754,official,tabular,table,False,,False,2020-06-22T18:23:35.000Z
3,TLC New Driver Application Status,dpec-ucu7,,[],THIS DATASET IS UPDATED SEVERAL TIMES PER DAY....,Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T23:06:31.000Z,...,[This is the number linked to your application...,"[{'precisionStyle': 'standard', 'noCommas': 't...",39667,official,tabular,table,False,,False,2019-12-17T18:44:57.000Z
4,For Hire Vehicles (FHV) - Active Drivers,xjfq-wh2d,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:07:08.000Z,...,"[Driver Name\n\n, Last Time Updated, Type of L...","[{'displayStyle': 'plain', 'align': 'left'}, {...",421843,official,tabular,table,False,,False,2024-01-11T19:58:17.000Z


In [14]:
dataset_ids = datasets['dataset_id'].to_list()

df = df.loc[df['id'].isin(dataset_ids)].reset_index(drop=True)

# sanity check
print(df.shape)
df.head()

(162, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,DOB Job Application Filings,ic3t-wcy2,,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2024-12-29T21:06:44.000Z,...,"[Proposed Dwelling Units, Document Number, Num...","[{'align': 'right'}, {'align': 'right'}, {'ali...",59754,official,tabular,table,False,,False,2020-06-22T18:23:35.000Z
1,311 Service Requests from 2010 to Present,erm2-nwe9,,[],<b>NOTE:</b> The 311 dataset is currently show...,311,,,dataset,2024-12-29T02:33:06.000Z,...,"[, Indicates how the SR was submitted to 311. ...","[{}, {'displayStyle': 'plain', 'align': 'left'...",446207,official,tabular,table,False,,False,2023-12-01T06:51:46.000Z
2,Civil Service List Certification,a9md-ynri,,[],A List Certification includes the names of eli...,Department of Citywide Administrative Services...,,,dataset,2024-12-27T14:12:33.000Z,...,"[The name of an appointing Agency.\n, An eligi...","[{'displayStyle': 'plain', 'align': 'left'}, {...",22427,official,tabular,table,False,,False,2021-04-22T15:38:30.000Z
3,Citywide Payroll Data (Fiscal Year),k397-673e,,[],Data is collected because of public interest i...,Office of Payroll Administration (OPA),,,dataset,2024-10-30T15:00:01.000Z,...,"[Payroll Number, Number of regular hours emplo...","[{}, {}, {'precisionStyle': 'currency', 'curre...",36880,official,tabular,table,False,,False,2023-11-28T17:52:17.000Z
4,Motor Vehicle Collisions - Crashes,h9gi-nx95,,[],The Motor Vehicle Collisions crash table conta...,Police Department (NYPD),,,dataset,2024-12-27T23:54:06.000Z,...,"[, Street address if known, Factors contributi...","[{}, {'align': 'left'}, {'align': 'left'}, {'a...",207788,official,tabular,table,False,,False,2021-04-19T14:42:57.000Z


In [15]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   name                 162 non-null    object
 1   id                   162 non-null    object
 2   resource_name        0 non-null      object
 3   parent_fxf           162 non-null    object
 4   description          162 non-null    object
 5   attribution          156 non-null    object
 6   attribution_link     43 non-null     object
 7   contact_email        0 non-null      object
 8   type                 162 non-null    object
 9   updatedAt            162 non-null    object
 10  createdAt            162 non-null    object
 11  metadata_updated_at  162 non-null    object
 12  data_updated_at      162 non-null    object
 13  page_views           162 non-null    object
 14  columns_name         162 non-null    object
 15  columns_field_name   162 non-null    object
 16  columns_

In [16]:
# review dataset types, we only want datasets
df['type'].value_counts()

dataset    162
Name: type, dtype: int64

In [17]:
# we only want datasets
df = (
    df
    .loc[df['type'] == 'dataset']
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(162, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,DOB Job Application Filings,ic3t-wcy2,,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2024-12-29T21:06:44.000Z,...,"[Proposed Dwelling Units, Document Number, Num...","[{'align': 'right'}, {'align': 'right'}, {'ali...",59754,official,tabular,table,False,,False,2020-06-22T18:23:35.000Z
1,311 Service Requests from 2010 to Present,erm2-nwe9,,[],<b>NOTE:</b> The 311 dataset is currently show...,311,,,dataset,2024-12-29T02:33:06.000Z,...,"[, Indicates how the SR was submitted to 311. ...","[{}, {'displayStyle': 'plain', 'align': 'left'...",446207,official,tabular,table,False,,False,2023-12-01T06:51:46.000Z
2,Civil Service List Certification,a9md-ynri,,[],A List Certification includes the names of eli...,Department of Citywide Administrative Services...,,,dataset,2024-12-27T14:12:33.000Z,...,"[The name of an appointing Agency.\n, An eligi...","[{'displayStyle': 'plain', 'align': 'left'}, {...",22427,official,tabular,table,False,,False,2021-04-22T15:38:30.000Z
3,Citywide Payroll Data (Fiscal Year),k397-673e,,[],Data is collected because of public interest i...,Office of Payroll Administration (OPA),,,dataset,2024-10-30T15:00:01.000Z,...,"[Payroll Number, Number of regular hours emplo...","[{}, {}, {'precisionStyle': 'currency', 'curre...",36880,official,tabular,table,False,,False,2023-11-28T17:52:17.000Z
4,Motor Vehicle Collisions - Crashes,h9gi-nx95,,[],The Motor Vehicle Collisions crash table conta...,Police Department (NYPD),,,dataset,2024-12-27T23:54:06.000Z,...,"[, Street address if known, Factors contributi...","[{}, {'align': 'left'}, {'align': 'left'}, {'a...",207788,official,tabular,table,False,,False,2021-04-19T14:42:57.000Z


In [18]:
# sanity check view type
df['lens_view_type'].value_counts()

tabular    162
Name: lens_view_type, dtype: int64

In [20]:
# sort df by download count ascending order
df = (
    df
    .sort_values(by='download_count')
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(162, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,Building Footprints (P Layer),u9wf-3gbt,,[],Shapefile of footprint outlines of buildings i...,Office of Technology and Innovation (OTI),,,dataset,2024-12-24T18:15:01.000Z,...,[This column was automatically created in orde...,"[{}, {}, {'noCommas': 'true'}, {}, {}, {}, {},...",42,official,tabular,table,False,,False,2024-11-19T16:45:45.000Z
1,SweepNYC Street Cleaning,c23c-uwsm,,[],This dataset contains NYC Street Centerline (C...,Department of Sanitation (DSNY),,,dataset,2024-12-29T15:50:54.000Z,...,[Date and time when street segment was last as...,"[{}, {}, {}]",86,official,tabular,table,False,,False,2024-09-27T19:30:02.000Z
2,Building Footprints,5zhs-2jue,,[],Shapefile of footprint outlines of buildings i...,Office of Technology and Innovation (OTI),,,dataset,2024-12-24T18:24:48.000Z,...,"[Geometry column used for mapping, Type of Bui...","[{}, {'noCommas': 'true'}, {}, {'noCommas': 't...",121,official,tabular,table,False,,False,2024-11-13T16:16:55.000Z
3,2012 Yellow Taxi Trip Data,kerk-3eby,,[],These records are generated from the trip reco...,Taxi and Limousine Commission (TLC),https://www.nyc.gov/site/tlc/about/tlc-trip-re...,,dataset,2023-12-14T20:44:28.000Z,...,[Miscellaneous extras and surcharges. Currentl...,"[{'precisionStyle': 'standard', 'noCommas': 'f...",260,official,tabular,table,False,,False,2015-12-08T21:22:21.000Z
4,Local Law 84 Monthly Data,fvp3-gcb2,,[],Monthly whole building electricity and natural...,Department of Buildings (DOB),,,dataset,2024-10-01T19:34:42.000Z,...,[Energy Use by Type is a summary of the annual...,"[{}, {'noCommas': 'true'}, {}, {}, {}, {}, {},...",261,official,tabular,table,False,,False,2024-10-01T15:22:10.000Z


In [21]:
field_names_df = df.loc[:, ['id', 'columns_field_name']].explode('columns_field_name')
field_types_df = df.loc[:, ['id', 'columns_datatype']].explode('columns_datatype')

df = (
    pd
    .concat([field_names_df, field_types_df.drop(columns=['id'])], axis=1)
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(4292, 3)


Unnamed: 0,id,columns_field_name,columns_datatype
0,u9wf-3gbt,:@computed_region_f5dn_yrer,Number
1,u9wf-3gbt,the_geom,Point
2,u9wf-3gbt,feat_code,Number
3,u9wf-3gbt,groundelev,Number
4,u9wf-3gbt,base_bbl,Text


In [22]:
df = (
    df
    .loc[df['columns_datatype'] == 'Text']
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(2324, 3)


Unnamed: 0,id,columns_field_name,columns_datatype
0,u9wf-3gbt,base_bbl,Text
1,u9wf-3gbt,geomsource,Text
2,u9wf-3gbt,mpluto_bbl,Text
3,u9wf-3gbt,name,Text
4,u9wf-3gbt,lststatype,Text


In [23]:
# number of dataset
df['id'].nunique()

161

In [24]:
# save as a CSV file
df.to_csv('../data/dataset-ids-columns.csv', index=False)

In [25]:
# sanity check
%ls ../data/

columns-large-datasets.csv  dataset-ids.csv
columns.csv                 datasets.csv
dataset-ids-columns.csv
