# Search Text Flood
Author: Mark Bauer

Goal: To search for the text pattern *flood* in every column in every dataset less than 1M rows on NYC Open Data. We'll retreive datasets with more than 1M rows an alternative method.

# Importing Libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from sodapy import Socrata
import requests
import time

Documention for installing watermark: https://github.com/rasbt/watermark.

In [2]:
# performed for reproducibility
%reload_ext watermark
%watermark -t -d -v -p pandas,sodapy

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.6.0

pandas: 1.5.1
sodapy: 2.2.0



# Socrata API
I used the Socrata API to retrieve metadata for datasets hosted on NYC Open Data. Documentation can be found here: https://dev.socrata.com/. Additionally, I used sodapy, the python client for the Socrata API, to query the metadata.

We'll use this API to gather all the datasets on NYC Open Data.

### Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`

Read more from the SODA documentation here: https://dev.socrata.com/docs/app-tokens.html

In [3]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# initialize Socrata object to fetch data
client = Socrata(
    domain=socrata_domain,
    app_token=None,
    timeout=10000
)

print(client)



<sodapy.socrata.Socrata object at 0x10e86fb50>


In [4]:
# Discovery API
url = 'https://api.us.socrata.com/api/catalog/v1?search_context=data.cityofnewyork.us&limit=50000'

# fetch the JSON data from the web
response = requests.get(url)

# parse the JSON response
data_dict = response.json() 

# preview keys    
data_dict.keys() 



In [5]:
# convert into df
df = pd.DataFrame.from_records(data_dict['results'])

# sanity check
print(df.shape)
df.head()

(3240, 8)


Unnamed: 0,resource,classification,metadata,permalink,link,owner,creator,preview_image_url
0,"{'name': 'For Hire Vehicles (FHV) - Active', '...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/8wbx-tsch,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
1,"{'name': 'Civil Service List (Active)', 'id': ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/vx8i-nprf,https://data.cityofnewyork.us/City-Government/...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
2,"{'name': 'DOB Job Application Filings', 'id': ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/ic3t-wcy2,https://data.cityofnewyork.us/Housing-Developm...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
3,"{'name': 'TLC New Driver Application Status', ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/dpec-ucu7,https://data.cityofnewyork.us/Transportation/T...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
4,{'name': 'For Hire Vehicles (FHV) - Active Dri...,"{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/xjfq-wh2d,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",


In [6]:
# convert resource key to a dataframe
df = pd.DataFrame.from_records(df['resource'])

# sanity check
print(df.shape)
df.head()

(3240, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,For Hire Vehicles (FHV) - Active,8wbx-tsch,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:05:32.000Z,...,"[Last Time Updated, Certification Date, Base N...","[{'displayStyle': 'plain', 'align': 'left'}, {...",535601,official,tabular,table,False,,False,2021-04-05T13:20:47.000Z
1,Civil Service List (Active),vx8i-nprf,,[],A Civil Service List consists of all candidate...,Department of Citywide Administrative Services...,,,dataset,2024-12-27T14:09:28.000Z,...,[A candidate’s last name as it appears on thei...,"[{'displayStyle': 'plain', 'align': 'left'}, {...",68870,official,tabular,table,False,,False,2024-01-12T16:15:05.000Z
2,DOB Job Application Filings,ic3t-wcy2,,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2024-12-29T21:06:44.000Z,...,"[Proposed Dwelling Units, Document Number, Num...","[{'align': 'right'}, {'align': 'right'}, {'ali...",59754,official,tabular,table,False,,False,2020-06-22T18:23:35.000Z
3,TLC New Driver Application Status,dpec-ucu7,,[],THIS DATASET IS UPDATED SEVERAL TIMES PER DAY....,Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T23:06:31.000Z,...,[This is the number linked to your application...,"[{'precisionStyle': 'standard', 'noCommas': 't...",39667,official,tabular,table,False,,False,2019-12-17T18:44:57.000Z
4,For Hire Vehicles (FHV) - Active Drivers,xjfq-wh2d,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:07:08.000Z,...,"[Driver Name\n\n, Last Time Updated, Type of L...","[{'displayStyle': 'plain', 'align': 'left'}, {...",421843,official,tabular,table,False,,False,2024-01-11T19:58:17.000Z


In [7]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3240 entries, 0 to 3239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   name                 3240 non-null   object
 1   id                   3240 non-null   object
 2   resource_name        0 non-null      object
 3   parent_fxf           3240 non-null   object
 4   description          3240 non-null   object
 5   attribution          3169 non-null   object
 6   attribution_link     477 non-null    object
 7   contact_email        0 non-null      object
 8   type                 3240 non-null   object
 9   updatedAt            3240 non-null   object
 10  createdAt            3240 non-null   object
 11  metadata_updated_at  3240 non-null   object
 12  data_updated_at      3063 non-null   object
 13  page_views           3240 non-null   object
 14  columns_name         3240 non-null   object
 15  columns_field_name   3240 non-null   object
 16  column

In [8]:
# review dataset types, we only want datasets
df['type'].value_counts()

dataset          2559
map               339
file              167
href              147
filter             24
story               2
chart               1
visualization       1
Name: type, dtype: int64

In [9]:
# we only want datasets
df = (
    df
    .loc[df['type'] == 'dataset']
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(2559, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,For Hire Vehicles (FHV) - Active,8wbx-tsch,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:05:32.000Z,...,"[Last Time Updated, Certification Date, Base N...","[{'displayStyle': 'plain', 'align': 'left'}, {...",535601,official,tabular,table,False,,False,2021-04-05T13:20:47.000Z
1,Civil Service List (Active),vx8i-nprf,,[],A Civil Service List consists of all candidate...,Department of Citywide Administrative Services...,,,dataset,2024-12-27T14:09:28.000Z,...,[A candidate’s last name as it appears on thei...,"[{'displayStyle': 'plain', 'align': 'left'}, {...",68870,official,tabular,table,False,,False,2024-01-12T16:15:05.000Z
2,DOB Job Application Filings,ic3t-wcy2,,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2024-12-29T21:06:44.000Z,...,"[Proposed Dwelling Units, Document Number, Num...","[{'align': 'right'}, {'align': 'right'}, {'ali...",59754,official,tabular,table,False,,False,2020-06-22T18:23:35.000Z
3,TLC New Driver Application Status,dpec-ucu7,,[],THIS DATASET IS UPDATED SEVERAL TIMES PER DAY....,Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T23:06:31.000Z,...,[This is the number linked to your application...,"[{'precisionStyle': 'standard', 'noCommas': 't...",39667,official,tabular,table,False,,False,2019-12-17T18:44:57.000Z
4,For Hire Vehicles (FHV) - Active Drivers,xjfq-wh2d,,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2024-12-29T20:07:08.000Z,...,"[Driver Name\n\n, Last Time Updated, Type of L...","[{'displayStyle': 'plain', 'align': 'left'}, {...",421843,official,tabular,table,False,,False,2024-01-11T19:58:17.000Z


In [10]:
# sort df by download count ascending order
df = (
    df
    .sort_values(by='download_count')
    .reset_index(drop=True)
)

# sanity check
print(df.shape)
df.head()

(2559, 27)


Unnamed: 0,name,id,resource_name,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,...,columns_description,columns_format,download_count,provenance,lens_view_type,lens_display_type,locked,blob_mime_type,hide_from_data_json,publication_date
0,Legal Defense Trust Expenditures,mhyv-6iza,,[],Pursuant to the City's Legal Defense Trusts La...,Conflicts of Interest Board (COIB),https://coib-ldt.cityofnewyork.us/s/,,dataset,2024-12-23T17:41:28.000Z,...,"[Amount of the expenditure, Date the expenses ...","[{'precisionStyle': 'currency', 'decimalSepara...",13,official,tabular,table,False,,False,2024-09-18T19:58:13.000Z
1,Legal Defense Trust Donations,jsiv-zh9r,,[],Pursuant to the City's Legal Defense Trusts La...,Conflicts of Interest Board (COIB),https://coib-ldt.cityofnewyork.us/s/,,dataset,2024-12-23T17:38:51.000Z,...,"[Donor's state, Donor's city, Value of the don...","[{}, {}, {'precision': '2', 'decimalSeparator'...",14,official,tabular,table,False,,False,2024-09-18T19:15:13.000Z
2,Legal Defense Trust Refunded Donations,t3pj-3dgu,,[],Pursuant to the City's Legal Defense Trusts La...,Conflicts of Interest Board (COIB),https://www.nyc.gov/site/coib/public-documents...,,dataset,2024-12-23T17:40:52.000Z,...,[The name of the legal defense trust refunding...,"[{}, {'precisionStyle': 'currency', 'decimalSe...",14,official,tabular,table,False,,False,2024-09-18T19:37:40.000Z
3,"Summer Sports Experience and ""Kids in Motion"" ...",4pta-f4ca,,[],The Kids in Motion (KIM) program provides free...,,,,dataset,2024-12-28T14:01:41.000Z,...,[Name of sport played for Summer Sports Experi...,"[{}, {}, {}, {}, {}, {}, {'decimalSeparator': ...",15,official,tabular,table,False,,False,2024-11-25T16:05:22.000Z
4,Aquatics Programming: 2021 to current,bzby-7sfr,,[],This dataset contains attendance and location ...,,,,dataset,2024-12-27T15:50:15.000Z,...,"[Location of the swim class, Session during wh...","[{}, {}, {}, {}, {}, {}, {'decimalSeparator': ...",16,official,tabular,table,False,,False,2024-11-26T19:38:05.000Z


In [11]:
df = df.loc[:, ['id']]

# sanity check
print(df.shape)
df.head()

(2559, 1)


Unnamed: 0,id
0,mhyv-6iza
1,jsiv-zh9r
2,t3pj-3dgu
3,4pta-f4ca
4,bzby-7sfr


In [12]:
# save as a CSV file
df.to_csv('../data/dataset-ids.csv', index=False)

In [13]:
# sanity check
%ls ../data/

columns-large-datasets.csv  dataset-ids.csv
columns.csv                 datasets.csv
