# Data Cleaning
## NYC Open *Big* Data Analysis
Author: Mark Bauer

Objective: Clean data to use for analysis.

In [1]:
# import libraries
import pandas as pd

In [2]:
# list files in directory
%ls

LICENSE              [34mdata[m[m/                [34mfigures[m[m/
README.md            data-cleaning.ipynb  log.txt
analysis.ipynb       data-export.ipynb


In [3]:
# log file
file = 'log.txt'

# manually specify column names
names = [
    'datetime_log',
    'id',
    'error_log',
    'count_rows'
]

# read log file into dataframe
df = pd.read_csv(file, names=names)

# preview data
print(f"shape of data: {df.shape}")
df.head()

shape of data: (2554, 4)


Unnamed: 0,datetime_log,id,error_log,count_rows
0,2024-08-09 13:56:42,fkec-mjr6,,182.0
1,2024-08-09 13:56:47,mzxg-pwib,,27673.0
2,2024-08-09 13:56:53,6r9j-qrwz,,91.0
3,2024-08-09 13:57:00,99xv-he3n,,188.0
4,2024-08-09 13:57:06,ufxk-pq9j,,39.0


In [4]:
# column information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2554 entries, 0 to 2553
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   datetime_log  2554 non-null   object 
 1   id            2554 non-null   object 
 2   error_log     1 non-null      object 
 3   count_rows    2553 non-null   float64
dtypes: float64(1), object(3)
memory usage: 79.9+ KB


There's one record with a non-null error log.

In [5]:
# summary statistics
df.describe().round(1)

Unnamed: 0,count_rows
count,2553.0
mean,2337140.8
std,18089982.8
min,0.0
25%,138.0
50%,1396.0
75%,12383.0
max,376404531.0


In [6]:
# is dataset id unique
df['id'].is_unique

True

In [7]:
# count nulls per column
df.isnull().sum()

datetime_log       0
id                 0
error_log       2553
count_rows         1
dtype: int64

Check row with null `count_rows`.

In [8]:
df.loc[df['count_rows'].isnull()]

Unnamed: 0,datetime_log,id,error_log,count_rows
174,2024-08-09 14:32:04,erdf-2akx,Request error for erdf-2akx: 408 Client Error:...,


In [9]:
# preview error log
df.loc[df['error_log'].notnull(), 'error_log'].values[0]

'Request error for erdf-2akx: 408 Client Error: Request Timeout for url: https://data.cityofnewyork.us/resource/erdf-2akx.json?$select=count(*)'

I reproduced this error given the https address above. The request timed out. Skip this dataset for now.

In [10]:
# fill null count_rows and cast column as int
df['count_rows'] = df['count_rows'].fillna(0).astype(int)

df.head()

Unnamed: 0,datetime_log,id,error_log,count_rows
0,2024-08-09 13:56:42,fkec-mjr6,,182
1,2024-08-09 13:56:47,mzxg-pwib,,27673
2,2024-08-09 13:56:53,6r9j-qrwz,,91
3,2024-08-09 13:57:00,99xv-he3n,,188
4,2024-08-09 13:57:06,ufxk-pq9j,,39


In [11]:
df.loc[df['count_rows'] == 0].shape[0]

59

We will identify these datasets with zero rows later in this notebook.

# Metadata API
Learn more about the Socrata Metadata API on the official docs here: https://dev.socrata.com/docs/other/metadata#?route=overview.

In [12]:
# read dataset metadata to join information
path = 'https://data.cityofnewyork.us/api/views/metadata/v1'
metadata_df = pd.read_json(path)

print(metadata_df.shape)
metadata_df.head()

(3234, 21)


Unnamed: 0,id,name,attribution,attributionLink,category,createdAt,dataUpdatedAt,dataUri,description,domain,...,hideFromCatalog,hideFromDataJson,license,metadataUpdatedAt,provenance,updatedAt,webUri,approvals,customFields,tags
0,bzg2-2abf,Bicycle Parking (Map),Department of Transportation (DOT),,Transportation,2024-08-22T18:16:30+0000,2024-08-22T18:14:38+0000,https://data.cityofnewyork.us/resource/bzg2-2abf,The data contains locations of all bike parkin...,data.cityofnewyork.us,...,False,False,,2024-08-22T18:19:32+0000,OFFICIAL,2024-08-22T18:20:37+0000,https://data.cityofnewyork.us/d/bzg2-2abf,"[{'reviewedAt': 1724350837, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...",
1,592z-n7dk,Bicycle Parking,Department of Transportation (DOT),,Transportation,2024-08-22T16:43:20+0000,2024-08-22T18:14:38+0000,https://data.cityofnewyork.us/resource/592z-n7dk,The data contains locations of all bike parkin...,data.cityofnewyork.us,...,False,False,,2024-08-22T18:19:15+0000,OFFICIAL,2024-08-22T18:20:13+0000,https://data.cityofnewyork.us/d/592z-n7dk,"[{'reviewedAt': 1724350813, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...","[bicycle, racks, bicycle racks, cityracks, par..."
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),,Health,2024-08-05T14:12:47+0000,2024-08-05T16:04:46+0000,https://data.cityofnewyork.us/resource/fkec-mjr6,"Cryptosporidiosis, number of cases and annual ...",data.cityofnewyork.us,...,False,False,,2024-08-05T16:33:29+0000,OFFICIAL,2024-08-05T16:34:05+0000,https://data.cityofnewyork.us/d/fkec-mjr6,"[{'reviewedAt': 1722875645, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...","[cryptosporidiosis, diagnosis year, race ethni..."
3,r6e8-2fwe,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),,City Government,2024-07-31T14:38:56+0000,2024-07-31T14:33:03+0000,https://data.cityofnewyork.us/resource/r6e8-2fwe,The location of the disposal facilities where ...,data.cityofnewyork.us,...,False,False,,2024-07-31T19:40:30+0000,OFFICIAL,2024-07-31T19:53:25+0000,https://data.cityofnewyork.us/d/r6e8-2fwe,"[{'reviewedAt': 1722455605, 'reviewedAutomatic...",{'Data Collection': {'Data Collection': 'Dispo...,
4,9e2b-mctv,New York City Bike Routes\t (Map),Department of Transportation (DOT),https://www.nyc.gov/html/dot/html/bicyclists/b...,,2024-07-24T16:08:52+0000,2024-07-24T16:06:04+0000,https://data.cityofnewyork.us/resource/9e2b-mctv,The New York City Department of Transportation...,data.cityofnewyork.us,...,False,False,,2024-08-06T21:34:51+0000,OFFICIAL,2024-08-06T21:34:51+0000,https://data.cityofnewyork.us/d/9e2b-mctv,"[{'reviewedAt': 1722300713, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Update Freque...",


In [13]:
# metadata column info
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3234 non-null   object 
 1   name               3234 non-null   object 
 2   attribution        3091 non-null   object 
 3   attributionLink    455 non-null    object 
 4   category           3124 non-null   object 
 5   createdAt          3234 non-null   object 
 6   dataUpdatedAt      3061 non-null   object 
 7   dataUri            3234 non-null   object 
 8   description        3160 non-null   object 
 9   domain             3234 non-null   object 
 10  externalId         0 non-null      float64
 11  hideFromCatalog    3234 non-null   bool   
 12  hideFromDataJson   3234 non-null   bool   
 13  license            79 non-null     object 
 14  metadataUpdatedAt  3234 non-null   object 
 15  provenance         3234 non-null   object 
 16  updatedAt          3234 

# Discovery API
Similar to the Metadata API but contains much more information on how the datasets are being used on NYC Open Data.

In [14]:
# discovery views
path = 'https://data.cityofnewyork.us/api/views/'
views_df = pd.read_json(path)

print(views_df.shape)
views_df.head()

(3234, 50)


Unnamed: 0,id,name,assetType,averageRating,category,createdAt,description,displayType,downloadCount,hideFromCatalog,...,blobFilename,blobFileSize,blobId,blobMimeType,ratings,childViews,indexUpdatedAt,iconUrl,previewImageId,disabledFeatureFlags
0,bzg2-2abf,Bicycle Parking (Map),map,0,Transportation,1724350590,The data contains locations of all bike parkin...,visualization_canvas_map,0,False,...,,,,,,,,,,
1,592z-n7dk,Bicycle Parking,dataset,0,Transportation,1724345000,The data contains locations of all bike parkin...,table,11,False,...,,,,,,,,,,
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",dataset,0,Health,1722867167,"Cryptosporidiosis, number of cases and annual ...",table,9,False,...,,,,,,,,,,
3,r6e8-2fwe,Location of Disposal Facilities and Sites Used...,map,0,City Government,1722436736,The location of the disposal facilities where ...,visualization_canvas_map,0,False,...,,,,,,,,,,
4,9e2b-mctv,New York City Bike Routes\t (Map),map,0,,1721837332,The New York City Department of Transportation...,visualization_canvas_map,0,False,...,,,,,,,,,,


In [15]:
# preview column info
views_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3234 entries, 0 to 3233
Data columns (total 50 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        3234 non-null   object 
 1   name                      3234 non-null   object 
 2   assetType                 3234 non-null   object 
 3   averageRating             3234 non-null   int64  
 4   category                  3124 non-null   object 
 5   createdAt                 3234 non-null   int64  
 6   description               3160 non-null   object 
 7   displayType               3234 non-null   object 
 8   downloadCount             3234 non-null   int64  
 9   hideFromCatalog           3234 non-null   bool   
 10  hideFromDataJson          3234 non-null   bool   
 11  locked                    3234 non-null   bool   
 12  modifyingViewUid          201 non-null    object 
 13  newBackend                3234 non-null   bool   
 14  numberOf

In [16]:
# retrieve only selected columns
cols = [
    'id',
    'viewCount', 'downloadCount',
    'assetType', 'displayType'
]

views_df = views_df.loc[:, cols]

views_df.head()

Unnamed: 0,id,viewCount,downloadCount,assetType,displayType
0,bzg2-2abf,51,0,map,visualization_canvas_map
1,592z-n7dk,34,11,dataset,table
2,fkec-mjr6,99,9,dataset,table
3,r6e8-2fwe,70,0,map,visualization_canvas_map
4,9e2b-mctv,435,0,map,visualization_canvas_map


## Merge Metadata and Discovery APIs together

In [17]:
# merge metadata with metadata views
metadata_merged_df = metadata_df.merge(
    views_df,
    on='id',
    how='right'
)

print(metadata_merged_df.shape)
metadata_merged_df.head()

(3234, 25)


Unnamed: 0,id,name,attribution,attributionLink,category,createdAt,dataUpdatedAt,dataUri,description,domain,...,provenance,updatedAt,webUri,approvals,customFields,tags,viewCount,downloadCount,assetType,displayType
0,bzg2-2abf,Bicycle Parking (Map),Department of Transportation (DOT),,Transportation,2024-08-22T18:16:30+0000,2024-08-22T18:14:38+0000,https://data.cityofnewyork.us/resource/bzg2-2abf,The data contains locations of all bike parkin...,data.cityofnewyork.us,...,OFFICIAL,2024-08-22T18:20:37+0000,https://data.cityofnewyork.us/d/bzg2-2abf,"[{'reviewedAt': 1724350837, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...",,51,0,map,visualization_canvas_map
1,592z-n7dk,Bicycle Parking,Department of Transportation (DOT),,Transportation,2024-08-22T16:43:20+0000,2024-08-22T18:14:38+0000,https://data.cityofnewyork.us/resource/592z-n7dk,The data contains locations of all bike parkin...,data.cityofnewyork.us,...,OFFICIAL,2024-08-22T18:20:13+0000,https://data.cityofnewyork.us/d/592z-n7dk,"[{'reviewedAt': 1724350813, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...","[bicycle, racks, bicycle racks, cityracks, par...",34,11,dataset,table
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),,Health,2024-08-05T14:12:47+0000,2024-08-05T16:04:46+0000,https://data.cityofnewyork.us/resource/fkec-mjr6,"Cryptosporidiosis, number of cases and annual ...",data.cityofnewyork.us,...,OFFICIAL,2024-08-05T16:34:05+0000,https://data.cityofnewyork.us/d/fkec-mjr6,"[{'reviewedAt': 1722875645, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Date Made Pub...","[cryptosporidiosis, diagnosis year, race ethni...",99,9,dataset,table
3,r6e8-2fwe,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),,City Government,2024-07-31T14:38:56+0000,2024-07-31T14:33:03+0000,https://data.cityofnewyork.us/resource/r6e8-2fwe,The location of the disposal facilities where ...,data.cityofnewyork.us,...,OFFICIAL,2024-07-31T19:53:25+0000,https://data.cityofnewyork.us/d/r6e8-2fwe,"[{'reviewedAt': 1722455605, 'reviewedAutomatic...",{'Data Collection': {'Data Collection': 'Dispo...,,70,0,map,visualization_canvas_map
4,9e2b-mctv,New York City Bike Routes\t (Map),Department of Transportation (DOT),https://www.nyc.gov/html/dot/html/bicyclists/b...,,2024-07-24T16:08:52+0000,2024-07-24T16:06:04+0000,https://data.cityofnewyork.us/resource/9e2b-mctv,The New York City Department of Transportation...,data.cityofnewyork.us,...,OFFICIAL,2024-08-06T21:34:51+0000,https://data.cityofnewyork.us/d/9e2b-mctv,"[{'reviewedAt': 1722300713, 'reviewedAutomatic...","{'Update': {'Automation': 'No', 'Update Freque...",,435,0,map,visualization_canvas_map


In [18]:
# preview column info
metadata_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3234 entries, 0 to 3233
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 3234 non-null   object 
 1   name               3234 non-null   object 
 2   attribution        3091 non-null   object 
 3   attributionLink    455 non-null    object 
 4   category           3124 non-null   object 
 5   createdAt          3234 non-null   object 
 6   dataUpdatedAt      3061 non-null   object 
 7   dataUri            3234 non-null   object 
 8   description        3160 non-null   object 
 9   domain             3234 non-null   object 
 10  externalId         0 non-null      float64
 11  hideFromCatalog    3234 non-null   bool   
 12  hideFromDataJson   3234 non-null   bool   
 13  license            79 non-null     object 
 14  metadataUpdatedAt  3234 non-null   object 
 15  provenance         3234 non-null   object 
 16  updatedAt          3234 

In [19]:
# select specific columns
cols = [
    'id', 'name', 'attribution', 'description',
    'viewCount', 'downloadCount',
    'category', 'assetType', 'displayType', 'tags',
    'createdAt', 'updatedAt', 'dataUpdatedAt', 'metadataUpdatedAt',
    'domain', 'attributionLink', 'webUri', 'dataUri'
]

metadata_merged_df = metadata_merged_df.loc[:, cols]

metadata_merged_df.head()

Unnamed: 0,id,name,attribution,description,viewCount,downloadCount,category,assetType,displayType,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
0,bzg2-2abf,Bicycle Parking (Map),Department of Transportation (DOT),The data contains locations of all bike parkin...,51,0,Transportation,map,visualization_canvas_map,,2024-08-22T18:16:30+0000,2024-08-22T18:20:37+0000,2024-08-22T18:14:38+0000,2024-08-22T18:19:32+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/bzg2-2abf,https://data.cityofnewyork.us/resource/bzg2-2abf
1,592z-n7dk,Bicycle Parking,Department of Transportation (DOT),The data contains locations of all bike parkin...,34,11,Transportation,dataset,table,"[bicycle, racks, bicycle racks, cityracks, par...",2024-08-22T16:43:20+0000,2024-08-22T18:20:13+0000,2024-08-22T18:14:38+0000,2024-08-22T18:19:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/592z-n7dk,https://data.cityofnewyork.us/resource/592z-n7dk
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",99,9,Health,dataset,table,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
3,r6e8-2fwe,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,70,0,City Government,map,visualization_canvas_map,,2024-07-31T14:38:56+0000,2024-07-31T19:53:25+0000,2024-07-31T14:33:03+0000,2024-07-31T19:40:30+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/r6e8-2fwe,https://data.cityofnewyork.us/resource/r6e8-2fwe
4,9e2b-mctv,New York City Bike Routes\t (Map),Department of Transportation (DOT),The New York City Department of Transportation...,435,0,,map,visualization_canvas_map,,2024-07-24T16:08:52+0000,2024-08-06T21:34:51+0000,2024-07-24T16:06:04+0000,2024-08-06T21:34:51+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/9e2b-mctv,https://data.cityofnewyork.us/resource/9e2b-mctv


In [20]:
# we only want datasets that are datasets (e.g. not map or dashboards) and are displayed as tables
metadata_merged_df = metadata_merged_df.loc[
    (metadata_merged_df['assetType'] == 'dataset')
    & (metadata_merged_df['displayType'] == 'table')
]

print(metadata_merged_df.shape)
metadata_merged_df.head()

(2552, 18)


Unnamed: 0,id,name,attribution,description,viewCount,downloadCount,category,assetType,displayType,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
1,592z-n7dk,Bicycle Parking,Department of Transportation (DOT),The data contains locations of all bike parkin...,34,11,Transportation,dataset,table,"[bicycle, racks, bicycle racks, cityracks, par...",2024-08-22T16:43:20+0000,2024-08-22T18:20:13+0000,2024-08-22T18:14:38+0000,2024-08-22T18:19:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/592z-n7dk,https://data.cityofnewyork.us/resource/592z-n7dk
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",99,9,Health,dataset,table,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
5,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,619,134,,dataset,table,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
6,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,106,17,City Government,dataset,table,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
7,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,75,15,City Government,dataset,table,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n


In [21]:
# now we can safely drop these columns, as each value is the same
cols = ['assetType', 'displayType']
metadata_merged_df = metadata_merged_df.loc[
    :,
    ~metadata_merged_df.columns.isin(cols)
]

metadata_merged_df.head()

Unnamed: 0,id,name,attribution,description,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
1,592z-n7dk,Bicycle Parking,Department of Transportation (DOT),The data contains locations of all bike parkin...,34,11,Transportation,"[bicycle, racks, bicycle racks, cityracks, par...",2024-08-22T16:43:20+0000,2024-08-22T18:20:13+0000,2024-08-22T18:14:38+0000,2024-08-22T18:19:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/592z-n7dk,https://data.cityofnewyork.us/resource/592z-n7dk
2,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",99,9,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
5,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,619,134,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
6,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,106,17,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
7,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,75,15,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n


In [22]:
# merge dataset log file with metadata
merged_df = df.merge(
    metadata_merged_df,
    on='id',
    how='left'
)

print(merged_df.shape)
merged_df.head()

(2554, 19)


Unnamed: 0,datetime_log,id,error_log,count_rows,name,attribution,description,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
0,2024-08-09 13:56:42,fkec-mjr6,,182,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",99.0,9.0,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
1,2024-08-09 13:56:47,mzxg-pwib,,27673,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,619.0,134.0,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
2,2024-08-09 13:56:53,6r9j-qrwz,,91,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,106.0,17.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
3,2024-08-09 13:57:00,99xv-he3n,,188,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,75.0,15.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n
4,2024-08-09 13:57:06,ufxk-pq9j,,39,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,92.0,28.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T17:54:05+0000,2024-07-31T19:51:24+0000,2024-07-31T14:33:03+0000,2024-07-31T19:45:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ufxk-pq9j,https://data.cityofnewyork.us/resource/ufxk-pq9j


In [23]:
# rearrange columns
cols = [
    'id', 'name', 'attribution', 'description',
    'count_rows', 'viewCount', 'downloadCount',
     'category', 'tags',
    'createdAt', 'updatedAt', 'dataUpdatedAt', 'metadataUpdatedAt',
    'domain', 'attributionLink', 'webUri', 'dataUri'
]

merged_df = merged_df.loc[:, cols]

merged_df.head()

Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
0,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",182,99.0,9.0,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
1,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,27673,619.0,134.0,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
2,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,91,106.0,17.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
3,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,188,75.0,15.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n
4,ufxk-pq9j,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,39,92.0,28.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T17:54:05+0000,2024-07-31T19:51:24+0000,2024-07-31T14:33:03+0000,2024-07-31T19:45:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ufxk-pq9j,https://data.cityofnewyork.us/resource/ufxk-pq9j


In [24]:
# preview column info
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2554 entries, 0 to 2553
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 2554 non-null   object 
 1   name               2550 non-null   object 
 2   attribution        2424 non-null   object 
 3   description        2476 non-null   object 
 4   count_rows         2554 non-null   int64  
 5   viewCount          2550 non-null   float64
 6   downloadCount      2550 non-null   float64
 7   category           2463 non-null   object 
 8   tags               1903 non-null   object 
 9   createdAt          2550 non-null   object 
 10  updatedAt          2550 non-null   object 
 11  dataUpdatedAt      2530 non-null   object 
 12  metadataUpdatedAt  2550 non-null   object 
 13  domain             2550 non-null   object 
 14  attributionLink    358 non-null    object 
 15  webUri             2550 non-null   object 
 16  dataUri            2550 

In [25]:
# summary statistics
merged_df.describe().round(1)

Unnamed: 0,count_rows,viewCount,downloadCount
count,2554.0,2550.0,2550.0
mean,2336225.7,10893.2,4538.9
std,18086498.7,107296.0,40557.1
min,0.0,51.0,9.0
25%,135.8,364.0,388.2
50%,1395.0,825.5,788.0
75%,12382.8,2560.8,2082.5
max,376404531.0,2813069.0,1664776.0


In [26]:
# null counts per column
merged_df.isnull().sum()

id                      0
name                    4
attribution           130
description            78
count_rows              0
viewCount               4
downloadCount           4
category               91
tags                  651
createdAt               4
updatedAt               4
dataUpdatedAt          24
metadataUpdatedAt       4
domain                  4
attributionLink      2196
webUri                  4
dataUri                 4
dtype: int64

Examine why three datasets have `name` as null.

In [27]:
merged_df.loc[merged_df['name'].isnull()]

Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
358,in83-58q5,,,,334044,,,,,,,,,,,,
359,evu4-6zyr,,,,335616,,,,,,,,,,,,
360,njuk-taxk,,,,309528,,,,,,,,,,,,
540,ykru-djh7,,,,2166,,,,,,,,,,,,


In [28]:
merged_df.loc[merged_df['name'].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 358 to 540
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4 non-null      object 
 1   name               0 non-null      object 
 2   attribution        0 non-null      object 
 3   description        0 non-null      object 
 4   count_rows         4 non-null      int64  
 5   viewCount          0 non-null      float64
 6   downloadCount      0 non-null      float64
 7   category           0 non-null      object 
 8   tags               0 non-null      object 
 9   createdAt          0 non-null      object 
 10  updatedAt          0 non-null      object 
 11  dataUpdatedAt      0 non-null      object 
 12  metadataUpdatedAt  0 non-null      object 
 13  domain             0 non-null      object 
 14  attributionLink    0 non-null      object 
 15  webUri             0 non-null      object 
 16  dataUri            0 non-n

These might be unauthorized or private datasets hosted on NYC Open Data. I couldn't access them. Let's drop them.

In [29]:
merged_df = (
    merged_df
    .loc[merged_df['name'].notnull()]
    .reset_index(drop=True)
)

print(merged_df.shape)
merged_df.head()

(2550, 17)


Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
0,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",182,99.0,9.0,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
1,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,27673,619.0,134.0,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
2,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,91,106.0,17.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
3,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,188,75.0,15.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n
4,ufxk-pq9j,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,39,92.0,28.0,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T17:54:05+0000,2024-07-31T19:51:24+0000,2024-07-31T14:33:03+0000,2024-07-31T19:45:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ufxk-pq9j,https://data.cityofnewyork.us/resource/ufxk-pq9j


In [30]:
# sanity check
merged_df.isnull().sum()

id                      0
name                    0
attribution           126
description            74
count_rows              0
viewCount               0
downloadCount           0
category               87
tags                  647
createdAt               0
updatedAt               0
dataUpdatedAt          20
metadataUpdatedAt       0
domain                  0
attributionLink      2192
webUri                  0
dataUri                 0
dtype: int64

In [31]:
# cast desired columns to int
merged_df = merged_df.astype({
    'viewCount':int,
    'downloadCount':int
})

merged_df.head()

Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
0,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",182,99,9,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6
1,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,27673,619,134,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib
2,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,91,106,17,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz
3,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,188,75,15,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n
4,ufxk-pq9j,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,39,92,28,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T17:54:05+0000,2024-07-31T19:51:24+0000,2024-07-31T14:33:03+0000,2024-07-31T19:45:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ufxk-pq9j,https://data.cityofnewyork.us/resource/ufxk-pq9j


# Examine Datasets with Zero Rows

In [32]:
# count of datasets with zero rows
merged_df.loc[merged_df['count_rows'] == 0].shape[0]

59

In [33]:
# preview these datasets
merged_df.loc[merged_df['count_rows'] == 0].head(10)

Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri
174,erdf-2akx,EZ Pass Readers (Tabular),Department of Transportation (DOT),E-Z Pass readers are installed throughout NYC ...,0,2346,727,Transportation,"[traffic, speed, midtown, central business dis...",2022-09-21T14:42:59+0000,2024-07-10T17:20:39+0000,2024-07-09T12:10:37+0000,2024-07-10T17:20:39+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/erdf-2akx,https://data.cityofnewyork.us/resource/erdf-2akx
498,mvn6-575n,2020 - 2021 Remote Learning Legislation Device...,NYC Department of Education,"For fall 2020, when buildings reopened and st...",0,444,178,Education,,2021-03-11T22:07:58+0000,2021-03-11T22:15:22+0000,2021-03-11T22:15:21+0000,2021-03-11T22:15:22+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/mvn6-575n,https://data.cityofnewyork.us/resource/mvn6-575n
1544,5x7b-baa2,2014-2015 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2014, 21 New York City public schools ...",0,282,695,Education,,2018-03-27T14:19:05+0000,2024-07-05T13:43:20+0000,2018-03-27T14:19:05+0000,2024-07-05T13:43:20+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/5x7b-baa2,https://data.cityofnewyork.us/resource/5x7b-baa2
1545,hjqb-2g4j,2014-2015 School Closure Discharge Reporting C...,NYC Department of Education,"In June 2014, 21 New York City public schools ...",0,231,657,Education,,2018-03-27T14:14:48+0000,2024-07-05T13:43:21+0000,2018-03-27T14:14:48+0000,2024-07-05T13:43:21+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/hjqb-2g4j,https://data.cityofnewyork.us/resource/hjqb-2g4j
1546,x2z9-ihqe,2013-2014 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2014, 21 New York City public schools ...",0,223,824,Education,,2018-03-27T12:29:22+0000,2024-07-05T13:43:20+0000,2018-03-27T12:29:22+0000,2024-07-05T13:43:20+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/x2z9-ihqe,https://data.cityofnewyork.us/resource/x2z9-ihqe
1547,ips9-qxk8,2013-2014 School Closure Discharge Reporting ...,NYC Department of Education,"In June 2014, 21 New York City public schools ...",0,195,711,Education,,2018-03-27T12:25:01+0000,2024-07-05T13:43:20+0000,2018-03-27T12:25:01+0000,2024-07-05T13:43:20+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ips9-qxk8,https://data.cityofnewyork.us/resource/ips9-qxk8
1548,hnb3-32fi,2012-2013 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2013, Some New York City public school...",0,207,744,Education,,2018-03-26T20:10:02+0000,2024-07-05T13:43:19+0000,2018-03-26T20:10:02+0000,2024-07-05T13:43:19+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/hnb3-32fi,https://data.cityofnewyork.us/resource/hnb3-32fi
1549,adqm-qibj,2012-2013 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2013, Some New York City public school...",0,236,761,Education,,2018-03-26T20:03:58+0000,2024-07-05T13:43:18+0000,2018-03-26T20:03:58+0000,2024-07-05T13:43:18+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/adqm-qibj,https://data.cityofnewyork.us/resource/adqm-qibj
1550,myhg-kfw9,2012-2013 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2013, Some New York City public school...",0,210,652,Education,,2018-03-26T19:56:59+0000,2024-07-05T13:43:18+0000,2018-03-26T19:56:59+0000,2024-07-05T13:43:18+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/myhg-kfw9,https://data.cityofnewyork.us/resource/myhg-kfw9
1551,b8z5-673p,2012-2013 School Closure Discharge Reporting -...,NYC Department of Education,"In June 2013, Some New York City public schoo...",0,232,643,Education,,2018-03-26T19:49:16+0000,2024-07-05T13:43:19+0000,2018-03-26T19:49:16+0000,2024-07-05T13:43:19+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/b8z5-673p,https://data.cityofnewyork.us/resource/b8z5-673p


In [34]:
# examine counts by agency
(merged_df
 .loc[merged_df['count_rows'] == 0]
 .groupby(by='attribution')['id']
 .count()
 .sort_values(ascending=False)
)

attribution
NYC Department of Education            49
NYC Department Of Education             6
Department of Transportation (DOT)      1
NYC Department Of Educarion             1
NYC Department of Educaton              1
Taxi and Limousine Commission (TLC)     1
Name: id, dtype: int64

I think it's appropriate to drop these datasets from this analysis simply because we are interested in the number of rows per dataset on NYC Open Data.

In [35]:
merged_df = merged_df.loc[merged_df['count_rows'] > 0].reset_index(drop=True)

# count of datasets with zero rows
merged_df.loc[merged_df['count_rows'] == 0].shape[0]

0

# Strip Extra White Space from Agency Name

In [36]:
def strip_extra_whitespace(text):
    # Replace multiple spaces with a single space
    return ' '.join(text.split())

merged_df['attribution_formatted'] = (
    merged_df['attribution']
    .fillna("")
    .apply(strip_extra_whitespace)
)

merged_df.head()

Unnamed: 0,id,name,attribution,description,count_rows,viewCount,downloadCount,category,tags,createdAt,updatedAt,dataUpdatedAt,metadataUpdatedAt,domain,attributionLink,webUri,dataUri,attribution_formatted
0,fkec-mjr6,"DOHMH Cryptosporidiosis by Race/Ethnicity, Age...",Department of Health and Mental Hygiene (DOHMH),"Cryptosporidiosis, number of cases and annual ...",182,99,9,Health,"[cryptosporidiosis, diagnosis year, race ethni...",2024-08-05T14:12:47+0000,2024-08-05T16:34:05+0000,2024-08-05T16:04:46+0000,2024-08-05T16:33:29+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/fkec-mjr6,https://data.cityofnewyork.us/resource/fkec-mjr6,Department of Health and Mental Hygiene (DOHMH)
1,mzxg-pwib,New York City Bike Routes,Department of Transportation (DOT),The New York City Department of Transportation...,27673,619,134,,"[nyc bike routes, bike routes]",2024-07-24T15:57:31+0000,2024-07-30T00:51:27+0000,2024-07-24T16:06:04+0000,2024-07-30T00:50:54+0000,data.cityofnewyork.us,https://www.nyc.gov/html/dot/html/bicyclists/b...,https://data.cityofnewyork.us/d/mzxg-pwib,https://data.cityofnewyork.us/resource/mzxg-pwib,Department of Transportation (DOT)
2,6r9j-qrwz,DSNY Disposal Facilities Used by Year,NYC Department of Sanitation (DSNY),A listing of the facilities used by year to ha...,91,106,17,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:37:24+0000,2024-07-31T19:51:22+0000,2024-07-31T14:21:50+0000,2024-07-31T19:45:38+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/6r9j-qrwz,https://data.cityofnewyork.us/resource/6r9j-qrwz,NYC Department of Sanitation (DSNY)
3,99xv-he3n,DSNY Disposal Sites Used by Facilities by Year,NYC Department of Sanitation (DSNY),A listing of the disposal sites used by each f...,188,75,15,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T18:18:59+0000,2024-07-31T19:51:26+0000,2024-07-31T14:18:13+0000,2024-07-31T19:44:47+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/99xv-he3n,https://data.cityofnewyork.us/resource/99xv-he3n,NYC Department of Sanitation (DSNY)
4,ufxk-pq9j,Location of Disposal Facilities and Sites Used...,NYC Department of Sanitation (DSNY),The location of the disposal facilities where ...,39,92,28,City Government,"[sanitation, waste, transfer station, waste to...",2024-07-12T17:54:05+0000,2024-07-31T19:51:24+0000,2024-07-31T14:33:03+0000,2024-07-31T19:45:15+0000,data.cityofnewyork.us,,https://data.cityofnewyork.us/d/ufxk-pq9j,https://data.cityofnewyork.us/resource/ufxk-pq9j,NYC Department of Sanitation (DSNY)


In [37]:
# save cleaned dataset
merged_df.to_csv('data/data.csv', index=False)

In [38]:
%ls data/

data.csv
