# Sodapy Basics Tutorial Using NYC Open Data 
Mark Bauer

Table of Contents
=================

   * [1. Introduction](#1-Introduction)
       
       
   * [2. Sodapy](#2-Sodapy)
       * [2.1 Using Sodapy](#21-Using-Sodapy)
       * [2.2 Sodapy Methods](#22-Sodapy-Methods)
       
       
   * [3. Importing Libraries](#3-Data-Inspection)
   
   
   * [4. Socrata Class](#4-Socrata-Class)
   

   * [5. Sodapy Methods](#5-Sodapy-Methods)
       * [5.1 .datasets()](#51-datasets())
       * [5.2 .get()](#52-get())
       * [5.3 .get_metadata()](#53-get_metadata())
       
          
   * [6. Conclusion](#10-Conclusion)

# 1. Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API (SODA), and reviews various methods to retrieve data from Socrata Open Data. The data in this tutorial is from NYC Open Data. 

# 2. Sodapy

## 2.1 Using Sodapy

In order use sodapy, a source domain (i.e. the Socrata Open Data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific data set on Socrata Open Data, then the data set identifier (i.e. the data set id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the data set identifier for the NYC 311 data set: `erm2-nwe9`. The screenshot below displays where we retrieve this information.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

Source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.2 Sodapy Methods

![socrata-methods](images/socrata-methods.png)

Source: https://github.com/xmunoz/sodapy#datasetslimit0-offset0

# We will be focusing on three sodapy methods:
## -  `.datasets()`
Returns the list of datasets associated with a particular domain.

## - `.get()`
Read data from the requested resource. Options for content_type are json, csv, and xml.

## - `.get_metadata()`
Retrieve the metadata for a particular dataset.

# 3. Importing Libraries

In [297]:
# importing libraries
import pandas as pd
from sodapy import Socrata

In [298]:
%reload_ext watermark

In [299]:
%watermark -a "Mark Bauer" -u -t -d -v -p pandas,sodapy

Mark Bauer 
last updated: 2021-02-07 12:24:09 

CPython 3.7.1
IPython 7.18.1

pandas 1.0.0
sodapy 2.0.0


Documention for installing watermark: https://github.com/rasbt/watermark

# 4. Socrata Class 

We assign this information to `socrata_domain` and `socrata_dataset_identifier` variables below.

In [300]:
# soure domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# NYC 311 data set identifier
socrata_dataset_identifier = 'erm2-nwe9'

In [301]:
# The main class that interacts with the SODA API.

# The required arguments are:
#     domain: the domain you wish you to access
#     app_token: your Socrata application token
# Simple requests are possible without an app_token, though these
# requests will be rate-limited.

client = Socrata(socrata_domain, None, timeout=100)



In [302]:
# printing the Socrata object 'client'
print(client)

<sodapy.Socrata object at 0x124857320>


In [303]:
# printing type of the Socrata object 'client'
type(client)

sodapy.Socrata

In [304]:
# printing attributes of client object
for key, value in client.__dict__.items():
    print(key + ':', value)

domain: data.cityofnewyork.us
session: <requests.sessions.Session object at 0x124857358>
uri_prefix: https://
timeout: 100


# 5. Sodapy Methods

## 5.1  `.datasets()`

`datasets` method: Returns the list of datasets associated with a particular domain.
WARNING: Large limits (>1000) will return megabytes of data,
which can be slow on low-bandwidth networks, and is also a lot of
data to hold in memory.

In [305]:
print(type(client.datasets()))

<class 'list'>


In [306]:
len(client.datasets())

3179

In [307]:
print('Number of data sets on NYC Open Data: {}.'.format(len(client.datasets())))

Number of data sets on NYC Open Data: 3179.


In [308]:
print(type(client.datasets()[0]))

<class 'dict'>


In [309]:
client.datasets()[0].keys()

dict_keys(['resource', 'classification', 'metadata', 'permalink', 'link', 'owner', 'creator'])

In [310]:
for key in client.datasets()[0].keys():
    print(key)

resource
classification
metadata
permalink
link
owner
creator


In [311]:
client.datasets()[0].items()

dict_items([('resource', {'name': 'DOB Job Application Filings', 'id': 'ic3t-wcy2', 'parent_fxf': [], 'description': 'This dataset contains all job applications submitted through the Borough Offices, through eFiling, or through the HUB, which have a "Latest Action Date" since January 1, 2000. This dataset does not include jobs submitted through DOB NOW. See the DOB NOW: Build – Job Application Filings dataset for DOB NOW jobs.', 'attribution': 'Department of Buildings (DOB)', 'attribution_link': None, 'contact_email': None, 'type': 'dataset', 'updatedAt': '2021-02-06T21:17:53.000Z', 'createdAt': '2013-04-18T15:18:56.000Z', 'metadata_updated_at': '2020-06-23T02:08:44.000Z', 'data_updated_at': '2021-02-06T21:17:53.000Z', 'page_views': {'page_views_last_week': 618, 'page_views_last_month': 2534, 'page_views_total': 2263841, 'page_views_last_week_log': 9.273795599214266, 'page_views_last_month_log': 11.307770031890703, 'page_views_total_log': 21.11034184119946}, 'columns_name': ["Owner's L

In [312]:
type(client.datasets()[0]['resource'])

dict

In [313]:
for key in client.datasets()[0]['resource'].keys():
    print(key)

name
id
parent_fxf
description
attribution
attribution_link
contact_email
type
updatedAt
createdAt
metadata_updated_at
data_updated_at
page_views
columns_name
columns_field_name
columns_datatype
columns_description
columns_format
download_count
provenance
lens_view_type
blob_mime_type
hide_from_data_json
publication_date


In [314]:
for key in client.datasets()[0]['classification'].keys():
    print(key)

categories
tags
domain_category
domain_tags
domain_metadata


In [315]:
for key, value in client.datasets()[0]['classification'].items():
    print(key + ':', value)

categories: ['economy', 'environment', 'housing & development']
tags: []
domain_category: Housing & Development
domain_tags: ['buildings', 'dob', 'job']
domain_metadata: [{'key': 'Update_Automation', 'value': 'Yes'}, {'key': 'Update_Date-Made-Public', 'value': '4/26/2013'}, {'key': 'Update_Update-Frequency', 'value': 'Daily'}, {'key': 'Dataset-Information_Agency', 'value': 'Department of Buildings (DOB)'}]


In [316]:
stop = 0

for key, value in list(client.datasets()[0]['resource'].items()):
    print(key + ':', value, '\n')
    stop += 1
    
    if stop == 5:
        break
        
print('\nPreviewing top {} results in classification dictionary'.format(stop))        

name: DOB Job Application Filings 

id: ic3t-wcy2 

parent_fxf: [] 

description: This dataset contains all job applications submitted through the Borough Offices, through eFiling, or through the HUB, which have a "Latest Action Date" since January 1, 2000. This dataset does not include jobs submitted through DOB NOW. See the DOB NOW: Build – Job Application Filings dataset for DOB NOW jobs. 

attribution: Department of Buildings (DOB) 


Previewing top 5 results in classification dictionary


In [317]:
stop = 0

for key, value in list(client.datasets()[0]['classification'].items()):
    print(key + ':', value, '\n')
    stop += 1
    
    if stop == 5:
        break
        
print('\nPreviewing top {} results in classification dictionary'.format(stop))        

categories: ['economy', 'environment', 'housing & development'] 

tags: [] 

domain_category: Housing & Development 

domain_tags: ['buildings', 'dob', 'job'] 

domain_metadata: [{'key': 'Update_Automation', 'value': 'Yes'}, {'key': 'Update_Date-Made-Public', 'value': '4/26/2013'}, {'key': 'Update_Update-Frequency', 'value': 'Daily'}, {'key': 'Dataset-Information_Agency', 'value': 'Department of Buildings (DOB)'}] 


Previewing top 5 results in classification dictionary


In [318]:
stop = 0

for key, value in list(client.datasets()[0]['metadata'].items()):
    print(key + ':', value, '\n')
    stop += 1
    
    if stop == 5:
        break
        
print('\nPreviewing top {} results in metadata dictionary'.format(stop))

domain: data.cityofnewyork.us 


Previewing top 1 results in metadata dictionary


In [319]:
stop = 0

for key, value in list(client.datasets()[0]['owner'].items()):
    print(key + ':', value, '\n')
    stop += 1
    
    if stop == 5:
        break
        
print('\nPreviewing top {} results in owner dictionary'.format(stop))

id: 5fuc-pqz2 

user_type: interactive 

display_name: NYC OpenData 


Previewing top 3 results in owner dictionary


In [320]:
stop = 0

for key, value in list(client.datasets()[0]['creator'].items()):
    print(key + ':', value, '\n')
    stop += 1
    
    if stop == 5:
        break
        
print('\nPreviewing top {} results in creator dictionary'.format(stop))        

id: 5fuc-pqz2 

user_type: interactive 

display_name: NYC OpenData 


Previewing top 3 results in creator dictionary


In [321]:
# Once we've identified the structure of the dictionary, we try to 
# find the 311 data set and identify its position in the datasets list.

idx = 0

for dataset_id in client.datasets():
    if client.datasets()[idx]['resource']['id'] == 'erm2-nwe9':
        print('We found the NYC 311 data set!\n')
        print(client.datasets()[idx]['resource']['name'], \
              '\nIndex is:', idx)
        break
    else:
        idx += 1    

We found the NYC 311 data set!

311 Service Requests from 2010 to Present 
Index is: 5


In [322]:
# Previewing information about the 311 data set from the datasets method.
# Note: Information is quite long.

idx_311 = idx
print('The 311 data set index in the datasets list is:', idx_311)

The 311 data set index in the datasets list is: 5


In [323]:
# Since the datasets method is long, let's see if we can identify specific keys we want to preview
# in the resource dictionary.

for key in client.datasets()[idx_311]['resource'].keys():
    print(key)

name
id
parent_fxf
description
attribution
attribution_link
contact_email
type
updatedAt
createdAt
metadata_updated_at
data_updated_at
page_views
columns_name
columns_field_name
columns_datatype
columns_description
columns_format
download_count
provenance
lens_view_type
blob_mime_type
hide_from_data_json
publication_date


In [324]:
items = client.datasets()[idx_311]['resource'].items()

for key, value in items:
    print(key + ':', str(value) + '\n')

name: 311 Service Requests from 2010 to Present

id: erm2-nwe9

parent_fxf: []

description: <b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>

All 311 Service Requests from 2010 to present. This information is automatically updated daily.

attribution: 311, DoITT

attribution_link: None

contact_email: None

type: dataset

updatedAt: 2021-02-07T02:34:01.000Z

createdAt: 2011-10-10T05:52:17.000Z

metadata_updated_at: 2020-04-22T20:18:38.000Z

data_updated_at: 2021-02-07T02:34:01.000Z

page_views: {'page_views_last_week': 1276, 'page_views_last_month': 5433, 'page_views_total': 440401, 'page_views_last_week_log': 10.318542809702723, 'page_views_last_month_log': 12.407798850221974, 'page_views_total_log': 18.748461495072814}

column

In [325]:
type(items)

dict_items

In [326]:
dict(items)['name']

'311 Service Requests from 2010 to Present'

In [327]:
print('The {} data description:\n\n{}.'.format(dict(items)['name'], dict(items)['description']))

The 311 Service Requests from 2010 to Present data description:

<b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>

All 311 Service Requests from 2010 to present. This information is automatically updated daily..


In [328]:
print('The {} data was created at:\n\n{}'.format(dict(items)['name'], dict(items)['createdAt']), \
      
      'and created at:\n\n{}.'.format(dict(items)['data_updated_at']))

The 311 Service Requests from 2010 to Present data was created at:

2011-10-10T05:52:17.000Z and created at:

2021-02-07T02:34:01.000Z.


In [329]:
print('311 page views information:\n')

for key, value in dict(items)['page_views'].items():
    print(key + ':', value)

311 page views information:

page_views_last_week: 1276
page_views_last_month: 5433
page_views_total: 440401
page_views_last_week_log: 10.318542809702723
page_views_last_month_log: 12.407798850221974
page_views_total_log: 18.748461495072814


In [330]:
print('The NYC 311 data view count: {}'.format(f"{dict(items)['page_views']['page_views_total']:,}"), \
      
     'and download count: {}.'.format(f"{dict(items)['download_count']:,}"))

The NYC 311 data view count: 440,401 and download count: 398,513.


##  5.2 `.get()`

`get` method: Read data from the requested resource. Options for content_type are json,
csv, and xml.

In [331]:
client = Socrata(socrata_domain, None, timeout=60)



In [332]:
# Using try and except statements because these requests are large and may timeout.
# If the request timesout, we skip it. 

try:
    print(type(client.get(socrata_dataset_identifier)))
except:
    print('timeout error.')
    pass  

<class 'list'>


In [333]:
# Using try and except statements because these requests are large and may timeout.
# If the request timesout, we skip it. 

# printing the column headers

keys = client.get(socrata_dataset_identifier)[0].keys()

try:
    for key in keys:
        print(key)
except:
        print('timeout error.') 
        pass

unique_key
created_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
city
landmark
status
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih


In [334]:
# Printing the column and value of the first record

items = client.get(socrata_dataset_identifier)[0].items()

try:
    for key, value in items:
        print(key + ':',  value)
except:
    print('timeout error.')
    pass        

unique_key: 49721241
created_date: 2021-02-06T02:06:28.000
agency: NYPD
agency_name: New York City Police Department
complaint_type: Noise - Street/Sidewalk
descriptor: Loud Talking
location_type: Street/Sidewalk
incident_zip: 11214
incident_address: 1901 84 STREET
street_name: 84 STREET
cross_street_1: 19 AVENUE
cross_street_2: 20 AVENUE
intersection_street_1: 19 AVENUE
intersection_street_2: 20 AVENUE
city: BROOKLYN
landmark: 84 STREET
status: In Progress
community_board: 11 BROOKLYN
bbl: 3063280001
borough: BROOKLYN
x_coordinate_state_plane: 984553
y_coordinate_state_plane: 160388
open_data_channel_type: ONLINE
park_facility_name: Unspecified
park_borough: BROOKLYN
latitude: 40.60690434935019
longitude: -73.99890876884116
location: {'latitude': '40.60690434935019', 'longitude': '-73.99890876884116', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}
:@computed_region_efsh_h5xi: 17616
:@computed_region_f5dn_yrer: 1
:@computed_region_yeji_bk3q: 2
:@computed_region

In [335]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# retrieve all columns and limit our records to 100.

query = """
SELECT 
    *
LIMIT 
    100
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head()



shape of data: (100, 37)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,taxi_pick_up_location,closed_date,resolution_description,resolution_action_updated_date
0,49721241,2021-02-06T02:06:28.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Talking,Street/Sidewalk,11214,1901 84 STREET,84 STREET,...,"{'latitude': '40.60690434935019', 'longitude':...",17616,1,2,45,37,,,,
1,49718744,2021-02-06T02:05:20.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10009,202 AVENUE A,AVENUE A,...,"{'latitude': '40.729457068256956', 'longitude'...",11729,70,4,50,5,,,,
2,49720071,2021-02-06T02:04:41.000,TLC,Taxi and Limousine Commission,Taxi Complaint,Driver Complaint - Passenger,,11354,135-41 ROOSEVELT AVENUE,ROOSEVELT AVENUE,...,"{'latitude': '40.75922794840877', 'longitude':...",13832,22,3,3,67,"135-41 ROOSEVELT AVENUE, QUEENS (FLUSHING), NY...",,,
3,49717575,2021-02-06T02:04:33.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11233,1851 EASTERN PARKWAY,EASTERN PARKWAY,...,"{'latitude': '40.675706919721506', 'longitude'...",13516,55,2,37,46,,,,
4,49723996,2021-02-06T02:03:54.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10032,650 WEST 171 STREET,WEST 171 STREET,...,"{'latitude': '40.843571750739436', 'longitude'...",13090,47,4,39,21,,,,


## 5.3  `.get_metadata()`

`get_metadata` method: Retrieve the metadata for a particular dataset.

In [336]:
client = Socrata(socrata_domain, None, timeout=100)



In [337]:
type(client.get_metadata(socrata_dataset_identifier))

dict

In [338]:
# Previewing keys vertically.
keys = client.get_metadata(socrata_dataset_identifier).keys()
for key in keys:
    print(key)

id
name
assetType
attribution
averageRating
category
createdAt
description
displayType
downloadCount
hideFromCatalog
hideFromDataJson
indexUpdatedAt
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowClass
rowIdentifierColumnId
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
columns
grants
metadata
owner
query
rights
tableAuthor
tags
flags


In [339]:
# Previewing the id and name of the data set.
print('id and name of dataset\n' + \
      '-' * 30 + \
      '\nid:', client.get_metadata(socrata_dataset_identifier)['id'], \
      '\nname:', client.get_metadata(socrata_dataset_identifier)['name'])

id and name of dataset
------------------------------
id: erm2-nwe9 
name: 311 Service Requests from 2010 to Present


In [340]:
# Previewing the first 30 keys, values of the dictionary.
metadata_items = client.get_metadata(socrata_dataset_identifier).items()
print(type(metadata_items), '\n')

metadata_items = list(metadata_items)
print(type(metadata_items), '\n')

stop = 0
for key, value in metadata_items:
    print(key + ':',  value)
    stop += 1
    
    if stop == 30:
        break

<class 'dict_items'> 

<class 'list'> 

id: erm2-nwe9
name: 311 Service Requests from 2010 to Present
assetType: dataset
attribution: 311, DoITT
averageRating: 0
category: Social Services
createdAt: 1318225937
description: <b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>

All 311 Service Requests from 2010 to present. This information is automatically updated daily.
displayType: table
downloadCount: 398513
hideFromCatalog: False
hideFromDataJson: False
indexUpdatedAt: 1571326778
newBackend: True
numberOfComments: 19
oid: 28506835
provenance: official
publicationAppendEnabled: False
publicationDate: 1524193398
publicationGroup: 244403
publicationStage: published
rowClass: 
rowIdentifierColumnId: 354922030
rowsUpdatedAt: 1612665241

In [341]:
# Previewing the first item in columns list
metadata_items = client.get_metadata(socrata_dataset_identifier)

print("metadata_items['columns']:", type(metadata_items['columns']))

metadata_items['columns'][0]

metadata_items['columns']: <class 'list'>


{'id': 354922030,
 'name': 'Unique Key',
 'dataTypeName': 'text',
 'description': 'Unique identifier of a Service Request (SR) in the open data set\n',
 'fieldName': 'unique_key',
 'position': 1,
 'renderTypeName': 'text',
 'tableColumnId': 1567787,
 'width': 220,
 'cachedContents': {'largest': '49724656',
  'non_null': '24895045',
  'null': '0',
  'top': [{'item': '10693408', 'count': '1'},
   {'item': '10836749', 'count': '1'},
   {'item': '10836967', 'count': '1'},
   {'item': '11051177', 'count': '1'},
   {'item': '11413576', 'count': '1'},
   {'item': '11463895', 'count': '1'},
   {'item': '11463896', 'count': '1'},
   {'item': '11464334', 'count': '1'},
   {'item': '11464394', 'count': '1'},
   {'item': '11464467', 'count': '1'},
   {'item': '11464508', 'count': '1'},
   {'item': '11464509', 'count': '1'},
   {'item': '11464521', 'count': '1'},
   {'item': '11464567', 'count': '1'},
   {'item': '11464572', 'count': '1'},
   {'item': '11464639', 'count': '1'},
   {'item': '1146484

In [342]:
for key in metadata_items['columns'][0].keys():
    print(key)

id
name
dataTypeName
description
fieldName
position
renderTypeName
tableColumnId
width
cachedContents
format


In [343]:
# Saving metadata dictionary as 'metadata'
metadata = client.get_metadata(socrata_dataset_identifier)
print('metadata', type(metadata))

# Previewing the datatype of columns
print("metadata['columns']", type(metadata['columns']), 'length:', len(metadata['columns']), '\n')

print('columns', '\n' + '-----------')

# Previewing the field names for each element in our columns list
for x in metadata['columns']:
    print(x['fieldName'])

metadata <class 'dict'>
metadata['columns'] <class 'list'> length: 46 

columns 
-----------
unique_key
created_date
closed_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
address_type
city
landmark
facility_type
status
due_date
resolution_description
resolution_action_updated_date
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
vehicle_type
taxi_company_borough
taxi_pick_up_location
bridge_highway_name
bridge_highway_direction
road_ramp
bridge_highway_segment
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih


# 6. Conclusion