# Sodapy Basics Tutorial Using NYC Open Data 
Mark Bauer

Table of Contents
=================

   1. Introduction
   2. Sodapy
       - 2.1 Using Sodapy
       - 2.2 Sodapy Methods
   3. Importing Libraries
   4. Socrata Class
   5. Sodapy Methods
       - 5.1 .datasets()
       - 5.2 .get()
       - 5.3 .get_metadata()

# 1. Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API (SODA), and reviews various methods to retrieve data from Socrata Open Data. The data in this tutorial is from NYC Open Data. 

# 2. Sodapy

## 2.1 Using Sodapy

In order use sodapy, a source domain (i.e. the Socrata Open Data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset on Socrata Open Data, then the dataset identifier (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set: `erm2-nwe9`. The screenshot below displays where we retrieve this information.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

**Source**: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.2 Sodapy Methods

![socrata-methods](images/socrata-methods.png)

**Source**: https://github.com/xmunoz/sodapy#datasetslimit0-offset0

We will be focusing on three sodapy methods:
-  `.datasets()`
Returns the list of datasets associated with a particular domain.

- `.get()`
Read data from the requested resource. Options for content_type are json, csv, and xml.

- `.get_metadata()`
Retrieve the metadata for a particular dataset.

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
from sodapy import Socrata

In [2]:
# documention for installing watermark: https://github.com/rasbt/watermark
%reload_ext watermark
%watermark -u -t -d -v -p pandas,sodapy

Last updated: 2023-07-07 13:10:10

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

pandas: 1.4.2
sodapy: 2.2.0



# 4. Socrata Class 

In [3]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Socrata(): The main class that interacts with the SODA API.

# The required arguments are:
#     domain: the domain you wish you to access
#     app_token: your Socrata application token
# Simple requests are possible without an app_token, though these
# requests will be rate-limited.

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

client



<sodapy.socrata.Socrata at 0x160fc8bb0>

In [4]:
# printing information about the Socrata object
print(client)
print('type: {}'.format(type(client)))

<sodapy.socrata.Socrata object at 0x160fc8bb0>
type: <class 'sodapy.socrata.Socrata'>


In [5]:
# printing attributes of object
for key, value in client.__dict__.items():
    print('{}: {}'.format(key, value))

domain: data.cityofnewyork.us
session: <requests.sessions.Session object at 0x160fc8f10>
uri_prefix: https://
timeout: 100


In [6]:
# close the session when finished
client.close()

# 5. Sodapy Methods

## 5.1  `.datasets()`

`datasets` method: Returns the list of datasets associated with a particular domain.
WARNING: Large limits (>1000) will return megabytes of data,
which can be slow on low-bandwidth networks, and is also a lot of
data to hold in memory.

In [7]:
# soure domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

print('object type: {}'.format(type(client.datasets())))

datasets = len(client.datasets())
print('Number of datasets on NYC Open Data: {:,}.'.format(datasets))



object type: <class 'list'>
Number of datasets on NYC Open Data: 3,525.


In [8]:
# reviewing type about first dataset
print(type(client.datasets()[0]))

<class 'dict'>


In [9]:
# reviewing information about keys first dataset
client.datasets()[0].keys()

dict_keys(['resource', 'classification', 'metadata', 'permalink', 'link', 'owner', 'creator'])

In [10]:
# reviewing information about resource key
for key, value in client.datasets()[0]['resource'].items():
    print('{}: {}\n'.format(key, value))      

name: Civil Service List (Active)

id: vx8i-nprf

parent_fxf: []

description: A Civil Service List consists of all candidates who passed an exam, ranked in score order. An established list is considered active for no less than one year and no more than four years from the date of establishment. For more information visit DCAS’ “Work for the City” webpage at: https://www1.nyc.gov/site/dcas/employment/take-an-exam.page

attribution: Department of Citywide Administrative Services (DCAS)

attribution_link: None

contact_email: None

type: dataset

updatedAt: 2023-07-07T13:34:17.000Z

createdAt: 2016-06-14T21:12:15.000Z

metadata_updated_at: 2023-07-07T13:34:17.000Z

data_updated_at: 2023-07-07T13:24:32.000Z

page_views: {'page_views_last_week': 4033, 'page_views_last_month': 19065, 'page_views_total': 2311699, 'page_views_last_week_log': 11.977995368612962, 'page_views_last_month_log': 14.218714581067502, 'page_views_total_log': 21.14052275410272}

columns_name: ['List Div Code', 'List Ag

In [11]:
# reviewing information about classification key
for key, value in client.datasets()[0]['classification'].items():
    print('{}: {}'.format(key, value))

categories: []
tags: []
domain_category: City Government
domain_tags: ['2018od4a-report', '2018od4a-video']
domain_metadata: [{'key': 'Update_Automation', 'value': 'Yes'}, {'key': 'Update_Date-Made-Public', 'value': '7/12/2016'}, {'key': 'Update_Update-Frequency', 'value': 'Daily'}, {'key': 'Update_Data-Change-Frequency', 'value': 'Weekly'}, {'key': 'Dataset-Information_Agency', 'value': 'Department of Citywide Administrative Services (DCAS)'}]


In [12]:
# reviewing information about metadata key
for key, value in client.datasets()[0]['metadata'].items():
    print('{}: {}\n'.format(key, value))

domain: data.cityofnewyork.us



In [13]:
for key, value in client.datasets()[0]['owner'].items():
    print('{}: {}'.format(key, value))

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


In [14]:
for key, value in list(client.datasets()[0]['creator'].items()):
    print('{}: {}'.format(key, value))     

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


Once we've identified the structure of the dictionary, we try to find the 311 dataset and identify its position in the datasets list.

In [15]:
# NYC 311 dataset identifier
socrata_dataset_identifier = 'erm2-nwe9'

for idx in range(len(client.datasets())):
    if client.datasets()[idx]['resource']['id'] == socrata_dataset_identifier:
        print('We found the NYC 311 dataset!')
        print('Index is: {}'.format(idx))
        
        dataset_index = idx
        break 

We found the NYC 311 dataset!
Index is: 6


Since the datasets method is long, let's see if we can identify specific keys we want to preview in the resource dictionary.

In [16]:
# preview items for 311 dataset
for key, value in client.datasets()[dataset_index]['resource'].items():
    print('{}: {}\n'.format(key, value)) 

name: 311 Service Requests from 2010 to Present

id: erm2-nwe9

parent_fxf: []

description: <b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p>

attribution: 311, DoITT

attribution_link: None

contact_email: None

type: dataset

updatedAt: 2023-07-07T01:32:42.000Z

createdAt: 2011-10-10T05:52:17.000Z

metadata_updated_at: 2023-06-22T22:31:54.000Z

data_updated_at: 2023-07-07T01:32:42.000Z

page_views: {'page_views_last_week': 1563, 'page_views_last_month': 7524, 'page_views_total': 441908, 'page_views_last_week_log': 10.611024797307353, 'page_views_last_month_log': 12.877475866534427, 'page_views_total_log': 18.75338978802358}

columns_name: ['Due Date', 'Community Board', 'Status', 'Incident Zip', 'Landmark', 'Locatio

In [17]:
name = client.datasets()[dataset_index]['resource']['name']
desc = client.datasets()[dataset_index]['resource']['description']

print('The {} data description:\n\n{}.'.format(name, desc))

The 311 Service Requests from 2010 to Present data description:

<b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p>.


In [18]:
created = client.datasets()[dataset_index]['resource']['createdAt']
updated = client.datasets()[dataset_index]['resource']['updatedAt']

print('created: {}\nupdated: {}.'.format(created, updated))

created: 2011-10-10T05:52:17.000Z
updated: 2023-07-07T01:32:42.000Z.


In [19]:
# NYC 311 dataset page views information
for key, value in client.datasets()[dataset_index]['resource']['page_views'].items():
    print('{}: {}'.format(key, value))

page_views_last_week: 1563
page_views_last_month: 7524
page_views_total: 441908
page_views_last_week_log: 10.611024797307353
page_views_last_month_log: 12.877475866534427
page_views_total_log: 18.75338978802358


In [20]:
client.close()

##  5.2 `.get()`

`get` method: Read data from the requested resource. Options for content_type are json,
csv, and xml.

In [21]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# NYC 311 dataset identifier
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

# printing the column and value of the first record
for key, value in client.get(socrata_dataset_identifier)[0].items():
    print('{}: {}'.format(key, value))



unique_key: 58102744
created_date: 2023-07-06T12:00:00.000
closed_date: 2023-07-06T12:00:00.000
agency: DSNY
agency_name: Department of Sanitation
complaint_type: Derelict Vehicles
descriptor: Derelict Vehicles
location_type: Street
incident_zip: 11231
incident_address: 40 CENTRE MALL
street_name: CENTRE MALL
cross_street_1: COLUMBIA STREET
cross_street_2: HICKS STREET
address_type: ADDRESS
city: BROOKLYN
facility_type: DSNY Garage
status: Closed
resolution_description: The owner claimed the vehicle.Â Your request is now closed and no further action will be taken.
resolution_action_updated_date: 2023-07-06T12:00:00.000
community_board: 06 BROOKLYN
bbl: 3005380001
borough: BROOKLYN
x_coordinate_state_plane: 982585
y_coordinate_state_plane: 185410
open_data_channel_type: PHONE
park_facility_name: Unspecified
park_borough: BROOKLYN
latitude: 40.675584218632636
longitude: -74.00600254428969
location: {'latitude': '40.675584218632636', 'longitude': '-74.00600254428969', 'human_address': '{"

In [22]:
# printing the column and value of the first record
for key, value in client.get(socrata_dataset_identifier)[0].items():
    print('{}: {}'.format(key, value))     

unique_key: 58103107
created_date: 2023-07-06T12:00:00.000
agency: DSNY
agency_name: Department of Sanitation
complaint_type: Derelict Vehicles
descriptor: Derelict Vehicles
location_type: Street
incident_zip: 10469
incident_address: 3525 EASTCHESTER ROAD
street_name: EASTCHESTER ROAD
cross_street_1: CHESTER STREET
cross_street_2: HICKS STREET
address_type: ADDRESS
city: BRONX
status: Open
resolution_description: If the abandoned vehicle meets the criteria to be classified as a derelict (i.e. junk) the Department of Sanitation (DSNY) will investigate and tag the vehicle within three business days.
resolution_action_updated_date: 2023-07-06T12:00:00.000
community_board: 12 BRONX
bbl: 2047220006
borough: BRONX
x_coordinate_state_plane: 1026426
y_coordinate_state_plane: 259584
open_data_channel_type: PHONE
park_facility_name: Unspecified
park_borough: BRONX
latitude: 40.87907158488153
longitude: -73.84748453642206
location: {'latitude': '40.87907158488153', 'longitude': '-73.8474845364220

In [23]:
# SoQL query string below:
# retrieve all columns and limit our records to 100

query = (
    """
    SELECT *
    LIMIT 100
    """
)

# returned as JSON from API / converted to Python list of dictionaries by sodapy
results = client.get(
    socrata_dataset_identifier,
    query=query
)

# convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
client.close()

print('shape of data: {}'.format(results_df.shape))
results_df.head()

shape of data: (100, 41)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,closed_date,facility_type,intersection_street_1,intersection_street_2,landmark,bridge_highway_name,bridge_highway_segment,bridge_highway_direction
0,58102040,2023-07-06T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10467.0,660 EAST 221 STREET,EAST 221 STREET,...,2.0,30.0,,,,,,,,
1,58103107,2023-07-06T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10469.0,3525 EASTCHESTER ROAD,EASTCHESTER ROAD,...,2.0,30.0,,,,,,,,
2,58102744,2023-07-06T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11231.0,40 CENTRE MALL,CENTRE MALL,...,7.0,48.0,2023-07-06T12:00:00.000,DSNY Garage,,,,,,
3,58106627,2023-07-06T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,,,,...,,,,DSNY Garage,MIDWOOD STREET,MIDWOOD STREET,,,,
4,58106335,2023-07-06T02:08:06.000,NYPD,New York City Police Department,Illegal Fireworks,,Street/Sidewalk,11226.0,31 KENMORE PLACE,KENMORE PLACE,...,11.0,43.0,,,WOODRUFF AVENUE,CATON AVENUE,KENMORE PLACE,,,


## 5.3  `.get_metadata()`

`get_metadata` method: Retrieve the metadata for a particular dataset.

In [24]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# previewing keys
for key in client.get_metadata(socrata_dataset_identifier).keys():
    print(key)



id
name
assetType
attribution
averageRating
category
createdAt
description
displayType
downloadCount
hideFromCatalog
hideFromDataJson
indexUpdatedAt
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowClass
rowIdentifierColumnId
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
clientContext
columns
grants
metadata
owner
query
rights
tableAuthor
tags
flags


In [25]:
# preview keys in columns
for key in client.get_metadata(socrata_dataset_identifier)['columns'][0].keys():
    print(key)

id
name
dataTypeName
description
fieldName
position
renderTypeName
tableColumnId
width
cachedContents
format


In [26]:
# preview field names
for idx in client.get_metadata(socrata_dataset_identifier)['columns']:
    print(idx['fieldName'])

unique_key
created_date
closed_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
address_type
city
landmark
facility_type
status
due_date
resolution_description
resolution_action_updated_date
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
vehicle_type
taxi_company_borough
taxi_pick_up_location
bridge_highway_name
bridge_highway_direction
road_ramp
bridge_highway_segment
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih


In [27]:
client.close()