# Sodapy Basics Tutorial Using NYC Open Data 
Mark Bauer

Table of Contents
=================

   * 1. Introduction
   * 2. Sodapy
       * 2.1 Using Sodapy
       * 2.2 Sodapy Methods
   * 3. Importing Libraries
   * 4. Socrata Class
   * 5. Sodapy Methods
       * 5.1 .datasets()
       * 5.2 .get()
       * 5.3 .get_metadata()

# 1. Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API (SODA), and reviews various methods to retrieve data from Socrata Open Data. The data in this tutorial is from NYC Open Data. 

# 2. Sodapy

## 2.1 Using Sodapy

In order use sodapy, a source domain (i.e. the Socrata Open Data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset on Socrata Open Data, then the dataset identifier (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set: `erm2-nwe9`. The screenshot below displays where we retrieve this information.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

Source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.2 Sodapy Methods

![socrata-methods](images/socrata-methods.png)

Source: https://github.com/xmunoz/sodapy#datasetslimit0-offset0

# We will be focusing on three sodapy methods:
-  `.datasets()`
Returns the list of datasets associated with a particular domain.

- `.get()`
Read data from the requested resource. Options for content_type are json, csv, and xml.

- `.get_metadata()`
Retrieve the metadata for a particular dataset.

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
from sodapy import Socrata

In [2]:
# documention for installing watermark: https://github.com/rasbt/watermark
%reload_ext watermark
%watermark -u -t -d -v -p pandas,sodapy

Last updated: 2023-06-27 17:25:22

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

pandas: 1.4.2
sodapy: 2.2.0



# 4. Socrata Class 

In [3]:
# soure domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Socrata(): The main class that interacts with the SODA API.

# The required arguments are:
#     domain: the domain you wish you to access
#     app_token: your Socrata application token
# Simple requests are possible without an app_token, though these
# requests will be rate-limited.

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

client



<sodapy.socrata.Socrata at 0x10f3f6cd0>

In [4]:
# printing information about the Socrata object
print(client)
print('type: {}'.format(type(client)))

<sodapy.socrata.Socrata object at 0x10f3f6cd0>
type: <class 'sodapy.socrata.Socrata'>


In [5]:
# printing attributes of object
for key, value in client.__dict__.items():
    print('{}: {}'.format(key, value))

domain: data.cityofnewyork.us
session: <requests.sessions.Session object at 0x10f3f6ee0>
uri_prefix: https://
timeout: 100


# 5. Sodapy Methods

## 5.1  `.datasets()`

`datasets` method: Returns the list of datasets associated with a particular domain.
WARNING: Large limits (>1000) will return megabytes of data,
which can be slow on low-bandwidth networks, and is also a lot of
data to hold in memory.

In [6]:
# datasets object type
print(type(client.datasets()))

<class 'list'>


In [7]:
datasets = len(client.datasets())
print('Number of datasets on NYC Open Data: {:,}.'.format(datasets))

Number of datasets on NYC Open Data: 3,529.


In [8]:
# reviewing type about first dataset
print(type(client.datasets()[0]))

<class 'dict'>


In [9]:
# reviewing information about keys first dataset
client.datasets()[0].keys()

dict_keys(['resource', 'classification', 'metadata', 'permalink', 'link', 'owner', 'creator'])

In [10]:
# reviewing information about resource key
for key, value in client.datasets()[0]['resource'].items():
    print(key + ':', value, '\n')      

name: Civil Service List (Active) 

id: vx8i-nprf 

parent_fxf: [] 

description: A Civil Service List consists of all candidates who passed an exam, ranked in score order. An established list is considered active for no less than one year and no more than four years from the date of establishment. For more information visit DCAS’ “Work for the City” webpage at: https://www1.nyc.gov/site/dcas/employment/take-an-exam.page 

attribution: Department of Citywide Administrative Services (DCAS) 

attribution_link: None 

contact_email: None 

type: dataset 

updatedAt: 2023-06-27T13:33:40.000Z 

createdAt: 2016-06-14T21:12:15.000Z 

metadata_updated_at: 2023-06-27T13:33:40.000Z 

data_updated_at: 2023-06-27T13:24:22.000Z 

page_views: {'page_views_last_week': 4143, 'page_views_last_month': 19081, 'page_views_total': 2305826, 'page_views_last_week_log': 12.016808287686555, 'page_views_last_month_log': 14.219924768862814, 'page_views_total_log': 21.13685284484433} 

columns_name: ['List Title 

In [11]:
# reviewing information about classification key
for key, value in client.datasets()[0]['classification'].items():
    print(key + ':', value)

categories: []
tags: []
domain_category: City Government
domain_tags: ['2018od4a-video', '2018od4a-report']
domain_metadata: [{'key': 'Update_Automation', 'value': 'Yes'}, {'key': 'Update_Date-Made-Public', 'value': '7/12/2016'}, {'key': 'Update_Update-Frequency', 'value': 'Daily'}, {'key': 'Update_Data-Change-Frequency', 'value': 'Weekly'}, {'key': 'Dataset-Information_Agency', 'value': 'Department of Citywide Administrative Services (DCAS)'}]


In [12]:
# reviewing information about metadata key
for key, value in client.datasets()[0]['metadata'].items():
    print(key + ':', value, '\n')

domain: data.cityofnewyork.us 



In [13]:
for key, value in client.datasets()[0]['owner'].items():
    print(key + ':', value)

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


In [14]:
for key, value in list(client.datasets()[0]['creator'].items()):
    print(key + ':', value)       

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


Once we've identified the structure of the dictionary, we try to find the 311 dataset and identify its position in the datasets list.

In [15]:
# NYC 311 dataset identifier
socrata_dataset_identifier = 'erm2-nwe9'

for idx in range(len(client.datasets())):
    if client.datasets()[idx]['resource']['id'] == socrata_dataset_identifier:
        print('We found the NYC 311 dataset!')
        print('Index is: {}'.format(idx))
        break 

We found the NYC 311 dataset!
Index is: 6


Since the datasets method is long, let's see if we can identify specific keys we want to preview in the resource dictionary.

In [16]:
# preview items for 311 dataset
for key, value in client.datasets()[idx]['resource'].items():
    print(str(key) + ':', value, '\n')

name: 311 Service Requests from 2010 to Present 

id: erm2-nwe9 

parent_fxf: [] 

description: <b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p> 

attribution: 311, DoITT 

attribution_link: None 

contact_email: None 

type: dataset 

updatedAt: 2023-06-27T01:32:58.000Z 

createdAt: 2011-10-10T05:52:17.000Z 

metadata_updated_at: 2023-06-22T22:31:54.000Z 

data_updated_at: 2023-06-27T01:32:58.000Z 

page_views: {'page_views_last_week': 1808, 'page_views_last_month': 8938, 'page_views_total': 441908, 'page_views_last_week_log': 10.82097669262124, 'page_views_last_month_log': 13.125897731760327, 'page_views_total_log': 18.75338978802358} 

columns_name: ['Open Data Channel Type', 'Location Type', 'Cross Street 2', 'Tax

In [17]:
name = client.datasets()[idx]['resource']['name']
desc = client.datasets()[idx]['resource']['description']

print('The {} data description:\n\n{}.'.format(name, desc))

The 311 Service Requests from 2010 to Present data description:

<b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p>.


In [18]:
created = client.datasets()[idx]['resource']['createdAt']
updated = client.datasets()[idx]['resource']['updatedAt']

print('created: {}\nupdated: {}.'.format(created, updated))

created: 2011-10-10T05:52:17.000Z
updated: 2023-06-27T01:32:58.000Z.


In [19]:
# NYC 311 dataset page views information
for key, value in client.datasets()[idx]['resource']['page_views'].items():
    print(key + ':', value)

page_views_last_week: 1808
page_views_last_month: 8938
page_views_total: 441908
page_views_last_week_log: 10.82097669262124
page_views_last_month_log: 13.125897731760327
page_views_total_log: 18.75338978802358


##  5.2 `.get()`

`get` method: Read data from the requested resource. Options for content_type are json,
csv, and xml.

In [20]:
client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

# NYC 311 data set identifier
socrata_dataset_identifier = 'erm2-nwe9'



In [21]:
# printing the column and value of the first record
for key, value in client.get(socrata_dataset_identifier)[0].items():
    print(key + ":", value)      

unique_key: 58002872
created_date: 2023-06-26T12:00:00.000
agency: DSNY
agency_name: Department of Sanitation
complaint_type: Derelict Vehicles
descriptor: Derelict Vehicles
location_type: Street
incident_zip: 11203
incident_address: 240 EAST   53 STREET
street_name: EAST   53 STREET
cross_street_1: LENOX ROAD
cross_street_2: LINDEN BOULEVARD
address_type: ADDRESS
city: BROOKLYN
status: Open
resolution_description: If the abandoned vehicle meets the criteria to be classified as a derelict (i.e. junk) the Department of Sanitation (DSNY) will investigate and tag the vehicle within three business days.
resolution_action_updated_date: 2023-06-26T12:00:00.000
community_board: 17 BROOKLYN
borough: BROOKLYN
x_coordinate_state_plane: 1004301
y_coordinate_state_plane: 177661
open_data_channel_type: PHONE
park_facility_name: Unspecified
park_borough: BROOKLYN
latitude: 40.65429238853374
longitude: -73.92773656642883
location: {'latitude': '40.65429238853374', 'longitude': '-73.92773656642883', '

In [22]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

client



<sodapy.socrata.Socrata at 0x11270d6d0>

In [23]:
# SoQL query string below:
# retrieve all columns and limit our records to 100.

query = (
    """
    SELECT *
    LIMIT 100
    """
)

# returned as JSON from API / converted to Python list of dictionaries by sodapy.
results = client.get(
    socrata_dataset_identifier,
    query=query
)

# convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df.head()

shape of data: (100, 37)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,intersection_street_1,intersection_street_2,landmark,closed_date
0,58006471,2023-06-26T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10032,177 FORT WASHINGTON AVENUE,FORT WASHINGTON AVENUE,...,"{'latitude': '40.84115227324087', 'longitude':...",13090,47,4,39,21,,,,
1,58006508,2023-06-26T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11214,8015 BAY PARKWAY,BAY PARKWAY,...,"{'latitude': '40.60469714576356', 'longitude':...",17616,1,2,18,37,,,,
2,58002872,2023-06-26T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11203,240 EAST 53 STREET,EAST 53 STREET,...,"{'latitude': '40.65429238853374', 'longitude':...",16866,61,2,17,40,,,,
3,58002867,2023-06-26T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11204,6908 16 AVENUE,16 AVENUE,...,"{'latitude': '40.61960174125197', 'longitude':...",13511,1,2,44,37,,,,
4,58006651,2023-06-26T02:00:17.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10467,3530 DECATUR AVENUE,DECATUR AVENUE,...,"{'latitude': '40.8792856112555', 'longitude': ...",11605,24,5,40,34,EAST GUN HILL ROAD,EAST 211 STREET,DECATUR AVENUE,


## 5.3  `.get_metadata()`

`get_metadata` method: Retrieve the metadata for a particular dataset.

In [24]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# previewing keys
for key in client.get_metadata(socrata_dataset_identifier).keys():
    print(key)



id
name
assetType
attribution
averageRating
category
createdAt
description
displayType
downloadCount
hideFromCatalog
hideFromDataJson
indexUpdatedAt
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowClass
rowIdentifierColumnId
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
clientContext
columns
grants
metadata
owner
query
rights
tableAuthor
tags
flags


In [25]:
# preview keys in columns
for key in client.get_metadata(socrata_dataset_identifier)['columns'][0].keys():
    print(key)

id
name
dataTypeName
description
fieldName
position
renderTypeName
tableColumnId
width
cachedContents
format


In [26]:
# preview field names
for idx in client.get_metadata(socrata_dataset_identifier)['columns']:
    print(idx['fieldName'])

unique_key
created_date
closed_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
address_type
city
landmark
facility_type
status
due_date
resolution_description
resolution_action_updated_date
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
vehicle_type
taxi_company_borough
taxi_pick_up_location
bridge_highway_name
bridge_highway_direction
road_ramp
bridge_highway_segment
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih
