# Sodapy Basics Tutorial Using NYC Open Data 
Author: Mark Bauer

Table of Contents
=================

   1. Introduction
   2. Sodapy
       - 2.1 Using Sodapy
       - 2.2 Sodapy Methods
   3. Importing Libraries
   4. Socrata Class
   5. Sodapy Methods
       - 5.1 .datasets()
       - 5.2 .get()
       - 5.3 .get_metadata()

# 1. Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API (SODA), and reviews various methods to retrieve data from Socrata Open Data. The data in this tutorial is from NYC Open Data. 

# 2. Sodapy

## 2.1 Using Sodapy

In order use sodapy, a source domain (i.e. the Socrata Open Data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset on Socrata Open Data, then the dataset identifier (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set: `erm2-nwe9`. The screenshot below displays where we retrieve this information.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

**Source**: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.2 Sodapy Methods

![socrata-methods](images/socrata-methods.png)

**Source**: https://github.com/xmunoz/sodapy#datasetslimit0-offset0

We will be focusing on three sodapy methods:
-  `.datasets()`
Returns the list of datasets associated with a particular domain.

- `.get()`
Read data from the requested resource. Options for content_type are json, csv, and xml.

- `.get_metadata()`
Retrieve the metadata for a particular dataset.

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
from sodapy import Socrata

In [2]:
# documention for installing watermark: https://github.com/rasbt/watermark
%reload_ext watermark
%watermark -u -t -d -v -p pandas,sodapy

Last updated: 2024-04-23 15:53:19

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

pandas: 1.4.3
sodapy: 2.1.1



# 4. Socrata Class 

### Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`

Read more from the SODA documentation here: https://dev.socrata.com/docs/app-tokens.html

In [3]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Socrata(): The main class that interacts with the SODA API.

# The required arguments are:
#     domain: the domain you wish you to access
#     app_token: your Socrata application token
# Simple requests are possible without an app_token, though these
# requests will be rate-limited.

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

client



<sodapy.socrata.Socrata at 0x15efb4b50>

In [4]:
# printing information about the Socrata object
print(client)
print('type: {}'.format(type(client)))

<sodapy.socrata.Socrata object at 0x15efb4b50>
type: <class 'sodapy.socrata.Socrata'>


In [5]:
# printing attributes of object
for key, value in client.__dict__.items():
    print('{}: {}'.format(key, value))

domain: data.cityofnewyork.us
session: <requests.sessions.Session object at 0x15efb4ca0>
uri_prefix: https://
timeout: 100


In [6]:
# close the session when finished
client.close()

# 5. Sodapy Methods

## 5.1  .datasets()

`datasets` method: Returns the list of datasets associated with a particular domain.
WARNING: Large limits (>1000) will return megabytes of data,
which can be slow on low-bandwidth networks, and is also a lot of
data to hold in memory.

In [7]:
# soure domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

print('object type: {}'.format(type(client.datasets())))

datasets = len(client.datasets())
print('Number of datasets on NYC Open Data: {:,}.'.format(datasets))



object type: <class 'list'>
Number of datasets on NYC Open Data: 3,342.


In [8]:
# reviewing type about first dataset
print(type(client.datasets()[0]))

<class 'dict'>


In [9]:
# reviewing information about keys first dataset
client.datasets()[0].keys()

dict_keys(['resource', 'classification', 'metadata', 'permalink', 'link', 'owner', 'creator'])

In [10]:
# reviewing information about resource key
resources = client.datasets()[0]['resource'].items()

for key, value in resources:
    print('{}: {}\n'.format(key, value))      

name: For Hire Vehicles (FHV) - Active

id: 8wbx-tsch

parent_fxf: []

description: <b>PLEASE NOTE:</b> This dataset, which includes all TLC licensed for-hire vehicles which are in good standing and able to drive, is updated every day in the evening between 4-7pm. Please check the 'Last Update Date' field to make sure the list has updated successfully. 'Last Update Date'  should show either today or yesterday's date, depending on the time of day. If the list is outdated, please download the most recent list from the link below. 
http://www1.nyc.gov/assets/tlc/downloads/datasets/tlc_for_hire_vehicle_active_and_inactive.csv

TLC authorized For-Hire vehicles that are active. This list is accurate to the date and time represented in the Last Date Updated and Last Time Updated fields. For inquiries about the contents of this dataset, please email licensinginquiries@tlc.nyc.gov.

attribution: Taxi and Limousine Commission (TLC)

attribution_link: None

contact_email: None

type: dataset



In [11]:
# reviewing information about classification key
classification = client.datasets()[0]['classification'].items()

for key, value in classification:
    print('{}: {}'.format(key, value))

categories: []
tags: []
domain_category: Transportation
domain_tags: ['active', 'for-hire', 'for hire', 'fhv', 'drivers', 'inactive', 'taxi', 'for-hire-vehicles']
domain_metadata: [{'key': 'Update_Automation', 'value': 'Yes'}, {'key': 'Update_Date-Made-Public', 'value': '7/20/2015'}, {'key': 'Update_Update-Frequency', 'value': 'Daily'}, {'key': 'Dataset-Information_Agency', 'value': 'Taxi and Limousine Commission (TLC)'}]


In [12]:
# reviewing information about metadata key
metadata = client.datasets()[0]['metadata'].items()

for key, value in metadata:
    print('{}: {}\n'.format(key, value))

domain: data.cityofnewyork.us



In [13]:
owner = client.datasets()[0]['owner'].items()

for key, value in owner:
    print('{}: {}'.format(key, value))

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


In [14]:
creator = list(client.datasets()[0]['creator'].items())
     
for key, value in creator:
    print('{}: {}'.format(key, value))     

id: 5fuc-pqz2
user_type: interactive
display_name: NYC OpenData


Once we've identified the structure of the dictionary, we try to find the 311 dataset and identify its position in the datasets list.

In [15]:
# NYC 311 dataset identifier
socrata_dataset_identifier = 'erm2-nwe9'

for idx in range(len(client.datasets())):
    
    if client.datasets()[idx]['resource']['id'] == socrata_dataset_identifier:
        print('We found the NYC 311 dataset!')
        print('Index is: {}'.format(idx))
        dataset_index = idx
        
        break 

We found the NYC 311 dataset!
Index is: 5


Since the datasets method is long, let's see if we can identify specific keys we want to preview in the resource dictionary.

In [16]:
# preview items for 311 dataset
resource_311 = client.datasets()[dataset_index]['resource'].items()

for key, value in resource_311:
    print('{}: {}\n'.format(key, value)) 

name: 311 Service Requests from 2010 to Present

id: erm2-nwe9

parent_fxf: []

description: <b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p>

attribution: 311

attribution_link: None

contact_email: None

type: dataset

updatedAt: 2024-04-23T01:34:04.000Z

createdAt: 2011-10-10T05:52:17.000Z

metadata_updated_at: 2024-01-24T21:32:48.000Z

data_updated_at: 2024-04-23T01:34:04.000Z

page_views: {'page_views_last_week': 2173, 'page_views_last_month': 9758, 'page_views_total': 829836, 'page_views_last_week_log': 11.08613622502731, 'page_views_last_month_log': 13.252517607762288, 'page_views_total_log': 19.66246845862453}

columns_name: ['Borough Boundaries', 'Community Districts', 'Zip Codes', 'Bridge Highway Segment', '

In [17]:
name = client.datasets()[dataset_index]['resource']['name']
desc = client.datasets()[dataset_index]['resource']['description']

print('The {} data description:\n\n{}.'.format(name, desc))

The 311 Service Requests from 2010 to Present data description:

<b>Please note: Due to pandemic call handling modifications, the ‘Open Data Channel Type’ values may not accurately indicate the channel the Service Request was submitted in for the period starting March 2020.</b>
<p>
All 311 Service Requests from 2010 to present. This information is automatically updated daily.
</p>.


In [18]:
created = client.datasets()[dataset_index]['resource']['createdAt']
updated = client.datasets()[dataset_index]['resource']['updatedAt']

print('created: {}\nupdated: {}.'.format(created, updated))

created: 2011-10-10T05:52:17.000Z
updated: 2024-04-23T01:34:04.000Z.


In [19]:
# NYC 311 dataset page views information
for key, value in client.datasets()[dataset_index]['resource']['page_views'].items():
    print('{}: {}'.format(key, value))

page_views_last_week: 2173
page_views_last_month: 9758
page_views_total: 829836
page_views_last_week_log: 11.08613622502731
page_views_last_month_log: 13.252517607762288
page_views_total_log: 19.66246845862453


In [20]:
client.close()

##  5.2 .get()

`get` method: Read data from the requested resource. Options for content_type are JSON,
CSV, and XML.

In [21]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# NYC 311 dataset identifier
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

# printing the column and value of the first record
for key, value in client.get(socrata_dataset_identifier)[0].items():
    print('{}: {}'.format(key, value))



unique_key: 60939475
created_date: 2024-04-22T01:51:26.000
agency: NYPD
agency_name: New York City Police Department
complaint_type: Noise - Residential
descriptor: Loud Music/Party
location_type: Residential Building/House
incident_zip: 11375
incident_address: 89-17 69 ROAD
street_name: 69 ROAD
cross_street_1: METROPOLITAN AVENUE
cross_street_2: OLCOTT STREET
intersection_street_1: METROPOLITAN AVENUE
intersection_street_2: OLCOTT STREET
address_type: ADDRESS
city: FOREST HILLS
landmark: 69 ROAD
status: In Progress
resolution_action_updated_date: 2024-04-22T02:23:56.000
community_board: 06 QUEENS
bbl: 4032060011
borough: QUEENS
x_coordinate_state_plane: 1024544
y_coordinate_state_plane: 198466
open_data_channel_type: ONLINE
park_facility_name: Unspecified
park_borough: QUEENS
latitude: 40.71132832064505
longitude: -73.8546568589395
location: {'latitude': '40.71132832064505', 'longitude': '-73.8546568589395', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}
:@com

In [22]:
# printing the column and value of the first record
for key, value in client.get(socrata_dataset_identifier)[0].items():
    print('{}: {}'.format(key, value))     

unique_key: 60939475
created_date: 2024-04-22T01:51:26.000
agency: NYPD
agency_name: New York City Police Department
complaint_type: Noise - Residential
descriptor: Loud Music/Party
location_type: Residential Building/House
incident_zip: 11375
incident_address: 89-17 69 ROAD
street_name: 69 ROAD
cross_street_1: METROPOLITAN AVENUE
cross_street_2: OLCOTT STREET
intersection_street_1: METROPOLITAN AVENUE
intersection_street_2: OLCOTT STREET
address_type: ADDRESS
city: FOREST HILLS
landmark: 69 ROAD
status: In Progress
resolution_action_updated_date: 2024-04-22T02:23:56.000
community_board: 06 QUEENS
bbl: 4032060011
borough: QUEENS
x_coordinate_state_plane: 1024544
y_coordinate_state_plane: 198466
open_data_channel_type: ONLINE
park_facility_name: Unspecified
park_borough: QUEENS
latitude: 40.71132832064505
longitude: -73.8546568589395
location: {'latitude': '40.71132832064505', 'longitude': '-73.8546568589395', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}
:@com

In [23]:
# SoQL query string below:
# retrieve all columns and limit our records to 100

query = (
    """
    SELECT *
    LIMIT 100
    """
)

# returned as JSON from API / converted to Python list of dictionaries by sodapy
results = client.get(
    socrata_dataset_identifier,
    query=query
)

# convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
client.close()

print('shape of data: {}'.format(results_df.shape))
results_df.head()

shape of data: (100, 44)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,:@computed_region_sbqj_enih,:@computed_region_7mpf_4k6g,taxi_pick_up_location,closed_date,resolution_description,vehicle_type,facility_type,bridge_highway_name,bridge_highway_segment,bridge_highway_direction
0,60939475,2024-04-22T01:51:26.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11375,89-17 69 ROAD,69 ROAD,...,70,70,,,,,,,,
1,60938407,2024-04-22T01:51:25.000,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Club/Bar/Restaurant,10458,621 CRESCENT AVENUE,CRESCENT AVENUE,...,31,31,,,,,,,,
2,60935450,2024-04-22T01:50:12.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11235,1245 AVENUE X,AVENUE X,...,36,36,,,,,,,,
3,60937446,2024-04-22T01:50:11.000,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Club/Bar/Restaurant,11373,92-02 CORONA AVENUE,CORONA AVENUE,...,68,68,,,,,,,,
4,60935436,2024-04-22T01:50:11.000,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Club/Bar/Restaurant,11423,184-15 JAMAICA AVENUE,JAMAICA AVENUE,...,61,61,,,,,,,,


## 5.3  .get_metadata()

`get_metadata` method: Retrieve the metadata for a particular dataset.

In [24]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# previewing keys
for key in client.get_metadata(socrata_dataset_identifier).keys():
    print(key, value)



id 70
name 70
assetType 70
attribution 70
averageRating 70
category 70
createdAt 70
description 70
displayType 70
downloadCount 70
hideFromCatalog 70
hideFromDataJson 70
newBackend 70
numberOfComments 70
oid 70
provenance 70
publicationAppendEnabled 70
publicationDate 70
publicationGroup 70
publicationStage 70
rowClass 70
rowIdentifierColumnId 70
rowsUpdatedAt 70
rowsUpdatedBy 70
tableId 70
totalTimesRated 70
viewCount 70
viewLastModified 70
viewType 70
approvals 70
clientContext 70
columns 70
grants 70
metadata 70
owner 70
query 70
rights 70
tableAuthor 70
tags 70
flags 70


In [25]:
# preview download count
client.get_metadata(socrata_dataset_identifier)['downloadCount']

434088

In [26]:
client.close()