# The Socrata Query Language (SoQL)
## Socrata Open Data API (SODA) Tutorial Using NYC Open Data
Author: Mark Bauer

Table of Contents
=================

   1. Introduction
   2. The Socrata Open Data API
       * 2.1 Using Socrata Open Data API (SODA)
       * 2.2 Using Sodapy
       * 2.3 Socrata Query Language (SoQL)
   3. Importing Libraries
   4. Retrieving Data Directly from Socrata Open Data API (SODA) 
   5. SoQL with Sodapy
       * 5.1 SoQL Clauses
       * 5.2 SoQL Function and Keyword Listing  

# 1. Introduction  
This notebook demonstrates how to interact with the [Socrata Open Data API (SODA)](https://dev.socrata.com/) and introduces the [Socrata Query Language (SoQL)](https://dev.socrata.com/docs/queries/), which is used for querying data from Socrata-powered platforms. We’ll use SoQL to fetch data from NYC Open Data. Additionally, this notebook introduces [Sodapy](https://github.com/xmunoz/sodapy), a Python client for SODA, and shows how to use Sodapy alongside SoQL to extract and work with data.

I'll demonstrate how to work with the Socrata API Endpoint as well as Sodapy, but my preferred method of retrieving data is with Sodapy. However, please note that the Sodapy project is now archived on GitHub.

# 2. Socrata Open Data

## 2.1 Socrata Open Data API (SODA)

The Socrata Open Data API (SODA) provides a programmatic way to access datasets, not only from NYC Open Data but from a wide range of sources globally. In my experience, it's one of the most efficient and user-friendly methods for accessing open data.

For more information, I encourage you to visit the official [Socrata Open Data API](https://dev.socrata.com/) website. You'll find a wealth of helpful resources there, and I highly recommend reviewing the [API Docs](https://dev.socrata.com/docs/endpoints.html) to deepen your understanding of how to query data effectively and efficiently. This guide is intended to complement the official documentation and help you get started quickly.

![dev socrata](images/dev-socrata.png)

Source: https://dev.socrata.com/

## 2.2 Sodapy

In addition to accessing datasets from NYC Open Data via the Socrata API Endpoint, you can also use Sodapy, a Python client. You can find more information about Sodapy in the [official documentation](https://github.com/xmunoz/sodapy) on GitHub. For an introduction to basic workflows, you can also refer to my tutorial [sodapy-basics.ipynb](https://github.com/mebauer/sodapy-tutorial-nyc-opendata/blob/main/socrata-api-basics.ipynb).

### Attention
When querying all records, be sure to set the `limit` parameter to a value large enough to exceed the total number of records in your dataset. If the `limit` value is set to exactly the same number as the records returned, you likely haven’t retrieved all the data. To avoid this, choose a `limit` that is larger than the dataset’s total size.

**Please note** that the Sodapy project is publicy archived on GitHub and is read-only.

In order use Sodapy, a **source domain** (i.e. the open data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset, then the **dataset identifier** (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set `erm2-nwe9`. The screenshot is the homepage of the 311 dataset from NYC Open Data.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

Source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.3 Socrata Query Language (SoQL)
If you're familiar with SQL, you'll feel right at home with SoQL. This notebook demonstrates how to use SoQL to fetch data from NYC Open Data.

![soql screenshot](images/soql-screenshot.png)

Source: https://dev.socrata.com/docs/queries/

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import urllib.parse

# sodapy
from sodapy import Socrata

In [2]:
# documention for installing watermark: https://github.com/rasbt/watermark, perform for reproducibility
%reload_ext watermark
%watermark -t -d -v -p pandas,sodapy

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.6.0

pandas: 1.5.1
sodapy: 2.2.0



# 4. Retrieving Data Directly from the Socrata Open Data API Endpoint
In this section, we’ll walk through how to retrieve data directly from the Socrata Open Data API (SODA) by constructing URLs with specific parameters. While I typically use Sodapy, the Python client for Socrata, when possible, it’s useful to understand how to work with the API directly.

Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`  

To avoid these limits, it's recommended to use an app token when making API requests.

In [3]:
# define the source domain for accessing NYC Open Data via Socrata API
socrata_domain = 'data.cityofnewyork.us'

# define the dataset identifier for the DEP Green Infrastructure dataset on Socrata
socrata_dataset_identifier = 'bs59-f3nu'

# set the row limit for the query (in this case, limiting to 20 rows)
limit = 20

# construct the full URL to access the dataset, specifying the limit parameter
url = f'https://{socrata_domain}/resource/{socrata_dataset_identifier}.csv?$limit={limit}'

# preview the constructed URL to ensure it's correct before making the request
print(f'Preview URL: {url}')

# load the dataset directly from the constructed URL into a pandas DataFrame
df = pd.read_csv(url)

# perform basic sanity checks on the DataFrame
# print the shape of the DataFrame (rows, columns)
print(df.shape)

# display the first few rows of the DataFrame to verify the data
df.head()

Preview URL: https://data.cityofnewyork.us/resource/bs59-f3nu.csv?$limit=20
(20, 30)


Unnamed: 0,the_geom,asset_id,gi_id,dep_contra,dep_cont_1,row_onsite,project_na,asset_type,status,asset_x_co,...,asset_leng,asset_widt,asset_area,gi_feature,tree_latin,tree_commo,constructi,construc_1,program_ar,status_gro
0,POINT (-73.81167623024226 40.69138622900597),94002.0,1A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036475.0,...,17.0,5.0,85.0,Standard,Chionanthus retusus,Chinese Fringetree,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
1,POINT (-73.81228577606385 40.69238458134393),94012.0,GS6A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036305.0,...,13.0,3.5,45.5,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
2,POINT (-73.8122344420821 40.69312522070409),94017.0,GS8C,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036319.0,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
3,POINT (-73.8120597400255 40.6931738947353),94019.0,GS8E,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036368.0,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
4,POINT (-73.81310191327061 40.69279332424906),94021.0,10A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036079.0,...,13.0,4.0,52.0,Standard,Quercus palustris,Pin Oak,GCJA03-2A,Package-1,Right of Way (ROW),Constructed


In [4]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 30 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   the_geom    20 non-null     object 
 1   asset_id    20 non-null     float64
 2   gi_id       20 non-null     object 
 3   dep_contra  20 non-null     object 
 4   dep_cont_1  20 non-null     int64  
 5   row_onsite  20 non-null     object 
 6   project_na  20 non-null     object 
 7   asset_type  20 non-null     object 
 8   status      20 non-null     object 
 9   asset_x_co  20 non-null     float64
 10  asset_y_co  20 non-null     float64
 11  borough     20 non-null     object 
 12  sewer_type  20 non-null     object 
 13  outfall     20 non-null     object 
 14  nyc_waters  20 non-null     object 
 15  bbl         20 non-null     float64
 16  secondary_  20 non-null     float64
 17  community_  20 non-null     float64
 18  city_counc  20 non-null     float64
 19  assembly_d  20 non-null     flo

## WHERE Statements
Notice how to use WHERE statements to construct a URL.

In [5]:
# define the dataset identifier for the DEP Green Infrastructure dataset on Socrata
socrata_dataset_identifier = 'bs59-f3nu'

# WHERE statements
area = '200' # units sq ft
column = 'asset_area'
limit = 20

# construct url
url = f'https://data.cityofnewyork.us/resource/{socrata_dataset_identifier}.csv?$where={column}%20>=%20%27{area}%27&$limit={limit}'

print(f'Preview URL: {url}')

# read data into pandas DataFrame
df = pd.read_csv(url)

# sanity check
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/bs59-f3nu.csv?$where=asset_area%20>=%20%27200%27&$limit=20
(20, 30)


Unnamed: 0,the_geom,asset_id,gi_id,dep_contra,dep_cont_1,row_onsite,project_na,asset_type,status,asset_x_co,...,asset_leng,asset_widt,asset_area,gi_feature,tree_latin,tree_commo,constructi,construc_1,program_ar,status_gro
0,POINT (-73.93538660699062 40.69408170243197),138299.0,ROO2-05SRa,GKNC15-02-OS9,2,Onsite,Roosevelt II Houses,Subsurface Storage,Constructed,1002168.0,...,0.0,0.0,200.0,,,,,,Public Onsite,Constructed
1,POINT (-73.92090216658133 40.66345958422758),158143.0,UNST-1,BEPA-GR-1,GR,ROW,Nitrogen Consent Order SW Pilot EBP,ROWB,Constructed (Full Maintenance),1006194.0,...,40.0,5.0,200.0,Standard,Quercus bicolor,Swamp White Oak,,,Right of Way (ROW),Constructed
2,POINT (-73.91994796986424 40.670230728097),158144.0,HOWAV-1,BEPA-GR-1,GR,ROW,Nitrogen Consent Order SW Pilot EBP,ROWB,Constructed (Full Maintenance),1006457.0,...,40.0,5.0,200.0,Type B - Stormwater Inlet,Liquidambar styraciflua,Sweetgum,,,Right of Way (ROW),Constructed
3,POINT (-73.91198377084235 40.6728839138885),158145.0,EPKY-1,BEPA-GR-1,GR,ROW,Nitrogen Consent Order SW Pilot EBP,ROWB,Constructed (Full Maintenance),1008665.0,...,40.0,5.0,200.0,Standard,Liquidambar styraciflua,Sweetgum,,,Right of Way (ROW),Constructed
4,POINT (-73.75503439377286 40.71276892893676),158154.0,99AV-1,BEPA-GR-1,GR,ROW,Nitrogen Consent Order SW Pilot EBP,ROWB,Constructed (Full Maintenance),1052161.0,...,40.0,5.0,200.0,Type B - Stormwater Inlet,Nyssa sylvatica,Blackgum,,,Right of Way (ROW),Constructed


In [6]:
# sanity check
df['asset_area'].describe()

count     20.000000
mean     200.627000
std        1.367076
min      200.000000
25%      200.000000
50%      200.000000
75%      201.000000
max      206.000000
Name: asset_area, dtype: float64

In [7]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 30 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   the_geom    20 non-null     object 
 1   asset_id    20 non-null     float64
 2   gi_id       20 non-null     object 
 3   dep_contra  20 non-null     object 
 4   dep_cont_1  20 non-null     object 
 5   row_onsite  20 non-null     object 
 6   project_na  20 non-null     object 
 7   asset_type  20 non-null     object 
 8   status      20 non-null     object 
 9   asset_x_co  20 non-null     float64
 10  asset_y_co  20 non-null     float64
 11  borough     20 non-null     object 
 12  sewer_type  20 non-null     object 
 13  outfall     20 non-null     object 
 14  nyc_waters  20 non-null     object 
 15  bbl         20 non-null     float64
 16  secondary_  20 non-null     float64
 17  community_  20 non-null     float64
 18  city_counc  20 non-null     float64
 19  assembly_d  20 non-null     flo

## QUERY Parameters
Notice how to use the Socrata API QUERY parameter to construct a URL.

In [8]:
# define the dataset identifier for the DEP Green Infrastructure dataset on Socrata
socrata_dataset_identifier = 'bs59-f3nu'

# QUERY parameter
query = 'SELECT%20*%20LIMIT%20100'
url = f'https://data.cityofnewyork.us/resource/{socrata_dataset_identifier}.csv?$query={query}'

print(f'Preview URL: {url}')

# read data into pandas DataFrame
df = pd.read_csv(url)

# sanity check
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/bs59-f3nu.csv?$query=SELECT%20*%20LIMIT%20100
(100, 30)


Unnamed: 0,the_geom,asset_id,gi_id,dep_contra,dep_cont_1,row_onsite,project_na,asset_type,status,asset_x_co,...,asset_leng,asset_widt,asset_area,gi_feature,tree_latin,tree_commo,constructi,construc_1,program_ar,status_gro
0,POINT (-73.81167623024226 40.69138622900597),94002.0,1A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036475.0,...,17.0,5.0,85.0,Standard,Chionanthus retusus,Chinese Fringetree,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
1,POINT (-73.81228577606385 40.69238458134393),94012.0,GS6A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036305.0,...,13.0,3.5,45.5,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
2,POINT (-73.8122344420821 40.69312522070409),94017.0,GS8C,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036319.0,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
3,POINT (-73.8120597400255 40.6931738947353),94019.0,GS8E,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036368.0,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
4,POINT (-73.81310191327061 40.69279332424906),94021.0,10A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036079.0,...,13.0,4.0,52.0,Standard,Quercus palustris,Pin Oak,GCJA03-2A,Package-1,Right of Way (ROW),Constructed


In [9]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 30 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   the_geom    100 non-null    object 
 1   asset_id    100 non-null    float64
 2   gi_id       100 non-null    object 
 3   dep_contra  100 non-null    object 
 4   dep_cont_1  100 non-null    int64  
 5   row_onsite  100 non-null    object 
 6   project_na  100 non-null    object 
 7   asset_type  100 non-null    object 
 8   status      100 non-null    object 
 9   asset_x_co  100 non-null    float64
 10  asset_y_co  100 non-null    float64
 11  borough     100 non-null    object 
 12  sewer_type  100 non-null    object 
 13  outfall     100 non-null    object 
 14  nyc_waters  100 non-null    object 
 15  bbl         100 non-null    float64
 16  secondary_  100 non-null    float64
 17  community_  100 non-null    float64
 18  city_counc  100 non-null    float64
 19  assembly_d  100 non-null    fl

I prefer formatting my queries with the `query` parameter, as we can use SQL-style statements as an argument to the Socrata API. Note the `urllib.parse.quote_plus()` method to format the URL.

As you'll see, we can also use the query parameter when working with sodapy.

In [10]:
# define the dataset identifier for NYC 311 Complaints
socrata_dataset_identifier = 'erm2-nwe9'

# define the query to filter the dataset:
# this SQL-like query will select rows 
# where the 'created_date' is greater than or equal to 2020 and the 'descriptor' 
# is 'Street Flooding (SJ)'. The result is limited to the first 100 records.

query = """
    SELECT *
    WHERE
        created_date >= '2020-01-01'
        AND descriptor = 'Street Flooding (SJ)'
    LIMIT
        100
"""

# encode the query string for use in the URL. This ensures any special characters 
# in the query are properly escaped and can be safely included in the URL.
safe_string = urllib.parse.quote_plus(query)

# construct the full URL by appending the query string to the Socrata dataset URL.
# this URL points to the NYC 311 dataset, filtered by the query above.
url = f'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query={safe_string}'

# print the constructed URL to verify it's correct before requesting the data
print(f'Preview URL: {url}')

# load the filtered dataset into a pandas DataFrame using the URL
df = pd.read_csv(url)

# sanity checks
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=%0A++++SELECT+%2A%0A++++WHERE%0A++++++++created_date+%3E%3D+%272020-01-01%27%0A++++++++AND+descriptor+%3D+%27Street+Flooding+%28SJ%29%27%0A++++LIMIT%0A++++++++100%0A
(100, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,63250920,2024-11-30T23:29:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11234,,...,,,,,,,,40.623203,-73.933152,"\n, \n(40.62320271523411, -73.93315206805134)"
1,63254360,2024-11-30T11:44:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11223,1991 WEST 10 STREET,...,,,,,,,,40.598841,-73.981793,"\n, \n(40.59884144472194, -73.98179340419189)"
2,63252042,2024-11-30T08:20:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,10467,WEBSTER AVENUE,...,,,,,,,,,,
3,63254361,2024-11-30T08:02:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11201,LIVINGSTON STREET,...,,,,,,,,,,
4,63253219,2024-11-30T07:26:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11434,140-59 161 STREET,...,,,,,,,,40.669915,-73.774547,"\n, \n(40.669915085838014, -73.77454676860035)"


# 5. SoQL with Sodapy
The primary focus of this notebook is SoQL (Socrata Query Language), which is used to query data from Socrata-powered platforms. In addition, we’ll be using [Sodapy](https://github.com/xmunoz/sodapy), a Python client for interacting with the Socrata API. This is my preferred method for extracting data from NYC Open Data. However, please note that the Sodapy project is now archived on GitHub.

## 5.1 SoQL Clauses

In [11]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# define the dataset identifier for the DEP Green Infrastructure dataset on Socrata
socrata_dataset_identifier = 'bs59-f3nu'

# Socrata - The main class that interacts with the SODA API.
# we pass the source domain value of NYC Open data, the app token as 'None',
# and set the timeout parameter for '100 seconds'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below: select all columns, limit our records to 10
query = """
    SELECT *    
    LIMIT 10
"""

# returned as JSON from API / converted to Python list of dictionaries by sodapy
results = client.get(socrata_dataset_identifier, query=query)

# convert to pandas DataFrame
df = pd.DataFrame.from_records(results)

# sanity check
print(df.shape)
df.head()



(10, 30)


Unnamed: 0,the_geom,asset_id,gi_id,dep_contra,dep_cont_1,row_onsite,project_na,asset_type,status,asset_x_co,...,asset_leng,asset_widt,asset_area,gi_feature,tree_latin,tree_commo,constructi,construc_1,program_ar,status_gro
0,"{'type': 'Point', 'coordinates': [-73.81167623...",94002.0,1A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036475.27735,...,17.0,5.0,85.0,Standard,Chionanthus retusus,Chinese Fringetree,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
1,"{'type': 'Point', 'coordinates': [-73.81228577...",94012.0,GS6A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036305.46107,...,13.0,3.5,45.5,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
2,"{'type': 'Point', 'coordinates': [-73.81223444...",94017.0,GS8C,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036319.11813,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
3,"{'type': 'Point', 'coordinates': [-73.81205974...",94019.0,GS8E,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWGS,Constructed (Full Maintenance),1036367.52667,...,20.0,3.5,70.0,,No Tree,,GCJA03-2A,Package-1,Right of Way (ROW),Constructed
4,"{'type': 'Point', 'coordinates': [-73.81310191...",94021.0,10A,GQJA03-02,2,ROW,DDC JAM-003 Phase 2,ROWB,Constructed (Full Maintenance),1036078.81888,...,13.0,4.0,52.0,Standard,Quercus palustris,Pin Oak,GCJA03-2A,Package-1,Right of Way (ROW),Constructed


In [12]:
# examine columns and datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 30 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   the_geom    10 non-null     object
 1   asset_id    10 non-null     object
 2   gi_id       10 non-null     object
 3   dep_contra  10 non-null     object
 4   dep_cont_1  10 non-null     object
 5   row_onsite  10 non-null     object
 6   project_na  10 non-null     object
 7   asset_type  10 non-null     object
 8   status      10 non-null     object
 9   asset_x_co  10 non-null     object
 10  asset_y_co  10 non-null     object
 11  borough     10 non-null     object
 12  sewer_type  10 non-null     object
 13  outfall     10 non-null     object
 14  nyc_waters  10 non-null     object
 15  bbl         10 non-null     object
 16  secondary_  10 non-null     object
 17  community_  10 non-null     object
 18  city_counc  10 non-null     object
 19  assembly_d  10 non-null     object
 20  asset_leng  1

In [13]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ), limit our records to 1,000

query = """
    SELECT *
    WHERE
        descriptor = 'Street Flooding (SJ)'
    LIMIT
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head()



(1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,intersection_street_1,intersection_street_2,address_type,...,longitude,location,incident_address,street_name,cross_street_1,cross_street_2,bbl,closed_date,resolution_description,resolution_action_updated_date
0,63250920,2024-11-30T23:29:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11234,AVENUE L,TROY AVENUE,INTERSECTION,...,-73.93315206805134,"{'latitude': '40.62320271523411', 'longitude':...",,,,,,,,
1,63254360,2024-11-30T11:44:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11223,,,ADDRESS,...,-73.98179340419189,"{'latitude': '40.59884144472194', 'longitude':...",1991 WEST 10 STREET,WEST 10 STREET,AVENUE S,AVENUE T,3070790047.0,,,
2,63252042,2024-11-30T08:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10467,,,BLOCKFACE,...,,,WEBSTER AVENUE,WEBSTER AVENUE,GUNHILL ROAD,EAST 233 STREET,,,,
3,63254361,2024-11-30T08:02:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11201,,,BLOCKFACE,...,,,LIVINGSTON STREET,LIVINGSTON STREET,SMITH STREET,GALLATIN PLACE,,,,
4,63253219,2024-11-30T07:26:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11434,,,ADDRESS,...,-73.77454676860035,"{'latitude': '40.669915085838014', 'longitude'...",140-59 161 STREET,161 STREET,140 AVE,N CONDUIT AVE,4123170015.0,,,


# Attention
When querying all records, be sure to set the `limit` parameter to a value large enough to exceed the total number of records in your dataset. If the `limit` value is set to exactly the same number as the records returned, you likely haven’t retrieved all the data. To avoid this, choose a `limit` that is larger than the dataset’s total size.

In [14]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ) and created_date is between 2011 and 2012, limit our records to 1,000

query = """
    SELECT * 
    WHERE 
        created_date BETWEEN '2011-01-01' AND '2012-01-01'
        AND descriptor = 'Street Flooding (SJ)'
    LIMIT 100000    
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

# sanity checks
print('sanity check:')
print('min:', df.created_date.min())
print('max:', df.created_date.max())

print(df.shape)
df.head()



sanity check:
min: 2011-01-02T10:13:00.000
max: 2011-12-31T17:03:00.000
(2644, 31)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,...,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2
0,22426149,2011-12-31T17:03:00.000,2012-01-02T08:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10460.0,1956 CROTONA PARKWAY,CROTONA PARKWAY,...,1015982.0,246199.0,UNKNOWN,Unspecified,BRONX,40.84237755161368,-73.88531510513788,"{'latitude': '40.84237755161368', 'longitude':...",,
1,22424342,2011-12-30T10:00:00.000,2011-12-31T09:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10024.0,,,...,991690.0,225337.0,UNKNOWN,Unspecified,MANHATTAN,40.78517106970749,-73.97313367344907,"{'latitude': '40.78517106970749', 'longitude':...",WEST 84 STREET,COLUMBUS AVENUE
2,22425059,2011-12-30T09:25:00.000,2011-12-30T13:55:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,,,...,,,UNKNOWN,Unspecified,QUEENS,,,,GRAHAM CT,26 AVE
3,22415128,2011-12-29T17:13:00.000,2011-12-30T11:00:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,263 COLONY AVENUE,COLONY AVENUE,...,958629.0,147876.0,UNKNOWN,Unspecified,STATEN ISLAND,40.572524396506175,-74.09222458237058,"{'latitude': '40.572524396506175', 'longitude'...",,
4,22414065,2011-12-29T12:33:00.000,2011-12-30T11:30:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,,,...,946267.0,146214.0,UNKNOWN,Unspecified,STATEN ISLAND,40.56791819419245,-74.13671306905549,"{'latitude': '40.56791819419245', 'longitude':...",AMBER STREET,THOMAS STREET


In [15]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ), sort the created_date in descending order and limit our records to 1,000

query = """
    SELECT *
    WHERE
        descriptor = 'Street Flooding (SJ)'
    ORDER BY
        created_date DESC
    LIMIT
        100000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head()



(40949, 32)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,intersection_street_1,intersection_street_2,address_type,...,incident_address,street_name,cross_street_1,cross_street_2,bbl,closed_date,resolution_description,resolution_action_updated_date,facility_type,due_date
0,63250920,2024-11-30T23:29:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11234,AVENUE L,TROY AVENUE,INTERSECTION,...,,,,,,,,,,
1,63254360,2024-11-30T11:44:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11223,,,ADDRESS,...,1991 WEST 10 STREET,WEST 10 STREET,AVENUE S,AVENUE T,3070790047.0,,,,,
2,63252042,2024-11-30T08:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10467,,,BLOCKFACE,...,WEBSTER AVENUE,WEBSTER AVENUE,GUNHILL ROAD,EAST 233 STREET,,,,,,
3,63254361,2024-11-30T08:02:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11201,,,BLOCKFACE,...,LIVINGSTON STREET,LIVINGSTON STREET,SMITH STREET,GALLATIN PLACE,,,,,,
4,63253219,2024-11-30T07:26:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11434,,,ADDRESS,...,140-59 161 STREET,161 STREET,140 AVE,N CONDUIT AVE,4123170015.0,,,,,


In [16]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the borough and count grouped by borough,
# where the descriptor is Street Flooding (SJ), sort the count in descending order

query = """
    SELECT 
        descriptor,
        borough, 
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        borough
    ORDER BY 
        count DESC
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df



(7, 3)


Unnamed: 0,descriptor,borough,count
0,Street Flooding (SJ),QUEENS,15773
1,Street Flooding (SJ),BROOKLYN,11037
2,Street Flooding (SJ),STATEN ISLAND,7257
3,Street Flooding (SJ),MANHATTAN,3535
4,Street Flooding (SJ),BRONX,3293
5,Street Flooding (SJ),Unspecified,50
6,Street Flooding (SJ),,4


In [17]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below: select the borough and count grouped by borough having
# more than 5,000 counts, where the descriptor is Street Flooding (SJ),
# sort the count in descending order

query = """
    SELECT 
        descriptor,
        borough, 
        count(*) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        borough
    HAVING 
        count > 5000
    ORDER BY 
        count DESC
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(3, 3)


Unnamed: 0,descriptor,borough,count
0,Street Flooding (SJ),QUEENS,15773
1,Street Flooding (SJ),BROOKLYN,11037
2,Street Flooding (SJ),STATEN ISLAND,7257


## 5.2 SoQL Function and Keyword Listing

In [18]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

# SoQL query string below:
# select descriptor and count grouped by descriptor,
# where the word "flood" is in descriptor, sort count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor, 
        count(unique_key) AS count
    WHERE 
        LOWER(descriptor) LIKE '%flood%'
    GROUP BY 
        descriptor
    ORDER BY 
        count DESC
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df



(13, 2)


Unnamed: 0,descriptor,count
0,Catch Basin Clogged/Flooding (Use Comments) (SC),118501
1,Street Flooding (SJ),40949
2,Flood Light Lamp Out,6606
3,Highway Flooding (SH),3186
4,Flood Light Lamp Cycling,2614
5,Flooding on Street,673
6,Ready NY - Flooding,271
7,Flood Light Lamp Dayburning,237
8,Flood Light Lamp Missing,216
9,Flood Light Lamp Dim,195


In [19]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year truncated and the count columns grouped by year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_y(created_date) AS year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(15, 3)


Unnamed: 0,descriptor,year,count
0,Street Flooding (SJ),2018-01-01T00:00:00.000,4140
1,Street Flooding (SJ),2021-01-01T00:00:00.000,3702
2,Street Flooding (SJ),2023-01-01T00:00:00.000,3484
3,Street Flooding (SJ),2019-01-01T00:00:00.000,3434
4,Street Flooding (SJ),2022-01-01T00:00:00.000,3078
5,Street Flooding (SJ),2024-01-01T00:00:00.000,2783
6,Street Flooding (SJ),2011-01-01T00:00:00.000,2644
7,Street Flooding (SJ),2017-01-01T00:00:00.000,2532
8,Street Flooding (SJ),2010-01-01T00:00:00.000,2531
9,Street Flooding (SJ),2014-01-01T00:00:00.000,2498


In [20]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year month truncated and the count columns grouped by year month,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_ym(created_date) AS year_month,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year_month
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(179, 3)


Unnamed: 0,descriptor,year_month,count
0,Street Flooding (SJ),2021-09-01T00:00:00.000,1035
1,Street Flooding (SJ),2023-09-01T00:00:00.000,932
2,Street Flooding (SJ),2018-11-01T00:00:00.000,710
3,Street Flooding (SJ),2021-08-01T00:00:00.000,595
4,Street Flooding (SJ),2024-03-01T00:00:00.000,575
5,Street Flooding (SJ),2022-12-01T00:00:00.000,530
6,Street Flooding (SJ),2017-05-01T00:00:00.000,524
7,Street Flooding (SJ),2021-07-01T00:00:00.000,499
8,Street Flooding (SJ),2011-08-01T00:00:00.000,497
9,Street Flooding (SJ),2016-02-01T00:00:00.000,490


In [21]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year month day and the count columns grouped by year month day,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_ymd(created_date) AS year_month_day,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year_month_day
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(1000, 3)


Unnamed: 0,descriptor,year_month_day,count
0,Street Flooding (SJ),2023-09-29T00:00:00.000,623
1,Street Flooding (SJ),2021-09-02T00:00:00.000,350
2,Street Flooding (SJ),2021-09-01T00:00:00.000,344
3,Street Flooding (SJ),2022-12-23T00:00:00.000,308
4,Street Flooding (SJ),2017-05-05T00:00:00.000,247
5,Street Flooding (SJ),2014-12-09T00:00:00.000,226
6,Street Flooding (SJ),2014-04-30T00:00:00.000,189
7,Street Flooding (SJ),2021-10-26T00:00:00.000,177
8,Street Flooding (SJ),2018-04-16T00:00:00.000,163
9,Street Flooding (SJ),2013-05-08T00:00:00.000,162


In [22]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year and the count columns grouped by year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_y(created_date) AS year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(15, 3)


Unnamed: 0,descriptor,year,count
0,Street Flooding (SJ),2018,4140
1,Street Flooding (SJ),2021,3702
2,Street Flooding (SJ),2023,3484
3,Street Flooding (SJ),2019,3434
4,Street Flooding (SJ),2022,3078
5,Street Flooding (SJ),2024,2783
6,Street Flooding (SJ),2011,2644
7,Street Flooding (SJ),2017,2532
8,Street Flooding (SJ),2010,2531
9,Street Flooding (SJ),2014,2498


In [23]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the month and the count columns grouped by month,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_m(created_date) AS month,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        month
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(12, 3)


Unnamed: 0,descriptor,month,count
0,Street Flooding (SJ),5,4331
1,Street Flooding (SJ),9,4249
2,Street Flooding (SJ),8,4020
3,Street Flooding (SJ),7,3902
4,Street Flooding (SJ),6,3450
5,Street Flooding (SJ),12,3333
6,Street Flooding (SJ),3,3202
7,Street Flooding (SJ),4,3098
8,Street Flooding (SJ),10,3059
9,Street Flooding (SJ),2,2907


In [24]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_d(created_date) AS day,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        day
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(31, 3)


Unnamed: 0,descriptor,day,count
0,Street Flooding (SJ),29,1972
1,Street Flooding (SJ),23,1745
2,Street Flooding (SJ),2,1744
3,Street Flooding (SJ),1,1724
4,Street Flooding (SJ),30,1670
5,Street Flooding (SJ),13,1590
6,Street Flooding (SJ),9,1531
7,Street Flooding (SJ),18,1495
8,Street Flooding (SJ),8,1458
9,Street Flooding (SJ),25,1457


In [25]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the week of year and the count columns grouped by week of year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_woy(created_date) AS week_of_year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        week_of_year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(53, 3)


Unnamed: 0,descriptor,week_of_year,count
0,Street Flooding (SJ),39,1440
1,Street Flooding (SJ),18,1360
2,Street Flooding (SJ),35,1214
3,Street Flooding (SJ),33,1179
4,Street Flooding (SJ),30,1104
5,Street Flooding (SJ),32,1061
6,Street Flooding (SJ),51,1052
7,Street Flooding (SJ),23,955
8,Street Flooding (SJ),29,919
9,Street Flooding (SJ),20,913


In [26]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the day of week and the count columns grouped by day of week,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_dow(created_date) AS day_of_week,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        day_of_week
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(7, 3)


Unnamed: 0,descriptor,day_of_week,count
0,Street Flooding (SJ),5,7324
1,Street Flooding (SJ),2,6889
2,Street Flooding (SJ),1,6715
3,Street Flooding (SJ),3,6529
4,Street Flooding (SJ),4,6298
5,Street Flooding (SJ),0,3607
6,Street Flooding (SJ),6,3587


In [27]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the hour and the count columns grouped by hour,
# where the descriptor is Street Flooding (SJ), and sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_hh(created_date) AS hour,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        hour
    ORDER BY 
        count DESC    
    LIMIT
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

df = pd.DataFrame.from_records(results)

print(df.shape)
df.head(10)



(24, 3)


Unnamed: 0,descriptor,hour,count
0,Street Flooding (SJ),11,3500
1,Street Flooding (SJ),9,3443
2,Street Flooding (SJ),10,3438
3,Street Flooding (SJ),12,3114
4,Street Flooding (SJ),15,2999
5,Street Flooding (SJ),14,2926
6,Street Flooding (SJ),16,2861
7,Street Flooding (SJ),13,2849
8,Street Flooding (SJ),8,2611
9,Street Flooding (SJ),17,2266


In [28]:
# close connection
client.close()