# The Socrata Query Language (SoQL)
## Socrata Open Data API (SODA) Tutorial Using NYC Open Data
Author: Mark Bauer

Table of Contents
=================

   1. Introduction
   2. The Socrata Open Data API
       * 2.1 Using Socrata Open Data API (SODA)
       * 2.2 Using Sodapy
       * 2.3 Socrata Query Language (SoQL)
   3. Importing Libraries
   4. Retrieving Data Directly from Socrata Open Data API (SODA) 
   5. SoQL with Sodapy
       * 5.1 SoQL Clauses
       * 5.2 SoQL Function and Keyword Listing  

# 1. Introduction  
This notebook demonstrates how to interact with the Socrata Open Data API (SODA) and introduces the Socrata Query Language (SoQL), which is used for querying data from Socrata-powered platforms. We’ll use SoQL to fetch data from NYC Open Data. Additionally, this notebook introduces sodapy, a Python client for SODA, and shows how to use sodapy alongside SoQL to extract and work with data.

# 2. Socrata Open Data

## 2.1 Socrata Open Data API (SODA)

The Socrata Open Data API (SODA) provides a programmatic way to access datasets, not only from NYC Open Data but from a wide range of sources globally. In my experience, it's one of the most efficient and user-friendly methods for accessing open data.

For more information, visit the official [Socrata Open Data API (SODA)](https://dev.socrata.com/) website. You'll find a wealth of helpful resources there, and I highly recommend reviewing the [API Docs](https://dev.socrata.com/docs/endpoints.html) to deepen your understanding.

![dev socrata](images/dev-socrata.png)

Source: https://dev.socrata.com/

## 2.2 Sodapy

In addition to accessing datasets from NYC Open Data via the Socrata API, you can also use sodapy, a Python client. You can find more information about sodapy in the [official documentation](https://github.com/xmunoz/sodapy) on GitHub, as well as sample workflows in my tutorial notebook [sodapy-basics.ipynb](https://github.com/mebauer/sodapy-tutorial-nyc-open-data/blob/main/sodapy-basics.ipynb).

**Please note** that the sodapy project is publicy archived on GitHub and is read-only.


In order use sodapy, a **source domain** (i.e. the open data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset, then the **dataset identifier** (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set `erm2-nwe9`. The screenshot is the homepage of the 311 dataset from NYC Open Data.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

Source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.3 Socrata Query Language (SoQL)
If you're familiar with SQL, you'll feel right at home with SoQL. This notebook demonstrates how to use SoQL to fetch data from NYC Open Data.

![soql screenshot](images/soql-screenshot.png)

Source: https://dev.socrata.com/docs/queries/

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import urllib.parse

# sodapy
from sodapy import Socrata

In [2]:
# documention for installing watermark: https://github.com/rasbt/watermark
# perform for reproducibility
%reload_ext watermark
%watermark -t -d -v -p pandas,sodapy

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.6.0

pandas: 1.5.1
sodapy: 2.2.0



# 4. Retrieving Data Directly from the Socrata Open Data API (SODA)
In this section, we’ll walk through how to retrieve data directly from the Socrata Open Data API (SODA) by constructing URLs with specific parameters. While I typically use sodapy, the Python client for Socrata, when possible, it’s useful to understand how to work with the API directly.

Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`  

To avoid these limits, it's recommended to use an app token when making API requests.

In [3]:
# define the source domain for accessing NYC Open Data via Socrata API
socrata_domain = 'data.cityofnewyork.us'

# define the dataset identifier for the NYC 311 dataset on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# set the row limit for the query (in this case, limiting to 20 rows)
limit = 20

# construct the full URL to access the dataset, specifying the limit parameter
url = f'https://{socrata_domain}/resource/{socrata_dataset_identifier}.csv?$limit={limit}'

# preview the constructed URL to ensure it's correct before making the request
print(f'Preview URL: {url}')

# load the dataset directly from the constructed URL into a pandas DataFrame
df = pd.read_csv(url)

## perform basic sanity checks on the DataFrame
# print the shape of the DataFrame (rows, columns)
print(df.shape)

# display the first few rows of the DataFrame to verify the data
df.head()

Preview URL: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,63231133,2024-11-29T04:00:33.000,,DOT,Department of Transportation,Street Condition,Pothole,,11213.0,,...,,,,,,,,40.665687,-73.93976,"\n, \n(40.66568740802441, -73.93975997355022)"
1,63233902,2024-11-29T01:51:28.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)"
2,63231561,2024-11-29T01:51:03.000,,DPR,Department of Parks and Recreation,Illegal Tree Damage,Bicycle Chained to Tree,Street,,,...,,,,,,,,40.83195,-73.929394,"\n, \n(40.83195011687482, -73.92939377993235)"
3,63232135,2024-11-29T01:50:50.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)"
4,63233043,2024-11-29T01:50:14.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.734143,-73.774759,"\n, \n(40.73414322554409, -73.7747588426775)"


In [4]:
# preview columns and datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 41 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   unique_key                      20 non-null     int64  
 1   created_date                    20 non-null     object 
 2   closed_date                     0 non-null      float64
 3   agency                          20 non-null     object 
 4   agency_name                     20 non-null     object 
 5   complaint_type                  20 non-null     object 
 6   descriptor                      20 non-null     object 
 7   location_type                   19 non-null     object 
 8   incident_zip                    1 non-null      float64
 9   incident_address                0 non-null      float64
 10  street_name                     0 non-null      float64
 11  cross_street_1                  0 non-null      float64
 12  cross_street_2                  0 non-

## WHERE Statements
Notice how to use WHERE statements to construct a URL.

In [5]:
# WHERE statements
year = '2020'
column = 'created_date'
limit = 20

# construct url
url = f'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where={column}%20>=%20%27{year}%27&$limit={limit}'

print(f'Preview URL: {url}')

# read data into pandas DataFrame
df = pd.read_csv(url)

# sanity check
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=created_date%20>=%20%272020%27&$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,63231133,2024-11-29T04:00:33.000,,DOT,Department of Transportation,Street Condition,Pothole,,11213.0,,...,,,,,,,,40.665687,-73.93976,"\n, \n(40.66568740802441, -73.93975997355022)"
1,63233902,2024-11-29T01:51:28.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)"
2,63231561,2024-11-29T01:51:03.000,,DPR,Department of Parks and Recreation,Illegal Tree Damage,Bicycle Chained to Tree,Street,,,...,,,,,,,,40.83195,-73.929394,"\n, \n(40.83195011687482, -73.92939377993235)"
3,63232135,2024-11-29T01:50:50.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)"
4,63233043,2024-11-29T01:50:14.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,,,,,,,40.734143,-73.774759,"\n, \n(40.73414322554409, -73.7747588426775)"


## QUERY Parameters
Notice how to use the Socrata API QUERY parameter to construct a URL.

In [6]:
# QUERY parameter
query = 'SELECT%20*%20LIMIT%2020'
url = f'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query={query}'

print(f'Preview URL: {url}')

# read data into pandas DataFrame
df = pd.read_csv(url)

# sanity check
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20LIMIT%2020
(20, 47)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,bridge_highway_segment,latitude,longitude,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,:@computed_region_7mpf_4k6g
0,63231133,2024-11-29T04:00:33.000,,DOT,Department of Transportation,Street Condition,Pothole,,11213.0,,...,,40.665687,-73.93976,"\n, \n(40.66568740802441, -73.93975997355022)",17615.0,17.0,2.0,48.0,44.0,44.0
1,63233902,2024-11-29T01:51:28.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)",11275.0,29.0,5.0,2.0,30.0,30.0
2,63231561,2024-11-29T01:51:03.000,,DPR,Department of Parks and Recreation,Illegal Tree Damage,Bicycle Chained to Tree,Street,,,...,,40.83195,-73.929394,"\n, \n(40.83195011687482, -73.92939377993235)",10930.0,50.0,5.0,35.0,27.0,27.0
3,63232135,2024-11-29T01:50:50.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,40.892051,-73.860009,"\n, \n(40.89205062779512, -73.86000893554395)",11275.0,29.0,5.0,2.0,30.0,30.0
4,63233043,2024-11-29T01:50:14.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,,,...,,40.734143,-73.774759,"\n, \n(40.73414322554409, -73.7747588426775)",14508.0,25.0,3.0,16.0,65.0,65.0


I prefer formatting my queries with the `query` parameter, as we can use SQL-style statements as an argument to the Socrata API. Note the `urllib.parse.quote_plus()` method to format the URL.

As you'll see, we can also use the query parameter when working with sodapy, a Python client.

In [7]:
# define the query to filter the dataset:
# this SQL-like query will select rows 
# where the 'created_date' is greater than or equal to 2020 and the 'descriptor' 
# is 'Street Flooding (SJ)'. The result is limited to the first 100 records.

query = """
    SELECT *
    WHERE
        created_date >= '2020-01-01'
        AND descriptor = 'Street Flooding (SJ)'
    LIMIT
        100
"""

# encode the query string for use in the URL. This ensures any special characters 
# in the query are properly escaped and can be safely included in the URL.
safe_string = urllib.parse.quote_plus(query)

# construct the full URL by appending the query string to the Socrata dataset URL.
# this URL points to the NYC 311 dataset, filtered by the query above.
url = f'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query={safe_string}'

# print the constructed URL to verify it's correct before requesting the data
print(f'Preview URL: {url}')

# load the filtered dataset into a pandas DataFrame using the URL
df = pd.read_csv(url)

# sanity checks
print(df.shape)
df.head()

Preview URL: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=%0A++++SELECT+%2A%0A++++WHERE%0A++++++++created_date+%3E%3D+%272020-01-01%27%0A++++++++AND+descriptor+%3D+%27Street+Flooding+%28SJ%29%27%0A++++LIMIT%0A++++++++100%0A
(100, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,63233749,2024-11-28T22:26:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,10473,210 BETTS AVENUE,...,,,,,,,,40.810312,-73.850236,"\n, \n(40.8103119559549, -73.85023582999517)"
1,63234576,2024-11-28T19:38:00.000,2024-11-28T20:34:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11434,177-37 135 AVENUE,...,,,,,,,,40.675681,-73.762969,"\n, \n(40.67568056757403, -73.76296916912234)"
2,63235424,2024-11-28T15:50:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11412,,...,,,,,,,,40.691349,-73.760793,"\n, \n(40.691348770666536, -73.76079314302808)"
3,63232233,2024-11-28T14:44:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11422,147-35 BROOKVILLE BOULEVARD,...,,,,,,,,40.656717,-73.744947,"\n, \n(40.65671670572484, -73.74494743956363)"
4,63233750,2024-11-28T14:29:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11413,220-18 139 AVENUE,...,,,,,,,,40.672329,-73.752674,"\n, \n(40.6723294727065, -73.75267447072801)"


# 5. SoQL with Sodapy
The primary focus of this notebook is SoQL (Socrata Query Language), which is used to query data from Socrata-powered platforms. In addition, we’ll be using [sodapy](https://github.com/xmunoz/sodapy), a Python client for interacting with the Socrata API. This is my preferred method for extracting data from NYC Open Data. However, please note that the Sodapy project is now archived on GitHub.

## 5.1 SoQL Clauses

In [8]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API.
# we pass the source domain value of NYC Open data, the app token as 'None',
# and set the timeout parameter for '100 seconds'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below: select all columns, limit our records to 10
query = """
    SELECT *    
    LIMIT 10
"""

# returned as JSON from API / converted to Python list of dictionaries by sodapy
results = client.get(socrata_dataset_identifier, query=query)

# convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

# sanity check
print(results_df.shape)
results_df.head()



(10, 32)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,intersection_street_1,intersection_street_2,address_type,...,latitude,longitude,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,:@computed_region_7mpf_4k6g,location_type
0,63231133,2024-11-29T04:00:33.000,DOT,Department of Transportation,Street Condition,Pothole,11213.0,ALBANY AVENUE,CROWN STREET,INTERSECTION,...,40.66568740802441,-73.93975997355022,"{'latitude': '40.66568740802441', 'longitude':...",17615,17,2,48,44,44,
1,63233902,2024-11-29T01:51:28.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,,,,,...,40.89205062779512,-73.86000893554395,"{'latitude': '40.89205062779512', 'longitude':...",11275,29,5,2,30,30,Residential Building/House
2,63231561,2024-11-29T01:51:03.000,DPR,Department of Parks and Recreation,Illegal Tree Damage,Bicycle Chained to Tree,,,,,...,40.83195011687482,-73.92939377993235,"{'latitude': '40.83195011687482', 'longitude':...",10930,50,5,35,27,27,Street
3,63232135,2024-11-29T01:50:50.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,,,,,...,40.89205062779512,-73.86000893554395,"{'latitude': '40.89205062779512', 'longitude':...",11275,29,5,2,30,30,Residential Building/House
4,63233043,2024-11-29T01:50:14.000,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,,,,,...,40.73414322554409,-73.7747588426775,"{'latitude': '40.73414322554409', 'longitude':...",14508,25,3,16,65,65,Residential Building/House


In [9]:
# examine columns and datatypes
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 32 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   unique_key                      10 non-null     object
 1   created_date                    10 non-null     object
 2   agency                          10 non-null     object
 3   agency_name                     10 non-null     object
 4   complaint_type                  10 non-null     object
 5   descriptor                      10 non-null     object
 6   incident_zip                    1 non-null      object
 7   intersection_street_1           1 non-null      object
 8   intersection_street_2           1 non-null      object
 9   address_type                    1 non-null      object
 10  city                            1 non-null      object
 11  facility_type                   1 non-null      object
 12  status                          10 non-null     objec

In [10]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ), limit our records to 1,000

query = """
    SELECT *
    WHERE
        descriptor = 'Street Flooding (SJ)'
    LIMIT
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head()



(1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,closed_date,resolution_description,resolution_action_updated_date,intersection_street_1,intersection_street_2
0,63233749,2024-11-28T22:26:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10473,210 BETTS AVENUE,BETTS AVENUE,GILDERSLEEVE AVE,...,Unspecified,BRONX,40.8103119559549,-73.85023582999517,"{'latitude': '40.8103119559549', 'longitude': ...",,,,,
1,63234576,2024-11-28T19:38:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11434,177-37 135 AVENUE,135 AVENUE,FARMERS BLVD,...,Unspecified,QUEENS,40.67568056757403,-73.76296916912234,"{'latitude': '40.67568056757403', 'longitude':...",2024-11-28T20:34:00.000,The Department of Environmental Protection has...,2024-11-28T20:34:00.000,,
2,63235424,2024-11-28T15:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11412,,,,...,Unspecified,QUEENS,40.691348770666536,-73.76079314302808,"{'latitude': '40.691348770666536', 'longitude'...",,,,117 ROAD,190 STREET
3,63232233,2024-11-28T14:44:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11422,147-35 BROOKVILLE BOULEVARD,BROOKVILLE BOULEVARD,147 AVE,...,Unspecified,QUEENS,40.65671670572484,-73.74494743956363,"{'latitude': '40.65671670572484', 'longitude':...",,,,,
4,63233750,2024-11-28T14:29:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11413,220-18 139 AVENUE,139 AVENUE,CARSON ST,...,Unspecified,QUEENS,40.6723294727065,-73.75267447072801,"{'latitude': '40.6723294727065', 'longitude': ...",,,,,


In [11]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ) and created_date is between 2011 and 2012, limit our records to 1,000

query = """
    SELECT * 
    WHERE 
        created_date BETWEEN '2011-01-01' AND '2012-01-01'
        AND descriptor = 'Street Flooding (SJ)'
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

# sanity checks
print('sanity check:')
print('min:', results_df.created_date.min())
print('max:', results_df.created_date.max())

print(results_df.shape)
results_df.head()



sanity check:
min: 2011-08-16T23:52:00.000
max: 2011-12-31T17:03:00.000
(1000, 31)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,...,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2
0,22426149,2011-12-31T17:03:00.000,2012-01-02T08:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10460.0,1956 CROTONA PARKWAY,CROTONA PARKWAY,...,1015982.0,246199.0,UNKNOWN,Unspecified,BRONX,40.84237755161368,-73.88531510513788,"{'latitude': '40.84237755161368', 'longitude':...",,
1,22424342,2011-12-30T10:00:00.000,2011-12-31T09:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10024.0,,,...,991690.0,225337.0,UNKNOWN,Unspecified,MANHATTAN,40.78517106970749,-73.97313367344907,"{'latitude': '40.78517106970749', 'longitude':...",WEST 84 STREET,COLUMBUS AVENUE
2,22425059,2011-12-30T09:25:00.000,2011-12-30T13:55:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,,,...,,,UNKNOWN,Unspecified,QUEENS,,,,GRAHAM CT,26 AVE
3,22415128,2011-12-29T17:13:00.000,2011-12-30T11:00:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,263 COLONY AVENUE,COLONY AVENUE,...,958629.0,147876.0,UNKNOWN,Unspecified,STATEN ISLAND,40.572524396506175,-74.09222458237058,"{'latitude': '40.572524396506175', 'longitude'...",,
4,22414065,2011-12-29T12:33:00.000,2011-12-30T11:30:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,,,...,946267.0,146214.0,UNKNOWN,Unspecified,STATEN ISLAND,40.56791819419245,-74.13671306905549,"{'latitude': '40.56791819419245', 'longitude':...",AMBER STREET,THOMAS STREET


In [12]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select all columns, where the descriptor
# is Street Flooding (SJ), sort the created_date in descending order and limit our records to 1,000

query = """
    SELECT *
    WHERE
        descriptor = 'Street Flooding (SJ)'
    ORDER BY
        created_date DESC
    LIMIT
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head()



(1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,closed_date,resolution_description,resolution_action_updated_date,intersection_street_1,intersection_street_2
0,63233749,2024-11-28T22:26:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10473,210 BETTS AVENUE,BETTS AVENUE,GILDERSLEEVE AVE,...,Unspecified,BRONX,40.8103119559549,-73.85023582999517,"{'latitude': '40.8103119559549', 'longitude': ...",,,,,
1,63234576,2024-11-28T19:38:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11434,177-37 135 AVENUE,135 AVENUE,FARMERS BLVD,...,Unspecified,QUEENS,40.67568056757403,-73.76296916912234,"{'latitude': '40.67568056757403', 'longitude':...",2024-11-28T20:34:00.000,The Department of Environmental Protection has...,2024-11-28T20:34:00.000,,
2,63235424,2024-11-28T15:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11412,,,,...,Unspecified,QUEENS,40.691348770666536,-73.76079314302808,"{'latitude': '40.691348770666536', 'longitude'...",,,,117 ROAD,190 STREET
3,63232233,2024-11-28T14:44:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11422,147-35 BROOKVILLE BOULEVARD,BROOKVILLE BOULEVARD,147 AVE,...,Unspecified,QUEENS,40.65671670572484,-73.74494743956363,"{'latitude': '40.65671670572484', 'longitude':...",,,,,
4,63233750,2024-11-28T14:29:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11413,220-18 139 AVENUE,139 AVENUE,CARSON ST,...,Unspecified,QUEENS,40.6723294727065,-73.75267447072801,"{'latitude': '40.6723294727065', 'longitude': ...",,,,,


In [13]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the borough and count grouped by borough,
# where the descriptor is Street Flooding (SJ), sort the count in descending order

query = """
    SELECT 
        descriptor,
        borough, 
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        borough
    ORDER BY 
        count DESC
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(7, 3)


Unnamed: 0,descriptor,borough,count
0,Street Flooding (SJ),QUEENS,15771
1,Street Flooding (SJ),BROOKLYN,11033
2,Street Flooding (SJ),STATEN ISLAND,7256
3,Street Flooding (SJ),MANHATTAN,3534
4,Street Flooding (SJ),BRONX,3292
5,Street Flooding (SJ),Unspecified,50
6,Street Flooding (SJ),,4


In [14]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below: select the borough and count grouped by borough having
# more than 5,000 counts, where the descriptor is Street Flooding (SJ),
# sort the count in descending order

query = """
    SELECT 
        descriptor,
        borough, 
        count(*) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        borough
    HAVING 
        count > 5000
    ORDER BY 
        count DESC
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(3, 3)


Unnamed: 0,descriptor,borough,count
0,Street Flooding (SJ),QUEENS,15771
1,Street Flooding (SJ),BROOKLYN,11033
2,Street Flooding (SJ),STATEN ISLAND,7256


## 5.2 SoQL Function and Keyword Listing

In [15]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

# SoQL query string below:
# select descriptor and count grouped by descriptor,
# where the word "flood" is in descriptor, sort count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor, 
        count(unique_key) AS count
    WHERE 
        LOWER(descriptor) LIKE '%flood%'
    GROUP BY 
        descriptor
    ORDER BY 
        count DESC
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(13, 2)


Unnamed: 0,descriptor,count
0,Catch Basin Clogged/Flooding (Use Comments) (SC),118486
1,Street Flooding (SJ),40940
2,Flood Light Lamp Out,6606
3,Highway Flooding (SH),3186
4,Flood Light Lamp Cycling,2614
5,Flooding on Street,673
6,Ready NY - Flooding,271
7,Flood Light Lamp Dayburning,237
8,Flood Light Lamp Missing,216
9,Flood Light Lamp Dim,195


In [16]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the descriptor, unique_key, borough, and case(borough != 'BRONX'),
# where the descriptor is Street Flooding (SJ), limit our records to 1,000

query = """
    SELECT 
        unique_key,
        descriptor,
        borough,
        case(borough != 'BRONX', False, True, True) AS in_bronx
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(1000, 4)


Unnamed: 0,unique_key,descriptor,borough,in_bronx
0,63233749,Street Flooding (SJ),BRONX,True
1,63234576,Street Flooding (SJ),QUEENS,False
2,63235424,Street Flooding (SJ),QUEENS,False
3,63232233,Street Flooding (SJ),QUEENS,False
4,63233750,Street Flooding (SJ),QUEENS,False
5,63230649,Street Flooding (SJ),QUEENS,False
6,63235421,Street Flooding (SJ),QUEENS,False
7,63232857,Street Flooding (SJ),MANHATTAN,False
8,63230650,Street Flooding (SJ),QUEENS,False
9,63236266,Street Flooding (SJ),QUEENS,False


In [17]:
# sanity check
(results_df
 .groupby(by=['borough', 'in_bronx'])['unique_key']
 .count()
)

borough        in_bronx
BRONX          True        111
BROOKLYN       False       244
MANHATTAN      False        81
QUEENS         False       438
STATEN ISLAND  False       126
Name: unique_key, dtype: int64

In [18]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year truncated and the count columns grouped by year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_y(created_date) AS year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(15, 3)


Unnamed: 0,descriptor,year,count
0,Street Flooding (SJ),2018-01-01T00:00:00.000,4140
1,Street Flooding (SJ),2021-01-01T00:00:00.000,3702
2,Street Flooding (SJ),2023-01-01T00:00:00.000,3484
3,Street Flooding (SJ),2019-01-01T00:00:00.000,3434
4,Street Flooding (SJ),2022-01-01T00:00:00.000,3078
5,Street Flooding (SJ),2024-01-01T00:00:00.000,2774
6,Street Flooding (SJ),2011-01-01T00:00:00.000,2644
7,Street Flooding (SJ),2017-01-01T00:00:00.000,2532
8,Street Flooding (SJ),2010-01-01T00:00:00.000,2531
9,Street Flooding (SJ),2014-01-01T00:00:00.000,2498


In [19]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year month truncated and the count columns grouped by year month,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_ym(created_date) AS year_month,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year_month
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(179, 3)


Unnamed: 0,descriptor,year_month,count
0,Street Flooding (SJ),2021-09-01T00:00:00.000,1035
1,Street Flooding (SJ),2023-09-01T00:00:00.000,932
2,Street Flooding (SJ),2018-11-01T00:00:00.000,710
3,Street Flooding (SJ),2021-08-01T00:00:00.000,595
4,Street Flooding (SJ),2024-03-01T00:00:00.000,575
5,Street Flooding (SJ),2022-12-01T00:00:00.000,530
6,Street Flooding (SJ),2017-05-01T00:00:00.000,524
7,Street Flooding (SJ),2021-07-01T00:00:00.000,499
8,Street Flooding (SJ),2011-08-01T00:00:00.000,497
9,Street Flooding (SJ),2016-02-01T00:00:00.000,490


In [20]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year month day and the count columns grouped by year month day,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_trunc_ymd(created_date) AS year_month_day,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year_month_day
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(1000, 3)


Unnamed: 0,descriptor,year_month_day,count
0,Street Flooding (SJ),2023-09-29T00:00:00.000,623
1,Street Flooding (SJ),2021-09-02T00:00:00.000,350
2,Street Flooding (SJ),2021-09-01T00:00:00.000,344
3,Street Flooding (SJ),2022-12-23T00:00:00.000,308
4,Street Flooding (SJ),2017-05-05T00:00:00.000,247
5,Street Flooding (SJ),2014-12-09T00:00:00.000,226
6,Street Flooding (SJ),2014-04-30T00:00:00.000,189
7,Street Flooding (SJ),2021-10-26T00:00:00.000,177
8,Street Flooding (SJ),2018-04-16T00:00:00.000,163
9,Street Flooding (SJ),2013-05-08T00:00:00.000,162


In [21]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the year and the count columns grouped by year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_y(created_date) AS year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(15, 3)


Unnamed: 0,descriptor,year,count
0,Street Flooding (SJ),2018,4140
1,Street Flooding (SJ),2021,3702
2,Street Flooding (SJ),2023,3484
3,Street Flooding (SJ),2019,3434
4,Street Flooding (SJ),2022,3078
5,Street Flooding (SJ),2024,2774
6,Street Flooding (SJ),2011,2644
7,Street Flooding (SJ),2017,2532
8,Street Flooding (SJ),2010,2531
9,Street Flooding (SJ),2014,2498


In [22]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the month and the count columns grouped by month,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_m(created_date) AS month,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        month
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(12, 3)


Unnamed: 0,descriptor,month,count
0,Street Flooding (SJ),5,4331
1,Street Flooding (SJ),9,4249
2,Street Flooding (SJ),8,4020
3,Street Flooding (SJ),7,3902
4,Street Flooding (SJ),6,3450
5,Street Flooding (SJ),12,3333
6,Street Flooding (SJ),3,3202
7,Street Flooding (SJ),4,3098
8,Street Flooding (SJ),10,3059
9,Street Flooding (SJ),2,2907


In [23]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_d(created_date) AS day,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        day
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(31, 3)


Unnamed: 0,descriptor,day,count
0,Street Flooding (SJ),29,1968
1,Street Flooding (SJ),23,1745
2,Street Flooding (SJ),2,1744
3,Street Flooding (SJ),1,1724
4,Street Flooding (SJ),30,1665
5,Street Flooding (SJ),13,1590
6,Street Flooding (SJ),9,1531
7,Street Flooding (SJ),18,1495
8,Street Flooding (SJ),8,1458
9,Street Flooding (SJ),25,1457


In [24]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the week of year and the count columns grouped by week of year,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_woy(created_date) AS week_of_year,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        week_of_year
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(53, 3)


Unnamed: 0,descriptor,week_of_year,count
0,Street Flooding (SJ),39,1440
1,Street Flooding (SJ),18,1360
2,Street Flooding (SJ),35,1214
3,Street Flooding (SJ),33,1179
4,Street Flooding (SJ),30,1104
5,Street Flooding (SJ),32,1061
6,Street Flooding (SJ),51,1052
7,Street Flooding (SJ),23,955
8,Street Flooding (SJ),29,919
9,Street Flooding (SJ),20,913


In [25]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the day of week and the count columns grouped by day of week,
# where the descriptor is Street Flooding (SJ), sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_dow(created_date) AS day_of_week,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        day_of_week
    ORDER BY 
        count DESC    
    LIMIT 
        1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(7, 3)


Unnamed: 0,descriptor,day_of_week,count
0,Street Flooding (SJ),5,7320
1,Street Flooding (SJ),2,6889
2,Street Flooding (SJ),1,6715
3,Street Flooding (SJ),3,6529
4,Street Flooding (SJ),4,6298
5,Street Flooding (SJ),0,3607
6,Street Flooding (SJ),6,3582


In [26]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

# SoQL query string below:
# select the hour and the count columns grouped by hour,
# where the descriptor is Street Flooding (SJ), and sort the count in descending order and
# limit our records to 1,000

query = """
    SELECT 
        descriptor,
        date_extract_hh(created_date) AS hour,
        count(unique_key) AS count
    WHERE 
        descriptor = 'Street Flooding (SJ)'
    GROUP BY 
        descriptor,
        hour
    ORDER BY 
        count DESC    
    LIMIT 1000
"""

results = client.get(socrata_dataset_identifier, query=query)

results_df = pd.DataFrame.from_records(results)

print(results_df.shape)
results_df.head(10)



(24, 3)


Unnamed: 0,descriptor,hour,count
0,Street Flooding (SJ),11,3499
1,Street Flooding (SJ),9,3442
2,Street Flooding (SJ),10,3437
3,Street Flooding (SJ),12,3114
4,Street Flooding (SJ),15,2999
5,Street Flooding (SJ),14,2926
6,Street Flooding (SJ),16,2861
7,Street Flooding (SJ),13,2848
8,Street Flooding (SJ),8,2609
9,Street Flooding (SJ),17,2266


In [27]:
client.close()