# Socrata Query Language (SoQL) Clauses and Functions
Mark Bauer

Table of Contents
=================

   1. Introduction
   2. Socrata Open Data
       * 2.1 Using Socrata Open Data API (SODA)
       * 2.2 Using Sodapy
       * 2.3 Socrata Query Language or "SoQL"
   3. Importing Libraries
   4. SoQL with Sodapy
       * 4.1 SoQL Clauses
       * 4.2 SoQL Function and Keyword Listing
   5. Answering Questions about NYC 311 Complaints with SoQL
   6. Retrieving Data Directly from Socrata Open Data API (SODA)    

# 1. Introduction  
This notebook demonstrates basic queries using SoQL, the Socrata Query Language. 

# 2. Socrata Open Data

## 2.1 Socrata Open Data API (SODA)

More information can be found on the offical [Socrata Open Data API (SODA)](https://dev.socrata.com/) website. We use sodapy, a python client, to interact with the Socrata Open Data API.

There's a lot of great resources on the website, and I encourage you to read through the [API Docs](https://dev.socrata.com/docs/endpoints.html) to further your understanding.

![dev socrata](images/dev-socrata.png)

**Source**: https://dev.socrata.com/

## 2.2 Sodapy

Sodapy - a python client for the Socrata Open Data API.
Information about sodapy can be found in its offical docs on [GitHub](https://github.com/xmunoz/sodapy), as well as my notebook tutorial in this project here [sodapy-basics.ipynb](https://github.com/mebauer/sodapy-tutorial-nyc-open-data/blob/main/sodapy-basics.ipynb).


In order use sodapy, a **source domain** (i.e. the open data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific dataset, then the **dataset identifier** (i.e. the dataset id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain `data.cityofnewyork.us` and the dataset identifier for the NYC 311 data set `erm2-nwe9`. The screenshot is the homepage of the 311 data set from NYC Open Data.

![nyc-311-api-docs](images/nyc-311-api-docs.png)  

**Source**: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

## 2.3 Socrata Query Language or "SoQl"

![soql screenshot](images/soql-screenshot.png)

**Source**: https://dev.socrata.com/docs/queries/

# 3. Importing Libraries

In [1]:
# importing libraries
import pandas as pd
import numpy as np
from sodapy import Socrata
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import urllib.parse

plt.style.use('ggplot')
plt.rcParams['savefig.facecolor'] = 'white'
%matplotlib inline

In [2]:
## documention for installing watermark: https://github.com/rasbt/watermark
%reload_ext watermark
%watermark -t -d -v -p pandas,sodapy

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

pandas: 1.4.2
sodapy: 2.2.0



# 4. SoQL with Sodapy

### Note:  
`WARNING:root:Requests made without an app_token will be subject to strict throttling limits.`

Read more from the SODA documentation here: https://dev.socrata.com/docs/app-tokens.html

## 4.1 SoQL Clauses

In [3]:
# source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

""" Socrata - The main class that interacts with the SODA API.
We pass the source domain value of NYC Open data, the app token as 'None',
and set the timeout parameter for '100 seconds'
"""
client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select all columns,
limit our records to 10
"""

query = """
SELECT
    *    
LIMIT
    10
"""

# returned as JSON from API / converted to Python list of dictionaries by sodapy
results = client.get(
    socrata_dataset_identifier,
    query=query
)
# closing client
client.close()

# convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head()



shape of data: (10, 37)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,...,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,facility_type,intersection_street_1,intersection_street_2,landmark
0,58235385,2023-07-19T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11422,243-57 CANEY ROAD,CANEY ROAD,...,"{'latitude': '40.660805762163754', 'longitude'...",24018,63,3,47,63,,,,
1,58232638,2023-07-19T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11232,161 19 STREET,19 STREET,...,"{'latitude': '40.664163252269546', 'longitude'...",13515,9,2,7,45,,,,
2,58240811,2023-07-19T12:00:00.000,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10461,1177 VAN NEST AVENUE,VAN NEST AVENUE,...,"{'latitude': '40.849584225416905', 'longitude'...",11270,59,5,12,32,DSNY Garage,,,
3,58238017,2023-07-19T02:07:20.000,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10454,554 EAST 137 STREET,EAST 137 STREET,...,"{'latitude': '40.80616697208923', 'longitude':...",10932,49,5,35,23,,BROOK AVENUE,ST ANNS AVENUE,EAST 137 STREET
4,58234065,2023-07-19T02:07:18.000,DHS,Department of Homeless Services,Homeless Person Assistance,Chronic,Street/Sidewalk,10019,200 WEST 52 STREET,WEST 52 STREET,...,"{'latitude': '40.76238879923077', 'longitude':...",12081,11,4,51,10,,7 AVENUE,BROADWAY,WEST 52 STREET


In [4]:
# examine available columns
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 37 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   unique_key                      10 non-null     object
 1   created_date                    10 non-null     object
 2   agency                          10 non-null     object
 3   agency_name                     10 non-null     object
 4   complaint_type                  10 non-null     object
 5   descriptor                      10 non-null     object
 6   location_type                   10 non-null     object
 7   incident_zip                    10 non-null     object
 8   incident_address                10 non-null     object
 9   street_name                     10 non-null     object
 10  cross_street_1                  10 non-null     object
 11  cross_street_2                  10 non-null     object
 12  address_type                    10 non-null     objec

In [5]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select all columns,
where the descriptor is Street Flooding (SJ),
limit our records to 1,000
"""

query = """
SELECT
    *
WHERE
    descriptor == 'Street Flooding (SJ)'
LIMIT
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head()



shape of data: (1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2,closed_date,resolution_description,resolution_action_updated_date
0,58234412,2023-07-18T20:08:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10473,800 BOLTON AVENUE,BOLTON AVENUE,LAFAYETTE AVE,...,Unspecified,BRONX,40.82221080548931,-73.85932846516333,"{'latitude': '40.82221080548931', 'longitude':...",,,,,
1,58241178,2023-07-18T17:56:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11385,,,,...,Unspecified,QUEENS,40.70204262831607,-73.85026266467501,"{'latitude': '40.702042628316065', 'longitude'...",FREEDOM DRIVE,MYRTLE AVENUE,,,
2,58237100,2023-07-18T16:53:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11432,172-03 HIGHLAND AVENUE,HIGHLAND AVENUE,KINGSTON PL,...,Unspecified,QUEENS,40.713078025858366,-73.79124626094924,"{'latitude': '40.713078025858366', 'longitude'...",,,,,
3,58234413,2023-07-18T16:47:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11692,70-20 ROCKAWAY BEACH BOULEVARD,ROCKAWAY BEACH BOULEVARD,CORAL REEF WAY,...,Unspecified,QUEENS,40.58956376846801,-73.79967588630932,"{'latitude': '40.58956376846801', 'longitude':...",,,,,
4,58239758,2023-07-18T16:11:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10033,,,,...,Unspecified,MANHATTAN,40.85179479152171,-73.93009926586655,"{'latitude': '40.85179479152171', 'longitude':...",AUDUBON AVENUE,WEST 186 STREET,,,


In [6]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select all columns,
where the descriptor is Street Flooding (SJ) and created_date is between 2011 and 2012,
limit our records to 1,000
"""

query = """
SELECT 
    * 
WHERE 
    created_date BETWEEN '2011' AND '2012'
    AND descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('sanity check:')
print('min:', results_df.created_date.min())
print('max:', results_df.created_date.max())

print('\nshape of data: {}'.format(results_df.shape))
results_df.head()



sanity check:
min: 2011-08-16T23:52:00.000
max: 2011-12-31T17:03:00.000

shape of data: (1000, 31)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,...,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2
0,22426149,2011-12-31T17:03:00.000,2012-01-02T08:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10460.0,1956 CROTONA PARKWAY,CROTONA PARKWAY,...,1015982.0,246199.0,UNKNOWN,Unspecified,BRONX,40.84237755161368,-73.88531510513788,"{'latitude': '40.84237755161368', 'longitude':...",,
1,22424342,2011-12-30T10:00:00.000,2011-12-31T09:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10024.0,,,...,991690.0,225337.0,UNKNOWN,Unspecified,MANHATTAN,40.78517106970749,-73.97313367344907,"{'latitude': '40.78517106970749', 'longitude':...",WEST 84 STREET,COLUMBUS AVENUE
2,22425059,2011-12-30T09:25:00.000,2011-12-30T13:55:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,,,...,,,UNKNOWN,Unspecified,QUEENS,,,,GRAHAM CT,26 AVE
3,22415128,2011-12-29T17:13:00.000,2011-12-30T11:00:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,263 COLONY AVENUE,COLONY AVENUE,...,958629.0,147876.0,UNKNOWN,Unspecified,STATEN ISLAND,40.572524396506175,-74.09222458237058,"{'latitude': '40.572524396506175', 'longitude'...",,
4,22414065,2011-12-29T12:33:00.000,2011-12-30T11:30:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10306.0,,,...,946267.0,146214.0,UNKNOWN,Unspecified,STATEN ISLAND,40.56791819419245,-74.13671306905549,"{'latitude': '40.56791819419245', 'longitude':...",AMBER STREET,THOMAS STREET


In [7]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select all columns,
where the descriptor is Street Flooding (SJ),
sort the created_date in descending order and limit our records to 1,000
"""

query = """
SELECT
    *
WHERE
    descriptor == 'Street Flooding (SJ)'
ORDER BY
    created_date DESC
LIMIT
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head()



shape of data: (1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2,closed_date,resolution_description,resolution_action_updated_date
0,58234412,2023-07-18T20:08:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10473,800 BOLTON AVENUE,BOLTON AVENUE,LAFAYETTE AVE,...,Unspecified,BRONX,40.82221080548931,-73.85932846516333,"{'latitude': '40.82221080548931', 'longitude':...",,,,,
1,58241178,2023-07-18T17:56:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11385,,,,...,Unspecified,QUEENS,40.70204262831607,-73.85026266467501,"{'latitude': '40.702042628316065', 'longitude'...",FREEDOM DRIVE,MYRTLE AVENUE,,,
2,58237100,2023-07-18T16:53:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11432,172-03 HIGHLAND AVENUE,HIGHLAND AVENUE,KINGSTON PL,...,Unspecified,QUEENS,40.713078025858366,-73.79124626094924,"{'latitude': '40.713078025858366', 'longitude'...",,,,,
3,58234413,2023-07-18T16:47:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11692,70-20 ROCKAWAY BEACH BOULEVARD,ROCKAWAY BEACH BOULEVARD,CORAL REEF WAY,...,Unspecified,QUEENS,40.58956376846801,-73.79967588630932,"{'latitude': '40.58956376846801', 'longitude':...",,,,,
4,58239758,2023-07-18T16:11:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10033,,,,...,Unspecified,MANHATTAN,40.85179479152171,-73.93009926586655,"{'latitude': '40.85179479152171', 'longitude':...",AUDUBON AVENUE,WEST 186 STREET,,,


In [8]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select the borough and count grouped by borough,
where the descriptor is Street Flooding (SJ),
sort the count in descending order
"""

query = """
SELECT 
    borough, 
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    borough
ORDER BY 
    count DESC
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (7, 2)


Unnamed: 0,borough,count
0,QUEENS,13929
1,BROOKLYN,9493
2,STATEN ISLAND,6506
3,MANHATTAN,3206
4,BRONX,2910
5,Unspecified,48
6,,4


In [9]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
select the borough and count grouped by borough having more than 5,000 counts,
where the descriptor is Street Flooding (SJ),
sort the count in descending order
"""

query = """
SELECT 
    borough, 
    count(*) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    borough
HAVING 
    count > 5000
ORDER BY 
    count DESC
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (3, 2)


Unnamed: 0,borough,count
0,QUEENS,13929
1,BROOKLYN,9493
2,STATEN ISLAND,6506


## 4.2 SoQL Function and Keyword Listing

In [10]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=1000
)

""" SoQL query string below:
Select descriptor and count grouped by descriptor,
where the word "flood" is in descriptor,
sort count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    descriptor, 
    count(unique_key) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    descriptor
ORDER BY 
    count DESC
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (11, 2)


Unnamed: 0,descriptor,count
0,Catch Basin Clogged/Flooding (Use Comments) (SC),108332
1,Street Flooding (SJ),36096
2,Flood Light Lamp Out,6418
3,Highway Flooding (SH),3077
4,Flood Light Lamp Cycling,2573
5,Ready NY - Flooding,271
6,Flood Light Lamp Dayburning,223
7,Flood Light Lamp Missing,206
8,Flood Light Lamp Dim,184
9,RAIN GARDEN FLOODING (SRGFLD),152


In [11]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the descriptor, unique_key, borough, and case(borough != 'BRONX'),
where the descriptor is Street Flooding (SJ),
limit our records to 1,000
"""

query = """
SELECT 
    unique_key,
    descriptor,
    borough,
    case(borough != 'BRONX', False, True, True) AS in_bronx
WHERE 
    descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier, 
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (1000, 4)


Unnamed: 0,unique_key,descriptor,borough,in_bronx
0,58234412,Street Flooding (SJ),BRONX,True
1,58241178,Street Flooding (SJ),QUEENS,False
2,58237100,Street Flooding (SJ),QUEENS,False
3,58234413,Street Flooding (SJ),QUEENS,False
4,58239758,Street Flooding (SJ),MANHATTAN,False
5,58238426,Street Flooding (SJ),STATEN ISLAND,False
6,58235751,Street Flooding (SJ),QUEENS,False
7,58233021,Street Flooding (SJ),STATEN ISLAND,False
8,58239759,Street Flooding (SJ),BROOKLYN,False
9,58234414,Street Flooding (SJ),STATEN ISLAND,False


In [12]:
# sanity check
(results_df
 .groupby(by=['borough', 'in_bronx'])['unique_key']
 .count()
)

borough        in_bronx
BRONX          True         86
BROOKLYN       False       322
MANHATTAN      False        79
QUEENS         False       369
STATEN ISLAND  False       144
Name: unique_key, dtype: int64

In [13]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the year truncated and the count columns grouped by year,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_trunc_y(created_date) AS year,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (14, 2)


Unnamed: 0,year,count
0,2018-01-01T00:00:00.000,4140
1,2021-01-01T00:00:00.000,3702
2,2019-01-01T00:00:00.000,3434
3,2022-01-01T00:00:00.000,3078
4,2011-01-01T00:00:00.000,2644
5,2017-01-01T00:00:00.000,2532
6,2010-01-01T00:00:00.000,2531
7,2014-01-01T00:00:00.000,2498
8,2016-01-01T00:00:00.000,2262
9,2012-01-01T00:00:00.000,2203


In [14]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the year month truncated and the count columns grouped by year month,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_trunc_ym(created_date) AS year_month,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year_month
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (163, 2)


Unnamed: 0,year_month,count
0,2021-09-01T00:00:00.000,1035
1,2018-11-01T00:00:00.000,710
2,2021-08-01T00:00:00.000,595
3,2022-12-01T00:00:00.000,530
4,2017-05-01T00:00:00.000,524
5,2021-07-01T00:00:00.000,499
6,2011-08-01T00:00:00.000,497
7,2016-02-01T00:00:00.000,490
8,2010-03-01T00:00:00.000,489
9,2019-05-01T00:00:00.000,452


In [15]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the year month day and the count columns grouped by year month day,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_trunc_ymd(created_date) AS year_month_day,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year_month_day
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (1000, 2)


Unnamed: 0,year_month_day,count
0,2021-09-02T00:00:00.000,350
1,2021-09-01T00:00:00.000,344
2,2022-12-23T00:00:00.000,308
3,2017-05-05T00:00:00.000,247
4,2014-12-09T00:00:00.000,226
5,2014-04-30T00:00:00.000,189
6,2021-10-26T00:00:00.000,177
7,2018-04-16T00:00:00.000,163
8,2013-05-08T00:00:00.000,162
9,2021-08-22T00:00:00.000,158


In [16]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the year and the count columns grouped by year,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_y(created_date) AS year,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (14, 2)


Unnamed: 0,year,count
0,2018,4140
1,2021,3702
2,2019,3434
3,2022,3078
4,2011,2644
5,2017,2532
6,2010,2531
7,2014,2498
8,2016,2262
9,2012,2203


In [17]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the month and the count columns grouped by month,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_m(created_date) AS month,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    month
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (12, 2)


Unnamed: 0,month,count
0,5,4081
1,7,3555
2,8,3326
3,6,3235
4,9,3170
5,12,3012
6,4,2812
7,2,2800
8,10,2727
9,3,2627


In [18]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the day and the count day columns grouped by day,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_d(created_date) AS day,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (31, 2)


Unnamed: 0,day,count
0,1,1639
1,2,1581
2,30,1521
3,23,1503
4,9,1416
5,13,1360
6,8,1346
7,16,1339
8,25,1302
9,29,1250


In [19]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the week of year and the count columns grouped by week of year,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_woy(created_date) AS week_of_year,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    week_of_year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (53, 2)


Unnamed: 0,week_of_year,count
0,18,1337
1,35,1153
2,33,985
3,30,966
4,51,898
5,23,880
6,20,864
7,21,847
8,50,835
9,43,825


In [20]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the day of week and the count columns grouped by day of week,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_dow(created_date) AS day_of_week,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day_of_week
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (7, 2)


Unnamed: 0,day_of_week,count
0,2,6264
1,5,6158
2,1,6031
3,3,5810
4,4,5743
5,0,3183
6,6,2907


In [21]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the hour and the count columns grouped by hour,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_hh(created_date) AS hour,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    hour
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (24, 2)


Unnamed: 0,hour,count
0,11,3058
1,9,2984
2,10,2943
3,12,2784
4,15,2653
5,14,2599
6,13,2547
7,16,2502
8,8,2264
9,17,1992


In [22]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

client = Socrata(
    socrata_domain,
    app_token=None,
    timeout=100
)

""" SoQL query string below:
Select the minute and the count columns grouped by minute,
where the descriptor is Street Flooding (SJ),
sort the count in descending order and limit our records to 1,000
"""

query = """
SELECT 
    date_extract_mm(created_date) AS minute,
    count(unique_key) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    minute
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

results = client.get(
    socrata_dataset_identifier,
    query=query
)
client.close()

results_df = pd.DataFrame.from_records(results)

print('shape of data: {}'.format(results_df.shape))
results_df.head(10)



shape of data: (60, 2)


Unnamed: 0,minute,count
0,44,745
1,38,722
2,35,721
3,47,710
4,56,702
5,29,697
6,32,696
7,23,695
8,2,691
9,17,678


# 6. Retrieving Data Directly from Socrata Open Data API (SODA)

In [23]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Dataset id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

url = 'https://{}/resource/{}.csv?$limit=20'.format(socrata_domain, socrata_dataset_identifier)
print(url)

df = pd.read_csv(url)

print(df.shape)
df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,58232638,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11232,161 19 STREET,...,,,,,,,,40.664163,-73.995022,"\n, \n(40.664163252269546, -73.99502216556228)"
1,58240811,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10461,1177 VAN NEST AVENUE,...,,,,,,,,40.849584,-73.848785,"\n, \n(40.849584225416905, -73.84878471556283)"
2,58235385,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11422,243-57 CANEY ROAD,...,,,,,,,,40.660806,-73.73853,"\n, \n(40.660805762163754, -73.73853048191496)"
3,58238017,2023-07-19T02:07:20.000,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10454,554 EAST 137 STREET,...,,,,,,,,40.806167,-73.918079,"\n, \n(40.80616697208923, -73.91807880412587)"
4,58234065,2023-07-19T02:07:18.000,,DHS,Department of Homeless Services,Homeless Person Assistance,Chronic,Street/Sidewalk,10019,200 WEST 52 STREET,...,,,,,,,,40.762389,-73.982485,"\n, \n(40.76238879923077, -73.98248510391365)"


In [24]:
# preview columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 41 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   unique_key                      20 non-null     int64  
 1   created_date                    20 non-null     object 
 2   closed_date                     0 non-null      float64
 3   agency                          20 non-null     object 
 4   agency_name                     20 non-null     object 
 5   complaint_type                  20 non-null     object 
 6   descriptor                      20 non-null     object 
 7   location_type                   20 non-null     object 
 8   incident_zip                    20 non-null     int64  
 9   incident_address                20 non-null     object 
 10  street_name                     20 non-null     object 
 11  cross_street_1                  20 non-null     object 
 12  cross_street_2                  20 non

In [25]:
year = '2020'
column = 'created_date'

url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?\
$where={}%20>=%20%27{}%27&$limit=20'.format(column, year)
print(url)

df = pd.read_csv(url)

print(df.shape)
df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=created_date%20>=%20%272020%27&$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,58232638,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11232,161 19 STREET,...,,,,,,,,40.664163,-73.995022,"\n, \n(40.664163252269546, -73.99502216556228)"
1,58240811,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10461,1177 VAN NEST AVENUE,...,,,,,,,,40.849584,-73.848785,"\n, \n(40.849584225416905, -73.84878471556283)"
2,58235385,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11422,243-57 CANEY ROAD,...,,,,,,,,40.660806,-73.73853,"\n, \n(40.660805762163754, -73.73853048191496)"
3,58238017,2023-07-19T02:07:20.000,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10454,554 EAST 137 STREET,...,,,,,,,,40.806167,-73.918079,"\n, \n(40.80616697208923, -73.91807880412587)"
4,58234065,2023-07-19T02:07:18.000,,DHS,Department of Homeless Services,Homeless Person Assistance,Chronic,Street/Sidewalk,10019,200 WEST 52 STREET,...,,,,,,,,40.762389,-73.982485,"\n, \n(40.76238879923077, -73.98248510391365)"


In [26]:
url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20LIMIT%2020'
print(url)

df = pd.read_csv(url)

print(df.shape)
df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20LIMIT%2020
(20, 46)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,road_ramp,bridge_highway_segment,latitude,longitude,location,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih
0,58235385,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11422,243-57 CANEY ROAD,...,,,40.660806,-73.73853,"\n, \n(40.660805762163754, -73.73853048191496)",24018,63,3,47,63
1,58232638,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,11232,161 19 STREET,...,,,40.664163,-73.995022,"\n, \n(40.664163252269546, -73.99502216556228)",13515,9,2,7,45
2,58240811,2023-07-19T12:00:00.000,,DSNY,Department of Sanitation,Derelict Vehicles,Derelict Vehicles,Street,10461,1177 VAN NEST AVENUE,...,,,40.849584,-73.848785,"\n, \n(40.849584225416905, -73.84878471556283)",11270,59,5,12,32
3,58238017,2023-07-19T02:07:20.000,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10454,554 EAST 137 STREET,...,,,40.806167,-73.918079,"\n, \n(40.80616697208923, -73.91807880412587)",10932,49,5,35,23
4,58234065,2023-07-19T02:07:18.000,,DHS,Department of Homeless Services,Homeless Person Assistance,Chronic,Street/Sidewalk,10019,200 WEST 52 STREET,...,,,40.762389,-73.982485,"\n, \n(40.76238879923077, -73.98248510391365)",12081,11,4,51,10


In [27]:
query = """
    SELECT
        *
    WHERE
        created_date >= '2020'
        AND descriptor == 'Street Flooding (SJ)'
    LIMIT
        100
    """

safe_string = urllib.parse.quote_plus(query)
print(safe_string)

url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query={}'.format(safe_string)
print('url:', url)

df = pd.read_csv(url)

print(df.shape)
df.head()

%0A++++SELECT%0A++++++++%2A%0A++++WHERE%0A++++++++created_date+%3E%3D+%272020%27%0A++++++++AND+descriptor+%3D%3D+%27Street+Flooding+%28SJ%29%27%0A++++LIMIT%0A++++++++100%0A++++
url: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=%0A++++SELECT%0A++++++++%2A%0A++++WHERE%0A++++++++created_date+%3E%3D+%272020%27%0A++++++++AND+descriptor+%3D%3D+%27Street+Flooding+%28SJ%29%27%0A++++LIMIT%0A++++++++100%0A++++
(100, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,58234412,2023-07-18T20:08:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,10473,800 BOLTON AVENUE,...,,,,,,,,40.822211,-73.859328,"\n, \n(40.82221080548931, -73.85932846516333)"
1,58241178,2023-07-18T17:56:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11385,,...,,,,,,,,40.702043,-73.850263,"\n, \n(40.702042628316065, -73.85026266467501)"
2,58237100,2023-07-18T16:53:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11432,172-03 HIGHLAND AVENUE,...,,,,,,,,40.713078,-73.791246,"\n, \n(40.713078025858366, -73.79124626094924)"
3,58234413,2023-07-18T16:47:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,11692,70-20 ROCKAWAY BEACH BOULEVARD,...,,,,,,,,40.589564,-73.799676,"\n, \n(40.58956376846801, -73.79967588630932)"
4,58239758,2023-07-18T16:11:00.000,,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,10033,,...,,,,,,,,40.851795,-73.930099,"\n, \n(40.85179479152171, -73.93009926586655)"
