# Socrata Query Language (SoQL) Clauses and Functions
Mark Bauer

# Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API, with NYC Open Data. Examples of popular methods are included, as well as basic queries using SoQL, the Socrata Query Language. 

# sodapy
sodapy is a python client for the Socrata Open Data API.

# Information about sodapy
 
**Installing**: https://pypi.org/project/sodapy/  
**GitHub**: https://github.com/xmunoz/sodapy

**The official Socrata Open Data API (SODA) docs**  
https://dev.socrata.com/

**Queries using SODA**  
https://dev.socrata.com/docs/queries/

**Inspiration for this notebook**  
https://github.com/xmunoz/sodapy/blob/master/examples/soql_queries.ipynb

# Importing Libraries

In [185]:
# importing libraries
import pandas as pd
from sodapy import Socrata
import itertools 

In [186]:
%reload_ext watermark

In [187]:
%watermark -a "Mark Bauer" -u -t -d -v -p pandas,sodapy,itertools

Mark Bauer 
last updated: 2021-01-29 12:19:51 

CPython 3.7.1
IPython 7.18.1

pandas 1.0.0
sodapy 2.0.0
itertools unknown


Documention for installing watermark: https://github.com/rasbt/watermark

# Using sodapy

In order use sodapy, a source domain (i.e. the open data source you are trying to connect to) needs to be passed to the Socrata class. Additionally, if a user wants to query a specific data set, then the data set identifier (i.e. the data set id on the given source domain) needs to be passed as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the data set identifier for the NYC 311 data set: `erm2-nwe9`. The screenshot is the homepage of the 311 data set from NYC Open Data.

![nyc-311-api-docs](nyc-311-api-docs.png)  

Source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

We assign this information to `socrata_domain` and `socrata_dataset_identifier` variables below.

# Socrata Query Language (SoQL)

# SoQL Clauses

In [188]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    *
WHERE 
    descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
print('unique descriptors in returned dataframe:', results_df.descriptor.unique())

results_df.head()



shape of data: (1000, 31)
unique descriptors in returned dataframe: ['Street Flooding (SJ)']


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,...,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2
0,20068090,2011-03-19T10:52:00.000,2011-03-22T10:00:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10019,1674 BROADWAY,BROADWAY,...,988929,217208,UNKNOWN,Unspecified,MANHATTAN,40.7628609882071,-73.98310948480062,"{'latitude': '40.7628609882071', 'longitude': ...",,
1,20068091,2011-03-19T09:18:00.000,2011-03-23T12:30:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11418,112-19 ATLANTIC AVE,ATLANTIC AVE,...,1030380,191607,UNKNOWN,Unspecified,QUEENS,40.69247350418123,-73.83365304856,"{'latitude': '40.69247350418123', 'longitude':...",,
2,20079313,2011-03-21T11:38:00.000,2011-03-21T12:50:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11379,7963 69 AVE,69 AVE,...,1019886,197923,UNKNOWN,Unspecified,QUEENS,40.709857903354816,-73.87146142347333,"{'latitude': '40.709857903354816', 'longitude'...",,
3,20079309,2011-03-21T18:36:00.000,2011-03-21T20:00:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10312,24 LITTLEFIELD AVE,LITTLEFIELD AVE,...,942781,134211,UNKNOWN,Unspecified,STATEN ISLAND,40.53495667592927,-74.14918670594267,"{'latitude': '40.53495667592927', 'longitude':...",,
4,20079316,2011-03-21T09:37:00.000,2011-03-22T08:40:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11691,12-50 BRUNSWICK AVE,BRUNSWICK AVE,...,1053866,161375,UNKNOWN,Unspecified,QUEENS,40.60934000436298,-73.74927384921338,"{'latitude': '40.60934000436298', 'longitude':...",,


In [189]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    *
WHERE 
    descriptor == 'Street Flooding (SJ)'
ORDER BY 
    created_date DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head()



shape of data: (1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,closed_date,intersection_street_1,intersection_street_2,resolution_description,resolution_action_updated_date
0,49636123,2021-01-27T23:58:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10308,145 GREAT KILLS ROAD,GREAT KILLS ROAD,DENT RD,...,Unspecified,STATEN ISLAND,40.55027022364553,-74.14472292734358,"{'latitude': '40.55027022364553', 'longitude':...",,,,,
1,49640168,2021-01-27T16:48:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11101,43-25 HUNTER STREET,HUNTER STREET,44 RD,...,Unspecified,QUEENS,40.74789180264069,-73.942594531949,"{'latitude': '40.74789180264069', 'longitude':...",,,,,
2,49640066,2021-01-27T14:02:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10016,,,,...,Unspecified,MANHATTAN,40.74966676207011,-73.98157172697513,"{'latitude': '40.74966676207011', 'longitude':...",2021-01-27T14:40:00.000,EAST 37 STREET,MADISON AVENUE,The Department of Environment Protection inspe...,2021-01-27T14:40:00.000
3,49630794,2021-01-26T20:25:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11237,340 WEIRFIELD STREET,WEIRFIELD STREET,IRVING AVE,...,Unspecified,BROOKLYN,40.695144569398586,-73.90712618152287,"{'latitude': '40.695144569398586', 'longitude'...",,,,,
4,49632939,2021-01-26T17:41:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11378,59-34 58 ROAD,58 ROAD,GRAND AVE,...,Unspecified,QUEENS,40.72043739380568,-73.9068961273691,"{'latitude': '40.720437393805675', 'longitude'...",,,,,


In [190]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 30 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   unique_key                      1000 non-null   object
 1   created_date                    1000 non-null   object
 2   agency                          1000 non-null   object
 3   agency_name                     1000 non-null   object
 4   complaint_type                  1000 non-null   object
 5   descriptor                      1000 non-null   object
 6   incident_zip                    993 non-null    object
 7   incident_address                749 non-null    object
 8   street_name                     749 non-null    object
 9   cross_street_1                  749 non-null    object
 10  cross_street_2                  749 non-null    object
 11  address_type                    1000 non-null   object
 12  city                            993 non-null    o

In [191]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    borough, 
    count(*) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    borough
ORDER BY 
    count DESC
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (6, 2)


Unnamed: 0,borough,count
0,QUEENS,10936
1,BROOKLYN,6993
2,STATEN ISLAND,4946
3,MANHATTAN,2587
4,BRONX,2048
5,Unspecified,13


In [192]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    borough, 
    count(*) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    borough
HAVING 
    count > 5000
ORDER BY 
    count DESC
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (2, 2)


Unnamed: 0,borough,count
0,QUEENS,10936
1,BROOKLYN,6993


# SoQL Function and Keyword Listing

In [193]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    distinct(status) as distinct_status 
WHERE 
    descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (5, 1)


Unnamed: 0,distinct_status
0,Assigned
1,Open
2,Started
3,Pending
4,Closed


In [194]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    * 
WHERE 
    created_date between '2011' and '2012' AND
    descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
print('min:', results_df.created_date.min())
print('max:', results_df.created_date.max())

results_df.head()



shape of data: (1000, 31)
min: 2011-01-08T12:08:00.000
max: 2011-12-30T10:00:00.000


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,...,x_coordinate_state_plane,y_coordinate_state_plane,open_data_channel_type,park_facility_name,park_borough,latitude,longitude,location,intersection_street_1,intersection_street_2
0,21431905,2011-09-13T17:50:00.000,2011-09-20T14:15:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10019,210 WEST 50 STREET,WEST 50 STREET,...,988573,216734,UNKNOWN,Unspecified,MANHATTAN,40.76156016137893,-73.98439489825104,"{'latitude': '40.76156016137893', 'longitude':...",,
1,21431198,2011-09-13T14:12:00.000,2011-09-14T10:55:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10036,200 WEST 42 STREET,WEST 42 STREET,...,987804,214752,UNKNOWN,Unspecified,MANHATTAN,40.75612042107343,-73.98717187083514,"{'latitude': '40.756120421073426', 'longitude'...",,
2,21431904,2011-09-13T17:34:00.000,2011-09-14T10:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),10019,,,...,988757,216332,UNKNOWN,Unspecified,MANHATTAN,40.76045668270954,-73.98373096728758,"{'latitude': '40.76045668270954', 'longitude':...",WEST 49 STREET,7 AVENUE
3,21431156,2011-09-13T12:47:00.000,2011-09-14T08:05:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11697,5 STATE ROAD,STATE ROAD,...,1015533,145614,UNKNOWN,Unspecified,QUEENS,40.56629748227965,-73.8874051386105,"{'latitude': '40.56629748227965', 'longitude':...",,
4,21431213,2011-09-13T13:43:00.000,2011-09-14T09:25:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11420,,,...,1037479,185728,UNKNOWN,Unspecified,QUEENS,40.67629718064151,-73.80810023963616,"{'latitude': '40.676297180641505', 'longitude'...",131 STREET,FOCH BOULEVARD


In [195]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    unique_key,
    borough,
    case(borough != 'BRONX', False, True, True) AS in_bronx
WHERE 
    descriptor == 'Street Flooding (SJ)'
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head()



shape of data: (1000, 3)


Unnamed: 0,unique_key,borough,in_bronx
0,21431905,MANHATTAN,False
1,21431198,MANHATTAN,False
2,21431904,MANHATTAN,False
3,21431156,QUEENS,False
4,21431213,QUEENS,False


In [196]:
results_df.sort_values(by='in_bronx', ascending=False).head()

Unnamed: 0,unique_key,borough,in_bronx
144,21563414,BRONX,True
905,39737538,BRONX,True
884,39564383,BRONX,True
885,40091298,BRONX,True
886,39769602,BRONX,True


In [197]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    status, 
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    status
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (5, 2)


Unnamed: 0,status,count
0,Closed,27506
1,Open,9
2,Pending,4
3,Started,3
4,Assigned,1


In [198]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_d(created_date) AS day,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (31, 2)


Unnamed: 0,day,count
0,30,1215
1,25,1110
2,9,1106
3,13,1086
4,8,1054
5,29,1041
6,16,1040
7,1,1015
8,5,998
9,27,976


In [199]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_dow(created_date) AS day_of_week,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day_of_week
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (7, 2)


Unnamed: 0,day_of_week,count
0,2,4868
1,1,4865
2,5,4663
3,3,4522
4,4,4231
5,0,2281
6,6,2093


In [200]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_hh(created_date) AS hour,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    hour
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (24, 2)


Unnamed: 0,hour,count
0,11,2465
1,9,2358
2,10,2328
3,12,2219
4,15,2124
5,14,2069
6,13,2060
7,16,1915
8,8,1745
9,17,1456


In [201]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_m(created_date) AS month,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    month
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (12, 2)


Unnamed: 0,month,count
0,5,3289
1,6,2626
2,8,2560
3,2,2418
4,12,2368
5,11,2190
6,3,2152
7,10,2128
8,7,2112
9,4,2086


In [202]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_mm(created_date) AS minute,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    minute
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (60, 2)


Unnamed: 0,minute,count
0,0,524
1,32,504
2,44,501
3,40,495
4,28,486
5,33,482
6,23,482
7,35,482
8,49,481
9,43,480


In [203]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_woy(created_date) AS week_of_year,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    week_of_year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (53, 2)


Unnamed: 0,week_of_year,count
0,18,1050
1,6,752
2,50,717
3,19,714
4,23,711
5,48,710
6,21,696
7,24,682
8,33,668
9,32,665


In [204]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_extract_y(created_date) AS year,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (12, 2)


Unnamed: 0,year,count
0,2018,4140
1,2019,2972
2,2011,2644
3,2017,2532
4,2010,2531
5,2014,2498
6,2016,2262
7,2012,2203
8,2020,2139
9,2015,1839


In [205]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_y(created_date) AS year,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (12, 2)


Unnamed: 0,year,count
0,2018-01-01T00:00:00.000,4140
1,2019-01-01T00:00:00.000,2972
2,2011-01-01T00:00:00.000,2644
3,2017-01-01T00:00:00.000,2532
4,2010-01-01T00:00:00.000,2531
5,2014-01-01T00:00:00.000,2498
6,2016-01-01T00:00:00.000,2262
7,2012-01-01T00:00:00.000,2203
8,2020-01-01T00:00:00.000,2139
9,2015-01-01T00:00:00.000,1839


In [206]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ym(created_date) AS year_month,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year_month
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (133, 2)


Unnamed: 0,year_month,count
0,2018-11-01T00:00:00.000,710
1,2017-05-01T00:00:00.000,524
2,2011-08-01T00:00:00.000,497
3,2016-02-01T00:00:00.000,490
4,2010-03-01T00:00:00.000,489
5,2019-05-01T00:00:00.000,452
6,2019-12-01T00:00:00.000,440
7,2019-06-01T00:00:00.000,435
8,2014-02-01T00:00:00.000,418
9,2018-03-01T00:00:00.000,405


In [207]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ymd(created_date) AS year_month_day,
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    year_month_day
ORDER BY 
    count DESC    
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (1000, 2)


Unnamed: 0,year_month_day,count
0,2017-05-05T00:00:00.000,247
1,2014-12-09T00:00:00.000,226
2,2014-04-30T00:00:00.000,189
3,2018-04-16T00:00:00.000,163
4,2013-05-08T00:00:00.000,162
5,2016-11-15T00:00:00.000,151
6,2016-02-08T00:00:00.000,150
7,2018-11-25T00:00:00.000,142
8,2020-07-10T00:00:00.000,131
9,2010-10-01T00:00:00.000,130


In [208]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select complaint_type and count of complaint_type grouped by compaint_type,
# where the word "flood" is in compplaint_type,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(*) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    descriptor
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



shape of data: (11, 2)


Unnamed: 0,descriptor,count
0,Catch Basin Clogged/Flooding (Use Comments) (SC),89857
1,Street Flooding (SJ),27523
2,Flood Light Lamp Out,5964
3,Highway Flooding (SH),2834
4,Flood Light Lamp Cycling,2511
5,Ready NY - Flooding,271
6,Flood Light Lamp Dayburning,205
7,Flood Light Lamp Missing,190
8,Flood Light Lamp Dim,177
9,RAIN GARDEN FLOODING (SRGFLD),78


# Analyzing NYC 311 Complaints with SoQL

## Most NYC 311 Complaints by Complaint Type

In [181]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the complaint_type and the count of complaint_type columns 
# grouped by complaint_type, sorted by the count of complaint_type in descending order
# and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
GROUP BY 
    complaint_type
ORDER BY 
    count(complaint_type) DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# results is returned as JSON from API and converted to Python list of
# dictionaries by sodapy
print(type(results), 'Returned a list from our request.\n')

# Identifying type of first element of our results list
print(type(results[0]), 'However, request is actually a list of dictionaries.\n')

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print(type(results_df), 'Convert list of dictionaries to DataFrame.')
print('Rows and columns of data:', results_df.shape)

results_df.head(10)



<class 'list'> Returned a list from our request.

<class 'dict'> However, request is actually a list of dictionaries.

<class 'pandas.core.frame.DataFrame'> Convert list of dictionaries to DataFrame.
Rows and columns of data: (445, 2)


Unnamed: 0,complaint_type,count
0,Noise - Residential,2238037
1,HEAT/HOT WATER,1423237
2,Illegal Parking,1126019
3,Blocked Driveway,1045927
4,Street Condition,1020888
5,Street Light Condition,984066
6,HEATING,887869
7,PLUMBING,747937
8,Water System,687400
9,Noise - Street/Sidewalk,682532


## Most NYC 311 Complaints by Descriptor

In [182]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the descriptor and count of descriptor columns grouped by descriptor,
# sort by the count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count
GROUP BY 
    descriptor
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 2)


Unnamed: 0,descriptor,count
0,Loud Music/Party,2336486
1,ENTIRE BUILDING,926711
2,HEAT,871935
3,No Access,778808
4,Street Light Out,729603
5,Pothole,617865
6,Banging/Pounding,613329
7,APARTMENT ONLY,496526
8,Loud Talking,359034
9,CEILING,358956


## Most NYC 311 Complaints by Day

In [183]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and count day grouped by day,
# sort by count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ymd(created_date) AS day, 
    count(day) AS count
GROUP BY 
    day
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 2)


Unnamed: 0,day,count
0,2020-08-04T00:00:00.000,23314
1,2020-08-05T00:00:00.000,18305
2,2020-07-05T00:00:00.000,16014
3,2020-07-04T00:00:00.000,15365
4,2020-06-20T00:00:00.000,15098
5,2020-06-21T00:00:00.000,14965
6,2020-06-28T00:00:00.000,12899
7,2020-06-27T00:00:00.000,12074
8,2020-08-09T00:00:00.000,12057
9,2020-08-06T00:00:00.000,12043


## Displaying the difference between the date (timestamp) and day (date_trunc_ymd) columns

In [184]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the timestamp, day and count timestamp grouped by timestamp,
# sort by count in descending order and limit our records to 1,000.

query = """
SELECT 
    created_date as timestamp, 
    date_trunc_ymd(created_date) as day, 
    count(timestamp) AS count
GROUP BY 
    timestamp
ORDER BY 
    count ASC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 3)


Unnamed: 0,timestamp,day,count
0,2010-01-01T15:48:17.000,2010-01-01T00:00:00.000,1
1,2010-01-01T16:01:57.000,2010-01-01T00:00:00.000,1
2,2010-01-01T15:40:55.000,2010-01-01T00:00:00.000,1
3,2010-01-01T15:48:01.000,2010-01-01T00:00:00.000,1
4,2010-01-01T15:57:07.000,2010-01-01T00:00:00.000,1
5,2010-01-01T16:01:43.000,2010-01-01T00:00:00.000,1
6,2010-01-01T15:35:00.000,2010-01-01T00:00:00.000,1
7,2010-01-01T15:39:32.000,2010-01-01T00:00:00.000,1
8,2010-01-01T15:45:00.000,2010-01-01T00:00:00.000,1
9,2010-01-01T15:48:00.000,2010-01-01T00:00:00.000,1


## Analyzing NYC 311 Street Flooding Complaints

### Searching the data set for the word "flood" in the complaint_type field

In [185]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select complaint_type and count of complaint_type grouped by compaint_type,
# where the word "flood" is in compplaint_type,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
WHERE 
    LOWER(complaint_type) LIKE '%flood%'
GROUP BY 
    complaint_type
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



shape of data: (0, 0)


### Searching the data set for the word "flood" in the descriptor field

In [186]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select complaint_type and count of complaint_type grouped by compaint_type,
# where the word "flood" is in descriptor,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    complaint_type
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



shape of data: (4, 2)


Unnamed: 0,complaint_type,count
0,Sewer,120281
1,Street Light Condition,9047
2,OEM Literature Request,271
3,Public Toilet,48


### Searching the data set where complaint_type field = 'Sewer'

In [187]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select descriptor and count of descriptor grouped by descriptor,
# where complaint_type = 'Sewer',
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor)
WHERE 
    complaint_type='Sewer'
GROUP BY 
    descriptor
ORDER BY 
    count(descriptor) DESC
LIMIT 1000
"""

# First 1000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df.head(len(results_df))



shape of data: (27, 2)


Unnamed: 0,descriptor,count_descriptor
0,Sewer Backup (Use Comments) (SA),148695
1,Catch Basin Clogged/Flooding (Use Comments) (SC),89849
2,Catch Basin Sunken/Damaged/Raised (SC1),28528
3,Street Flooding (SJ),27520
4,Manhole Cover Broken/Making Noise (SB),19778
5,Manhole Cover Missing (Emergency) (SA3),17430
6,Sewer Odor (SA2),15351
7,Defective/Missing Curb Piece (SC4),8486
8,Manhole Overflow (Use Comments) (SA1),6842
9,Catch Basin Search (SC2),4153


### Searching the data set where the word "flood" is in the descriptor field

In [188]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select descriptor and count of descriptor grouped by descriptor,
# where the word "flood" is in descriptor,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    descriptor
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



shape of data: (11, 2)


Unnamed: 0,descriptor,count
0,Catch Basin Clogged/Flooding (Use Comments) (SC),89849
1,Street Flooding (SJ),27520
2,Flood Light Lamp Out,5964
3,Highway Flooding (SH),2834
4,Flood Light Lamp Cycling,2511
5,Ready NY - Flooding,271
6,Flood Light Lamp Dayburning,205
7,Flood Light Lamp Missing,190
8,Flood Light Lamp Dim,177
9,RAIN GARDEN FLOODING (SRGFLD),78


### Displaying the highest number of street flooding complaints by day

In [189]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ymd(created_date) as day, 
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)



shape of data: (1000, 2)


Unnamed: 0,day,count
0,2017-05-05T00:00:00.000,247
1,2014-12-09T00:00:00.000,226
2,2014-04-30T00:00:00.000,189
3,2018-04-16T00:00:00.000,163
4,2013-05-08T00:00:00.000,162
5,2016-11-15T00:00:00.000,151
6,2016-02-08T00:00:00.000,150
7,2018-11-25T00:00:00.000,142
8,2020-07-10T00:00:00.000,131
9,2010-10-01T00:00:00.000,130


### Selecting all the rows and columns where the descriptor field = 'Street Flooding'

In [190]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select all columns where the descriptor is Street Flooding(SJ),
# sort the created date field in descending order and limit our records to 1,000.

query = """
SELECT 
    *
WHERE 
    descriptor == 'Street Flooding (SJ)'
ORDER BY 
    created_date DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# results is returned as JSON from API and converted to Python list of
# dictionaries by sodapy
print(type(results), 'Returned a list from our request.\n')

# Identifying type of first element of our results list
print(type(results[0]), 'However, request is actually a list of dictionaries.\n')

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print(type(results_df), 'Convert list of dictionaries to DataFrame.')
print('Rows and columns of data:', results_df.shape)

# Writing out sample data as a csv
results_df.to_csv('sample_data_street_flooding.csv', index=False)

# Previewing the first five rows of our DataFrame
results_df.head()



<class 'list'> Returned a list from our request.

<class 'dict'> However, request is actually a list of dictionaries.

<class 'pandas.core.frame.DataFrame'> Convert list of dictionaries to DataFrame.
Rows and columns of data: (1000, 30)


Unnamed: 0,unique_key,created_date,agency,agency_name,complaint_type,descriptor,incident_zip,incident_address,street_name,cross_street_1,...,park_facility_name,park_borough,latitude,longitude,location,closed_date,resolution_description,resolution_action_updated_date,intersection_street_1,intersection_street_2
0,49630794,2021-01-26T20:25:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11237.0,340 WEIRFIELD STREET,WEIRFIELD STREET,IRVING AVE,...,Unspecified,BROOKLYN,40.695144569398586,-73.90712618152287,"{'latitude': '40.695144569398586', 'longitude'...",,,,,
1,49632939,2021-01-26T17:41:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11378.0,59-34 58 ROAD,58 ROAD,GRAND AVE,...,Unspecified,QUEENS,40.72043739380568,-73.9068961273691,"{'latitude': '40.720437393805675', 'longitude'...",,,,,
2,49618841,2021-01-25T18:09:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11432.0,185-23 80 ROAD,80 ROAD,CHEVY CHASE ST,...,Unspecified,QUEENS,40.7277798170635,-73.78278642955841,"{'latitude': '40.7277798170635', 'longitude': ...",,,,,
3,49622225,2021-01-25T17:09:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),,PELHAM BAY PARK,PELHAM BAY PARK,BRUCKNER EXPRESSWAY EXIT 8 C NB,...,Unspecified,BRONX,,,,2021-01-26T11:15:00.000,The Department of Environmental Protection inv...,2021-01-26T11:15:00.000,,
4,49610767,2021-01-24T19:20:00.000,DEP,Department of Environmental Protection,Sewer,Street Flooding (SJ),11420.0,124-19 HAWTREE CREEK ROAD,HAWTREE CREEK ROAD,124 ST,...,Unspecified,QUEENS,40.68086403950213,-73.81622814854843,"{'latitude': '40.68086403950213', 'longitude':...",2021-01-25T10:50:00.000,The Department of Environmental Protection inv...,2021-01-25T10:50:00.000,,


## Analyzing NYC 311 Data Sets with the Most Downloads

In [191]:
type(client)

sodapy.Socrata

In [192]:
type(client.datasets())

list

In [193]:
type(client.datasets()[0])

dict

In [194]:
len(client.datasets())

3143

In [195]:
# Reading in a list of dictionaries of our data into a pandas DataFrame
df = pd.DataFrame.from_records(client.datasets())

df.head()

Unnamed: 0,resource,classification,metadata,permalink,link,owner,creator,preview_image_url
0,"{'name': 'DOB Job Application Filings', 'id': ...","{'categories': ['economy', 'environment', 'hou...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/ic3t-wcy2,https://data.cityofnewyork.us/Housing-Developm...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
1,"{'name': 'Civil Service List (Active)', 'id': ...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/vx8i-nprf,https://data.cityofnewyork.us/City-Government/...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
2,"{'name': 'TLC New Driver Application Status', ...","{'categories': ['transportation', 'environment...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/dpec-ucu7,https://data.cityofnewyork.us/Transportation/T...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
3,"{'name': 'For Hire Vehicles (FHV) - Active', '...","{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/8wbx-tsch,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",
4,{'name': 'For Hire Vehicles (FHV) - Active Dri...,"{'categories': [], 'tags': [], 'domain_categor...",{'domain': 'data.cityofnewyork.us'},https://data.cityofnewyork.us/d/xjfq-wh2d,https://data.cityofnewyork.us/Transportation/F...,"{'id': '5fuc-pqz2', 'user_type': 'interactive'...","{'id': '5fuc-pqz2', 'user_type': 'interactive'...",


In [196]:
# Only saving the dictionary in the resource column
df = df.resource

# Reading the dictionary in the resource column into a pandas DataFrame
df = pd.DataFrame.from_records(df)

df.head()

Unnamed: 0,name,id,parent_fxf,description,attribution,attribution_link,contact_email,type,updatedAt,createdAt,...,columns_field_name,columns_datatype,columns_description,columns_format,download_count,provenance,lens_view_type,blob_mime_type,hide_from_data_json,publication_date
0,DOB Job Application Filings,ic3t-wcy2,[],This dataset contains all job applications sub...,Department of Buildings (DOB),,,dataset,2021-01-27T21:18:28.000Z,2013-04-18T15:18:56.000Z,...,"[fuel_storage, fee_status, curb_cut, professio...","[Text, Text, Text, Text, Text, Text, Text, Tex...","[Fuel Storage Work Type? (X=Yes, Blank=No), T...","[{'align': 'right'}, {'align': 'right'}, {'ali...",36893.0,official,tabular,,False,2020-06-22T18:23:35.000Z
1,Civil Service List (Active),vx8i-nprf,[],A Civil Service List consists of all candidate...,Department of Citywide Administrative Services...,,,dataset,2021-01-28T14:55:02.000Z,2016-06-14T21:12:15.000Z,...,"[exam_no, list_no, first_name, mi, last_name, ...","[text, number, text, text, text, number, text,...",[A four (4) digit number that identifies a civ...,"[{'displayStyle': 'plain', 'align': 'left'}, {...",34291.0,official,tabular,,False,2021-01-28T14:55:02.000Z
2,TLC New Driver Application Status,dpec-ucu7,[],THIS DATASET IS UPDATED SEVERAL TIMES PER DAY....,Taxi and Limousine Commission (TLC),,,dataset,2021-01-28T17:16:32.000Z,2016-05-17T18:43:43.000Z,...,"[status, driver_exam, app_no, app_date, fru_in...","[Text, Text, Number, Calendar date, Text, Text...","[""Incomplete"": Your application is missing req...","[{'displayStyle': 'plain', 'align': 'left'}, {...",35363.0,official,tabular,,False,2019-12-17T18:44:57.000Z
3,For Hire Vehicles (FHV) - Active,8wbx-tsch,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2021-01-27T20:31:43.000Z,2015-07-16T17:33:32.000Z,...,"[active, vehicle_license_number, name, license...","[text, text, text, text, calendar_date, text, ...","[Permit active or not\n, FHV Vehicle License N...","[{'displayStyle': 'plain', 'align': 'left'}, {...",254346.0,official,tabular,,False,2021-01-27T20:31:43.000Z
4,For Hire Vehicles (FHV) - Active Drivers,xjfq-wh2d,[],"<b>PLEASE NOTE:</b> This dataset, which includ...",Taxi and Limousine Commission (TLC),,,dataset,2021-01-27T20:20:02.000Z,2015-07-16T17:24:02.000Z,...,"[license_number, name, type, expiration_date, ...","[number, text, text, calendar_date, text, cale...","[FHV License Number\n, Driver Name\n\n, Type o...","[{'precisionStyle': 'standard', 'noCommas': 't...",222191.0,official,tabular,,False,2021-01-27T20:20:02.000Z


In [197]:
len(df)

3143

In [198]:
# Sorting the data sets by download_count
df[['name', 'download_count']].sort_values(by='download_count', ascending=False).head()

Unnamed: 0,name,download_count
33,Demographic Statistics By Zip Code,1014101.0
1237,Overhead Electronic Signs,429382.0
5,311 Service Requests from 2010 to Present,398304.0
10,Medallion Drivers - Active,287979.0
3,For Hire Vehicles (FHV) - Active,254346.0


In [199]:
highest_downloaded = df[['name', 'download_count']].sort_values(by='download_count', ascending=False)

print('The data set {}'.format(highest_downloaded['name'].iloc[0]), \
     'has {} downloads'.format(f"{highest_downloaded['download_count'].iloc[0]:,.0f}"), \
     'and is the most downloaded data set on NYC Open Data.')

The data set Demographic Statistics By Zip Code has 1,014,101 downloads and is the most downloaded data set on NYC Open Data.


# Retrieving Data Directly from SODA API

In [200]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

url = 'https://' + socrata_domain + '/resource/' + socrata_dataset_identifier + '.csv?$limit=20'
print(url)

df = pd.read_csv(url)
print(df.shape)

df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,49633221,2021-01-27T01:59:12.000,,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,1-2 Family Dwelling,11238,465 SAINT JOHN'S PLACE,...,,,,,,,,40.673467,-73.962338,"\n, \n(40.6734674790127, -73.9623383832787)"
1,49628739,2021-01-27T01:59:02.000,,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,11365,67-10 190 LANE,...,,,,,,,,40.736984,-73.781609,"\n, \n(40.73698352494025, -73.78160893379135)"
2,49627727,2021-01-27T01:57:54.000,,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11101,40-03 10 STREET,...,,,,,,,,40.75667,-73.944561,"\n, \n(40.75667046806001, -73.9445613744302)"
3,49630814,2021-01-27T01:55:27.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11365,67-10 190 LANE,...,,,,,,,,40.736984,-73.781609,"\n, \n(40.73698352494025, -73.78160893379135)"
4,49626583,2021-01-27T01:54:11.000,,NYPD,New York City Police Department,Abandoned Vehicle,With License Plate,Street/Sidewalk,11434,115 AVENUE,...,,,,,,,,40.690441,-73.776349,"\n, \n(40.690441338844835, -73.77634885043103)"


In [201]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

year = '2020'
column = 'created_date'

url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?' \
'$where=' + column + '%20>=%20%27' + year + '%27'  \
'&$limit=20'
print(url)

df = pd.read_csv(url)
print(df.shape)

df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=created_date%20>=%20%272020%27&$limit=20
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,45285347,2020-01-01T00:00:00.000,2020-01-10T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,11229,3442 NOSTRAND AVENUE,...,,,,,,,,40.600129,-73.941843,"\n, \n(40.6001292057807, -73.94184291675883)"
1,45285651,2020-01-01T00:00:00.000,2020-01-02T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,1 or 2,Restaurant/Bar/Deli/Bakery,10458,2701 DECATUR AVENUE,...,,,,,,,,40.864866,-73.888783,"\n, \n(40.86486556770799, -73.88878325729915)"
2,45285821,2020-01-01T00:00:00.000,2020-01-02T09:50:09.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,1 or 2,Other (Explain Below),11203,5707 CHURCH AVENUE,...,,,,,,,,40.652536,-73.92354,"\n, \n(40.65253575905768, -73.92353994017134)"
3,45287907,2020-01-01T00:00:00.000,2020-01-02T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,11214,1602 SHORE PARKWAY,...,,,,,,,,40.595653,-74.000173,"\n, \n(40.59565343138651, -74.00017283917487)"
4,45288120,2020-01-01T00:00:00.000,2020-01-02T09:51:29.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,10455,748 EAST 149 STREET,...,,,,,,,,40.812996,-73.907973,"\n, \n(40.81299645614164, -73.90797324533352)"


In [202]:
url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20LIMIT%2020'
print(url)

df = pd.read_csv(url)
print(df.shape)

df.head()

https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20LIMIT%2020
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,49633221,2021-01-27T01:59:12.000,,DOHMH,Department of Health and Mental Hygiene,Rodent,Rat Sighting,1-2 Family Dwelling,11238,465 SAINT JOHN'S PLACE,...,,,,,,,,40.673467,-73.962338,"\n, \n(40.6734674790127, -73.9623383832787)"
1,49628739,2021-01-27T01:59:02.000,,NYPD,New York City Police Department,Illegal Parking,Blocked Hydrant,Street/Sidewalk,11365,67-10 190 LANE,...,,,,,,,,40.736984,-73.781609,"\n, \n(40.73698352494025, -73.78160893379135)"
2,49627727,2021-01-27T01:57:54.000,,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11101,40-03 10 STREET,...,,,,,,,,40.75667,-73.944561,"\n, \n(40.75667046806001, -73.9445613744302)"
3,49630814,2021-01-27T01:55:27.000,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11365,67-10 190 LANE,...,,,,,,,,40.736984,-73.781609,"\n, \n(40.73698352494025, -73.78160893379135)"
4,49626583,2021-01-27T01:54:11.000,,NYPD,New York City Police Department,Abandoned Vehicle,With License Plate,Street/Sidewalk,11434,115 AVENUE,...,,,,,,,,40.690441,-73.776349,"\n, \n(40.690441338844835, -73.77634885043103)"


In [203]:
query = """
        SELECT %20
            * %20
        WHERE %20
            created_date %20 >= %20%27 2020 %27%20 
        LIMIT %20
            20
        """

query = ''.join(query.split())
print('query:', query)

url = 'https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=' + query
print('url:', url)

df = pd.read_csv(url)
print(df.shape)

df.head()

query: SELECT%20*%20WHERE%20created_date%20>=%20%272020%27%20LIMIT%2020
url: https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$query=SELECT%20*%20WHERE%20created_date%20>=%20%272020%27%20LIMIT%2020
(20, 41)


Unnamed: 0,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,...,vehicle_type,taxi_company_borough,taxi_pick_up_location,bridge_highway_name,bridge_highway_direction,road_ramp,bridge_highway_segment,latitude,longitude,location
0,45285347,2020-01-01T00:00:00.000,2020-01-10T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,11229,3442 NOSTRAND AVENUE,...,,,,,,,,40.600129,-73.941843,"\n, \n(40.6001292057807, -73.94184291675883)"
1,45285651,2020-01-01T00:00:00.000,2020-01-02T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,1 or 2,Restaurant/Bar/Deli/Bakery,10458,2701 DECATUR AVENUE,...,,,,,,,,40.864866,-73.888783,"\n, \n(40.86486556770799, -73.88878325729915)"
2,45285821,2020-01-01T00:00:00.000,2020-01-02T09:50:09.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,1 or 2,Other (Explain Below),11203,5707 CHURCH AVENUE,...,,,,,,,,40.652536,-73.92354,"\n, \n(40.65253575905768, -73.92353994017134)"
3,45287907,2020-01-01T00:00:00.000,2020-01-02T00:00:01.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,11214,1602 SHORE PARKWAY,...,,,,,,,,40.595653,-74.000173,"\n, \n(40.59565343138651, -74.00017283917487)"
4,45288120,2020-01-01T00:00:00.000,2020-01-02T09:51:29.000,DOHMH,Department of Health and Mental Hygiene,Food Poisoning,3 or More,Restaurant/Bar/Deli/Bakery,10455,748 EAST 149 STREET,...,,,,,,,,40.812996,-73.907973,"\n, \n(40.81299645614164, -73.90797324533352)"
