# sodapy tutorial using NYC Open Data 
Mark Bauer

# Introduction  
This notebook demonstrates how to use sodapy, the python client for the Socrata Open Data API, with NYC Open Data. Examples of popular methods are included, as well as basic queries using SoQL, the Socrata Query Language. 

# sodapy
sodapy is a python client for the Socrata Open Data API.

# Information about sodapy
 
**Installing**: https://pypi.org/project/sodapy/  
**GitHub**: https://github.com/xmunoz/sodapy

**The official Socrata Open Data API (SODA) docs**  
https://dev.socrata.com/

**Queries using SODA**  
https://dev.socrata.com/docs/queries/

**Inspiration for this notebook**  
https://github.com/xmunoz/sodapy/blob/master/examples/soql_queries.ipynb

# Importing Libraries

In [101]:
# importing libraries
import pandas as pd
from sodapy import Socrata
import itertools 

In [102]:
%reload_ext watermark

In [103]:
%watermark -a "Mark Bauer" -u -t -d -v -p pandas,sodapy,itertools

Mark Bauer 
last updated: 2021-01-25 15:42:47 

CPython 3.7.1
IPython 7.18.1

pandas 1.0.0
sodapy 2.0.0
itertools unknown


Documention for installing watermark: https://github.com/rasbt/watermark

# Using sodapy

In order for a user to use sodapy, they need to retrieve a source domain (i.e. the open data source you are trying to connect to). Additionally, if a user wants to query a specific data set, then the data set identifier (i.e. the data set id on the given source domain) needs to be identified as well. Below, we identify NYC Open Data's source domain: `data.cityofnewyork.us` and the data set identifier for the NYC 311 data set: `erm2-nwe9`.

![nyc-311-api-docs](nyc-311-api-docs.png)  

source: https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9

We save this information as variables below.

In [104]:
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata class in sodapy

In [105]:
# The main class that interacts with the SODA API.

# The required arguments are:
#     domain: the domain you wish you to access
#     app_token: your Socrata application token
# Simple requests are possible without an app_token, though these
# requests will be rate-limited.

client = Socrata(socrata_domain, None, timeout=1000)



In [106]:
type(client)

sodapy.Socrata

Socrata Methods

![socrata-methods](socrata-methods.png)

source: https://github.com/xmunoz/sodapy#datasetslimit0-offset0

We will review a few of the popular methods in this tutorial.

# Socrata Methods

## `.datasets()`

`datasets` method: Returns the list of datasets associated with a particular domain.
WARNING: Large limits (>1000) will return megabytes of data,
which can be slow on low-bandwidth networks, and is also a lot of
data to hold in memory.

In [107]:
print(type(client.datasets()))

<class 'list'>


In [108]:
print(type(client.datasets()[0]))

<class 'dict'>


In [109]:
client.datasets()[0].keys()

dict_keys(['resource', 'classification', 'metadata', 'permalink', 'link', 'owner', 'creator'])

In [110]:
# Viewing the resource dictionary in the first item of the datasets list. From there, 
# we view the first five items under the resource dictionary.

limit = 5

dict(itertools.islice(client.datasets()[0]['resource'].items(), limit))

{'name': 'DOB Job Application Filings',
 'id': 'ic3t-wcy2',
 'parent_fxf': [],
 'description': 'This dataset contains all job applications submitted through the Borough Offices, through eFiling, or through the HUB, which have a "Latest Action Date" since January 1, 2000. This dataset does not include jobs submitted through DOB NOW. See the DOB NOW: Build – Job Application Filings dataset for DOB NOW jobs.',
 'attribution': 'Department of Buildings (DOB)'}

In [111]:
# Once we've identified the structure of the dictionary, we try to 
# find the 311 data set and identify its position in the datasets list.

idx = 0

for dataset_id in client.datasets():
    if client.datasets()[idx]['resource']['id'] == 'erm2-nwe9':
        print(client.datasets()[idx]['resource']['name'], \
              '\nindex:', idx)
        break
    else:
        idx += 1    

311 Service Requests from 2010 to Present 
index: 5


In [112]:
# Previewing information about the 311 data set from the datasets method.
# Note: Information is quite long.

idx_311 = idx
print(idx_311)

client.datasets()[idx_311]

5


{'resource': {'name': '311 Service Requests from 2010 to Present',
  'id': 'erm2-nwe9',
  'parent_fxf': [],
  'description': '<b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>\r\n\r\nAll 311 Service Requests from 2010 to present. This information is automatically updated daily.',
  'attribution': '311, DoITT',
  'attribution_link': None,
  'contact_email': None,
  'type': 'dataset',
  'updatedAt': '2021-01-25T02:41:21.000Z',
  'createdAt': '2011-10-10T05:52:17.000Z',
  'metadata_updated_at': '2020-04-22T20:18:38.000Z',
  'data_updated_at': '2021-01-25T02:41:21.000Z',
  'page_views': {'page_views_last_week': 1294,
   'page_views_last_month': 4868,
   'page_views_total': 438194,
   'page_views_last_week_log': 10.338736382573916,
   'p

In [113]:
# Since the datasets method is long, let's see if we can identify specific keys we want to preview
# in the resource dictionary.

lst_of_keys = list(client.datasets()[idx_311]['resource'].keys())

print(lst_of_keys)

['name', 'id', 'parent_fxf', 'description', 'attribution', 'attribution_link', 'contact_email', 'type', 'updatedAt', 'createdAt', 'metadata_updated_at', 'data_updated_at', 'page_views', 'columns_name', 'columns_field_name', 'columns_datatype', 'columns_description', 'columns_format', 'download_count', 'provenance', 'lens_view_type', 'blob_mime_type', 'hide_from_data_json', 'publication_date']


In [114]:
# Previewing keys, values in the resource dictionary.

for item in lst_of_keys:
    print(item + ':', client.datasets()[idx_311]['resource'][item], '\n')

name: 311 Service Requests from 2010 to Present 

id: erm2-nwe9 

parent_fxf: [] 

description: <b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>

All 311 Service Requests from 2010 to present. This information is automatically updated daily. 

attribution: 311, DoITT 

attribution_link: None 

contact_email: None 

type: dataset 

updatedAt: 2021-01-25T02:41:21.000Z 

createdAt: 2011-10-10T05:52:17.000Z 

metadata_updated_at: 2020-04-22T20:18:38.000Z 

data_updated_at: 2021-01-25T02:41:21.000Z 

page_views: {'page_views_last_week': 1294, 'page_views_last_month': 4868, 'page_views_total': 438194, 'page_views_last_week_log': 10.338736382573916, 'page_views_last_month_log': 12.249409785269128, 'page_views_total_log': 18.74121349706627

lens_view_type: tabular 

blob_mime_type: None 

hide_from_data_json: False 

publication_date: 2018-04-20T03:03:18.000Z 



## `.get()`

`get` method: Read data from the requested resource. Options for content_type are json,
csv, and xml.

In [138]:
client = Socrata(socrata_domain, None, timeout=100)



In [139]:
# Using try and except statements because these requests are large and may timeout.
# If the request timesout, we skip it. 

try:
    print(type(client.get(socrata_dataset_identifier)))
except:
    print('timeout error. skipping.')
    pass  

<class 'list'>


In [140]:
# Using try and except statements because these requests are large and may timeout.
# If the request timesout, we skip it. 

try:
    print(client.get(socrata_dataset_identifier)[0].keys())
except:
    print('timeout error. skipping.')
    pass

dict_keys(['unique_key', 'created_date', 'agency', 'agency_name', 'complaint_type', 'descriptor', 'location_type', 'incident_zip', 'incident_address', 'street_name', 'cross_street_1', 'cross_street_2', 'intersection_street_1', 'intersection_street_2', 'city', 'landmark', 'status', 'community_board', 'bbl', 'borough', 'x_coordinate_state_plane', 'y_coordinate_state_plane', 'open_data_channel_type', 'park_facility_name', 'park_borough', 'latitude', 'longitude', 'location', ':@computed_region_efsh_h5xi', ':@computed_region_f5dn_yrer', ':@computed_region_yeji_bk3q', ':@computed_region_92fq_4b7q', ':@computed_region_sbqj_enih'])


In [141]:
# Using try and except statements because these requests are large and may timeout.
# If the request timesout, we skip it. 

try:
    print(client.get(socrata_dataset_identifier, select='*')[0])
except:
    print('timeout error. skipping.')
    pass

timeout error. skipping.


## `.get_metadata()`

`get_metadata` method: Retrieve the metadata for a particular dataset.

In [142]:
client = Socrata(socrata_domain, None, timeout=1000)



In [143]:
type(client.get_metadata(socrata_dataset_identifier))

dict

In [144]:
# Previewing keys in dictionary
client.get_metadata(socrata_dataset_identifier).keys()

dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowIdentifierColumnId', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

In [145]:
# Previewing keys vertically.
keys = client.get_metadata(socrata_dataset_identifier).keys()
for key in keys:
    print(key)

id
name
assetType
attribution
averageRating
category
createdAt
description
displayType
downloadCount
hideFromCatalog
hideFromDataJson
indexUpdatedAt
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowClass
rowIdentifierColumnId
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
columns
grants
metadata
owner
query
rights
tableAuthor
tags
flags


In [146]:
# Previewing the id and name of the data set.
print('id and name of dataset\n' + \
      '-' * 30 + \
      '\nid:', client.get_metadata(socrata_dataset_identifier)['id'], \
      '\nname:', client.get_metadata(socrata_dataset_identifier)['name'])

id and name of dataset
------------------------------
id: erm2-nwe9 
name: 311 Service Requests from 2010 to Present


In [147]:
# Previewing the first 30 keys, values of the dictionary.
metadata = client.get_metadata(socrata_dataset_identifier)

limit = 30
out = dict(itertools.islice(metadata.items(), limit))
print(type(out), '\n')

for key, value in out.items():
    print(key + ':',  value)

<class 'dict'> 

id: erm2-nwe9
name: 311 Service Requests from 2010 to Present
assetType: dataset
attribution: 311, DoITT
averageRating: 0
category: Social Services
createdAt: 1318225937
description: <b>NOTE: This data does not present a full picture of 311 calls or service requests, in part because of operational and system complexities associated with remote call taking necessitated by the unprecedented volume 311 is handling during the Covid-19 crisis. The City is working to address this issue. </b>

All 311 Service Requests from 2010 to present. This information is automatically updated daily.
displayType: table
downloadCount: 398217
hideFromCatalog: False
hideFromDataJson: False
indexUpdatedAt: 1571326778
newBackend: True
numberOfComments: 19
oid: 28506835
provenance: official
publicationAppendEnabled: False
publicationDate: 1524193398
publicationGroup: 244403
publicationStage: published
rowClass: 
rowIdentifierColumnId: 354922030
rowsUpdatedAt: 1611542481
rowsUpdatedBy: 5fuc-pq

In [148]:
# Saving metadata dictionary as 'metadata'
metadata = client.get_metadata(socrata_dataset_identifier)
print(type(metadata))

# Previewing the datatype of columns
print(type(metadata['columns']), ', length:', len(metadata['columns']))

# Previewing the field names for each element in our columns list
for x in metadata['columns']:
    print(x['fieldName'])

<class 'dict'>
<class 'list'> , length: 46
unique_key
created_date
closed_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
address_type
city
landmark
facility_type
status
due_date
resolution_description
resolution_action_updated_date
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
vehicle_type
taxi_company_borough
taxi_pick_up_location
bridge_highway_name
bridge_highway_direction
road_ramp
bridge_highway_segment
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih


In [149]:
metadata = client.get_metadata(socrata_dataset_identifier)

# Previewing the first element in our columns list
metadata['columns'][0]

{'id': 354922030,
 'name': 'Unique Key',
 'dataTypeName': 'text',
 'description': 'Unique identifier of a Service Request (SR) in the open data set\n',
 'fieldName': 'unique_key',
 'position': 1,
 'renderTypeName': 'text',
 'tableColumnId': 1567787,
 'width': 220,
 'cachedContents': {'largest': '49608994',
  'non_null': '24796821',
  'null': '0',
  'top': [{'item': '10693408', 'count': '1'},
   {'item': '10836749', 'count': '1'},
   {'item': '10836967', 'count': '1'},
   {'item': '11051177', 'count': '1'},
   {'item': '11413576', 'count': '1'},
   {'item': '11463895', 'count': '1'},
   {'item': '11463896', 'count': '1'},
   {'item': '11464334', 'count': '1'},
   {'item': '11464394', 'count': '1'},
   {'item': '11464467', 'count': '1'},
   {'item': '11464508', 'count': '1'},
   {'item': '11464509', 'count': '1'},
   {'item': '11464521', 'count': '1'},
   {'item': '11464567', 'count': '1'},
   {'item': '11464572', 'count': '1'},
   {'item': '11464639', 'count': '1'},
   {'item': '1146484

In [150]:
metadata = client.get_metadata(socrata_dataset_identifier)

# Identifying our columns list
cols = metadata['columns']

# Previewing first position in our columns list
cols[1]

{'id': 354922031,
 'name': 'Created Date',
 'dataTypeName': 'calendar_date',
 'description': 'Date SR  was created\n',
 'fieldName': 'created_date',
 'position': 2,
 'renderTypeName': 'calendar_date',
 'tableColumnId': 1567788,
 'width': 244,
 'cachedContents': {'largest': '2021-01-24T02:04:05.000',
  'non_null': '24796821',
  'null': '0',
  'top': [{'item': '2013-01-24T00:00:00.000', 'count': '7650'},
   {'item': '2015-01-08T00:00:00.000', 'count': '7242'},
   {'item': '2014-01-07T00:00:00.000', 'count': '7030'},
   {'item': '2015-02-16T00:00:00.000', 'count': '6430'},
   {'item': '2014-01-08T00:00:00.000', 'count': '6197'},
   {'item': '2012-01-04T00:00:00.000', 'count': '5933'},
   {'item': '2013-11-25T00:00:00.000', 'count': '5909'},
   {'item': '2014-01-23T00:00:00.000', 'count': '5782'},
   {'item': '2014-01-22T00:00:00.000', 'count': '5497'},
   {'item': '2015-01-07T00:00:00.000', 'count': '5432'},
   {'item': '2011-01-24T00:00:00.000', 'count': '5380'},
   {'item': '2011-10-28T

In [151]:
# Creating a fieldName dictionary for every element in our column list
fieldName = {x['fieldName']: x for x in cols}

# Previewing the field names (values) in our fieldName dictionary
for key in fieldName.keys():
    print(key)

unique_key
created_date
closed_date
agency
agency_name
complaint_type
descriptor
location_type
incident_zip
incident_address
street_name
cross_street_1
cross_street_2
intersection_street_1
intersection_street_2
address_type
city
landmark
facility_type
status
due_date
resolution_description
resolution_action_updated_date
community_board
bbl
borough
x_coordinate_state_plane
y_coordinate_state_plane
open_data_channel_type
park_facility_name
park_borough
vehicle_type
taxi_company_borough
taxi_pick_up_location
bridge_highway_name
bridge_highway_direction
road_ramp
bridge_highway_segment
latitude
longitude
location
:@computed_region_efsh_h5xi
:@computed_region_f5dn_yrer
:@computed_region_yeji_bk3q
:@computed_region_92fq_4b7q
:@computed_region_sbqj_enih


In [152]:
# Removing the last five field names
list(fieldName.keys())[:-5]

['unique_key',
 'created_date',
 'closed_date',
 'agency',
 'agency_name',
 'complaint_type',
 'descriptor',
 'location_type',
 'incident_zip',
 'incident_address',
 'street_name',
 'cross_street_1',
 'cross_street_2',
 'intersection_street_1',
 'intersection_street_2',
 'address_type',
 'city',
 'landmark',
 'facility_type',
 'status',
 'due_date',
 'resolution_description',
 'resolution_action_updated_date',
 'community_board',
 'bbl',
 'borough',
 'x_coordinate_state_plane',
 'y_coordinate_state_plane',
 'open_data_channel_type',
 'park_facility_name',
 'park_borough',
 'vehicle_type',
 'taxi_company_borough',
 'taxi_pick_up_location',
 'bridge_highway_name',
 'bridge_highway_direction',
 'road_ramp',
 'bridge_highway_segment',
 'latitude',
 'longitude',
 'location']

In [153]:
# Removing the last five field names
cols_as_list = list(fieldName.keys())[:-5]

cols_as_list

['unique_key',
 'created_date',
 'closed_date',
 'agency',
 'agency_name',
 'complaint_type',
 'descriptor',
 'location_type',
 'incident_zip',
 'incident_address',
 'street_name',
 'cross_street_1',
 'cross_street_2',
 'intersection_street_1',
 'intersection_street_2',
 'address_type',
 'city',
 'landmark',
 'facility_type',
 'status',
 'due_date',
 'resolution_description',
 'resolution_action_updated_date',
 'community_board',
 'bbl',
 'borough',
 'x_coordinate_state_plane',
 'y_coordinate_state_plane',
 'open_data_channel_type',
 'park_facility_name',
 'park_borough',
 'vehicle_type',
 'taxi_company_borough',
 'taxi_pick_up_location',
 'bridge_highway_name',
 'bridge_highway_direction',
 'road_ramp',
 'bridge_highway_segment',
 'latitude',
 'longitude',
 'location']

# Socrata Query Language (SoQL) - Analyzing NYC 311 Complaints

## Most NYC 311 Complaints by Complaint Type

In [154]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the complaint_type and the count of complaint_type columns 
# grouped by complaint_type, sorted by the count of complaint_type in descending order
# and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
GROUP BY 
    complaint_type
ORDER BY 
    count(complaint_type) DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# results is returned as JSON from API and converted to Python list of
# dictionaries by sodapy
print(type(results), 'Returned a list from our request.\n')

# Identifying type of first element of our results list
print(type(results[0]), 'However, request is actually a list of dictionaries.\n')

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print(type(results_df), 'Convert list of dictionaries to DataFrame.')
print('Rows and columns of data:', results_df.shape)

results_df.head(10)



<class 'list'> Returned a list from our request.

<class 'dict'> However, request is actually a list of dictionaries.

<class 'pandas.core.frame.DataFrame'> Convert list of dictionaries to DataFrame.
Rows and columns of data: (445, 2)


Unnamed: 0,complaint_type,count
0,Noise - Residential,2236002
1,HEAT/HOT WATER,1418669
2,Illegal Parking,1124210
3,Blocked Driveway,1044959
4,Street Condition,1020518
5,Street Light Condition,983562
6,HEATING,887869
7,PLUMBING,747458
8,Water System,686847
9,Noise - Street/Sidewalk,682283


## Most NYC 311 Complaints by Descriptor

In [155]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the descriptor and count of descriptor columns grouped by descriptor,
# sort by the count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count
GROUP BY 
    descriptor
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 2)


Unnamed: 0,descriptor,count
0,Loud Music/Party,2335159
1,ENTIRE BUILDING,923819
2,HEAT,871935
3,No Access,778060
4,Street Light Out,729234
5,Pothole,617633
6,Banging/Pounding,612473
7,APARTMENT ONLY,494850
8,Loud Talking,358823
9,CEILING,358776


## Most NYC 311 Complaints by Day

In [156]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and count day grouped by day,
# sort by count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ymd(created_date) AS day, 
    count(day) AS count
GROUP BY 
    day
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 2)


Unnamed: 0,day,count
0,2020-08-04T00:00:00.000,23314
1,2020-08-05T00:00:00.000,18305
2,2020-07-05T00:00:00.000,16014
3,2020-07-04T00:00:00.000,15365
4,2020-06-20T00:00:00.000,15098
5,2020-06-21T00:00:00.000,14965
6,2020-06-28T00:00:00.000,12899
7,2020-06-27T00:00:00.000,12074
8,2020-08-09T00:00:00.000,12057
9,2020-08-06T00:00:00.000,12043


## Displaying the difference between the date (timestamp) and day (date_trunc_ymd) columns

In [157]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the timestamp, day and count timestamp grouped by timestamp,
# sort by count in descending order and limit our records to 1,000.

query = """
SELECT 
    created_date as timestamp, 
    date_trunc_ymd(created_date) as day, 
    count(timestamp) AS count
GROUP BY 
    timestamp
ORDER BY 
    count ASC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('Rows and columns of data:', results_df.shape)

results_df.head(10)



Rows and columns of data: (1000, 3)


Unnamed: 0,timestamp,day,count
0,2010-01-01T15:48:17.000,2010-01-01T00:00:00.000,1
1,2010-01-01T16:01:57.000,2010-01-01T00:00:00.000,1
2,2010-01-01T15:40:55.000,2010-01-01T00:00:00.000,1
3,2010-01-01T15:48:01.000,2010-01-01T00:00:00.000,1
4,2010-01-01T15:57:07.000,2010-01-01T00:00:00.000,1
5,2010-01-01T16:01:43.000,2010-01-01T00:00:00.000,1
6,2010-01-01T15:35:00.000,2010-01-01T00:00:00.000,1
7,2010-01-01T15:39:32.000,2010-01-01T00:00:00.000,1
8,2010-01-01T15:45:00.000,2010-01-01T00:00:00.000,1
9,2010-01-01T15:48:00.000,2010-01-01T00:00:00.000,1


## Analyzing NYC 311 Street Flooding Complaints

### Searching the data set for the word "flood" in the complaint_type field

In [158]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select complaint_type and count of complaint_type grouped by compaint_type,
# where the word "flood" is in compplaint_type,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
WHERE 
    LOWER(complaint_type) LIKE '%flood%'
GROUP BY 
    complaint_type
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



shape of data: (0, 0)


### Searching the data set for the word "flood" in the descriptor field

In [None]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select complaint_type and count of complaint_type grouped by compaint_type,
# where the word "flood" is in descriptor,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    complaint_type
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df



### Searching the data set where complaint_type field = 'Sewer'

In [None]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select descriptor and count of descriptor grouped by descriptor,
# where complaint_type = 'Sewer',
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor)
WHERE 
    complaint_type='Sewer'
GROUP BY 
    descriptor
ORDER BY 
    count(descriptor) DESC
LIMIT 1000
"""

# First 1000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df.head(len(results_df))

### Searching the data set where the word "flood" is in the descriptor field

In [None]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select descriptor and count of descriptor grouped by descriptor,
# where the word "flood" is in descriptor,
# sort count in descending order and limit our records to 1,000.

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%'
GROUP BY 
    descriptor
ORDER BY 
    count(descriptor) DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)
results_df

### Displaying the highest number of street flooding complaints by day

In [None]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select the day and the count day columns grouped by day,
# where the descriptor is Street Flooding (SJ),
# sort the count in descending order and limit our records to 1,000.

query = """
SELECT 
    date_trunc_ymd(created_date) as day, 
    count(created_date) AS count
WHERE 
    descriptor == 'Street Flooding (SJ)'
GROUP BY 
    day
ORDER BY 
    count DESC
LIMIT 
    1000
"""

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get(socrata_dataset_identifier, query=query)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print('shape of data:', results_df.shape)

results_df.head(10)

### Selecting all the rows and columns where the descriptor field = 'Street Flooding'

In [None]:
# Source domain for NYC Open Data on Socrata
socrata_domain = 'data.cityofnewyork.us'

# Data set id for NYC 311 on NYC Open Data on Socrata
socrata_dataset_identifier = 'erm2-nwe9'

# Socrata - The main class that interacts with the SODA API. We pass the source domain value
# of NYC Open data, the app token as 'None', and set the timeout parameter for '1,000 seconds'
client = Socrata(socrata_domain, None, timeout=1000)

# SoQL query string below:
# Select all columns where the descriptor is Street Flooding(SJ),
# sort the created date field in descending order and limit our records to 1,000.

query = """
SELECT 
    *
WHERE 
    descriptor == 'Street Flooding (SJ)'
ORDER BY 
    created_date DESC
LIMIT 
    1000
"""

# Requesting data from the NYC 311 data set
# and passing our query as a full SoQL query string
results = client.get(socrata_dataset_identifier, query=query)

# results is returned as JSON from API and converted to Python list of
# dictionaries by sodapy
print(type(results), 'Returned a list from our request.\n')

# Identifying type of first element of our results list
print(type(results[0]), 'However, request is actually a list of dictionaries.\n')

# Convert list of dictionaries to a pandas DataFrame
results_df = pd.DataFrame.from_records(results)

print(type(results_df), 'Convert list of dictionaries to DataFrame.')
print('Rows and columns of data:', results_df.shape)

# Writing out sample data as a csv
results_df.to_csv('sample_data_street_flooding.csv', index=False)

# Previewing the first five rows of our DataFrame
results_df.head()

## Analyzing NYC 311 Data Sets with the Most Downloads

In [None]:
type(client)

In [None]:
type(client.datasets())

In [None]:
type(client.datasets()[0])

In [None]:
len(client.datasets())

In [None]:
# Reading in a list of dictionaries of our data into a pandas DataFrame
df = pd.DataFrame.from_records(client.datasets())

df.head()

In [None]:
# Only saving the dictionary in the resource column
df = df.resource

# Reading the dictionary in the resource column into a pandas DataFrame
df = pd.DataFrame.from_records(df)

df.head()

In [None]:
len(df)

In [None]:
# Sorting the data sets by download_count
df[['name', 'download_count']].sort_values(by='download_count', ascending=False).head()

In [None]:
highest_downloaded = df[['name', 'download_count']].sort_values(by='download_count', ascending=False)

print('The data set {}'.format(highest_downloaded['name'].iloc[0]), \
     'has {} downloads'.format(f"{highest_downloaded['download_count'].iloc[0]:,.0f}"), \
     'and is the most downloaded data set on NYC Open Data.')