<a href="https://www.youtube.com/c/ByteSizeDataScience" target="_blank"><IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=100% /></a>

This notebook is provided by the youtube channel Byte Size Data Science.<br/>
Please keep this notice with the notebook.

Click here to access the youtube channel: <a href="https://www.youtube.com/c/ByteSizeDataScience" target="_blank">Byte Size Data Science</a>

# Open Data: Socrata API
### Accessing the open data portal with Socrata
We use the Socrata API to access a large catalog of data.<br/>
Socrata documentation: https://dev.socrata.com/

The production API endpoints for the public version of this API are at https://api.us.socrata.com/api/catalog/v1 for domains in North America<br/> 
and https://api.eu.socrata.com/api/catalog/v1 for all other domains.

### Other data sources
Outside of the Socrata API, you can also find other public data in locations like:
- https://www.data.gov and https://catalog.data.gov
- https://www.census.gov

In [1]:
# Libraries needed in the notebook
import urllib3, requests, json
import pandas as pd
import numpy as np

# pd.set_option('display.max_colwidth', -1)

## Socrata Domains
You can search the socrata catalogs (2) to find out how many assewts are in each domain.<br/>
In our case, we care only about datasets so we limit the coung by domain to datasets.

We start with the US catalog that includes Canada and Mexico amongst others.

In [2]:
# Get dataset count by domains
# The default limit is 100, maximum 10,000
cnt_by_doms="http://api.us.socrata.com/api/catalog/v1/domains?only=dataset&limit=10000"
response = requests.get(cnt_by_doms)
if (response.status_code != 200) :
    print(response.status_code)
jsondoc = json.loads(response.text)
cnt_by_doms_df = pd.io.json.json_normalize(jsondoc['results'])
print('Number of domains: ' + str(cnt_by_doms_df.shape[0]))
print('Total number of datasets: ' + str(cnt_by_doms_df['count'].sum()))
cnt_by_doms_df.head()

Number of domains: 226
Total number of datasets: 38125


Unnamed: 0,count,domain
0,2,data.permits.performance.gov
1,30,data.auburnwa.gov
2,81,datapoints.dallascityhall.com
3,537,data.ct.gov
4,75,opengov.cityofdubuque.org


## European catalog
The European catalog as fewer domains

In [3]:
# Get dataset count by domains
eu_cnt_by_doms="http://api.eu.socrata.com/api/catalog/v1/domains?only=dataset&limit=10000"
response = requests.get(eu_cnt_by_doms)
if (response.status_code != 200) :
    print(response.status_code)
jsondoc = json.loads(response.text)
eu_cnt_by_doms_df = pd.io.json.json_normalize(jsondoc['results'])
print('Number of domains: ' + str(eu_cnt_by_doms_df.shape[0]))
print('Total number of datasets: ' + str(eu_cnt_by_doms_df['count'].sum()))
eu_cnt_by_doms_df.head()

Number of domains: 18
Total number of datasets: 4979


Unnamed: 0,count,domain
0,38,cohesiondata.ec.europa.eu
1,55,opendata.granollers.cat
2,81,analisis.datosabiertos.jcyl.es
3,77,data.theaudienceagency.org
4,25,opendata.rubi.cat


### List all the domaind from the US catalog

In [4]:
cnt_by_doms_df.domain.sort_values().values

array(['agtransport.usda.gov', 'amopen.amo.on.ca',
       'apd-data.albanyny.gov', 'bis.data.commerce.gov',
       'bronx.lehman.cuny.edu', 'bythenumbers.sco.ca.gov',
       'census.data.commerce.gov', 'chronicdata.cdc.gov',
       'cipdata.cityofnovi.org', 'controllerdata.lacity.org',
       'corstat.coronaca.gov', 'daisi.datacenterresearch.org',
       'dashboard.edmonton.ca', 'dashboard.hawaii.gov',
       'dashboard.plano.gov', 'dashboard.udot.utah.gov',
       'dashboarddata.caloes.ca.gov', 'data.acgov.org',
       'data.albanyny.gov', 'data.auburnwa.gov', 'data.austintexas.gov',
       'data.baltimorecity.gov', 'data.bayareametro.gov', 'data.brla.gov',
       'data.buffalony.gov', 'data.calgary.ca', 'data.cambridgema.gov',
       'data.cdc.gov', 'data.cincinnati-oh.gov',
       'data.cityofberkeley.info', 'data.cityofchicago.org',
       'data.cityofevanston.org', 'data.cityofgainesville.org',
       'data.cityofgp.com', 'data.cityofnewyork.us',
       'data.cityoforlando.net', '

### Count by tags

In [5]:
# Get dataset count by tags
cnt_by_tags="http://api.us.socrata.com/api/catalog/v1/domain_tags?only=dataset&limit=10000"
response = requests.get(cnt_by_tags)
if (response.status_code != 200) :
    print(response.status_code)
jsondoc = json.loads(response.text)
cnt_by_tags_df = pd.io.json.json_normalize(jsondoc['results'])
print('Number of tags: ' + str(cnt_by_tags_df.shape[0]))
print('Total number of datasets: ' + str(cnt_by_tags_df['count'].sum()))
cnt_by_tags_df.head()

Number of tags: 10000
Total number of datasets: 118824


Unnamed: 0,count,domain_tag
0,634,educación
1,621,icfes
2,530,saber359
3,437,education
4,431,student


In [6]:
cnt_by_tags_df[0:100]['domain_tag'].sort_values().values

array(['2013', '2014', '2015', '2016', '2017', '2018', '2019',
       'agriculture', 'assessment', 'budget', 'building', 'buildings',
       'business', 'cali', 'census', 'children', 'city of topeka',
       'colorado', 'community', 'construction', 'county', 'cps', 'crime',
       'data book', 'databook', 'delito', 'departamento', 'development',
       'dfps', 'dijin', 'district', 'doe', 'driver feedback sign',
       'educacion', 'educación', 'education', 'elections', 'employment',
       'energy', 'enrollment', 'establecimiento', 'estudiantes',
       'expenditures', 'facility', 'finance', 'fire', 'gender',
       'gocodecolorado', 'health', 'hospital', 'housing', 'icfes',
       'income', 'información', 'insight community', 'jobs', 'k-12',
       'lifelong learning', 'measure k', 'mpup', 'municipio', 'nedss',
       'netss', 'nndss', 'nutrition', 'oidb', 'operations', 'ospi',
       'parks', 'permit', 'permits', 'población', 'police',
       'policía nacional', 'population', 'povert

### Count by categories

In [7]:
# Get dataset count by categories
cnt_by_cats="http://api.us.socrata.com/api/catalog/v1/domain_categories?only=dataset&limit=10000"
response = requests.get(cnt_by_cats)
if (response.status_code != 200) :
    print(response.status_code)
jsondoc = json.loads(response.text)
cnt_by_cats_df = pd.io.json.json_normalize(jsondoc['results'])
print('Number of tags: ' + str(cnt_by_cats_df.shape[0]))
print('Total number of assets: ' + str(cnt_by_cats_df['count'].sum()))
cnt_by_cats_df.head()

Number of tags: 843
Total number of assets: 32791


Unnamed: 0,count,domain_category
0,1972,Educación
1,1735,Education
2,1346,Salud y Protección Social
3,1143,Government
4,974,Health


In [8]:
cnt_by_cats_df[0:100]['domain_category'].sort_values().values

array(['A Livable and Sustainable City', 'A Prosperous City',
       'A Well Run City', 'Active Specifications',
       'Administration & Finance', 'Administrative',
       'Agricultura y Desarrollo Rural', 'Agriculture',
       'Ambiente y Desarrollo Sostenible', 'Budget', 'Business',
       'Business & Economy', 'Business and Economy', 'Census',
       'Ciencia, Tecnología e Innovación', 'City Administration',
       'City Business', 'City Government', 'City Management and Ethics',
       'City Services', 'Comercio, Industria y Turismo', 'Community',
       'Community Services', 'County Government', 'Courts', 'Cultura',
       'Dashboard', 'Demographic Profiles', 'Demographics',
       'Deporte y Recreación', 'Economic Development', 'Economy',
       'Economía y Finanzas', 'Educación', 'Education',
       'Energy & Environment', 'Energy and Environment', 'Environment',
       'Environment and Energy', 'Equity Indicators',
       'Estadísticas Nacionales', 'Finance', 'Finance & Admini

## Searching for specific types of datasets
We can do some searches for specific subjects. For example, `Transportation`

In [9]:
# Get dataset count by categories
transport_cats="http://api.us.socrata.com/api/catalog/v1?categories=transportation&only=dataset&limit=10000"
response = requests.get(transport_cats)
if (response.status_code != 200) :
    print(response.status_code)
jsondoc = json.loads(response.text)
#print(jsondoc)
transport_cats_df = pd.io.json.json_normalize(jsondoc['results'])
print('Number of datasets: ' + str(transport_cats_df.shape[0]))
transport_cats_df.head()

Number of datasets: 1667


Unnamed: 0,classification.categories,classification.domain_category,classification.domain_metadata,classification.domain_tags,classification.tags,link,metadata.domain,metadata.license,owner.display_name,owner.id,...,resource.page_views.page_views_last_month,resource.page_views.page_views_last_month_log,resource.page_views.page_views_last_week,resource.page_views.page_views_last_week_log,resource.page_views.page_views_total,resource.page_views.page_views_total_log,resource.parent_fxf,resource.provenance,resource.type,resource.updatedAt
0,"[finance, transportation, infrastructure]",,[],"[cta, public transit, ridership]",[],https://data.cityofevanston.org/dataset/CTA-Ri...,data.cityofevanston.org,,Hillary Beata,x6gn-3tzg,...,0,0.0,0,0.0,5,2.584963,,official,dataset,2019-05-29T19:48:57.000Z
1,"[transportation, economy]",Dept of Workforce Services,"[{'key': 'Periodicity_StateAgency', 'value': '...",[safety activity community and housing program...,[],https://opendata.utah.gov/Dept-of-Workforce-Se...,opendata.utah.gov,,Utah Open Data Portal Kung-Fu Master,rhju-e3ii,...,0,0.0,0,0.0,35,5.169925,,official,dataset,2019-02-11T21:18:31.000Z
2,"[transportation, economy]",Dept of Workforce Services,"[{'key': 'Periodicity_StateAgency', 'value': '...","[entertainment, arts, social assistance, nursi...",[],https://opendata.utah.gov/Dept-of-Workforce-Se...,opendata.utah.gov,Public Domain,Utah Open Data Portal Kung-Fu Master,rhju-e3ii,...,0,0.0,0,0.0,26,4.754888,,official,dataset,2019-02-11T21:17:58.000Z
3,[transportation],My Neighborhood,[],"[lots, garage, parking]",[],https://bronx.lehman.cuny.edu/My-Neighborhood/...,bronx.lehman.cuny.edu,,cpbride,v4c9-bc9b,...,0,0.0,0,0.0,3853,11.912141,,official,dataset,2019-04-18T20:33:01.000Z
4,"[transportation, economy]",Business & Economy,"[{'key': 'Dataset-Summary_Data-Frequency', 'va...",[safety activity community and housing program...,[],https://opendata.utah.gov/Business-Economy/Uta...,opendata.utah.gov,Public Domain,Utah Open Data Portal Kung-Fu Master,rhju-e3ii,...,0,0.0,0,0.0,47,5.584963,,official,dataset,2019-02-11T21:18:30.000Z


## Drill down further
Forst convert `resource.updatedAt` to a datetime so we can select more current datasets

In [10]:
transport_cats_df['resource.updatedAt'] = pd.to_datetime(transport_cats_df['resource.updatedAt'], infer_datetime_format=True)
# Show the most recent datetime
transport_cats_df['resource.updatedAt'].max()

Timestamp('2019-09-18 16:20:39+0000', tz='UTC')

In [11]:
from datetime import datetime, timedelta

current_m30 = datetime.now() - timedelta(days=30)
df2019 = transport_cats_df[transport_cats_df['resource.updatedAt'].dt.date >= current_m30.date()]
print('Number of datasets: ' + str(df2019.shape[0]))
df2019.head()

Number of datasets: 504


Unnamed: 0,classification.categories,classification.domain_category,classification.domain_metadata,classification.domain_tags,classification.tags,link,metadata.domain,metadata.license,owner.display_name,owner.id,...,resource.page_views.page_views_last_month,resource.page_views.page_views_last_month_log,resource.page_views.page_views_last_week,resource.page_views.page_views_last_week_log,resource.page_views.page_views_total,resource.page_views.page_views_total_log,resource.parent_fxf,resource.provenance,resource.type,resource.updatedAt
21,[transportation],Community Improvement,[],"[strategic plan, community improvement, diseas...",[],https://healthstat.dph.sbcounty.gov/Community-...,healthstat.dph.sbcounty.gov,,Pavneet Kaur,qhzd-iv6b,...,3,2.0,0,0.0,5,2.584963,,official,dataset,2019-08-27 21:48:05+00:00
27,"[housing & development, transportation]",Government,[],"[el dorado county, municipal debt, public debt...",[],https://data.debtwatch.treasurer.ca.gov/Govern...,data.debtwatch.treasurer.ca.gov,,Devinder Kumar,3qdj-b5hf,...,0,0.0,0,0.0,287,8.169925,,official,dataset,2019-08-23 21:57:49+00:00
34,[transportation],Economy,[],[],[],https://data.michigan.gov/Economy/Payroll-Jobs...,data.michigan.gov,,LMISI,29vf-2472,...,2,1.584963,0,0.0,220,7.787903,,official,dataset,2019-08-21 15:22:27+00:00
35,[transportation],Economy,[],[],[],https://data.michigan.gov/Economy/Payroll-Jobs...,data.michigan.gov,,LMISI,29vf-2472,...,2,1.584963,0,0.0,227,7.83289,,official,dataset,2019-08-21 15:26:33+00:00
40,"[transportation, infrastructure]",Transportation,[{'key': 'Department-Metrics_Publishing-Depart...,[shuttles],[],https://data.sfgov.org/Transportation/Commuter...,data.sfgov.org,Open Data Commons Public Domain Dedication and...,OpenData,dbag-6qd9,...,11,3.584963,4,2.321928,48,5.61471,,official,dataset,2019-09-06 03:15:54+00:00


In [12]:
result = df2019['classification.domain_category'].unique()
result = result[~pd.isnull(result)]
print("Number of domain categories: " + str(result.shape[0]))
np.sort(result)

Number of domain categories: 114


array(['A Livable and Sustainable City', 'A Safe City', 'A Well Run City',
       'Active Specifications', 'Administration & Finance',
       'Administrative', 'Assets & Infrastructure', 'Automobiles',
       'Barge', 'Base Maps', 'Better Future', 'Budget and Finances',
       'Business', 'Business and Economy', 'Business and Financial',
       'Ciencia, Tecnología e Innovación', 'City Administration',
       'City Business', 'City Government', 'City Infrastructure',
       'City Management and Ethics', 'City Planning',
       'Community & Economic Development', 'Community Improvement',
       'Community Model', 'Community Services', 'Consumer/Housing',
       'County Government', 'County Operations', 'Culture and Recreation',
       'Disability Insurance', 'Drug Pricing and Payment',
       'Early Education', 'Economic Development', 'Economic Growth',
       'Economy', 'Economía y Finanzas', 'Educación', 'Education',
       'Energy & Environment', 'Energy and Environment', 'Environmen

In [13]:
result = df2019[df2019['classification.domain_category'].notna() &
                df2019['classification.domain_category'].str.contains('ransport')]
print("Number of domain categories: " + str(result.shape[0]))
result.head()

Number of domain categories: 147


Unnamed: 0,classification.categories,classification.domain_category,classification.domain_metadata,classification.domain_tags,classification.tags,link,metadata.domain,metadata.license,owner.display_name,owner.id,...,resource.page_views.page_views_last_month,resource.page_views.page_views_last_month_log,resource.page_views.page_views_last_week,resource.page_views.page_views_last_week_log,resource.page_views.page_views_total,resource.page_views.page_views_total_log,resource.parent_fxf,resource.provenance,resource.type,resource.updatedAt
40,"[transportation, infrastructure]",Transportation,[{'key': 'Department-Metrics_Publishing-Depart...,[shuttles],[],https://data.sfgov.org/Transportation/Commuter...,data.sfgov.org,Open Data Commons Public Domain Dedication and...,OpenData,dbag-6qd9,...,11,3.584963,4,2.321928,48,5.61471,,official,dataset,2019-09-06 03:15:54+00:00
59,"[transportation, environment, infrastructure]",Transportation,[{'key': 'Department-Metrics_Publishing-Depart...,"[parking, parking regulations, time limits, rp...",[],https://data.sfgov.org/Transportation/Parking-...,data.sfgov.org,Open Data Commons Public Domain Dedication and...,OpenData,dbag-6qd9,...,26,4.754888,6,2.807355,93,6.554589,,official,dataset,2019-09-16 17:01:27+00:00
68,"[infrastructure, transportation]",Transportation,[{'key': 'Metadata_Last-Updated-Date-via-Autom...,"[streets, permits, moratoriums]",[],https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,,cocadmin,scy9-9wg4,...,416,8.703904,112,6.820179,18911,14.207014,,official,dataset,2019-09-18 06:00:22+00:00
86,"[transportation, public safety]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Cur...","[transportation, red light cameras, traffic]",[],https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,,cocadmin,scy9-9wg4,...,196,7.622052,42,5.426265,6222,12.603395,,official,dataset,2019-09-18 10:27:18+00:00
89,"[transportation, infrastructure, housing & dev...",Transportation Infrastructure,"[{'key': 'Time-Frame_Date-Made-Public', 'value...","[on-street construction, on-street maintenance...",[],https://data.edmonton.ca/Transportation-Infras...,data.edmonton.ca,See Terms of Use,opendata@edmonton.ca,dhgx-s4x9,...,25,4.70044,5,2.584963,761,9.573647,,official,dataset,2019-09-18 10:08:39+00:00


In [14]:
print("Number of unique domains: " + str(result['metadata.domain'].unique().shape[0]))
result['metadata.domain'].unique()

Number of unique domains: 23


array(['data.sfgov.org', 'data.cityofchicago.org', 'data.edmonton.ca',
       'data.cityofnewyork.us', 'data.ny.gov', 'data.brla.gov',
       'data.seattle.gov', 'data.austintexas.gov', 'data.cambridgema.gov',
       'data.melbourne.vic.gov.au', 'data.oregon.gov',
       'opendata.maryland.gov', 'data.calgary.ca', 'data.nashville.gov',
       'www.data.act.gov.au', 'data.ct.gov',
       'data.montgomerycountymd.gov', 'agtransport.usda.gov',
       'data.smgov.net', 'data.buffalony.gov', 'data.everettwa.gov',
       'www.datos.gov.co', 'data.baltimorecity.gov'], dtype=object)

In [15]:
result[result['metadata.domain'] == 'data.cityofchicago.org'][[
        'classification.categories', 'classification.domain_category',
        'classification.domain_metadata', 'classification.domain_tags',
        'link', 'metadata.domain', 'resource.updatedAt'
       ]]

Unnamed: 0,classification.categories,classification.domain_category,classification.domain_metadata,classification.domain_tags,link,metadata.domain,resource.updatedAt
68,"[infrastructure, transportation]",Transportation,[{'key': 'Metadata_Last-Updated-Date-via-Autom...,"[streets, permits, moratoriums]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 06:00:22+00:00
86,"[transportation, public safety]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Cur...","[transportation, red light cameras, traffic]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 10:27:18+00:00
115,[transportation],Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Cur...","[parking restrictions, parking]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 05:33:15+00:00
129,[transportation],Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Rea...","[traffic, sustainability]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 16:20:10+00:00
533,"[transportation, health, infrastructure]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': '201...","[transportation, public safety, vision zero]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 15:09:11+00:00
672,"[transportation, demographics]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': '201...","[taxis, transportation]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-11 18:36:11+00:00
674,"[public safety, transportation]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': '201...","[transportation, public safety, vision zero]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 14:54:39+00:00
692,[transportation],Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Mar...","[traffic, sustainability, historical]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 16:15:19+00:00
693,[transportation],Transportation,"[{'key': 'Metadata_Time-Period', 'value': 'Rea...","[traffic, sustainability]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 16:18:40+00:00
1064,"[transportation, economy, public safety]",Transportation,"[{'key': 'Metadata_Time-Period', 'value': '201...","[permits, transportation, streets]",https://data.cityofchicago.org/Transportation/...,data.cityofchicago.org,2019-09-18 06:17:59+00:00


## Querying the City of Chicago
We selected the city of chicago domain.

Instead of using the http queries, we are using the sodapy Socrata API 

In [16]:
# Library used to read datasets
# https://github.com/xmunoz/sodapy
!pip install sodapy
from sodapy import Socrata

Collecting sodapy
  Downloading https://files.pythonhosted.org/packages/ae/e9/99b640c13544f03fc8d169fc99811d834a0d69ba5a69c68a3e891962ae5d/sodapy-1.5.4-py2.py3-none-any.whl
Installing collected packages: sodapy
Successfully installed sodapy-1.5.4


In [17]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)



In [18]:
# Print the URLs
result[result['metadata.domain'] == 'data.cityofchicago.org']['link'].values

array(['https://data.cityofchicago.org/Transportation/Roadway-Construction-Moratoriums/ndbz-vy4e',
       'https://data.cityofchicago.org/Transportation/Red-Light-Camera-Locations/thvf-6diy',
       'https://data.cityofchicago.org/Transportation/Parking-Permit-Zones/u9xt-hiju',
       'https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Re/t2qc-9pjd',
       'https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if',
       'https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew',
       'https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3',
       'https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Historical-Congestion-Esti/sxs8-h27x',
       'https://data.cityofchicago.org/Transportation/Chicago-Traffic-Tracker-Congestion-Estimates-by-Se/n4j6-wkkf',
       'https://data.cityofchicago.org/Transportation/Transportation-Department-Permits/pubx-yq2d',
       'http

In [19]:
# We start from the results above
# Get some metadata information: id and name
for lnk in result[result['metadata.domain'] == 'data.cityofchicago.org']['link'] : 
    id = lnk.split('/')[-1]
    meta = client.get_metadata(id)
    print(meta['id'] + '|' + meta['name'])

ndbz-vy4e|Roadway Construction Moratoriums
thvf-6diy|Red Light Camera Locations
u9xt-hiju|Parking Permit Zones
t2qc-9pjd|Chicago Traffic Tracker - Congestion Estimates by Regions
85ca-t3if|Traffic Crashes - Crashes
wrvz-psew|Taxi Trips
68nd-jvt3|Traffic Crashes - Vehicles
sxs8-h27x|Chicago Traffic Tracker - Historical Congestion Estimates by Segment - 2018-Current
n4j6-wkkf|Chicago Traffic Tracker - Congestion Estimates by Segments
pubx-yq2d|Transportation Department Permits
ygr5-vcbg|Towed Vehicles
u6pd-qa9d|Traffic Crashes - People


## Get some metadata on the columns

In [20]:
# Get the columns for 85ca-t3if|Traffic Crashes - Crashes
meta = client.get_metadata('85ca-t3if')
total_count = 0
for c in meta['columns'] :
    print(c['fieldName'] + '| type: ' + c['renderTypeName'] + '| not null count: ' + c['cachedContents']['not_null'] )
    if (c['fieldName'] == 'rd_no') :
        total_count = int(c['cachedContents']['not_null'])

rd_no| type: text| not null count: 339953
crash_date_est_i| type: text| not null count: 25474
crash_date| type: calendar_date| not null count: 339953
posted_speed_limit| type: number| not null count: 339953
traffic_control_device| type: text| not null count: 339953
device_condition| type: text| not null count: 339953
weather_condition| type: text| not null count: 339953
lighting_condition| type: text| not null count: 339953
first_crash_type| type: text| not null count: 339953
trafficway_type| type: text| not null count: 339953
lane_cnt| type: number| not null count: 198544
alignment| type: text| not null count: 339953
roadway_surface_cond| type: text| not null count: 339953
road_defect| type: text| not null count: 339953
report_type| type: text| not null count: 332288
crash_type| type: text| not null count: 339953
intersection_related_i| type: text| not null count: 74187
private_property_i| type: text| not null count: 15672
hit_and_run_i| type: text| not null count: 94378
damage| type:

## Get some records

In [21]:
first10_df = pd.DataFrame(client.get("85ca-t3if", limit=10))
first10_df.head()

Unnamed: 0,alignment,beat_of_occurrence,crash_date,crash_date_est_i,crash_day_of_week,crash_hour,crash_month,crash_type,damage,date_police_notified,...,report_type,road_defect,roadway_surface_cond,sec_contributory_cause,street_direction,street_name,street_no,traffic_control_device,trafficway_type,weather_condition
0,STRAIGHT AND LEVEL,634,2019-09-18T02:55:00.000,,4,2,9,NO INJURY / DRIVE AWAY,"OVER $1,500",2019-09-18T02:56:00.000,...,ON SCENE,NO DEFECTS,DRY,FAILING TO REDUCE SPEED TO AVOID CRASH,W,87TH ST,125,NO CONTROLS,PARKING LOT,CLEAR
1,STRAIGHT AND LEVEL,321,2019-09-18T01:00:00.000,,4,1,9,NO INJURY / DRIVE AWAY,$500 OR LESS,2019-09-18T01:05:00.000,...,,NO DEFECTS,DRY,NOT APPLICABLE,S,COTTAGE GROVE AVE,7040,NO CONTROLS,PARKING LOT,CLEAR
2,STRAIGHT AND LEVEL,1631,2019-09-18T00:47:00.000,,4,0,9,NO INJURY / DRIVE AWAY,"OVER $1,500",2019-09-18T01:00:00.000,...,ON SCENE,NO DEFECTS,DRY,UNABLE TO DETERMINE,N,ORANGE AVE,3540,NO CONTROLS,ONE-WAY,CLEAR
3,STRAIGHT AND LEVEL,1511,2019-09-18T00:05:00.000,,4,0,9,NO INJURY / DRIVE AWAY,"OVER $1,500",2019-09-18T00:13:00.000,...,ON SCENE,NO DEFECTS,DRY,FAILING TO REDUCE SPEED TO AVOID CRASH,N,CENTRAL AVE,800,TRAFFIC SIGNAL,NOT DIVIDED,CLEAR
4,STRAIGHT AND LEVEL,223,2019-09-17T22:05:00.000,,3,22,9,INJURY AND / OR TOW DUE TO CRASH,"OVER $1,500",2019-09-17T22:05:00.000,...,ON SCENE,NO DEFECTS,DRY,NOT APPLICABLE,E,51ST ST,500,NO CONTROLS,NOT DIVIDED,CLEAR


In [22]:
# Check data types
first10_df.dtypes

alignment                        object
beat_of_occurrence               object
crash_date                       object
crash_date_est_i                 object
crash_day_of_week                object
crash_hour                       object
crash_month                      object
crash_type                       object
damage                           object
date_police_notified             object
device_condition                 object
first_crash_type                 object
hit_and_run_i                    object
injuries_fatal                   object
injuries_incapacitating          object
injuries_no_indication           object
injuries_non_incapacitating      object
injuries_reported_not_evident    object
injuries_total                   object
injuries_unknown                 object
intersection_related_i           object
latitude                         object
lighting_condition               object
location                         object
longitude                        object


In [None]:
# Casting columns to their proper values
first10_df = first10_df.astype({'beat_of_occurrence': 'int32','crash_day_of_week': 'int32',
                   'crash_hour': 'int32','crash_month':'int32',
                   'injuries_fatal':'int32','injuries_incapacitating':'int32',
                   'injuries_no_indication':'int32','injuries_non_incapacitating':'int32',
                    'injuries_reported_not_evident':'int32','injuries_total':'int32',
                    'injuries_unknown':'int32','latitude':'float64','longitude':'float64',
                    'num_units':'int32','posted_speed_limit':'int32',
                    'street_no':'int32'})
first10_df['crash_date'] = pd.to_datetime(first10_df['crash_date'])
first10_df['date_police_notified'] = pd.to_datetime(first10_df['date_police_notified'])
first10_df.head()

In [None]:
# Check data types
first10_df.dtypes

In [None]:
# Last 10 records
last10_df = pd.DataFrame(client.get("85ca-t3if", offset=total_count - 10))
last10_df

## Too many records!
The API limit defaults to 1000 records. We can change the `limit` to an arbitrary number.

How can we know if we read all the records we wanted? We use `offset` in a loop. Note that the underlying query has a limit of 10,000 records.

In [None]:
years_df = pd.DataFrame(client.get("85ca-t3if", where="crash_date > '2018'", limit=1005))
print("Number of accidents: " + str(years_df.shape[0]))

### We need to loop
We don't know how many records are available. we can loop to get them all

In [None]:
y2019_df = pd.DataFrame(client.get("85ca-t3if", where="crash_date > '2019'", limit=10000))
offset = 10000
result = client.get("85ca-t3if", where="crash_date > '2019'", offset=offset, limit=10000)
while (len(result) > 0) :
    y2019_df = y2019_df.append(pd.DataFrame(result))
    offset += 10000
    result = client.get("85ca-t3if", where="crash_date > '2019'", offset=offset, limit=10000)

print("Number of records: " + str(y2019_df.shape[0]))

In [None]:
y2019_df.head()

## Accessing data from a URL
See the documentation:
- https://dev.socrata.com/docs/endpoints.html
- https://dev.socrata.com/docs/queries/


In [23]:
# Get data count by categories
crashes="https://data.cityofchicago.org/resource/85ca-t3if.csv?$limit=100"
crashes_df = pd.read_csv(crashes, parse_dates=['crash_date'])
crashes_df.head()

Unnamed: 0,rd_no,crash_date_est_i,crash_date,posted_speed_limit,traffic_control_device,device_condition,weather_condition,lighting_condition,first_crash_type,trafficway_type,...,injuries_non_incapacitating,injuries_reported_not_evident,injuries_no_indication,injuries_unknown,crash_hour,crash_day_of_week,crash_month,latitude,longitude,location
0,JC438003,,2019-09-18 02:55:00,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",OTHER OBJECT,PARKING LOT,...,0,0,2,0,2,4,9,41.736159,-87.627785,POINT (-87.627784584001 41.736159141056)
1,JC437968,,2019-09-18 01:00:00,10,NO CONTROLS,NO CONTROLS,CLEAR,DARKNESS,PARKED MOTOR VEHICLE,PARKING LOT,...,0,0,2,0,1,4,9,41.766431,-87.605748,POINT (-87.605747860649 41.766430892538)
2,JC437962,,2019-09-18 00:47:00,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,ONE-WAY,...,0,0,1,0,0,4,9,41.94432,-87.824268,POINT (-87.8242679345 41.944319879827)
3,JC437938,,2019-09-18 00:05:00,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,"DARKNESS, LIGHTED ROAD",TURNING,NOT DIVIDED,...,0,0,3,0,0,4,9,41.894897,-87.765543,POINT (-87.765543258459 41.894896579397)
4,JC437840,,2019-09-17 22:05:00,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,NOT DIVIDED,...,0,0,1,0,22,3,9,41.802239,-87.613996,POINT (-87.613995727141 41.802238709769)


In [24]:
crashes_df.dtypes

rd_no                                    object
crash_date_est_i                         object
crash_date                       datetime64[ns]
posted_speed_limit                        int64
traffic_control_device                   object
device_condition                         object
weather_condition                        object
lighting_condition                       object
first_crash_type                         object
trafficway_type                          object
lane_cnt                                float64
alignment                                object
roadway_surface_cond                     object
road_defect                              object
report_type                              object
crash_type                               object
intersection_related_i                   object
private_property_i                       object
hit_and_run_i                            object
damage                                   object
date_police_notified                    