# The Concept: API Endpoints

In data science, API endpoints usually refer to URLs that return data. They behave a lot like querying a database, but over the internet. 

APIs are great because it means you can access data in real time, automate it in your data pipeline, control how you want the data. 

For example,
* Dataset : [NYPD Complaint Data Current (Year To Date)](https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uac-w243)
* API endpoint: https://data.cityofnewyork.us/resource/5uac-w243.json 
* **How to access**:
    * via web browser
        * you can use a web extension to format the output into something more readable. I use [JSONView](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc) in Chrome. 
    * programmatically, via python etc
* **What is returned**:
    * data in the format of JSON, GeoJSON, CSV etc depending on the API
        * Most commonly, JSON is returned. It stands for JavaScript Object Notation.
        * JSON looks a lot like a python dictionary, in the format of `{attribute: value}`

### What you can do with endpoints: querying and filtering

For each dataset, you can specify how you want the data, right in the URL. Of course, this requires you to know the data structure ie what fields/columns are available for you to filter. Assuming you know those, you can query the data, aka specify filter parameters. 

**Syntax for querying**: `<base URL>?<field><condition><value>`

For example,
* `https://data.cityofnewyork.us/resource/5uac-w243.json?cmplnt_fr_dt=1018-12-16T00:00:00.000`
* `https://soda.demo.socrata.com/resource/4tka-6guv.json?$where=magnitude > 3.0`
    * For advanced querying, use special keywords such as `where`, as explained [here](https://dev.socrata.com/docs/queries/).


**How do I know what I can query from an endpoint?** 
Most (good) APIs come with documentation, that's where you can find out things like what fields exist for a specific endpoint. Open data portals here seem to be built using the Socrata Open Data API (SODA) which has additional support for documentation, giving you useful information such as what [data type](https://dev.socrata.com/docs/datatypes/#,) that field is and also what corresponding filter operations you can perform.

For example, 
* **API endpoint**/Base URL: https://data.cityofnewyork.us/resource/5uac-w243.json 
* **API documentation**: https://dev.socrata.com/foundry/data.cityofnewyork.us/5uac-w243
* The field `cmplnt_fr_dt` (complaint from date) is of type `floating_timestamp`. Click on the tab to know what query options are possible.



##### Additional resources
Learn more about SODA API endpoints [here](https://dev.socrata.com/docs/endpoints.html). In general, everyone builds their APIs differently so using APIs from different sources are bound to be different. However, they usually follow along a similar framework.

# Python/SODA API Cheatsheet

### 1) Accessing endpoints: `pd.read_csv`, `pd.read_json`

In [6]:
# using pandas to read data into dataframes
import pandas as pd

crime_in_parks = pd.read_json("https://data.cityofnewyork.us/resource/5uac-w243.json")
# same as 
# crime_in_parks = pd.read_csv("https://data.cityofnewyork.us/resource/5uac-w243.csv")

crime_in_parks.head()

Unnamed: 0,cmplnt_num,addr_pct_cd,boro_nm,cmplnt_fr_dt,cmplnt_fr_tm,cmplnt_to_dt,cmplnt_to_tm,crm_atpt_cptd_cd,hadevelopt,housing_psa,...,susp_sex,transit_district,vic_age_group,vic_race,vic_sex,x_coord_cd,y_coord_cd,latitude,longitude,lat_lon
0,660160752,103,QUEENS,2019-09-30T00:00:00.000,14:00:00,2019-09-30T00:00:00.000,14:01:00,COMPLETED,,,...,M,,25-44,UNKNOWN,M,,,,,
1,219394661,40,BRONX,2019-09-30T00:00:00.000,13:00:00,2019-09-30T00:00:00.000,14:00:00,COMPLETED,,772.0,...,,,UNKNOWN,UNKNOWN,E,,,,,
2,184671534,18,MANHATTAN,2019-09-30T00:00:00.000,11:40:00,2019-09-30T00:00:00.000,11:45:00,COMPLETED,,,...,M,,25-44,WHITE,F,,,,,
3,825219244,48,BRONX,2019-09-30T00:00:00.000,02:00:00,2019-09-30T00:00:00.000,03:30:00,COMPLETED,,,...,U,,25-44,WHITE,F,,,,,
4,457673231,32,MANHATTAN,2019-09-29T00:00:00.000,20:30:00,,,COMPLETED,,,...,U,,UNKNOWN,UNKNOWN,D,,,,,


### 2) Querying: `urllib.parse.urlencode`
Since you're making queries right in the URL, it's a good practice to **[URL encode](https://en.wikipedia.org/wiki/Percent-encoding)** them so that they will be interpreted well. This essentially converts spaces and special characters (e.g. ?$%^&*) into HTTP-friendly strings. This can be done easily using the python library `urllib`.

In [17]:
from urllib.parse import urlencode

base_url = "https://data.cityofnewyork.us/resource/5uac-w243.json?" # note the question mark needed for querying

In [28]:
# example 1
query_params = urlencode({'$where': 'parks_nm IS NOT NULL'}) # i asked for all records where park names are not empty/NULL
df = pd.read_json(base_url + query_params)
df.head(2)

query params look like: "%24where=parks_nm+IS+NOT+NULL"


Unnamed: 0,addr_pct_cd,boro_nm,cmplnt_fr_dt,cmplnt_fr_tm,cmplnt_num,cmplnt_to_dt,cmplnt_to_tm,crm_atpt_cptd_cd,hadevelopt,housing_psa,...,prem_typ_desc,rpt_dt,susp_age_group,susp_race,susp_sex,vic_age_group,vic_race,vic_sex,x_coord_cd,y_coord_cd
0,25,MANHATTAN,2019-05-21T00:00:00.000,17:00:00,264363701,2019-05-22T00:00:00.000,20:00:00,COMPLETED,,645.0,...,RESIDENCE - PUBLIC HOUSING,2019-05-22T00:00:00.000,UNKNOWN,BLACK,M,<18,BLACK,F,,
1,48,BRONX,2019-03-24T00:00:00.000,14:25:00,883586818,2019-03-24T00:00:00.000,14:30:00,COMPLETED,,,...,PARK/PLAYGROUND,2019-03-24T00:00:00.000,<18,WHITE HISPANIC,F,<18,BLACK HISPANIC,F,,


In [29]:
# understanding how urlencoding works
before = {'$where': 'parks_nm IS NOT NULL'}
query_params = urlencode(before)
print(f'{before} ==> URL encoding ==> "{query_params}"')

{'$where': 'parks_nm IS NOT NULL'} ==> URL encoding ==> "%24where=parks_nm+IS+NOT+NULL"


In [22]:
# example 2
query = {"$where": "cmplnt_fr_dt between '2018-01-10T12:00:00' and '2019-01-10T14:00:00'"}
df = pd.read_json(base_url + urlencode(query))
df.head(2)

Unnamed: 0,addr_pct_cd,boro_nm,cmplnt_fr_dt,cmplnt_fr_tm,cmplnt_num,cmplnt_to_dt,cmplnt_to_tm,crm_atpt_cptd_cd,hadevelopt,housing_psa,...,station_name,susp_age_group,susp_race,susp_sex,transit_district,vic_age_group,vic_race,vic_sex,x_coord_cd,y_coord_cd
0,73,BROOKLYN,2018-01-11T00:00:00.000,12:48:00,667212057,2019-01-11T00:00:00.000,13:30:00,COMPLETED,,,...,,45-64,BLACK,M,,45-64,BLACK,M,1009102.0,182811.0
1,72,BROOKLYN,2018-01-11T00:00:00.000,12:30:00,514197470,2019-01-11T00:00:00.000,12:35:00,COMPLETED,,,...,,25-44,BLACK,M,,18-24,BLACK,M,990242.0,178571.0


### 3) Multiple conditional queries: [boolean operators](https://dev.socrata.com/docs/queries/where.html)


In [24]:
condition_1 = 'parks_nm IS NOT NULL'
condition_2 = "cmplnt_fr_dt between '2018-01-10T12:00:00' and '2019-01-10T14:00:00'"
query_params = urlencode({'$where': condition_1 + ' AND ' + condition_2}) # using boolean operator AND
query_params

'%24where=parks_nm+IS+NOT+NULL+AND+cmplnt_fr_dt+between+%272018-01-10T12%3A00%3A00%27+and+%272019-01-10T14%3A00%3A00%27'

In [25]:
df = pd.read_json(base_url + query_params)
df.head(2)

Unnamed: 0,addr_pct_cd,boro_nm,cmplnt_fr_dt,cmplnt_fr_tm,cmplnt_num,cmplnt_to_dt,cmplnt_to_tm,crm_atpt_cptd_cd,hadevelopt,housing_psa,...,prem_typ_desc,rpt_dt,susp_age_group,susp_race,susp_sex,vic_age_group,vic_race,vic_sex,x_coord_cd,y_coord_cd
0,30,MANHATTAN,2018-02-01T00:00:00.000,12:00:00,321118618,2018-10-31T00:00:00.000,12:00:00,COMPLETED,,,...,PARK/PLAYGROUND,2019-08-26T00:00:00.000,45-64,BLACK,M,<18,BLACK,M,,
1,69,BROOKLYN,2018-05-14T00:00:00.000,15:45:00,521681118,2019-07-11T00:00:00.000,20:00:00,COMPLETED,,,...,STREET,2019-07-11T00:00:00.000,,,,UNKNOWN,UNKNOWN,E,,


### 4) (SODA) API Application Tokens: `$$app_token` query parameter
If you're accessing a ton of data on a highly regular basis, you may soon encounter problems where the data starts coming in slower, aka your access is being throttled. Getting SODA's Application Token will allow you to access data at higher throttling limits. You can get acquire a token on a project basis, ie one for each project. 

See [Obtaining an Application Token](https://dev.socrata.com/docs/app-tokens.html). You should get a long string of characters.

In [31]:
# using the app token as part of your API endpoint request
APP_TOKEN = '2EqneQvd21Xp25hEqUeBSW2b6' 

params = urlencode({'$$app_token': APP_TOKEN,
                                 '$where': 'parks_nm IS NOT NULL'})

df = pd.read_json("https://data.cityofnewyork.us/resource/5uac-w243.json?" + params)
df.head(2)

Unnamed: 0,addr_pct_cd,boro_nm,cmplnt_fr_dt,cmplnt_fr_tm,cmplnt_num,cmplnt_to_dt,cmplnt_to_tm,crm_atpt_cptd_cd,hadevelopt,housing_psa,...,prem_typ_desc,rpt_dt,susp_age_group,susp_race,susp_sex,vic_age_group,vic_race,vic_sex,x_coord_cd,y_coord_cd
0,25,MANHATTAN,2019-05-21T00:00:00.000,17:00:00,264363701,2019-05-22T00:00:00.000,20:00:00,COMPLETED,,645.0,...,RESIDENCE - PUBLIC HOUSING,2019-05-22T00:00:00.000,UNKNOWN,BLACK,M,<18,BLACK,F,,
1,48,BRONX,2019-03-24T00:00:00.000,14:25:00,883586818,2019-03-24T00:00:00.000,14:30:00,COMPLETED,,,...,PARK/PLAYGROUND,2019-03-24T00:00:00.000,<18,WHITE HISPANIC,F,<18,BLACK HISPANIC,F,,
