# Data Download

To make an API call using python, we use the `requests` library. Before we start coding, we need to read the documentation to understand if we need to set-up any configuration beforehand.

Below is a copy+paste of the MVCC [API Documentation](https://dev.socrata.com/foundry/data.cityofnewyork.us/h9gi-nx95).
#### START

### API Documentation 

#### Getting Started
All communication with the API is done through HTTPS, and errors are communicated through HTTP response codes. Available response types include JSON (including GeoJSON), XML, and CSV, which are selectable by the "extension" (.json, etc.) on the API endpoint or through content-negotiation with HTTP Accepts headers.

This documentation also includes inline, runable examples. Click on any link that contains a  gear symbol next to it to run that example live against the Motor Vehicle Collisions - Crashes API. If you just want to grab the API endpoint and go, you'll find it below.

#### Tokens
All requests should include an app token that identifies your application, and each application should have its own unique app token. A limited number of requests can be made without an app token, but they are subject to much lower throttling limits than request that do include one. With an app token, your application is guaranteed access to it's own pool of requests. If you don't have an app token yet, click the button to the right to sign up for one.

Once you have an app token, you can include it with your request either by using the X-App-Token HTTP header, or by passing it via the $$app_token parameter on your URL.

#### END

**The above tells us a few important points:**
1. All API calls are done via HTTPS and errors are communicated through HTTP response codes. Response codes indicate to the client what has happened. Typically a response of 200/201 indicates a success, while a 401/403 indicates an error.

2. We need to use {'X-App-Token': APP_TOKEN} as our headers to pass to our GET request. Even though this is mentioned, there is no mention on where we need to pass our SECRET. Thus, I suspect this header is an optional argument.

### Things to Know / Plans

We have multiple methods to query the underlying API. We can use a basic cURL request, the NYC OpenData [Socrata API](https://dev.socrata.com/docs/queries/), and/or leveraging SQL to pull the information from the google [bigquery-public-data project](https://cloud.google.com/bigquery/public-data). 

1. The API Documentation from the Socrate API above tells us that we need to use [Paging Through Data](https://dev.socrata.com/docs/paging.html) to pull all the 1.8 million records from the table. This is because the API defaults the limit to 1000 records returned. Paging through the data allows us to set an offset index, which tells the API where to start the returned list of results. It is important to mention that the data has to be ordered properly to ensure the results will be stable as we page through the dataset.
2. The dataset has 1.83 Million Rows.
3. There are several noticable data quality issues which we will discuss in this Notebook.
4. After we have created the simple API, we are going to parallelize the IO, increase the number of workers to assist with getting this data more rapidly.

In [1]:
import sys
sys.path.append('/Users/jordancarson/Projects/JPM/nyc-open-data')
import os
import requests
from requests.auth import HTTPBasicAuth
import json
import base64 
import traceback
from datetime import datetime, timedelta
import warnings

import pandas as pd
from pandas.core import api
from sodapy import Socrata

from tqdm import tqdm
# personal common library
from common.utilities import decorators 
from concurrent.futures import ThreadPoolExecutor

warnings.filterwarnings('ignore')

In [2]:
API_LIMIT = 50000 # we want to pull 50,000 records at each iteration
NYC_OPEN_DATA_API_ENDPOINT = 'https://data.cityofnewyork.us/resource/h9gi-nx95.json'
NYC_OPEN_DATA_API_KEY = "311jty15z5y8qcksv6wy8f724"
NYC_OPEN_DATA_API_SECRET =  "43c9eoayynpkyfceayqri48epnfmdxntbpli0p5zrz24yw6fjp"
NYC_OPEN_DATA_APP_TOKEN = 'Ivn8M6s3sEWUF69NbSH3Tbbkm'
NYC_OPEN_DATA_APP_SECRET = 'f2WCAvrC-sUWGRIvBlHQlLalJJI_uaQrhInk'

### Step 1: Query the Data

We are going to create a function to call the API using the offset and limit parameters in our request URL. The below code snippet downloads all the data and returns a single Pandas Dataframe.

In [3]:
@decorators.timeit
def api_pagination_results(orient = 'records'):
    """
    One method to pull data from the Open Source API is to 
    """
    ID = 'collision_id'
    finished = False
    offset = 0
    out_frames = list()
    while not finished:
        ENDPOINT = f'https://data.cityofnewyork.us/resource/h9gi-nx95.json?$limit={API_LIMIT}&$offset={offset}&$order={ID}'
        response = requests.get(ENDPOINT, auth=HTTPBasicAuth(NYC_OPEN_DATA_API_KEY, NYC_OPEN_DATA_API_SECRET))

        temp_df = pd.read_json(response.text, orient=orient)
        length = len(temp_df)
        out_frames.append(temp_df)
        
        offset += temp_df.shape[0] # len(temp_df)

        if length < API_LIMIT:
            finished = True
        
    # concatenate the list into one master dataframe
    df = pd.concat(out_frames, ignore_index=True)
    del out_frames, length, offset
    return df

In [None]:
# data = api_pagination_results()

We are now going to parallelize the requests using the ThreadPoolExecutor from the `concurrency` library. The idea is that we will create a function to pull the response from the endpoint, and then use the exector to map the response function to a list of URLs. In this instance, we will need the full-list of URLs ahead of time.

When we parallelize the job the request is completed in:
    attack_all => 7653.711795806885 ms
    This is when we return a DataFrame instead of response.text.encode('utf8')

In [16]:
def get_one(url):
    headers = {
        "x-api-key": NYC_OPEN_DATA_API_KEY,
        "Content-Type": "application/json",
    }
    response = requests.request("GET", url, auth=HTTPBasicAuth(NYC_OPEN_DATA_API_KEY, NYC_OPEN_DATA_API_SECRET), headers=headers)
    return pd.read_json(response.text, orient='records')
    # return response.text.encode("utf8")

In [17]:
@decorators.timeit
def get_all(urls, workers=15):
    with ThreadPoolExecutor(max_workers=workers) as executor:
        results = list(
            tqdm(executor.map(get_one, urls, timeout=60), total=len(urls))
        )
        return results

In [18]:
def create_urls(id='collision_id'):
    
    # we start with an offset of 0, we then increment the offset to be equal to the number of records returned. We are specifying the 
    # number of records returned via the API_LIMIT. 
    offset, limit = 0, 50000
    urls = list()
    for _ in range(0, 200 * 1000, limit):
        ENDPOINT = f'https://data.cityofnewyork.us/resource/h9gi-nx95.json?$limit={limit}&$offset={offset}&$order={id}'
        urls.append(ENDPOINT)
        offset += limit
    return urls

In [19]:
urls = create_urls()
results = get_all(urls)
for result in results:
        print(result)
        # print(result)

100%|██████████| 4/4 [00:08<00:00,  2.13s/it]


get_all => 8515.436887741089 ms
                    crash_date          crash_time    borough  zip_code  \
0      2012-07-01T00:00:00.000 2021-10-15 10:40:00  MANHATTAN   10013.0   
1      2012-07-01T00:00:00.000 2021-10-15 12:18:00  MANHATTAN   10004.0   
2      2012-07-01T00:00:00.000 2021-10-15 15:00:00        NaN       NaN   
3      2012-07-01T00:00:00.000 2021-10-15 18:00:00  MANHATTAN   10007.0   
4      2012-07-01T00:00:00.000 2021-10-15 19:30:00  MANHATTAN   10013.0   
...                        ...                 ...        ...       ...   
49995  2013-10-11T00:00:00.000 2021-10-15 12:50:00  MANHATTAN   10028.0   
49996  2013-10-11T00:00:00.000 2021-10-15 10:20:00  MANHATTAN   10028.0   
49997  2013-10-11T00:00:00.000 2021-10-15 15:19:00  MANHATTAN   10021.0   
49998  2013-10-11T00:00:00.000 2021-10-15 14:15:00  MANHATTAN   10075.0   
49999  2013-10-11T00:00:00.000 2021-10-15 13:01:00  MANHATTAN   10075.0   

        latitude  longitude  \
0      40.720854 -74.003929   
1    