# STA 141B Data & Web Technologies for Data Analysis


### Lecture 8, 2/5/24, APIs


### Last week's topics

- APIs

### Today's topics

- Undocumented APIs

### Ressources
 - [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

### Recap: HTTP

A response to an HTTP request always includes a status code that summarizes whether the request was successful. Wikipedia has a full [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Generally,

* 200-299: Your request succeeded.
* 300-399: You need to take further action to complete the request.
* 400-499: Your request wasn't valid (you made a mistake). You've probably seen 404 before!
* 500-599: Your request failed (the server made a mistake).

In [None]:
import requests

### Undocumented Web APIs

Many websites use undocumented web APIs to get data. For example:

 - [University of California Compensation](https://ucannualwage.ucop.edu/wage/)
 - [Yolo County Health Inspections](https://yoloeco.envisionconnect.com/)

You can identify these websites by looking at requests in your browser's developer tools. For Firefox and Chrome these can be accessed (Windows: <kbd>Ctrl</kbd> + <kbd>i</kbd>; MacOS: <kbd>&#8984;</kbd> + <kbd>&#8997;</kbd> + <kbd>i</kbd>).

Requests to web APIs almost always return JSON or XML data. By examining the browser requests, you can work out the endpoints and parameters, allowing you to use the API.

**CAUTION:** Web APIs that are undocumented are often undocumented for a reason. Using an undocumented API may make someone angry or get you into legal trouble! Government and quasi-government websites (like the examples above) are probably okay, as long as you cache and rate-limit your requests. For everything else, find for an alternative or get permission first.

Let's reverse engineer the Yolo County Health Inspections web API so that we can get data about local restaurants.

In [None]:
import numpy as np
import pandas as pd
import requests
import requests_cache
requests_cache.install_cache("lecture5")

In [None]:
url = 'https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities'

In [None]:
result = requests.post(url, params = {
    'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4'
}, 
                       data = {
    'FacilityName': "Ali Baba"
})
result.raise_for_status()

Check the [docs](https://requests.readthedocs.io/en/latest/api/?highlight=post#requests.post) for `requests`!

In [None]:
result.url

In [None]:
result.json()

Lets investigate this further. The second request uses the `FacilityID` as parameter. 

In [None]:
url = 'https://yoloeco.envisionconnect.com/api/pressAgentClient/programs'
result = requests.get(url, params = {
    'FacilityId': 'FA0001973', 
    'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4'
})
result.raise_for_status()
result.json()

In [None]:
result.url

We are interested in the inspections text, for which we have to provide the `ProgramID` parameter. 

In [None]:
url = 'https://yoloeco.envisionconnect.com/api/pressAgentClient/inspections'

In [None]:
result = requests.get(url, params = {
    'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
    'ProgramId': 'PR0000674'
})
result.raise_for_status()

In [None]:
results = result.json()
results

In [None]:
results_df = pd.DataFrame(results)
results_df

In [None]:
results_df['violations'][1]

In [None]:
results_df['violations'][1][0]['v_memo']

In [None]:
len(results_df['violations'][1])

In [None]:
violations = [
    results_df['violations'][1][i]['violation_description'] for i in range(len(results_df['violations'][1]))
]
violations

In [None]:
{'Ali Baba': violations}

How can we generalize this procedure? 

In [None]:
url = 'https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities'

In [None]:
result=requests.post(url, params  = {
    "PressAgentOid": "c08cb189-894c-4c8c-b595-a5ef010226b4"
}, 
                     data = {
    "FacilityName": "Ali Baba", 
})
result.raise_for_status()

In [None]:
result.json()

In [None]:
result=requests.post(url, params  = {
    "PressAgentOid": "c08cb189-894c-4c8c-b595-a5ef010226b4"}, 
              data = {
    "FacilityName": "a", 
})
result.json()

In [None]:
pd.DataFrame(result.json())

Lets write a pipeline. 

In [None]:
def fetch_violations(ProgramId):
    result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/inspections', 
                          params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'ProgramId': ProgramId
    })
    result.raise_for_status()
    results = result.json()
    results_df = pd.DataFrame(results)
    violations = [
        results_df['violations'][0][i]['violation_description'] for i in range(len(results_df['violations'][0]))
    ]
    return(violations)

In [None]:
fetch_violations('PR0024103') # for in-n-out

In [None]:
result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/inspections', 
                          params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'ProgramId': 'PR0024103'
    })

In [None]:
result.text

In [None]:
def fetch_ProgramId(FacilityID):
    result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/programs', 
                          params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'FacilityID': FacilityID
    })
    result.raise_for_status()
    ProgramId = result.json()[0]['ProgramId']
    return(ProgramId)

In [None]:
fetch_ProgramId('FA0003293')

In [None]:
def fetch_FacilityID(letter):
    result = requests.post('https://yoloeco.envisionconnect.com/api/pressAgentClient/searchFacilities?', 
                           params  = {
    "PressAgentOid": "c08cb189-894c-4c8c-b595-a5ef010226b4"}, 
                           data = {
    "FacilityName": letter, 
    })
    facility_table = pd.DataFrame(result.json())[['FacilityId', 'FacilityName']]
    return(facility_table)

In [None]:
fetch_FacilityID('in')

In [None]:
import time

In [None]:
[letter for letter in map(chr, range(97, 99))]

In [None]:
x = {}
type(x)

In [None]:
def get_violations(): 
    violations = {}
    for letter in map(chr, range(97, 99)): # map(chr, range(97, 123)) takes too long
        time.sleep(0.05) # sleep until making a request for each letter
        facility_table = fetch_FacilityID(letter)
        for index in range(facility_table.shape[0]): # for all facilities returned for this letter
            FacilityId, FacilityName = facility_table.iloc[index]
            time.sleep(0.1) # sleep again for each individual request
            ProgramId = fetch_ProgramId(FacilityId)
            print(FacilityName)
            violations[FacilityName] = fetch_violations(ProgramId)
    return(violations)

In [None]:
violations = get_violations()

In [None]:
fetch_FacilityID('A&B LIQUOR')

In [None]:
fetch_ProgramID('FA0001345')

In [None]:
ProgramId = fetch_ProgramId('FA0001345')            
ProgramId

In [None]:
fetch_violations('PR0000623')

In [None]:
result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/inspections', params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'ProgramId': 'PR0000623'
})
result.raise_for_status()

In [None]:
results = result.json()
results

Lets check this in the browser! 

In [None]:
result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/programs', params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'FacilityID': 'FA0001345'
    }).json()
[result[i]['ProgramId'] for i in range(len(result))]

In [None]:
def fetch_ProgramId(FacilityID):
    result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/programs', params = {
        'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
        'FacilityID': FacilityID
    }).json()
    ProgramId = [result[i]['ProgramId'] for i in range(len(result))]
    return(ProgramId)

In [None]:
fetch_ProgramId('FA0001345')

In [None]:
def fetch_violations(ProgramId_list):
    violations = []
    for ProgramId in ProgramId_list: 
        result = requests.get('https://yoloeco.envisionconnect.com/api/pressAgentClient/inspections', params = {
            'PressAgentOid': 'c08cb189-894c-4c8c-b595-a5ef010226b4', 
            'ProgramId': ProgramId
        }).json()
        results_df = pd.DataFrame(result)
        if not results_df.empty: # only append violations if there are any
            violations.extend(
                [results_df['violations'][0][i]['violation_description'] for i in range(len(results_df['violations'][0]))]
            )
    return(violations)

In [None]:
fetch_violations(['PR0000623', 'PR0069422'])

In [None]:
violations = get_violations()

In [None]:
violations

#### Safeway

Check the [docs](https://requests.readthedocs.io/en/latest/api/?requests.get)!

In [None]:
url = 'https://www.safeway.com/abs/pub/xapi/pgmsearch/v1/search/products'
params = {
    'request-id': 9401706033563384632,
    'q': 'eggs',
    'rows': 30,
    'start': 0,
    'search-type': 'keyword',
    'storeid': 3132,
    'featured': 'true',
    'url': 'https://www.safeway.com',
    'pageurl': 'https://www.safeway.com', 
    'search-uid': 'uid%3D3640904575678%3Av%3D12.0%3Ats%3D1674581210532%3Ahc%3D3', 
    'pagename': 'search',
    'dvid': 'web-4.1search',
}
header = {
    'Ocp-Apim-Subscription-Key': '5e790236c84e46338f4290aa1050cdd4', 
}

In [None]:
results = requests.get(url, params=params)
results.raise_for_status()

In [None]:
results = requests.get(url, params=params, headers=header)
results.json()

In [None]:
results.raise_for_status

### Summary 

- Check the query type, header and params using the developer tools 
- Often, multiple API queries are made to display one result 