Urban Data Science & Smart Cities <br>
URSP688Y Spring 2025<br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

# Demo 5 - Accessing and Wrangling Data from the Web

- GitHub Branches
- Loading Data from the Web with APIs
- Debugging

In [None]:
# Import package dependencies
import pandas as pd
import requests # for making RESTful API requests
import json # for converting strings in JSON format to python dictionaries and lists
import yaml # for converting yaml-structured text into python dictionaries and lists
import os # for basic operating system functions, like compiling paths
from datetime import datetime # for getting current date and time

## APIs (Application Programming Interfaces)

APIs are an interface for accessing information. At the most general level, nearly all programs that can be accessed with code have an API.

Python functions, for example, are programs with APIs. You access them by calling the function name and defining arguments.

In practice, when people talk about getting data from APIs they are usually talking about web APIs
- These are usually designed with a software architecture called [REST](https://en.wikipedia.org/wiki/REST).
- Using a REST API involves making a request to a URL and receiving a response.
- Often, that response is in a format called [JSON](https://en.wikipedia.org/wiki/JSON), which is structured like nested dictionaries and lists.

Today, we're going to practice retrieving data about cities from web APIs, then wrangling the data they return into a tabular format we can analyze.

Making a request to an API is just another way to get data, similar to downloading it from an open data portal. Why would you bother querying an API instead of just downloading a table?
- APIs allow programmatic access to data that can be easily scaled, replicated, and documented
- APIs can allow customization of which data you are accessing
- JSON allows for much more complex data structures than downloadable tables
- APIs can be an easy way to access real-time data
- Can we think of other reasons?

### Capital Bikeshare Data — Free, simple, real-time

Some APIs with data about cities are free and simple. The Capital Bikeshare (CABI) system, for example, has an API that reports on the status of bikes in its system in real-time. It's available free as part of CABI's operating agreement with the City of Washington, D.C.

The District Department of Transportation (DDOT) lists APIs for all of the micromobility systems operating in the city on [this webpage](https://ddot.dc.gov/page/dockless-api).

Let's request some data from the CABI systems and see what it looks like.

- What could we do with these data?
- What are its limitations?

In [None]:
# Making a "GET" request
response = requests.get('https://gbfs.lyft.com/gbfs/1.1/dca-cabi/en/free_bike_status.json')

# Get JSON content
data = response.json()

In [None]:
# Preview the data
data

In [None]:
# Inspect the keys
data.keys()

In [None]:
# Make a dataframe out of data for available bikes
df = pd.DataFrame(data['data']['bikes'])

df.head()

In [None]:
# Save the json data for later

def save_json(data, file_name, timestamp=False):
    """Save data as json file
    data: json-compatable data structure (nested dicts and lists)
    file_name: string for file name; DO NOT include file extension (e.g., ".json")
    """
    if timestamp:
        file_name = f'{file_name}_{timestamp}.json'
    else:
        file_name = f'{file_name}.json'
    with open(file_name, 'w') as f:
        json.dump(data, f, indent=4)

save_json(data, 'cabi_data')

In [None]:
# Make an automated workflow to retrieve data and save it, all at once

def get_and_save_cabi_data():
    """Get current data from the CABI API and save as a timestamped JSON
    """
    # Making a "GET" request
    response = requests.get('https://gbfs.lyft.com/gbfs/1.1/dca-cabi/en/free_bike_status.json')
    # Get JSON content
    data = response.json()
    # Get timestamp from data
    timestamp = data['last_updated']
    # Save to file
    save_json(data, 'cabi_data', timestamp) 

In [None]:
# Run the automated workflow
# Could we do this on a schedule to collect "snapshots" of the state of the CABI system?
get_and_save_cabi_data()

### Rentcast — Paid, more complex, updated less frequently

Free APIs like for CABI are becoming less common. (Does this sound familiar in light of today's reading about smart cities as emerging markets?) Many other APIs require that you pay for data, either through a subscription or for request you make. Some have free tiers, but they're usually quite limited.

Several years ago, Zillow provided data about real estate markets through a free API available to the public. You now have to go through a complicated application process to get access to their API, and your use case needs to be aligned with their business model.

An alternative source of real estate data is a company called [Rentcast](https://www.rentcast.io/). They allow anyone to set up an account and purchase data through an API, and it [gets expensive fast](https://www.rentcast.io/api#api-pricing). You get 50 requests free for "development," but after that you pay \\$0.20 per request or \\$74 per month for a subscription to make up to 1,000 requests.

They keep track of who is making requests with an 'API key', which is a long string of characters you include in your request as a 'header'. Because API keys are attached to billing information (i.e., your credit card), they're very sensitive. You ***NEVER*** want to commit your API key to GitHub or share it anywhere else publicly.

It's best practice to store your API key in a separate file—I like to use a format called YAML—that you prevent from being committed by adding it to your respository's `.gitignore` file. This is a list of files that you explicitly tell git not to keep track of.

When you want to use your API key, you load the configs into memory in the Python kernel you're currently working in. When you close or restart the kernel, the computer forgets it.

In [None]:
# Load personal data from a configs file (API key, local data path)
with open('configs.yaml', 'r') as file:
    CONFIGS = yaml.safe_load(file)

In [None]:
# Load eviction data we used last week
df = pd.read_csv('District_Court_of_Maryland_Eviction_Case_Data_MG_PG.csv')

In [None]:
# Get rentcast market data for the 10 zipcodes that are most represented in the eviction case data
zipcodes = df['Tenant ZIP Code'].value_counts().head(1).index.astype('Int64')
for zipcode in zipcodes:
    # Make GET request to rentcast API
    url = f'https://api.rentcast.io/v1/markets?zipCode={zipcode}&dataType=All&historyRange=6'
    headers = {
        'X-Api-Key': CONFIGS['rentcast_api_key'],
        'Accept': 'application/json', 
    }
    response = requests.get(url, headers=headers)
    data = response.json()
    # Save to json
    file_path = f'rentcast_{zipcode}.json')
    with open(file_path, 'w') as file:
        json.dump(data, file, indent=4)

In [None]:
# Preview the data
data

## Debugging

Errors are frustrating and inevitable. Even professional programmers probably spend most of their time debugging.

Luckily, there are good tools and techniques for making debugging a little easier.

An "interactive debugger" helps diagnose problems by stepping through the code one line at a time.

The debugger provides tools for setting "breakpoints" where the code will stop running temporarily, a table that shows the values of variables at that time, and buttons to start, stop, and step through the code.

https://jupyterlab.readthedocs.io/en/stable/user/debugger.html

In [None]:
def check_adult(age, cutoff=18):
    if age < cutoff:
        adult = False
    else:
        adult = True
    return adult

check_adult(20)

## Style guidelines for Python
- At the very least, do things consistently
- One statement per line
- Try to limit line length to 72 characters
- Use four spaces to indent
- Put spaces around operators (e.g., `1 + 1` or `day = 'Monday'`) (except in keyword function arguments)
- Use blank lines intentionally and consistently
- Use meaningful names
- Name variables and functions with `lowercase_underscores`
- Constants are often named in `ALL_CAPS_WITH_UNDERSCORES` (e.g., `C = 2.99792458e+8`)
- Name custom classes with `CapWords`
- In general, avoid spaces in folder and filenames used for programming

See [Code Readability](https://github.com/ncsg/ursp688y_sp2024/blob/main/README.md#code-readability) on the syllabus. [CS61A](https://cs61a.org/articles/composition/) has an excellent composition guide. [PEP 8](https://peps.python.org/pep-0008/) is a standard Python style guide. [Google](https://google.github.io/styleguide/pyguide.html) publishes their internal Python style guide.