Salena Kha

### Data Collection

I plan to use the TransitLand API to predict information about public transportation in major cities. After college, or as newly-workers, many young adults look for full-time jobs and often need to move to a new city during this phase of their life. They often choose their city, and one thing they may consider to be very important is a fitting system of public transporation that they may use to commute to work. I wanted to develop a model that will predict or provide information about the efficiency of transportations in different major cities that young adults may move to. The TransitLand API is very easy to use, and I will use this data, which include modes of transportation, different routes each city offers, as well as the efficiency measured through route sort order, to categorize a city as transit-rich or transit-limited!

In [6]:
import requests
import json
import pandas as pd

# comment these two out if ur not using an env for the api key
import os
from dotenv import load_dotenv

# API key -- secret!
api_key = os.getenv('TRANSITLAND_API_KEY') # replace this with your API key if ur not putting it in an env

# transitland API url
base_url = "https://transit.land/api/v2/rest/routes"

# major cities of college grads
cities = [
    {"name": "Boston", "bbox": "-71.191155,42.227926,-70.986166,42.400819"},
    {"name": "New York", "bbox": "-74.259090,40.477399,-73.700272,40.917577"},
    {"name": "Chicago", "bbox": "-87.940101,41.643919,-87.523985,42.023131"},
    {"name": "San Francisco", "bbox": "-122.515,37.703,-122.357,37.812"},
]

# empty dict for Transit info
transit_dict = {
    'city': [],
    'route_name': [],
    'route_type': [],  # categorical route type
    'route_id_numeric': [],  # numeric value of the routeID
    'route_sort_order': [] # numeric value of route order
}

# for all cities, create params
for city in cities:
    params = {
        "bbox": city['bbox'], 
        "limit": 50, # large enough number of routes
        "apikey": api_key
    }
    
    response = requests.get(base_url, params=params)
    data = response.json()
    
    # organize cols
    for route in data.get('routes', []):
        transit_dict['city'].append(city['name'])
        transit_dict['route_name'].append(route.get('route_long_name', route.get('route_short_name', 'Unknown')))
        
        # obtain categorical route type (0=tram, 1=subway, 2=rail, 3=bus, ...) to measure variability
        transit_dict['route_type'].append(route.get('route_type', 3))
        
        # obtain numeric value of the routeID
        route_id = route.get('id', 0)
        transit_dict['route_id_numeric'].append(route_id if isinstance(route_id, int) else hash(str(route_id)) % 10000)
        
        # obtain numeric value of route order, which notes priority of routes to measure efficiency
        transit_dict['route_sort_order'].append(route.get('route_sort_order', 999))


# convert to df
transit_df = pd.DataFrame(transit_dict)
display(transit_df.head(30))

Unnamed: 0,city,route_name,route_type,route_id_numeric,route_sort_order


### Data Usage and Remaining Issues

The dataset above is mostly cleaned already.


Questions of interest:

1. Which major city offers the most accessible and efficient public transit system for recent college graduates or young adults?

2. How does the variety of transportation (bus, subway, etc) vary across these major cities?


My main concern with this data is that the route_type in this dataset returns numerical values, which is something I wanted, but it is not as human readable. However, this could potentially be a positive trait and easy for someone to look for and sort through one mode of transportation, for example, by typing "0" instead of "railroad."

Similarly, route_id_numeric could be cleaned to display the number of routes per city instead, which may be a bit more helpful.

Overall, I have every necessary feature to address a categorical prediction for each city as transit-rich or transit-limited, a useful categorization for those moving to a new city. I plan to predict numeric results like route priority through regression or predict categorical "transit-rich" or "transit-limited" through classification.
