## Overview

The goal of this notebook is to gather data from the National Transit Database (NTD) and parse it into something usable for our transit carbon calculations. There are two datasets we need to import and parse, one being the [NTD Annual Data - Fuel and Energy](https://data.transportation.gov/Public-Transit/2022-NTD-Annual-Data-Fuel-and-Energy/8ehq-7his/data) set, and the other being the [TS2.2 - Service Data and Operating Expenses Time Series by System](https://www.transit.dot.gov/ntd/data-product/ts22-service-data-and-operating-expenses-time-series-system-0) set. The former gives us data on fuel types used by certain transit agencies, while the latter gives us the passenger miles traveled (PMT) of each agency. Once the notebook has ran, all of the parsed data goes into `revised-ntd-json-data.json`. The data format is as follows...

<details>
<summary>Data Format</summary>

```json
{
    "meta": {
        // Meta data, column names, etc...
    },
    "data": {
        // uace_code: [ array of dictionaries/all rows that pertain to the uace_code ]
    },
    "pmt": {
        // uace_code: total_pmt_value
    }
}
```

</details>

#### To Maintain

To update the data each year, update the two url variables below with their most recent download link.

In [1]:
# NTD Annual Data - Fuel and Energy: https://data.transportation.gov/Public-Transit/2022-NTD-Annual-Data-Fuel-and-Energy/8ehq-7his/data
ntd_fuel_energy_url = "https://data.transportation.gov/api/views/8ehq-7his/rows.json?accessType=DOWNLOAD"
# TS2.2 - Service Data and Operating Expenses Time Series by System: https://www.transit.dot.gov/ntd/data-product/ts22-service-data-and-operating-expenses-time-series-system-0
ntd_pmt_url = "https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-06/2022%20TS2.2%20Service%20Data%20and%20Operating%20Expenses%20Time%20Series%20by%20System.xlsx"

### Setup NTD fuel and energy data

In [2]:
import json
import requests
import pandas as pd

# Download file helper function
def download_file(url, filename):
    response = requests.get(url, stream=True)
    response.raise_for_status()
    with open(filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)

# Download NTD data 
ntd_fuel_energy_name = "ntd-json-data.json"
download_file(ntd_fuel_energy_url, ntd_fuel_energy_name)

with open(ntd_fuel_energy_name, 'r') as file:
    raw = json.load(file)

# Focus on actual data
data = raw["data"]

# Data comes up as an array of arrays, with each array corresponding to a row in the dataset table, e.g... 
print(json.dumps(data[0], indent=2))



[
  "row-jzh9~3jst-vtgf",
  "00000000-0000-0000-ECEB-FE81C09A5D87",
  0,
  1708100603,
  null,
  1708100603,
  null,
  "{ }",
  "MTA New York City Transit",
  "Brooklyn",
  "NY",
  "20008",
  "Subsidiary Unit of a Transit Agency, Reporting Separately",
  "Full Reporter",
  "63217",
  "New York--Jersey City--Newark, NY--NJ",
  "19426449",
  "10019",
  "CB",
  "DO",
  "443",
  "3697741",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "15739148",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "0",
  null,
  "4.2564",
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null
]


### Refactor each row from an array into a dictionary

Now that we have the data, we can see that each row in the table is represented as an array in this data. Arrays are a bit ambiguous when it comes to accessing its data, so I think we should convert the arrays to a dictionary so we can access data in a cleaner way. 

Note: As I am changing each row array into a dictionary, I am also getting rid of any keys/column names that have "Questionable" in it. This appears after almost every column that reports some type of numerical data, and I believe it is used if their source is unreliable for that data field. I haven't seen it marked before, and it just clutters our dictionary, so I am going to remove them for the time being.

In [3]:
# Retrieve all column values
keys = [column["name"] for column in raw["meta"]["view"]["columns"]]

# Map column names to each value in the row, and remove "Questionable" fields
converted_data = [{k: v for k, v in dict(zip(keys, row)).items() if "Questionable" not in k} for row in data]

# Now each element in the array is a dictionary with mapped keys to each value
print(json.dumps(converted_data[0], indent=2))

{
  "sid": "row-jzh9~3jst-vtgf",
  "id": "00000000-0000-0000-ECEB-FE81C09A5D87",
  "position": 0,
  "created_at": 1708100603,
  "created_meta": null,
  "updated_at": 1708100603,
  "updated_meta": null,
  "meta": "{ }",
  "Agency": "MTA New York City Transit",
  "City": "Brooklyn",
  "State": "NY",
  "NTD ID": "20008",
  "Organization Type": "Subsidiary Unit of a Transit Agency, Reporting Separately",
  "Reporter Type": "Full Reporter",
  "UACE Code": "63217",
  "UZA Name": "New York--Jersey City--Newark, NY--NJ",
  "Primary UZA Population": "19426449",
  "Agency VOMS": "10019",
  "Mode": "CB",
  "TOS": "DO",
  "Mode VOMS": "443",
  "Diesel (gal)": "3697741",
  "Gasoline (gal)": "0",
  "Liquefied Petroleum Gas (gal equivalent)": "0",
  "Compressed Natural Gas (gal equivalent)": "0",
  "Bio-Diesel (gal)": "0",
  "Hydrogen (kg)": "0",
  "Other Fuel (gal/gal equivalent)": "0",
  "Electric Propulsion (kWh)": "0",
  "Electric Battery (kWh)": "0",
  "Diesel (miles)": "15739148",
  "Gasoline (

### Aggregate all rows with same UACE code into one area

Now that each row is turned into a dictionary, lets organize the data by UACE code. The code will be the key in a dictionary which will hold an array of entries that all pertain to that code. 

In [4]:
aggregate_data = {}

for row in converted_data:
    code = row["UACE Code"]
    if aggregate_data.get(code) == None:
        aggregate_data[code] = [row]
    else:
        aggregate_data[code].append(row)

# Show first few rows of data in 63217 area code
print(json.dumps(aggregate_data["63217"][0:3], indent=2))
print("Number of entries for 63217: " + str(len(aggregate_data["63217"])))

[
  {
    "sid": "row-jzh9~3jst-vtgf",
    "id": "00000000-0000-0000-ECEB-FE81C09A5D87",
    "position": 0,
    "created_at": 1708100603,
    "created_meta": null,
    "updated_at": 1708100603,
    "updated_meta": null,
    "meta": "{ }",
    "Agency": "MTA New York City Transit",
    "City": "Brooklyn",
    "State": "NY",
    "NTD ID": "20008",
    "Organization Type": "Subsidiary Unit of a Transit Agency, Reporting Separately",
    "Reporter Type": "Full Reporter",
    "UACE Code": "63217",
    "UZA Name": "New York--Jersey City--Newark, NY--NJ",
    "Primary UZA Population": "19426449",
    "Agency VOMS": "10019",
    "Mode": "CB",
    "TOS": "DO",
    "Mode VOMS": "443",
    "Diesel (gal)": "3697741",
    "Gasoline (gal)": "0",
    "Liquefied Petroleum Gas (gal equivalent)": "0",
    "Compressed Natural Gas (gal equivalent)": "0",
    "Bio-Diesel (gal)": "0",
    "Hydrogen (kg)": "0",
    "Other Fuel (gal/gal equivalent)": "0",
    "Electric Propulsion (kWh)": "0",
    "Electric Ba

### PMT Data

In order to properly assess the amount of carbon emitted when a person uses public transit, it is important to know how many people were using that service at the same time. It is not reasonable for us to ask users to report how many people were on at the same time as them, so the next best thing is an estimate. We can do this by taking the total passenger miles traveled (PMT) for each agency and then dividing that by the total miles of service the agency provided to get an average of how many passengers per mile they served. 

_Requires `pandas` and `openpyxl`_

In [5]:
# Download PMT data
pmt_file_name = "pmt-data.xlsx"
download_file(ntd_pmt_url, pmt_file_name)

# Import PMT data, and replace any NaN/empty values with 0
df = pd.read_excel(pmt_file_name, sheet_name="PMT")
df["2022"] = df["2022"].fillna(0)

# Map all NTD id's to their 2022 PMT value
ntd_to_pmt = {row['NTD ID']: row['2022'] for _, row in df.iterrows()}

# Get the PMT for an agency given their NTD id
def find_pmt_by_ntd_id(ntd_id):
    ntd_id = int(ntd_id)
    return ntd_to_pmt.get(ntd_id, "0")

# Retrieve all NTD ids within a certain UACE code
def get_all_ntd_id_by_uace(code):
    s = set()
    for row in aggregate_data[code]:
        s.add(row['NTD ID'])
    # print("In " + code + ", there were " + str(len(s)) + " unique ids")
    return s

# Aggregates and sums all PMTs in one UACE code
def total_pmt_by_uace(code):
    ids = get_all_ntd_id_by_uace(code)
    total = 0
    for id in ids:
        total += find_pmt_by_ntd_id(id)
    # print("Total PMT in " + code + " is " + str(total))
    return total


# Calculate all PMTs for each area code and put them in a dictionary 
def calculate_all_pmts():
    # Dict to hold uace_code: PMT values
    uace_pmt = {}
    for code in aggregate_data:
        pmt = total_pmt_by_uace(code)
        uace_pmt[code] = pmt
    return uace_pmt

# Run the calculations
pmt_dict = calculate_all_pmts()

  warn("""Cannot parse header or footer so it will be ignored""")


### Write JSON to file

Integrate our revised data into the old data, and then export the file for further use elsewhere.

In [6]:
# Replace the "data" field for our new aggregate_data variable, and add in a "pmt" field
json_data = raw
json_data["data"] = aggregate_data
json_data["pmt"] = pmt_dict
json_data = json.dumps(json_data, indent=4)
combined_data_name = "revised-ntd-json-data.json"

# Write JSON to file
with open(combined_data_name, 'w') as file:
    file.write(json_data)

### Data analysis

With all of our data reformatted and saved in one place, we can now perform some data analysis.

In [8]:
# Import data
with open(combined_data_name, 'r') as file:
    combined_data = json.load(file)

# Find all key/column id's 
keys = [column["name"] for column in combined_data["meta"]["view"]["columns"]]

# Total "fuel + (unit)" count based off area code and mode
def total_count(code, fuel, unit, mode=""):
    total_miles = 0
    total_entries = 0
    for row in combined_data["data"][code]:
        if mode == "" or row["Mode"] == mode:
            total_miles += float(row[fuel + " (" + unit + ")"])
            total_entries += 1
    return (total_miles, total_entries)

# Average miles in area code based on fuel type and mode
def average_miles(code, fuel, mode=""):
    (total, entries) = total_count(code, fuel, "miles", mode)
    avg = total / entries if entries > 0 else 0
    print("Area Code: " + code + "\n" + "Total " + fuel + " Miles: " + str(total) + "\nTotal Entries: " + str(entries) + "\nAvg: " + str(avg) + "\n")

# Aggregate and rank fuel mileage based on area code and mode
def ranked_fuel_mileage(code, mode=""):
    fuel_types = [k.split(" (").pop(0) for k in keys if k.endswith("(miles)")]
    aggregate_fuel_mileage = []
    for fuel in fuel_types:
        (total, _) = total_count(code, fuel, "miles", mode)
        aggregate_fuel_mileage.append((fuel + " (miles)", total))
    sorted_aggregate_fuel_mileage = sorted(aggregate_fuel_mileage, key=lambda x: x[1], reverse=True)
    print("Aggregate fuel mileage for UACE " + code + " ranked in descending order...")
    print(sorted_aggregate_fuel_mileage.__str__() + "\n")

# Utilize total miles and PMT to calculate the average amount of passengers per mile
def average_passengers_per_mile(code, mode=""):
    fuel_types = [k.split(" (").pop(0) for k in keys if k.endswith("(miles)")]
    total = 0
    for fuel in fuel_types:
        (m, _) = total_count(code, fuel, "miles", mode)
        total += m
    pmt = combined_data["pmt"][code]
    avg = pmt / total
    print("Avg Passengers per mile for " + code + "\nTotal miles: " + str(total) + "\nPMT: " + str(pmt) + "\nAvg: " + str(avg))
    

### Running data analysis

In [None]:
average_miles("63217", "Gasoline")

ranked_fuel_mileage("63217")

average_passengers_per_mile("63217")

Area Code: 63217
Total Gasoline Miles: 66437808.0
Total Entries: 58
Avg: 1145479.448275862

Aggregate fuel mileage for UACE 63217 ranked in descending order...
[('Electric Propulsion (miles)', 497061246.0), ('Diesel (miles)', 259872221.0), ('Gasoline (miles)', 66437808.0), ('Compressed Natural Gas (miles)', 32531976.0), ('Other Fuel (miles)', 5410262.0), ('Electric Battery (miles)', 256773.0), ('Liquefied Petroleum Gas (miles)', 0.0), ('Hydrogen (miles)', 0.0)]

Avg Passengers per mile for 63217
Total miles: 861570286.0
PMT: 14407877556.0
Avg: 16.722811580342732
