## Overview

The goal of this notebook is to gather data from the National Transit Database (NTD) and parse it into something usable for our public transit carbon calculations. The two datasets we want to use are the [NTD Annual Data - Fuel and Energy](https://data.transportation.gov/Public-Transit/2022-NTD-Annual-Data-Fuel-and-Energy/8ehq-7his/data) dataset, and the [NTD Service](https://www.transit.dot.gov/ntd/data-product/2022-service) dataset. The Fuel and Energy set is important as it gives us information about different fuel types used by a specific agency in terms of how much was used (gallons, kwh, etc.), and the miles they have accumulated using that fuel type. The Service data is important as it gives us PMT, passenger miles traveled, which allows us to calculate an average load of the transit system at any time. Using these two sets of data together can allow us to accurately calculate the efficiency of the mode of transit by fuel type, along with the carbon footprint of using it. 

<br>

---

#### Results of Notebook

As stated above, the goal is to parse these datasets into something usable for our transit caluclations. Once the notebook has finished running, the data will be unified into two separate `json` files, which will be structured in the following way. 

<details>
<summary>NTD Fuel and Energy data format</summary>

```json
    {
        "2022": {
            "uace_code": [
                {
                    // Info about agency in uace_code
                },
                ...
            ],
            "uace_code2": [
                {
                    // Info about agency in uace_code2
                },
                ...
            ],
            ...
        },
        "2021": {
            ...
        },
        ...
    }
```

</details>

<br>

<details>
<summary>NTD Service data format</summary>

```json
    {
        "2022": {
            "ntd_id": {
                "mode": [
                    {
                        // Info about a specific mode found in ntd_id
                    },
                    {
                        // Can be multiple of the same mode found in one ntd_id
                    }
                ],
                "mode2": [
                    {
                        // Info about a specific mode2 found in ntd_id
                    }
                ],
                ...
            },
            ...
        },
        "2021": {
            ...
        },
        ...
    }

```

</details>

<br>

---

#### Notebook Index

0. [URLs for datasets](#section0)
1. [Downloading datatsets](#section1)
2. [Parse and refactor NTD Fuel and Energy dataset](#section2)
3. [Parse and refactor NTD Service dataset](#section3)
4. [Carbon calculations using datasets](#section4)

<br>

----

#### To Maintain

To update the data each year, append the newest version of each dataset in their respective lists below. Assuming the data format has not changed, this script should just be plug and play. Something to note is that years 2018-2021 of the Fuel and Energy data were formatted slightly different than that of 2022, so we do have to be mindful about future versions in case they decide to change the format once again. Below is the difference in the formatting of the columns between the two time frames. The current script refactors the 2018 - 2021 data to look like the 2022 one, as those keys are easier to work with in a dictionary. 

2018 - 2021 Fuel and Energy Data Format

| Miles Traveled by Vehicles Fueled by | Diesel | Gasoline | ... | 
|--------------------------------------|--------|----------|-----|

2022 Fuel and Energy Data Format

| Diesel (miles) | Gasoline (miles) | ... | 
|----------------|------------------|-----|

<br>

### <a id="section0">File URLs </a>

In [168]:
ntd_fuel_energy_urls = [
    {
        # https://www.transit.dot.gov/ntd/data-product/2022-fuel-and-energy | CSV file
        "year": "2022",
        "file_type": "csv",
        "url": "https://data.transportation.gov/api/views/8ehq-7his/rows.csv?date=20231027&accessType=DOWNLOAD&bom=true&format=true"
    }, 
    {
        # https://www.transit.dot.gov/ntd/data-product/2021-fuel-and-energy | xlsx file
        "year": "2021",
        "file_type": "xlsx",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-12/2021%20Fuel%20and%20Energy_1-1_0.xlsx"
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2020-fuel-and-energy | xlsx file, Data is organized differently
        "year": "2020",
        "file_type": "xlsx",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/2023-12/2020-Fuel%20and%20Energy_1-1_1.xlsx",
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2019-fuel-and-energy | ZIP file then xlsm, Data is organized differently
        "year": "2019",
        "file_type": "zip",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/Fuel%20and%20Energy.zip",
    }, 
    {
        # https://www.transit.dot.gov/ntd/data-product/2018-fuel-and-energy | xlsm, Data is organized differently
        "year": "2018",
        "file_type": "xlsm",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/Fuel%20and%20Energy_3.xlsm"
    }
]

ntd_service_pmt_urls = [
    {
        # https://www.transit.dot.gov/ntd/data-product/2022-service
        "year": "2022",
        "file_type": "csv",
        "url": "https://data.transportation.gov/api/views/4fir-qbim/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true"
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2021-service | Annual Service Data by Mode
        "year": "2021",
        "file_type": "xlsx",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/2022-10/2021%20Service_static.xlsx"
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2020-service | Annual Service Data by Mode
        "year": "2020",
        "file_type": "zip",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/2020-Service.zip"
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2019-service | Annual Service Data by Mode
        "year": "2019",
        "file_type": "zip",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/Service.zip"
    },
    {
        # https://www.transit.dot.gov/ntd/data-product/2018-service | Annual Service Data by Mode
        "year": "2018",
        "file_type": "xlsm",
        "url": "https://www.transit.dot.gov/sites/fta.dot.gov/files/Service_4.xlsm"
    },

]

### <a id="section1">Download the NTD Fuel and Energy + NTD Service data</a>

In [169]:
import json
import requests
import pandas as pd
import zipfile
import os
from collections import defaultdict

# Download file helper function
def download_file(url, filename):
    response = requests.get(url, stream=True)
    response.raise_for_status()
    with open(filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)

# Download all NTD Fuel and Energy Data 
for url_entry in ntd_fuel_energy_urls:
    ntd_fuel_energy_name = url_entry["year"] + "-fuel-energy." + url_entry["file_type"]
    download_file(url_entry["url"], ntd_fuel_energy_name)

# Download all NTD Service Data 
for url_entry in ntd_service_pmt_urls:
    ntd_service_pmt_name = url_entry["year"] + "-service-pmt." + url_entry["file_type"]
    download_file(url_entry["url"], ntd_service_pmt_name)

### <a id="section2">Unify the NTD Fuel and Energy data into one JSON file</a>

All of the NTD data between the years is split up into different file structures and formats, so we need to unify that into one cohesive JSON file. We will also need to refactor all of the data a bit by converting all the rows, which are represented by arrays, into dictionaries, removing extraneous values, and adding additional fields where necessary.

In [170]:
ntd_fuel_energy_data = {}
ntd_to_uace = {}

def refactor_fuel_energy_data(url_entry):
    '''
    Take all of the NTD Fuel and Energy data and do the following
    1) Load in the data from the file
    2) Convert the rows into dictionaries
    3) Group all the rows by UACE code
    4) Write all the data to a single file
    '''
    # Load in data
    df = load_dataframe(url_entry, "fuel-energy", "Fuel and Energy")
    # Ensure there are no NaN values
    df = df.fillna(0)
    # Convert all rows that are arrays into dictionaries
    converted_rows = convert_arrays_dictionary(df, url_entry["year"])
    # Group all rows by UACE code
    aggregate_data = group_by_uace(converted_rows)
    # Add to data
    ntd_fuel_energy_data[url_entry["year"]] = aggregate_data
    print("Service data year " + url_entry["year"] + " is finished!")

def load_dataframe(url_entry, name, sheet_name):
    '''
    Open a dataframe for the file we just downloaded based upon what file type it was
    '''
    df = pd.DataFrame()
    file_name = url_entry["year"] + "-" + name + "." + url_entry["file_type"]
    if url_entry["file_type"] == "csv":
        df = pd.read_csv(file_name)
    elif url_entry["file_type"] == "xlsx" or url_entry["file_type"] == "xlsm":
        df = pd.read_excel(file_name, sheet_name=sheet_name)
    elif url_entry["file_type"] == "zip":
        with zipfile.ZipFile(file_name, "r") as zip_ref:
            with zip_ref.open(zip_ref.namelist()[0]) as xlsm_file:
                df = pd.read_excel(xlsm_file, sheet_name=sheet_name)
    return df
    
def convert_arrays_dictionary(df, year):
    '''
    Takes in a dataframe and converts all of the rows into a dictionary.
    '''
    converted_rows = []
    # Previous years had a different column name structure than 2022
    if int(year) < 2022:
        # Map column names to each value in the row, and remove "Questionable" fields
        for _, row in df.iterrows():
            temp_row = row.to_dict()
            converted_row = {}
            miles = False
            for k, v in temp_row.items():
                if k == "Miles Traveled by Vehicles Fueled by:" or k == "Miles per Gallon/KwH:":
                    # Once we get to the start or end of the miles columns, note it
                    miles = not miles
                    continue
                if miles:
                    # If at the miles column, add it to the name
                    k = k + " (miles)"
                if "Questionable" not in str(k):
                    # Only add k,v back if it is not a "Questionable" field
                    converted_row[k] = v
            converted_row["UACE Code"] = ntd_to_uace.get(converted_row["NTD ID"], -1)
            converted_rows.append(converted_row)
    else:
        converted_rows = [{k: v for k, v in row.to_dict().items() if "Questionable" not in k} for _, row in df.iterrows()]
    
    return converted_rows

def group_by_uace(converted_rows):
    '''
    Organize the data into UACE codes.
    Example data:
    {
        "uace_code": [
            {
                "field1": 1,
                ...
            },
            {
                "field1": 1,
                ...
            }
        ]
    } 
    '''
    aggregate_data = defaultdict(list)
    for row in converted_rows:
        code = row["UACE Code"]
        aggregate_data[code].append(row)
    return aggregate_data

def map_ntd_to_uace():
    '''
    Use 2022 data to create a mapping between NTD ids and UACE codes because older versions of the data
    2018-2021 don't have an UACE field.
    '''
    df = pd.read_csv("2022-fuel-energy.csv")
    for _, row in df.iterrows():
        temp = row.to_dict()
        ntd_to_uace[temp["NTD ID"]] = temp["UACE Code"]
        
def delete_old_files(file_name, urls):
    '''
    Delete all old NTD Fuel and Energy data files
    '''
    for url_entry in urls:
        file_path = url_entry["year"] + "-" + file_name + "." + url_entry["file_type"]
        try:
            os.remove(file_path)
            print(f"The file {file_path} has been removed successfully.")
        except FileNotFoundError:
            print(f"The file {file_path} does not exist.")
        except Exception as e:
            print(f"An error occurred: {e}")

# Create NTD to UACE mapping
map_ntd_to_uace()
            
# For each year of data, refactor it
for url_entry in ntd_fuel_energy_urls:
    refactor_fuel_energy_data(url_entry)

# Add all the data to a file
ntd_fuel_energy_data_json = json.dumps(ntd_fuel_energy_data, indent=2)
with open("ntd_fuel_energy.json", 'w') as file:
    file.write(ntd_fuel_energy_data_json)
    print("NTD Fuel and Energy dataset can now be found at ntd_fuel_energy.json")

# Remove the old files
delete_old_files("fuel-energy", ntd_fuel_energy_urls)

Service data year 2022 is finished!
Service data year 2021 is finished!
Service data year 2020 is finished!
Service data year 2019 is finished!
Service data year 2018 is finished!
NTD Fuel and Energy dataset can now be found at ntd_fuel_energy.json
The file 2022-fuel-energy.csv has been removed successfully.
The file 2021-fuel-energy.xlsx has been removed successfully.
The file 2020-fuel-energy.xlsx has been removed successfully.
The file 2019-fuel-energy.zip has been removed successfully.
The file 2018-fuel-energy.xlsm has been removed successfully.


### <a id="section3">Unify the NTD Service (PMT) Data</a>

Now that we have the NTD Fuel and Energy data unified into a JSON file, it is now time to do the same for the PMT data. We will be trimming the data quite a bit so that we only store the values important to us as years 2018-2022 take up 50mb when combined. The fields we will be saving are...

`["Total Vehicle Miles", "Vehicle Revenue Miles", "Vehicle Deadhead Miles", "Total Train Miles", "Train Revenue Miles", "PMT"]`

This step is important because in order to properly assess the amount of carbon emitted when a person uses public transit, we need to know how many people were using that service at the same time. It is not reasonable for us to ask users to report how many people were on at the same time as them, so the next best thing is an estimate. We can do this by taking the total passenger miles traveled (PMT) for each agency and then dividing that by the total miles of service the agency provided to get an average of how many passengers per mile they served. 

In [171]:
ntd_service_pmt_data = {}

def refactor_service_pmt_data(url_entry):
    # Load in dataframe
    df = load_dataframe(url_entry, "service-pmt", "Annual Service Data By Mode")
    # Get rid of all NaN values
    df = df.fillna(0)
    # Convert all array rows into dictionaries
    converted_rows = convert_and_filter_rows(url_entry["year"], df)
    # Group all of the rows by NTD ID, and then mode that they pertain to
    aggregate_data = group_by_ntd_mode(converted_rows)
    # Add to overall data
    ntd_service_pmt_data[url_entry["year"]] = aggregate_data
    print("Service data year " + url_entry["year"] + " is finished!")

def convert_and_filter_rows(year, df):
    '''
    Convert each array into a dictionary, filter out fields we don't want, and then change field names for unification.
    '''
    # Specify the fields we want to keep so we aren't saving redundant data
    fields_2022 = ["NTD ID","Mode","Actual Vehicle/Passenger Car Miles","Actual Vehicle/Passenger Car Revenue Miles","Actual Vehicle/ Passenger Deadhead Miles","Train Miles","Train Revenue Miles","Passenger Miles Traveled"]  
    fields = ["NTD ID","Mode","Vehicle Miles","Vehicle Revenue Miles","Deadhead Miles","Train Miles","Train \nRevenue\nMiles","Passenger Miles"] 

    # Specify what we want to change the key's name to to unify the data
    convert = ["NTD ID","Mode","Total Vehicle Miles","Vehicle Revenue Miles","Vehicle Deadhead Miles","Total Train Miles","Train Revenue Miles","PMT"]
    fields_2022_convert = dict(zip(fields_2022, convert))
    fields_convert = dict(zip(fields, convert))

    converted_rows = []

    # Converting each row into a dictionary, then filtering out the items in the dictionary based on if the "key" value is in "fields",
    # then we are converting the key into a seperate more concise name
    if int(year) < 2022:
        converted_rows = [{(fields_convert[k]): v for k, v in row.to_dict().items() if k in fields} for _,row in df.iterrows()]
    else:
        converted_rows = [{(fields_2022_convert[k]): v for k, v in row.to_dict().items() if k in fields_2022} for _,row in df.iterrows()]
    
    return converted_rows

def group_by_ntd_mode(converted_rows):
    '''
    Group each row based on NTD id, and then group it once more based upon mode. Each mode must store an array
    of rows because can be multiple instances of same mode in same NTD id. 
    Example Structure:
    {
        "ntd_id": {
            "mode": [
                {
                    "field1": 1,
                    "field2": 2,
                    ...
                },
                {
                    "field1": 1,
                    "field2": 2,
                    ...
                }
            ]
        }
    }
    '''
    aggregate_data = defaultdict(lambda: defaultdict(list))
    for row in converted_rows:
        ntd_id = row["NTD ID"]
        mode = row["Mode"]
        aggregate_data[ntd_id][mode].append(row)
    return aggregate_data

for url_entry in ntd_service_pmt_urls:
    refactor_service_pmt_data(url_entry)

# Add all the data to a file
ntd_service_pmt_data_json = json.dumps(ntd_service_pmt_data, indent=2)
with open("ntd_service_pmt.json", 'w') as file:
    file.write(ntd_service_pmt_data_json)
    print("NTD Service (PMT) dataset can now be found at ntd_fuel_energy.json")

# Delete all old files
delete_old_files("service-pmt", ntd_service_pmt_urls)

Service data year 2022 is finished!
Service data year 2021 is finished!
Service data year 2020 is finished!
Service data year 2019 is finished!
Service data year 2018 is finished!
NTD Service (PMT) dataset can now be found at ntd_fuel_energy.json
The file 2022-service-pmt.csv has been removed successfully.
The file 2021-service-pmt.xlsx has been removed successfully.
The file 2020-service-pmt.zip has been removed successfully.
The file 2019-service-pmt.zip has been removed successfully.
The file 2018-service-pmt.xlsm has been removed successfully.


### <a id="section4">Carbon calculations</a>

In [186]:
# Temporary constants
# https://www.epa.gov/energy/greenhouse-gases-equivalencies-calculator-calculations-and-references
KG_CO2_PER_GALLON_GASOLINE = 8.89
KG_CO2_PER_GALLON_DIESEL = 10.18
KG_CO2_PER_GALLON_BIODIESEL = KG_CO2_PER_GALLON_DIESEL * .26 # https://afdc.energy.gov/fuels/biodiesel-benefits
KG_CO2_PER_GALLON_LPG = 5.75 # https://www.eia.gov/environment/emissions/co2_vol_mass.php
KG_CO2_PER_GALLON_CNG = KG_CO2_PER_GALLON_GASOLINE * 1.22 # https://www.ctc-n.org/technology-library/vehicle-and-fuel-technologies/compressed-natural-gas-cng-fuel
KG_CO2_PER_KG_HYDROGEN = 0 
KG_CO2_PER_KWH_ELECTRICITY = 0.5 # Figure out way to integrate this with eGrid work

DIESEL_GGE = 1.136 # from energy.gov
KWH_PER_GALLON_GASOLINE = 33.7 # from the EPA, used as the basis for MPGe
KWH_PER_GALLON_DIESEL = KWH_PER_GALLON_GASOLINE * 1.14
# GGE constants found from https://epact.energy.gov/fuel-conversion-factors
KWH_PER_GALLON_BIODIESEL = KWH_PER_GALLON_GASOLINE * 1.05 
KWH_PER_GALLON_LPG = KWH_PER_GALLON_GASOLINE * .74
KWH_PER_GALLON_CNG = KWH_PER_GALLON_GASOLINE * .26
KWH_PER_KG_HYDROGEN = KWH_PER_GALLON_GASOLINE * 1.00

# Import data
with open('ntd_fuel_energy.json', 'r') as file:
    fuel_energy_data = json.load(file)

with open("ntd_service_pmt.json", 'r') as file:
    service_pmt_data = json.load(file)

factors = {
    "Gasoline": {"kWh_per_unit":  KWH_PER_GALLON_GASOLINE, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_GASOLINE},
    "Diesel": {"kWh_per_unit":  KWH_PER_GALLON_DIESEL, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_DIESEL},
    "Bio-Diesel": {"kWh_per_unit":  KWH_PER_GALLON_BIODIESEL, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_BIODIESEL},
    "Liquefied Petroleum Gas": {"kWh_per_unit":  KWH_PER_GALLON_LPG, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_LPG},
    "Compressed Natural Gas": {"kWh_per_unit":  KWH_PER_GALLON_CNG, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_CNG},
    "Hydrogen": {"kWh_per_unit":  KWH_PER_KG_HYDROGEN, "kg_CO2_per_unit":  KG_CO2_PER_KG_HYDROGEN},
    "Electric Propulsion": {"kWh_per_unit": 1, "kg_CO2_per_unit":  KG_CO2_PER_KWH_ELECTRICITY},
    "Electric Battery": {"kWh_per_unit": 1, "kg_CO2_per_unit":  KG_CO2_PER_KWH_ELECTRICITY},
    "Other Fuel": {"kWh_per_unit":  KWH_PER_GALLON_GASOLINE, "kg_CO2_per_unit":  KG_CO2_PER_GALLON_GASOLINE},
}

mode_conversion = {
    "Bus": ["CB", "MB", "RB", "TB"],
    "Train": ["LR", "CC", "SR", "TR", "CR", "HR", "MG", "YR"],
    "": []
}

def get_ntd_ids_by_uace(code, year):
    '''
    Given an UACE code, find all the NTD ids within it. Necessary because the PMT data
    only has NTD ids, so this helps us bridge the gap between Fuel + Energy data and Service (PMT) data.
    '''
    ids = set()
    for row in fuel_energy_data[year][code]:
        ids.add(str(row["NTD ID"]))
    return ids

def average_passengers(code, modes, year):
    '''
    Calculate the average number of passengers using public transit given the constraints of UACE code, modes, and year.
    To do this we do the following steps:

    1) Gather all NTD ids in a given UACE code. 
       - This is necessary as the Service/PMT data uses NTD ids instead of UACE codes, so we must convert our UACE code into its corresponding NTD ids that make it up. 
    2) Search through each NTD id (aka agency) in our UACE code and see if it has data on our modes we are searching for.
    3) If the agency has information on the modes we are looking for, add it to our aggregate_modes array.
    4) Sum up all the instances of "Total Miles" and "PMT" in all the mode data we found.
    '''
    # Find all NTD id's by UACE code
    ntd_ids = get_ntd_ids_by_uace(code, year)
    # Goal is to collect all data about each mode 
    aggregate_modes = []
    # Search through each ntd_id in a given UACE code
    for ntd_id in ntd_ids:
        # Get the agency based on the ntd_id, if it doesn't exist we can skip
        agency = service_pmt_data[year].get(ntd_id, None)
        if agency == None: continue
        # Given an agency with an ntd_id in our UACE code, search through the agency's modes for the ones we are looking for
        for mode in agency:
            # If we find a mode within the agency we are looking for, add to our aggregate data.
            if mode in modes:
                aggregate_modes.append(agency[mode])
    # Sum up all the miles
    total_miles = sum(
        (int(mode["Vehicle Revenue Miles"].replace(",", "")) + int(mode["Train Revenue Miles"].replace(",", "")) )
        # (int(mode["Total Vehicle Miles"].replace(",", "")) + int(mode["Total Train Miles"].replace(",", "")) )
        for modes in aggregate_modes
        for mode in modes
    )
    # Sum up all the PMTs
    total_pmt = sum(
        int(mode["PMT"].replace(",", ""))
        for modes in aggregate_modes
        for mode in modes
    )
    # Convert to km
    total_kms =  total_miles * 1.60934
    total_pkt = total_pmt * 1.60934
    avg = 0
    if total_kms != 0:
        avg = float(total_pkt) / float(total_kms)
    return (avg, aggregate_modes)

def aggregate_fuel_data(code, modes, year, fields, get_factor):
    '''
    Aggregate and sum all fields provided in a given year, area code, and modes. Then applies a constant
    factor to the item which is obtained by passing in a "field_name" to get_factor().
    '''
    # Store all the totals in one big dictionary
    totals = defaultdict(int)
    aggregate_data = []
    # Look through every entry in our fuel data based on year and area code
    for entry in fuel_energy_data[year][code]:
        # Only care about modes that we have specified
        if entry["Mode"] in modes:
            # Create new trimmed down object for all the fields we care about
            new_entry = {"NTD ID": entry["NTD ID"], "Mode": entry["Mode"]}
            # Keep track of total that all fields sum up to
            total_value = 0
            # Extract and sum the data for each field
            for field in fields:
                field_num = int(entry[field].replace(",",""))
                total_value += field_num
                # Copy field we want into our new trimmed object
                new_entry[field] = field_num
                # Trim the name e.g. Gasoline (Miles) --> Gasoline
                field_name = field.split(" (").pop(0)
                # Put the data in our totals
                totals[field_name] += field_num * get_factor(field_name)
            # Add in the total_value
            new_entry["Total"] = total_value
            # Add filtered dictionary to aggregate_data
            aggregate_data.append(new_entry)
        
    return (totals, aggregate_data)

def aggregate_total_whs(code, modes, year):
    '''
    Finds total kWh in a given year, area code, and between modes 
    '''
    fields = ["Diesel (gal)","Gasoline (gal)","Liquefied Petroleum Gas (gal equivalent)","Compressed Natural Gas (gal equivalent)","Other Fuel (gal/gal equivalent)","Electric Propulsion (kWh)","Electric Battery (kWh)"]
    if int(year) >= 2022:
        fields.append("Hydrogen (kg)")
    get_factor = lambda factor: factors[factor]["kWh_per_unit"] * 1000
    (total_whs, aggregate_gallons_data) = aggregate_fuel_data(code, modes, year, fields, get_factor)
    return (total_whs, aggregate_gallons_data)

def aggregate_total_kms(code, modes, year):
    '''
    Finds total KMs in a given year, area code, and between modes
    '''
    fields = ["Diesel (miles)","Gasoline (miles)","Liquefied Petroleum Gas (miles)","Compressed Natural Gas (miles)","Other Fuel (miles)","Electric Propulsion (miles)","Electric Battery (miles)"]
    if int(year) >= 2022:
        fields.append("Hydrogen (miles)")
    get_factor = lambda _: 1.60934
    (total_kms, aggregate_miles_data) = aggregate_fuel_data(code, modes, year, fields, get_factor)
    return (total_kms, aggregate_miles_data)

def calculate_weights(aggregate_gallons_data, aggregate_modes):
    '''
    Calculate the weights of each fuel type. The calculations of how to get it are described in this github issue comment:
    https://github.com/JGreenlee/e-mission-common/pull/2#issuecomment-2252070684
    '''
    refactored = defaultdict(dict)
    # First thing we want to do is to combine the aggregate_modes into the ntd_id: { mode1: {}, mode2:{} } format
    for modes in aggregate_modes:
        for mode in modes:
            ntd_id = mode['NTD ID']
            mode_type = mode['Mode']
            pkt = int(mode["PMT"].replace(",","")) * 1.60934
            if ntd_id in refactored and mode_type in refactored[ntd_id]:
                # Check to see if mode already exists within ntd_id, if it does then add the values
                refactored[ntd_id][mode_type]["PKT"] += pkt
            else:
                # If mode doesn't exist, just add it in
                refactored[ntd_id][mode_type] = {"PKT": pkt}

    # Now we want to add all the gallon data into refactored
    for agency in aggregate_gallons_data:
        ntd_id = agency["NTD ID"]
        mode = agency["Mode"]
        # Add in the gallon data, and remove NTD ID + Mode because we already store that info
        refactored[ntd_id][mode].update(agency)
        refactored[ntd_id][mode].pop("NTD ID")
        refactored[ntd_id][mode].pop("Mode")

    # Find the percentage of each fuel type used and add up total passenger km
    total_passenger_km = 0
    for agency in refactored:
        for mode in refactored[agency]:
            for field in refactored[agency][mode]:
                if field == "PKT":
                    # Add up total passenger kms 
                    total_passenger_km += refactored[agency][mode][field]
                elif field != "Total":
                    # Calculating averages for each fuel type within each mode
                    if refactored[agency][mode][field] != 0:
                        refactored[agency][mode][field] /= refactored[agency][mode]["Total"]

    # Calculate the weight by agency & mode based off of the PKT, and then adjust the percentages of the fuel types using that weight
    for agency in refactored:
        for mode in refactored[agency]:
            # Calculate the weight by passenger kilometers traveled
            refactored[agency][mode]["weight_by_pkt"] = refactored[agency][mode]["PKT"] / total_passenger_km
            for field in refactored[agency][mode]:
                if field != "weight_by_pkt" and field != "PKT" and field != "Total":
                    if refactored[agency][mode][field] != 0:
                        refactored[agency][mode][field] *= refactored[agency][mode]["weight_by_pkt"]

    fuel_type_weights = defaultdict(int)

    # Add up all the percentages of each fuel type
    for agency in refactored:
        for mode in refactored[agency]:
            for field in refactored[agency][mode]:
                if field != "weight_by_pkt" and field != "PKT" and field != "Total":
                    fuel_name = field.split(" (").pop(0)
                    fuel_type_weights[fuel_name] += refactored[agency][mode][field]

    return fuel_type_weights

def calculate(trip, modes):
    year = trip["year"]
    code = trip["code"]
    # Set a default value in case we get a KeyError somewhere and have to return something
    fuel_efficiencies = {"Diesel": {"wh_per_km": 1000, "weight": 1.0}}

    try:
        # Test to see if year and code exist
        _ = fuel_energy_data[year]
        _ = fuel_energy_data[year][code]

        # Get all the data we need to compute efficiencies, 
        (total_kms, aggregate_miles_data) = aggregate_total_kms(code, modes, year)
        (total_whs, aggregate_gallons_data) = aggregate_total_whs(code, modes, year)
        (average_number_passengers, aggregate_modes) = average_passengers(code, modes, year)
        weights = calculate_weights(aggregate_gallons_data, aggregate_modes)

    except KeyError as e:
        print(f"Key not found: {e}")
        return fuel_efficiencies

    for fuel in total_kms:
        wh_per_km = 0
        wh_per_km_passenger = 0
        if total_kms[fuel] != 0:
            wh_per_km = total_whs[fuel] / total_kms[fuel]
            wh_per_km_passenger = wh_per_km / average_number_passengers

        fuel_efficiencies[fuel] = {
            "wh_per_km": wh_per_km_passenger,
            "weight": weights[fuel]
        }

    print(average_number_passengers)    
    
    print(json.dumps(fuel_efficiencies, indent=4))


fake_trip = {
    "year": "2022",
    "distance": 1000,
    "code": "63217"
}
modes = ["MB","CB"]

calculate(fake_trip, modes)

11.484986876537171
{
    "Diesel": {
        "wh_per_km": 547.2951046624571,
        "weight": 0.8485528516203305
    },
    "Gasoline": {
        "wh_per_km": 152.49903279152704,
        "weight": 0.001856387118270534
    },
    "Liquefied Petroleum Gas": {
        "wh_per_km": 0,
        "weight": 0
    },
    "Compressed Natural Gas": {
        "wh_per_km": 137.95315732765698,
        "weight": 0.13031106613939686
    },
    "Other Fuel": {
        "wh_per_km": 0,
        "weight": 0
    },
    "Electric Propulsion": {
        "wh_per_km": 0,
        "weight": 0
    },
    "Electric Battery": {
        "wh_per_km": 134.869174597796,
        "weight": 0.008713526812239027
    },
    "Hydrogen": {
        "wh_per_km": 0,
        "weight": 0
    }
}
