# Augment: Intercity Passenger Rail Service Station Performance Metrics

This notebook augments the quarterly [Amtrak](https://www.amtrak.com/home.html) station performance
metrics with additional information about each station. The dataset is sourced from the US
Department of Transportation (DOT), Bureau of Transportation Statistics (BTS), ArcGIS online
[Amtrak Stations](https://geodata.bts.gov/datasets/1ed62a9f46304679aaa396bed4c8565a_0/about) layer.
The dataset contains information about the location of each station, including the station name,
city, state, and geo coordinates.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import json
import numpy as np
import pandas as pd
import pathlib as pl
import re
import tomllib as tl

import fra_amtrak.amtk_frame as frm

# Set random seed
rdg = np.random.default_rng(24)

## 1.0 Read files

### 1.1 Resolve paths

Instantiate instances of `pathlib.Path` to represent absolute paths to the `data/interim` and `data/processed` directories.

In [2]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

data_raw_path = parent_path.joinpath("data", "raw")
data_interim_path = parent_path.joinpath("data", "interim")
data_processed_path = parent_path.joinpath("data", "processed")

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
COLS = const["columns"]

filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations = pd.read_csv(filepath)

### 1.3 Retrieve performance data

In [4]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations = pd.read_csv(filepath)

### 1.4 Review the `DataFrame`

In [5]:
stations.shape

(68412, 16)

In [6]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 16 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Fiscal Year                             68412 non-null  int64  
 1   Fiscal Quarter                          68412 non-null  int64  
 2   Service Line                            68412 non-null  object 
 3   Service                                 68412 non-null  object 
 4   Sub Service                             68412 non-null  object 
 5   Train Number                            68412 non-null  int64  
 6   Arrival Station Code                    68412 non-null  object 
 7   Arrival Station Name                    68412 non-null  object 
 8   Arrival Station                         68412 non-null  object 
 9   State                                   68412 non-null  object 
 10  Division                                68412 non-null  ob

In [7]:
stations.head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,State,Division,Region,Country,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),Virginia,South Atlantic,South,United States,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),Florida,South Atlantic,South,United States,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,Iowa,West North Central,Midwest,United States,557,223,54.0


## 2.0 Add route miles

Every named train is associated with a route that Amtrak measures in miles. The route miles data was sourced
from the FRA's
[_Methodology Report for the Performance and Service Quality of Intercity Passenger Train Operations_](https://railroads.dot.gov/sites/fra.dot.gov/files/2024-08/Methodology%20Report_FY24Q3_web.pdf) (FY 2024 v.2), pp. 12-15.

### 2.1 Retrieve data

In [8]:
with open(data_processed_path.joinpath("amtk_sub_services.json"), "r") as file:
    amtk_sub_svcs = json.load(file)

route_miles = [
    {"Route": route["sub service"], "Route Miles": sum([host["miles"] for host in route["hosts"]])}
    for route in amtk_sub_svcs
]

# Create DataFrame
route_miles = pd.DataFrame.from_dict(route_miles, orient="columns")
route_miles

Unnamed: 0,Route,Route Miles
0,Auto Train,914
1,California Zephyr,2408
2,Capitol Ltd,788
3,Cardinal,1140
4,City Of New Orleans,930
5,Coast Starlight,1388
6,Crescent,1367
7,Empire Builder,2560
8,Lake Shore Ltd,1255
9,Palmetto,885


### 2.2 Combine data [1 pt]

Add `route_miles` to the `stations` `DataFrame`. Once the data is combined, move the `route_miles`
column from the last position to the fifth (`5th`) position in `stations`. Drop any redundant
columns after reordering the columns.

In [9]:
# YOUR CODE HERE
stations = pd.merge(stations, route_miles, left_on="Service", right_on="Route", how="left")

# Convert to int with replacement
stations['Route Miles'] = stations['Route Miles'].replace([np.inf, -np.inf], np.nan)
stations['Route Miles'] = stations['Route Miles'].fillna(0).astype(int)

# Get column names
cols = list(stations.columns)

# Remove the last column
last_col = cols.pop()

# Insert it at the 5th position (index 4)
cols.insert(5, last_col)

# Reorder the DataFrame
stations = stations[cols]

# Drop columns
stations = stations.drop(columns=["Route"])

stations.info()
stations.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 17 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Fiscal Year                             68412 non-null  int64  
 1   Fiscal Quarter                          68412 non-null  int64  
 2   Service Line                            68412 non-null  object 
 3   Service                                 68412 non-null  object 
 4   Sub Service                             68412 non-null  object 
 5   Route Miles                             68412 non-null  int64  
 6   Train Number                            68412 non-null  int64  
 7   Arrival Station Code                    68412 non-null  object 
 8   Arrival Station Name                    68412 non-null  object 
 9   Arrival Station                         68412 non-null  object 
 10  State                                   68412 non-null  ob

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,State,Division,Region,Country,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,914,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),Virginia,South Atlantic,South,United States,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,914,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),Florida,South Atlantic,South,United States,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,BRL,"Burlington, Iowa",Burlington,Iowa,West North Central,Midwest,United States,557,223,54.0
3,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,COX,"Colfax, California",Colfax,California,Pacific,West,United States,508,326,99.0
4,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,CRN,"Creston, Iowa",Creston,Iowa,West North Central,Midwest,United States,205,144,67.0


In [10]:
#hidden tests are within this cell

## 3.0 Add location data

The Bureau of Transportation Statistics (BTS) maintains an [Amtrak stations](https://data-usdot.opendata.arcgis.com/datasets/amtrak-stations/about) dataset that provides mapping (i.e., location) information.

### 3.1 Retrieve data

In [11]:
filepath = data_raw_path.joinpath("NTAD_Amtrak_Stations_-3056704789218436106.csv")
ntad_stations = pd.read_csv(filepath)

### 3.2 Filter data [1 pt]

Filter out all bus stations and reset the index.

In [12]:
# YOUR CODE HERE
ntad_stations = ntad_stations[ntad_stations["StnType"] != "BUS"].reset_index(drop=True)

In [13]:
#hidden tests are within this cell

### 3.3 Drop columns [1 pt]

Drop the following columns. They are not required for the analysis.

* OBJECTID
* StnType
* State
* Name
* StationName
* StationFacilityName
* StationAliases
* DateModif
* x
* y

 Retain only the "StaType", "ZipCode", "Address2", "Address1", "Code", "lon", and "lat" columns.

In [14]:
# YOUR CODE HERE
ntad_stations = ntad_stations.drop(columns=["OBJECTID", "StnType", "State", "Name", "StationName", "StationFacilityName", "StationAliases", "DateModif", "x", "y"])

In [15]:
#hidden tests are within this cell

## 4.0 Clean data

### 4.1 Blank and missing values

No empty or missing values it appears.

In [16]:
# Combined condition to check for empty strings or NaN
mask = (ntad_stations == "") | pd.isna(ntad_stations)
empty_nan_values = ntad_stations.columns[mask.any()]
empty_nan_values

# Count empty or NaN values
# empty_nan_counts = ntad_stations[empty_nan_values].apply(lambda x: x.isin(["", np.nan]).sum())
# empty_nan_counts

Index([], dtype='object')

### 4.2 Normalize strings

Trim each string value of leading/trailing spaces. Also search and remove unnecessary spaces in each string value based on the regular expression `re.Pattern` object. Call the function `frm.normalize_dataframe_strings()` to perform this operation.

#### 4.2.1 Locate suspect strings

As is illustrated below, the regex pattern to employ is `"\s{2,}"`.

In [17]:
# Locate extra spaces in all string columns
extra_spaces = ntad_stations.select_dtypes(include="object").apply(
    lambda x: x.str.contains(r"\s{2,}").sum()
)
extra_spaces

StaType     0
ZipCode     0
City        0
Address2    0
Address1    3
Code        0
dtype: int64

#### 4.2.2 Clean strings [1 pt]

In [18]:
# YOUR CODE HERE
ntad_stations = frm.normalize_dataframe_strings(frame=ntad_stations, pattern="\s{2,}")

In [19]:
#hidden tests are within this cell

## 5.0 Manipulate data

### 5.1 Rename the columns

Note use of constants.

In [20]:
mapper = {
    "StaType": COLS["station_type"],
    "ZipCode": COLS["zip_code"],
    "City": COLS["city"],
    "Address2": COLS["address_02"],
    "Address1": COLS["address_01"],
    "Code": "Code",
    "lon": COLS["lon"],
    "lat": COLS["lat"],
}
ntad_stations.rename(columns=mapper, inplace=True)
ntad_stations.head(3)

Unnamed: 0,Arrival Station Type,ZIP Code,City,Address 02,Address 01,Code,Longitude,Latitude
0,Station Building (with waiting room),21001,Aberdeen,,18 East Bel Air Avenue,ABE,-76.16326,39.508447
1,Platform with Shelter,8201,Absecon,,Shore Road and Ohio Avenue,ABN,-74.501475,39.424041
2,Station Building (with waiting room),87102,Albuquerque,,320 1st Street SW,ABQ,-106.647975,35.082061


### 5.2 Reorder columns

:bulb: By convention, latitude is always listed before longitude.

In [21]:
columns = [
    "Code",
    COLS["station_type"],
    COLS["city"],
    COLS["address_01"],
    COLS["address_02"],
    COLS["zip_code"],
    COLS["lat"],
    COLS["lon"],
]
ntad_stations = ntad_stations.loc[:, columns]
ntad_stations.sample(n=7, random_state=rdg)

Unnamed: 0,Code,Arrival Station Type,City,Address 01,Address 02,ZIP Code,Latitude,Longitude
210,HLK,Platform with Shelter,Holyoke,74 Main Street,,1041,42.204156,-72.602309
70,CHW,Station Building (with waiting room),Charleston,350 MacCorkle Avenue - Southeast,,25314,38.346368,-81.638494
182,GNS,Station Building (with waiting room),Gainesville,116 Industrial Boulevard,,30501,34.288897,-83.819694
435,SBG,Station Building (with waiting room),Sebring,601 East Center Avenue,,33870,27.496632,-81.434202
319,MVW,Station Building (with waiting room),Mount Vernon,105 East Kincaid Street,,98273,48.41847,-122.334738
415,ROM,Station Building (with waiting room),Rome,6599 Martin Street,,13440,43.199425,-75.44996
224,IDP,Platform only (no shelter),Independence,600 South Grand Avenue,,64050,39.086917,-94.429711


## 6.0 Merge data [1 pt]

Merge `stations` and `ntad_stations`. Perform a __left join__ to retain all rows in the `stations` `DataFrame`, joining on the "Arrival Station Code" column in `stations` and the "Code" column in `ntad_stations`.

In [25]:
# YOUR CODE HERE
stations = pd.merge(stations, ntad_stations, left_on='Arrival Station Code', right_on='Code', how='left')

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,Late Detraining Customers,Late Detraining Customers Avg Min Late,Code,Arrival Station Type,City,Address 01,Address 02,ZIP Code,Latitude,Longitude
0,2024,3,Long Distance,Auto Train,914,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),...,23316,95.0,LOR,Station Building (with waiting room),Lorton,8006 Lorton Road,,22079,38.708143,-77.220942
1,2024,3,Long Distance,Auto Train,914,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),...,18439,91.0,SFA,Station Building (with waiting room),Sanford,600 South Persimmon Avenue,,32771,28.808544,-81.291274
2,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,...,223,54.0,BRL,Station Building (with waiting room),Burlington,300 South Main Street,,52601,40.805788,-91.101951
3,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,COX,"Colfax, California",Colfax,...,326,99.0,COX,Station Building (with waiting room),Colfax,99 Railroad Street,,95713,39.099172,-120.953075
4,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,CRN,"Creston, Iowa",Creston,...,144,67.0,CRN,Station Building (with waiting room),Creston,116 West Adams Street,,50801,41.05692,-94.361617


In [None]:
#hidden tests are within this cell

## 7.0 Check geo coordinates [1 pt]

Check for missing geo coordinates in the latitude and longitude columns in the merged DataFrame
named `stations`. Create a new `DataFrame` named `missing_coords` containing the filtered rows.
Limit the new `DataFrame` to the following columns:

* "Arrival Station Code"
* "Arrival Station"
* "State"
* "Latitude"
* "Longitude"

In [31]:
# YOUR CODE HERE
missing_coords = stations[["Arrival Station Code", "Arrival Station", "State", "Latitude", "Longitude"]]
missing_coords = missing_coords[missing_coords.isna().any(axis=1)]
missing_coords.head()

Unnamed: 0,Arrival Station Code,Arrival Station,State,Latitude,Longitude
31113,FAL,Falmouth,Maine,,
31125,FAL,Falmouth,Maine,,
31147,FAL,Falmouth,Maine,,
31170,FAL,Falmouth,Maine,,
31193,FAL,Falmouth,Maine,,


In [None]:
#hidden tests are within this cell

### 7.1 Missing geo coordinates

The BTS Amtrak stations dataset does not contain geo coordinates for the following stations:

* CBN: Canadian Border, NY
* FAL: Falmouth, ME
* MCI: Michigan City, IN

#### 7.1.1 CBN

This is not a physical station but an international border crossing in the vicinity of
Niagra Falls that features an exchange of US and Canadian train crews. The MCI
[Michigan City Station](https://en.wikipedia.org/wiki/Michigan_City_station) is a former Amtrak
station that was closed on 4 April 2022. The geo coordinates for the station can be obtained from
[Google Maps](https://www.google.com/maps/place/41%C2%B043'16.0%22N+86%C2%B054'20.0%22W/@41.721111,-86.905556,15z/data=!4m4!3m3!8m2!3d41.721111!4d-86.905556?hl=en&entry=ttu&g_ep=EgoyMDI0MTAyOS4wIKXMDSoASAFQAw%3D%3D).

#### 7.1.2 FAL

A special event stop for the Amtrak [Downeaster](https://www.amtrak.com/downeaster-train)
in support of the _The Live + Work in Maine Open Golf Tournament_ held at the
[Falmouth Country Club](https://www.falmouthcc.org/) during June 24-27, 2021 and June 23-26, 2022
(source: http://www.trainweb.org/usarail/falmouth.htm).

FAL row values can be updated with the following information:

Muirfield Road at Railroad Crossing \
Falmouth, ME 04105 \
Latitude: `43.769600`, Longitude: `-70.259500`

In [32]:
values = ("Falmouth", "Muirfield Road at Railroad Crossing", "04105", 43.769600, -70.259500)
mask = stations[COLS["station_code"]] == "FAL"
stations.loc[
    mask,
    [COLS["city"], COLS["address_01"], COLS["zip_code"], COLS["lat"], COLS["lon"]],
] = values
stations[mask]

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,Late Detraining Customers,Late Detraining Customers Avg Min Late,Code,Arrival Station Type,City,Address 01,Address 02,ZIP Code,Latitude,Longitude
31113,2022,3,State Supported,Downeaster,145,Downeaster,681,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31125,2022,3,State Supported,Downeaster,145,Downeaster,682,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31147,2022,3,State Supported,Downeaster,145,Downeaster,684,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31170,2022,3,State Supported,Downeaster,145,Downeaster,686,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31193,2022,3,State Supported,Downeaster,145,Downeaster,688,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31224,2022,3,State Supported,Downeaster,145,Downeaster,691,FAL,"Falmouth, Maine",Falmouth,...,13,18.0,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31246,2022,3,State Supported,Downeaster,145,Downeaster,693,FAL,"Falmouth, Maine",Falmouth,...,2,23.0,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31258,2022,3,State Supported,Downeaster,145,Downeaster,694,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595
31270,2022,3,State Supported,Downeaster,145,Downeaster,695,FAL,"Falmouth, Maine",Falmouth,...,0,,,,Falmouth,Muirfield Road at Railroad Crossing,,4105,43.7696,-70.2595


#### 7.1.3 MCI

Formerly Amtrak's Michigan City, IN station, closed since April 2022. MCI row values can be
updated with the following information:

Amtrak Michigan City Station (closed)
100 Washington Street \
Michigan City, Indiana 46360 \
Latitude: `41.721111`, Longitude: `-86.905556`

In [33]:
values = ("Michigan City", "100 Washington Street", "46360", 41.721111, -86.905556)
mask = stations[COLS["station_code"]] == "MCI"
stations.loc[
    mask, [COLS["city"], COLS["address_01"], COLS["zip_code"], COLS["lat"], COLS["lon"]]
] = values
stations[mask]

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,Late Detraining Customers,Late Detraining Customers Avg Min Late,Code,Arrival Station Type,City,Address 01,Address 02,ZIP Code,Latitude,Longitude


In [34]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41929 entries, 0 to 41928
Data columns (total 25 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Fiscal Year                             41929 non-null  int64  
 1   Fiscal Quarter                          41929 non-null  int64  
 2   Service Line                            41929 non-null  object 
 3   Service                                 41929 non-null  object 
 4   Route Miles                             41929 non-null  int64  
 5   Sub Service                             41929 non-null  object 
 6   Train Number                            41929 non-null  int64  
 7   Arrival Station Code                    41929 non-null  object 
 8   Arrival Station Name                    41929 non-null  object 
 9   Arrival Station                         41929 non-null  object 
 10  State                                   41929 non-null  ob

## 8.0 Reorder columns

The `stations` columns are reordered as follows:

| Position | Column Name | Note |
| :----- | :------------- | :------------- |
| `0`-`1` | "Fiscal Year", "Fiscal Quarter" | &nbsp; |
| `2`-`5` | "Service Line", "Service", "Sub Service", "Train Number" | &nbsp; |
| `6`-`9` | "Arrival Station", "Arrival Station Type", "Code", "Arrival Station Code" | Drop "Code" after confirming column order. |
| `10`-`13` | "City", "Address 01", "Address 02", "ZIP Code" | &nbsp; |
| `14`-`17` | "State", "Division", "Region", "Country" | &nbsp; |
| `18`-`19` | "Latitude", "Longitude" | &nbsp; |
| `20`-`22` | "Total Detraining Customers", "Late Detraining Customers", "Late Detraining Customers Avg Min Late" | &nbsp; |

In [35]:
# Indices of interest
state_idx = stations.columns.get_loc(COLS["state"])
total_detrain_idx = stations.columns.get_loc(COLS["total_detrn"])
code_idx = stations.columns.get_loc("Code")

columns_start = stations.columns[:state_idx].tolist()
columns_start.extend([
    "Code",
    COLS["station_type"],
    COLS["city"],
    COLS["address_01"],
    COLS["address_02"],
    COLS["zip_code"],
])
print(f"columns_start = {columns_start}")

columns_middle = stations.columns[state_idx:total_detrain_idx].tolist()
columns_middle.extend([COLS["lat"], COLS["lon"]])
print(f"columns_middle = {columns_middle}")

columns_end = stations.columns[total_detrain_idx:code_idx].tolist()
print(f"columns_end = {columns_end}")

columns = columns_start + columns_middle + columns_end
print(f"columns = {columns}")

# Reorder DataFrame
stations = stations.loc[:, columns]
stations.shape

columns_start = ['Fiscal Year', 'Fiscal Quarter', 'Service Line', 'Service', 'Route Miles', 'Sub Service', 'Train Number', 'Arrival Station Code', 'Arrival Station Name', 'Arrival Station', 'Code', 'Arrival Station Type', 'City', 'Address 01', 'Address 02', 'ZIP Code']
columns_middle = ['State', 'Division', 'Region', 'Country', 'Latitude', 'Longitude']
columns_end = ['Total Detraining Customers', 'Late Detraining Customers', 'Late Detraining Customers Avg Min Late']
columns = ['Fiscal Year', 'Fiscal Quarter', 'Service Line', 'Service', 'Route Miles', 'Sub Service', 'Train Number', 'Arrival Station Code', 'Arrival Station Name', 'Arrival Station', 'Code', 'Arrival Station Type', 'City', 'Address 01', 'Address 02', 'ZIP Code', 'State', 'Division', 'Region', 'Country', 'Latitude', 'Longitude', 'Total Detraining Customers', 'Late Detraining Customers', 'Late Detraining Customers Avg Min Late']


(41929, 25)

In [36]:
stations.head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,ZIP Code,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,914,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),...,22079,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,914,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),...,32771,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,...,52601,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,54.0


## 9.0 Drop column [1 pt]

Drop the redundant "Code" column.

In [37]:
# YOUR CODE HERE
stations = stations.drop(columns="Code")

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,ZIP Code,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,914,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),...,22079,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,914,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),...,32771,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,...,52601,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,54.0
3,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,COX,"Colfax, California",Colfax,...,95713,California,Pacific,West,United States,39.099172,-120.953075,508,326,99.0
4,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,CRN,"Creston, Iowa",Creston,...,50801,Iowa,West North Central,Midwest,United States,41.05692,-94.361617,205,144,67.0


In [None]:
#hidden tests are within this cell

## 10.0 Late detraining passengers

Calculate the ratio of late detraining passengers to total detraining passengers _for each station_
and assign the results to a new column named "Late to Total Detraining Customers Ratio" (use the
associated `COLS` constant rather than hard-coding the string name ibnto the code). Round the 
values to the fitfh (`5th`) decimal place.

Note: Design your `lambda` function carefully to avoid a `ZeroDivisionError` error.

### 10.1 Calculate the percentage [1 pt]

In [38]:
COLS

{'year': 'Fiscal Year',
 'quarter': 'Fiscal Quarter',
 'year_quarter': 'Fiscal Year Quarter',
 'svc_line': 'Service Line',
 'svc': 'Service',
 'sub_svc': 'Sub Service',
 'route_miles': 'Route Miles',
 'station_code': 'Arrival Station Code',
 'station': 'Arrival Station',
 'station_name': 'Arrival Station Name',
 'station_type': 'Arrival Station Type',
 'city': 'City',
 'address_01': 'Address 01',
 'address_02': 'Address 02',
 'zip_code': 'ZIP Code',
 'state': 'State',
 'division': 'Division',
 'region': 'Region',
 'country': 'Country',
 'lat': 'Latitude',
 'lon': 'Longitude',
 'trn': 'Train Number',
 'trn_arrivals': 'Train Arrivals',
 'trn_arrival_ratio': 'Train Arrival Ratio',
 'avg_mm_late': 'Avg Min Late',
 'avg_mm_late_c': 'Avg Min Late (Lt C)',
 'avg_mm_late_cs': 'Avg Min Late (Lt CS)',
 'detrn_ratio': 'Detraining Ratio',
 'late_detrn': 'Late Detraining Customers',
 'late_detrn_avg_mm_late': 'Late Detraining Customers Avg Min Late',
 'late_to_total_detrn_ratio': 'Late to Total Det

In [39]:
# YOUR CODE HERE
stations[COLS['late_to_total_detrn_ratio']] = stations.apply(
    lambda row: round(row[COLS['late_detrn']] / row[COLS['total_detrn']], 5) if row[COLS['late_detrn']] else np.nan, axis=1
)

stations.head()

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late,Late to Total Detraining Customers Ratio
0,2024,3,Long Distance,Auto Train,914,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,95.0,0.54932
1,2024,3,Long Distance,Auto Train,914,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,91.0,0.65774
2,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,54.0,0.40036
3,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,COX,"Colfax, California",Colfax,...,California,Pacific,West,United States,39.099172,-120.953075,508,326,99.0,0.64173
4,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,CRN,"Creston, Iowa",Creston,...,Iowa,West North Central,Midwest,United States,41.05692,-94.361617,205,144,67.0,0.70244


In [None]:
#hidden tests are within this cell

### 10.2 Sample the rows

Return a sample of rows to verify row values.

In [40]:
# Apply weights to sample (CBN stations are fewer)
weights = stations[COLS["svc_line"]].apply(lambda row: 3 if row == "Long Distance" else 1)
stations.sample(n=7, weights=weights, random_state=rdg)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late,Late to Total Detraining Customers Ratio
3673,2024,3,State Supported,Vermonter,602,Vermonter,56,NCR,"New Carrollton, Maryland",New Carrollton,...,Maryland,South Atlantic,South,United States,38.948098,-76.871494,9,1,27.0,0.11111
30072,2022,3,Long Distance,Texas Eagle,1257,Texas Eagle,22,CRV,"Carlinville, Illinois",Carlinville,...,Illinois,East North Central,Midwest,United States,39.279295,-89.889276,46,33,149.0,0.71739
33377,2022,2,Long Distance,Silver Star,1592,Silver Star,91,BAL,"Baltimore (Penn Station), Maryland",Baltimore (Penn Station),...,Maryland,South Atlantic,South,United States,39.307302,-76.615688,831,289,48.0,0.34777
29459,2022,3,Long Distance,Capitol Ltd,788,Capitol Ltd,30,WAS,"Washington, District of Columbia",Washington,...,District of Columbia,South Atlantic,South,United States,38.896993,-77.006422,13342,9792,99.0,0.73392
16428,2023,3,Northeast Corridor,Acela,457,Acela,2271,NHV,"New Haven (Union Station), Connecticut",New Haven (Union Station),...,Connecticut,New England,Northeast,United States,41.297714,-72.92667,30,1,70.0,0.03333
38667,2022,1,State Supported,Missouri,271,Missouri,311,WAH,"Washington, Missouri",Washington,...,Missouri,West North Central,Midwest,United States,38.561466,-91.012717,342,2,26.0,0.00585
626,2024,3,Long Distance,Silver Star,1592,Silver Star,92,WIL,"Wilmington, Delaware",Wilmington,...,Delaware,South Atlantic,South,United States,39.737263,-75.551095,532,454,116.0,0.85338


### 10.3 Reorder columns [1 pt]

Move "Late to Total Detraining Customers Ratio" to the __second to last__ position in `stations`.

In [41]:
# YOUR CODE HERE
# Get column names
cols = list(stations.columns)

# Remove the last column
last_col = cols.pop()

# Insert it at the second to last position
cols.insert(-1, last_col)

# Reorder the DataFrame
stations = stations[cols]

# Drop last column
# stations = stations.drop(columns=new_stations.columns[-1])
# stations.head()

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Route Miles,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,914,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,0.54932,23316,95.0
1,2024,3,Long Distance,Auto Train,914,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,0.65774,18439,91.0
2,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,0.40036,223,54.0
3,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,COX,"Colfax, California",Colfax,...,California,Pacific,West,United States,39.099172,-120.953075,508,0.64173,326,99.0
4,2024,3,Long Distance,California Zephyr,2408,California Zephyr,5,CRN,"Creston, Iowa",Creston,...,Iowa,West North Central,Midwest,United States,41.05692,-94.361617,205,0.70244,144,67.0


In [None]:
#hidden tests are within this cell

## 11.0 Persist data

### 11.1 Recheck data.

In [42]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41929 entries, 0 to 41928
Data columns (total 25 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Fiscal Year                               41929 non-null  int64  
 1   Fiscal Quarter                            41929 non-null  int64  
 2   Service Line                              41929 non-null  object 
 3   Service                                   41929 non-null  object 
 4   Route Miles                               41929 non-null  int64  
 5   Sub Service                               41929 non-null  object 
 6   Train Number                              41929 non-null  int64  
 7   Arrival Station Code                      41929 non-null  object 
 8   Arrival Station Name                      41929 non-null  object 
 9   Arrival Station                           41929 non-null  object 
 10  Arrival Station Type              

### 11.2 Write to file. [1 pt]

Write data to a CSV file.

In [45]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p2.csv")
stations.to_csv(filepath, index=False)

# filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
# stations.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

## 12.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v