# Explore: Amtrak Trains

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand individual Amtrak train performance and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk
import fra_amtrak.chart_box_preagg as boxp
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl


## 1.0 Read files

### 1.1 Resolve paths

In [2]:
parent_path = pl.Path.cwd() # current working directory
parent_path

PosixPath('/home/jovyan/work/assignments/Course4')

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
COLORS = const["colors"]
COLS = const["columns"]
DIRECTION = const["train"]["direction"]
SUB_SVC = const["train"]["sub_service"]
TRN = const["train"]


### 1.3 Retrieve performance data

In [4]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
trains = pd.read_csv(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning
trains.shape

(68412, 24)

In [5]:
trains.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 24 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Fiscal Year                               68412 non-null  int64  
 1   Fiscal Quarter                            68412 non-null  int64  
 2   Service Line                              68412 non-null  object 
 3   Service                                   68412 non-null  object 
 4   Sub Service                               68412 non-null  object 
 5   Route Miles                               68412 non-null  int64  
 6   Train Number                              68412 non-null  int64  
 7   Arrival Station Code                      68412 non-null  object 
 8   Arrival Station                           68412 non-null  object 
 9   Arrival Station Type                      68386 non-null  object 
 10  City                              

In [6]:
trains.head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,914,52,LOR,Lorton (Auto Train),Station Building (with waiting room),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,0.54932,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,914,53,SFA,Sanford (Auto Train),Station Building (with waiting room),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,0.65774,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,BRL,Burlington,Station Building (with waiting room),...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,0.40036,54.0


## 2.0 Select trains: Northeast Corridor (NEC)

Amtrak's Northeast Corridor (NEC) is the busiest passenger rail corridor in the United States. It is the only Amtrak line that operates high-speed Acela Express service. The NEC is a shared asset with commuter rail operators, including the Massachusetts Bay Transportation Authority (MBTA), Metro-North Railroad, New Jersey Transit, Southeastern Pennsylvania Transportation Authority (SEPTA), and the Maryland Transit Administration (MTA).

### 2.1 _Acela Express_ (Boston - New York - Philadelphia - Washington, D.C.)  [1 pt]

Amtrak's [_Acela_](https://www.amtrak.com/acela-train) service is a high-speed rail service(`150` mph | `240` km/h) that operates along the Northeast Corridor (NEC) between Boston, MA and Washington, D.C. The service features multiple departure daily. The service features express trains that make limited stops and regional trains that make all stops.

This section focuses on the _Acela Express_ service. Retrieve the _Acela Express_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `acela_xp`.

In [7]:
# YOUR CODE HERE
acela_xp = ntwk.by_sub_service(trains, "Acela Express")
acela_xp

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2023,1,Northeast Corridor,Acela Express,Acela Express,457,2103,BAL,Baltimore (Penn Station),Station Building (with waiting room),...,Maryland,South Atlantic,South,United States,39.307302,-76.615688,1502,57,0.03795,79.0
1,2023,1,Northeast Corridor,Acela Express,Acela Express,457,2103,MET,Metropark (Iselin),Station Building (with waiting room),...,New Jersey,Middle Atlantic,Northeast,United States,40.568056,-74.329583,7,1,0.14286,19.0
2,2023,1,Northeast Corridor,Acela Express,Acela Express,457,2103,NWK,Newark (Penn Station),Station Building (with waiting room),...,New Jersey,Middle Atlantic,Northeast,United States,40.734706,-74.164750,10,0,0.00000,
3,2023,1,Northeast Corridor,Acela Express,Acela Express,457,2103,PHL,Philadelphia (Gray 30th St Sta),Station Building (with waiting room),...,Pennsylvania,Middle Atlantic,Northeast,United States,39.955615,-75.181041,1730,58,0.03353,76.0
4,2023,1,Northeast Corridor,Acela Express,Acela Express,457,2103,TRE,Trenton,Station Building (with waiting room),...,New Jersey,Middle Atlantic,Northeast,United States,40.219011,-74.754440,132,6,0.04545,60.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2448,2021,4,Northeast Corridor,Acela Express,Acela Express,457,2259,RTE,Route 128 (Westwood),Station Building (with waiting room),...,Massachusetts,New England,Northeast,United States,42.210242,-71.147894,2,0,0.00000,
2449,2021,4,Northeast Corridor,Acela Express,Acela Express,457,2259,STM,Stamford,Station Building (with waiting room),...,Connecticut,New England,Northeast,United States,41.047130,-73.542160,110,18,0.16364,13.0
2450,2021,4,Northeast Corridor,Acela Express,Acela Express,457,2259,TRE,Trenton,Station Building (with waiting room),...,New Jersey,Middle Atlantic,Northeast,United States,40.219011,-74.754440,49,16,0.32653,62.0
2451,2021,4,Northeast Corridor,Acela Express,Acela Express,457,2259,WAS,Washington,Station Building (with waiting room),...,District of Columbia,South Atlantic,South,United States,38.896993,-77.006422,1555,759,0.48810,41.0


In [8]:
#hidden tests are within this cell

### 2.2 _Acela Express_: on-time performance metrics (entire period) [1 pt]

_Acela Express_ performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [9]:
# Total train arrivals
acela_xp_trn_arrivals = acela_xp.shape[0]

# Detraining totals
acela_xp_detrn = acela_xp[COLS["total_detrn"]].sum()
acela_xp_detrn_late = acela_xp[COLS["late_detrn"]].sum()
acela_xp_detrn_on_time = acela_xp_detrn - acela_xp_detrn_late

print(
    f"Train Arrivals: {acela_xp_trn_arrivals}",
    f"Total Detraining Customers: {acela_xp_detrn}",
    f"Late Detraining Customers: {acela_xp_detrn_late}",
    f"On-Time Detraining Customers: {acela_xp_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
acela_xp_stats = detrn.get_sum_stats(acela_xp, AGG["columns"], AGG["funcs"])
acela_xp_stats

Train Arrivals: 2453
Total Detraining Customers: 3310356
Late Detraining Customers: 576649
On-Time Detraining Customers: 2733707


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2453,3310356.0,1349.5132,377.0,2404.6604,576649.0,235.0791,48.0,501.387,0.1742,,2733707.0


In [10]:
#hidden tests are within this cell

### 2.3 _Acela Express_ trains [1 pt]

Each _Acela Express_ train is identified by a unique train number.

Create a `DataFrame` named `acela_xp_trns` that contains one row for each train comprising the _Acela Express_ service. Include the following columns in the `DataFrame` in the order specified:

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`.

In [104]:
acela_xp_trns = acela_xp.drop_duplicates(subset='Train Number', ignore_index=True)
acela_xp_trns = acela_xp_trns[["Service Line", "Service", "Sub Service", "Route Miles", "Train Number"]]
acela_xp_trns.sort_values(by="Train Number", inplace=True, ignore_index=True)
acela_xp_trns

Unnamed: 0,Service Line,Service,Sub Service,Route Miles,Train Number
0,Northeast Corridor,Acela Express,Acela Express,457,2103
1,Northeast Corridor,Acela Express,Acela Express,457,2106
2,Northeast Corridor,Acela Express,Acela Express,457,2107
3,Northeast Corridor,Acela Express,Acela Express,457,2109
4,Northeast Corridor,Acela Express,Acela Express,457,2121
5,Northeast Corridor,Acela Express,Acela Express,457,2122
6,Northeast Corridor,Acela Express,Acela Express,457,2126
7,Northeast Corridor,Acela Express,Acela Express,457,2128
8,Northeast Corridor,Acela Express,Acela Express,457,2150
9,Northeast Corridor,Acela Express,Acela Express,457,2151


In [12]:
#hidden tests are within this cell

### 2.4 _Acela Express_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [13]:
# Drop missing values
acela_xp_avg_mm_late = acela_xp[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
acela_xp_avg_mm_late_describe = frm.describe_numeric_column(acela_xp_avg_mm_late)
acela_xp_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(1986),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(30.614300100704934),
  'median': 26.0,
  'mode': np.float64(21.0)},
 'position': {'min': 11.0,
  '25%': np.float64(20.0),
  '50%': np.float64(26.0),
  '75%': np.float64(35.0),
  'max': 305.0},
 'spread': {'variance': 441.80179036631756,
  'std': 21.019081577612223,
  'range': 294.0,
  'iqr': np.float64(15.0)},
 'shape': {'skewness': np.float64(5.309165458832421),
  'kurtosis': np.float64(51.41815789389943)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of _Acela Express_ trains are positively skewed and features features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 2.5 _Acela Express_: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 2.5.1 Create the chart data

In [14]:
# Convert to DataFrame
acela_xp_avg_mm_late = acela_xp_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = acela_xp_avg_mm_late_describe["center"]["mean"]
sigma = acela_xp_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = acela_xp_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
acela_xp_mm_late, bins, num_bins, bin_width = frm.create_bins(
    acela_xp_avg_mm_late, COLS["avg_mm_late"], 10
)

# Bin the data
chrt_data = frm.bin_data(acela_xp_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 2.5.2 Generate the histogram

In [15]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['ace_xp']} Service Late Detraining Passengers"
title = ttl.format_title(acela_xp_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 2.6 _Acela Express_, Trains 2155 & 2154

_Acela_ trains 2155 (southbound) and 2154 (northbound) operate daily between [South Station](https://www.amtrak.com/stations/bos) Boston, MA ([BOS](https://www.amtrak.com/stations/bos)) and [Union Station](https://www.amtrak.com/stations/was) Washington, D.C. ([WAS](https://www.amtrak.com/stations/was)).

#### 2.6.1 _Acela Express_ Train 2155, southbound, detraining passengers summary statistics [1 pt]

Departing daily from [South Station](https://www.amtrak.com/stations/bos), Boston, MA ([BOS](https://www.amtrak.com/stations/bos)).

In [26]:
# Base columns for routes
rte_cols = [
    COLS["trn"],
    COLS["station_code"],
    COLS["station"],
    COLS["state"],
    COLS["lat"],
    COLS["lon"],
]

# Train 2154 southbound
amtk_2155 = ntwk.by_train_number(trains, 2155)
amtk_2155_rte = ntwk.create_route(amtk_2155, TRN["2154"]["direction"])
amtk_2155_rte_stats = detrn.get_route_sum_stats(
    amtk_2155_rte,
    COLS["station_code"],
    AGG["columns"],
    AGG["funcs"],
    rte_cols,
)
amtk_2155_rte_stats.sort_values(by=[COLS["lat"]], ascending=False, inplace=True)

amtk_2155_rte_stats

Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
12,2155,BBY,Boston (Back Bay Station),Massachusetts,42.347317,-71.075828,9,17,1.8889,2.0,0.928,0,0.0,0.0,0.0,0.0,,17
11,2155,RTE,Route 128 (Westwood),Massachusetts,42.210242,-71.147894,10,28,2.8,2.0,2.201,1,0.1,0.0,0.3162,0.0357,16.0,27
10,2155,PVD,Providence (Amtrak),Rhode Island,41.82949,-71.413478,12,6158,513.1667,502.5,117.8889,219,18.25,13.5,13.9618,0.0356,24.2727,5939
9,2155,NHV,New Haven (Union Station),Connecticut,41.297714,-72.92667,12,4000,333.3333,348.5,85.8342,442,36.8333,16.0,38.7294,0.1105,24.1667,3558
8,2155,STM,Stamford,Connecticut,41.04713,-73.54216,12,7912,659.3333,672.5,141.15,1535,127.9167,99.5,91.3957,0.194,21.3333,6377
7,2155,NYP,NY Moynihan Train Hall at Penn Station,New York,40.751038,-73.996327,12,122032,10169.3333,10580.0,1206.4759,20019,1668.25,1296.0,968.1477,0.164,28.25,102013
6,2155,NWK,Newark (Penn Station),New Jersey,40.734706,-74.16475,12,5434,452.8333,480.0,95.5004,1177,98.0833,79.5,68.8285,0.2166,26.1667,4257
5,2155,MET,Metropark (Iselin),New Jersey,40.568056,-74.329583,11,3661,332.8182,387.0,174.7277,957,87.0,63.0,64.8521,0.2614,22.2,2704
4,2155,PHL,Philadelphia (Gray 30th St Sta),Pennsylvania,39.955615,-75.181041,12,30483,2540.25,2516.0,362.5107,8892,741.0,682.5,289.7519,0.2917,23.1667,21591
3,2155,WIL,Wilmington,Delaware,39.737263,-75.551095,12,9161,763.4167,755.5,117.2088,3191,265.9167,228.5,94.7527,0.3483,24.8333,5970


In [None]:
#hidden tests are within this cell

##### 2.6.1.1 Write to file [1 pt]

Write `amtk_2155_rte_stats` to a CSV file named `stu-amtk_2155_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_2155_rte_stats.csv` file. It must match line for line, character for character.

In [27]:
filepath = parent_path.joinpath("data", "student", "stu-amtk_2155_rte_stats.csv")
amtk_2155_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 2.6.2 _Acela Express_ Train 2155: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.


In [29]:
# Drop missing values
amtk_2155_avg_mm_late = amtk_2155[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_2155_avg_mm_late_describe = frm.describe_numeric_column(amtk_2155_avg_mm_late)
amtk_2155_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(121),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(25.28099173553719),
  'median': 23.0,
  'mode': np.float64(21.0)},
 'position': {'min': 12.0,
  '25%': np.float64(20.0),
  '50%': np.float64(23.0),
  '75%': np.float64(29.0),
  'max': 77.0},
 'spread': {'variance': 79.07038567493116,
  'std': 8.892153039333678,
  'range': 65.0,
  'iqr': np.float64(9.0)},
 'shape': {'skewness': np.float64(2.2531334382697175),
  'kurtosis': np.float64(9.855369686527695)}}

##### 2.6.2.1 Retrieve the data

In [30]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_2155_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,29.0,#00537e
1,2021Q4,25.0,#00537e
2,2021Q4,28.0,#00537e
3,2021Q4,25.0,#00537e
4,2021Q4,23.0,#00537e
...,...,...,...
136,2024Q3,25.0,#ef3824
137,2024Q3,17.0,#ef3824
138,2024Q3,14.0,#ef3824
139,2024Q3,14.0,#ef3824


##### 2.6.2.2 Preaggregate the data

Attempting to instantiate an instance of a Vega-Altair [`alt.Chart()`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html) class by passing to it a dataset comprising more than `5000` rows will trigger a `MaxRowsError`. You can disable the `MaxRows` check by calling `alt.data_transformers.disable_max_rows()` method. However, disabling the check may result in performance issues, including browser crashes.

The preferred approach when [working with large datasets](https://altair-viz.github.io/user_guide/large_datasets.html#large-datasets) is to _preaggregate_ the data before generating a plot. This can be achieved "manually"&mdash;the approach adopted in this notebook&mdash;or by [installing](https://altair-viz.github.io/user_guide/large_datasets.html#installing-vegafusion[) and [enabling](https://altair-viz.github.io/user_guide/large_datasets.html#enabling-the-vegafusion-data-transformer) Altair's companion [vegafusion](https://vegafusion.io/) data transformer package.

In [31]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 2.6.2.3 Generate box plots

Visualize the distribution of mean late arrival times for late detraining passengers. Illustrate with box plots.

In [32]:
# Create chart title
txt = TRN["2155"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_2155_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 2.6.3 _Acela Express_ Train 2154, northbound, detraining passengers summary statistics [1 pt]

Departing daily from [Union Station](https://www.amtrak.com/stations/was), Washington, D.C. 
([WAS](https://www.amtrak.com/stations/was)).

Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_2154`, `amtk_2154_rte`, and `amtk_2154_rte_stats`, respectively.

In [44]:
# YOUR CODE HERE
# Base columns for routes
rte_cols = [
    COLS["trn"],
    COLS["station_code"],
    COLS["station"],
    COLS["state"],
    COLS["lat"],
    COLS["lon"],
]

# Train 2154 southbound
amtk_2154 = ntwk.by_train_number(trains, 2154)
amtk_2154_rte = ntwk.create_route(amtk_2154, TRN["2154"]["direction"])
amtk_2154_rte_stats = detrn.get_route_sum_stats(
    amtk_2154_rte,
    COLS["station_code"],
    AGG["columns"],
    AGG["funcs"],
    rte_cols,
)
amtk_2154_rte_stats.sort_values(by=[COLS["lat"]], ascending=True, inplace=True)

filepath = parent_path.joinpath("data", "student", "stu-amtk_2154_rte_stats.csv")
amtk_2154_rte_stats.to_csv(filepath, index=False)

amtk_2154_rte_stats

Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2154,BWI,BWI Thurgood Marshall Airport Station,Maryland,39.192362,-76.6943,12,1278,106.5,95.0,48.6051,41,3.4167,3.0,2.811,0.0321,27.7,1237
1,2154,BAL,Baltimore (Penn Station),Maryland,39.307302,-76.615688,12,1669,139.0833,151.5,48.9702,83,6.9167,4.0,6.0221,0.0497,26.2727,1586
2,2154,WIL,Wilmington,Delaware,39.737263,-75.551095,12,4089,340.75,382.5,129.1955,275,22.9167,22.0,13.8003,0.0673,24.5,3814
3,2154,PHL,Philadelphia (Gray 30th St Sta),Pennsylvania,39.955615,-75.181041,12,25334,2111.1667,2221.0,518.8538,2468,205.6667,240.5,77.4331,0.0974,25.0,22866
4,2154,TRE,Trenton,New Jersey,40.219011,-74.75444,1,28,28.0,28.0,,1,1.0,1.0,,0.0357,11.0,27
5,2154,NWK,Newark (Penn Station),New Jersey,40.734706,-74.16475,12,9512,792.6667,842.5,127.6446,1683,140.25,125.0,71.9319,0.1769,23.0833,7829
6,2154,NYP,NY Moynihan Train Hall at Penn Station,New York,40.751038,-73.996327,12,133093,11091.0833,11278.0,1207.6415,14971,1247.5833,1098.5,634.6931,0.1125,28.0833,118122
7,2154,STM,Stamford,Connecticut,41.04713,-73.54216,12,7586,632.1667,645.0,81.4559,2130,177.5,146.5,94.4029,0.2808,22.0833,5456
8,2154,NHV,New Haven (Union Station),Connecticut,41.297714,-72.92667,11,7418,674.3636,722.0,162.3387,1596,145.0909,144.0,68.5835,0.2152,25.3636,5822
9,2154,PVD,Providence (Amtrak),Rhode Island,41.82949,-71.413478,12,19647,1637.25,1668.0,193.583,6986,582.1667,574.5,224.3264,0.3556,27.0833,12661


In [None]:
#hidden tests are within this cell

##### 2.6.3.1 Write to file [1 pt]

Write `amtk_2154_stats` to a CSV file named `stu-amtk_2154_route_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-2154_route_stats.csv` file.
It must match line for line, character for character.

In [45]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk_2154_rte_stats.csv")
amtk_2154_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 2.6.4 _Acela Express_ Train 2154: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Acela Express_ Train 2154. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [37]:
# Drop missing values
amtk_2154_avg_mm_late = amtk_2154[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_2154_avg_mm_late_describe = frm.describe_numeric_column(amtk_2154_avg_mm_late)
amtk_2154_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(141),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(26.80141843971631),
  'median': 26.0,
  'mode': np.float64(21.0)},
 'position': {'min': 11.0,
  '25%': np.float64(20.0),
  '50%': np.float64(26.0),
  '75%': np.float64(31.0),
  'max': 56.0},
 'spread': {'variance': 83.81742654508611,
  'std': 9.155185773379266,
  'range': 45.0,
  'iqr': np.float64(11.0)},
 'shape': {'skewness': np.float64(0.8783906477596932),
  'kurtosis': np.float64(0.6057990936284594)}}

##### 2.6.4.1 Retrieve the chart data

In [38]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_2154_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
# chrt_data

##### 2.6.4.2 Preaggregate the data

In [39]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 2.6.4.3 Generate box plots

In [40]:
# Create chart title
txt = TRN["2154"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_2154_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 3.0 Select trains: State Supported Michigan Service

Amtrak's state-supported trains are funded by state governments. These services are typically shorter in length and operate within a single state or across multiple states.

### 3.1 _Pacific Surfliner_ Service (San Luis Obispo - Santa Barbara - Los Angeles - San Diego) [1 pt]

Amtrak's [Pacific Surfliner](https://www.amtrak.com/pacific-surfliner-train) service operates between San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)) and [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)). The service features multiple departures daily and serves a number of popular destinations, including Santa Barbara, CA ([SBA](https://www.amtrak.com/stations/sba)), Los Angeles, CA ([LAX](https://www.amtrak.com/stations/lax)), Anaheim, CA ([ANA](https://www.amtrak.com/stations/ana)), and San Juan Capistrano, CA ([SNC](https://www.amtrak.com/stations/snc)).

Retrieve the _Pacific Surfliner_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `surf`.

In [41]:
surf = ntwk.by_sub_service(trains, "Pacific Surfliner")
surf

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,State Supported,Pacific Surfliner,Pacific Surfliner,351,562,ANA,Anaheim,Station Building (with waiting room),...,California,Pacific,West,United States,33.803803,-117.877308,289,20,0.06920,39.0
1,2024,3,State Supported,Pacific Surfliner,Pacific Surfliner,351,562,FUL,Fullerton,Station Building (with waiting room),...,California,Pacific,West,United States,33.868969,-117.922849,360,19,0.05278,32.0
2,2024,3,State Supported,Pacific Surfliner,Pacific Surfliner,351,562,IRV,Irvine,Station Building (with waiting room),...,California,Pacific,West,United States,33.656771,-117.733695,1516,28,0.01847,37.0
3,2024,3,State Supported,Pacific Surfliner,Pacific Surfliner,351,562,OLT,San Diego (Old Town),Platform only (no shelter),...,California,Pacific,West,United States,32.755266,-117.200073,1732,248,0.14319,36.0
4,2024,3,State Supported,Pacific Surfliner,Pacific Surfliner,351,562,OSD,Oceanside,Station Building (with waiting room),...,California,Pacific,West,United States,33.192515,-117.379430,720,71,0.09861,33.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4443,2021,4,State Supported,Pacific Surfliner,Pacific Surfliner,351,1793,OSD,Oceanside,Station Building (with waiting room),...,California,Pacific,West,United States,33.192515,-117.379430,639,0,0.00000,
4444,2021,4,State Supported,Pacific Surfliner,Pacific Surfliner,351,1793,SOL,Solana Beach,Platform only (no shelter),...,California,Pacific,West,United States,32.992937,-117.271135,9,0,0.00000,
4445,2021,4,State Supported,Pacific Surfliner,Pacific Surfliner,351,1796,CPN,Carpinteria,Platform with Shelter,...,California,Pacific,West,United States,34.396780,-119.522975,6,2,0.33333,30.0
4446,2021,4,State Supported,Pacific Surfliner,Pacific Surfliner,351,1796,OXN,Oxnard,Station Building (with waiting room),...,California,Pacific,West,United States,34.199241,-119.175978,288,133,0.46181,36.0


In [None]:
#hidden tests are within this cell

### 3.2 _Pacific Surfliner_: on-time performance metrics (entire period) [1 pt]

Pacific Surfliner performance data is a compilation of quarterly metrics that focus on late detraining passengers. Detraining assengers are considered on-time if they arrive at their destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other detraining passengers are considered late.

In [43]:
# Total train arrivals
surf_trn_arrivals = surf.shape[0]

# Detraining totals
surf_detrn = surf[COLS["total_detrn"]].sum()
surf_detrn_late = surf[COLS["late_detrn"]].sum()
surf_detrn_on_time = surf_detrn - surf_detrn_late

print(
    f"Train Arrivals: {surf_trn_arrivals}",
    f"Total Detraining Customers: {surf_detrn}",
    f"Late Detraining Customers: {surf_detrn_late}",
    f"On-Time Detraining Customers: {surf_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
surf_stats = detrn.get_sum_stats(surf, AGG["columns"], AGG["funcs"])
surf_stats

Train Arrivals: 4448
Total Detraining Customers: 5040187
Late Detraining Customers: 932657
On-Time Detraining Customers: 4107530


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,4448,5040187.0,1133.1356,478.5,1976.3956,932657.0,209.6801,58.0,445.2761,0.185,,4107530.0


In [None]:
#hidden tests are within this cell

### 3.3 _Pacific Surfliner_ trains [1 pt]

Each _Pacific Surfliner_ train is identified by a unique train number. Create a `DataFrame` named `surf_trns` that contains one row for each train comprising the _Pacific Surfliner_ service. Include the following columns in the `DataFrame` in the order specified: 

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`.

In [46]:
surf_trns = surf.drop_duplicates(subset='Train Number', ignore_index=True)
surf_trns = surf_trns[["Service Line", "Service", "Sub Service", "Route Miles", "Train Number"]]
surf_trns.head(3)

Unnamed: 0,Service Line,Service,Sub Service,Route Miles,Train Number
0,State Supported,Pacific Surfliner,Pacific Surfliner,351,562
1,State Supported,Pacific Surfliner,Pacific Surfliner,351,564
2,State Supported,Pacific Surfliner,Pacific Surfliner,351,572


In [None]:
#hidden tests are within this cell

### 3.4 _Pacific Surfliner_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [47]:
# Drop missing values
surf_avg_mm_late = surf[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
surf_avg_mm_late_describe = frm.describe_numeric_column(surf_avg_mm_late)
surf_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(3774),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(42.63222045574987),
  'median': 39.0,
  'mode': np.float64(33.0)},
 'position': {'min': 2.0,
  '25%': np.float64(31.0),
  '50%': np.float64(39.0),
  '75%': np.float64(49.0),
  'max': 391.0},
 'spread': {'variance': 422.04811078520663,
  'std': 20.543809548990826,
  'range': 389.0,
  'iqr': np.float64(18.0)},
 'shape': {'skewness': np.float64(4.752313978802546),
  'kurtosis': np.float64(54.5568427759855)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of Pacific Surfliner trains is positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.5 _Pacific Surfliner_: visualize distribution of mean late arrival times: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.5.1 Create the chart data [1 pt]

In [48]:
# Convert to DataFrame
surf_avg_mm_late = surf_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = surf_avg_mm_late_describe["center"]["mean"]
sigma = surf_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = surf_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
surf_mm_late, bins, num_bins, bin_width = frm.create_bins(surf_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(surf_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

In [None]:
#hidden tests are within this cell

#### 3.5.2 Generate the histogram

In [49]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['surf']} Service Late Detraining Passengers"
title = ttl.format_title(surf_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.6 _Pacific Surfliner_ Trains 774 & 777

_Pacific Surfliner_ trains 774 (southbound) and 777 (northbound) operate Sunday to Friday between San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)) and [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)).

#### 3.6.1 _Pacific Surfliner_ Train 774, southbound, detraining passengers summary statistics [1 pt]

Departing Sunday to Friday from San Luis Obispo, CA ([SLO](https://www.amtrak.com/stations/slo)).

In [50]:
# YOUR CODE HERE
# Base columns for routes
rte_cols = [
    COLS["trn"],
    COLS["station_code"],
    COLS["station"],
    COLS["state"],
    COLS["lat"],
    COLS["lon"],
]

# Train 774 southbound
amtk_774 = ntwk.by_train_number(trains, 774)
amtk_774_rte = ntwk.create_route(amtk_774, TRN["774"]["direction"])
amtk_774_rte_stats = detrn.get_route_sum_stats(
    amtk_774_rte,
    COLS["station_code"],
    AGG["columns"],
    AGG["funcs"],
    rte_cols,
)
amtk_774_rte_stats.sort_values(by=[COLS["lat"]], ascending=False, inplace=True)

amtk_774_rte_stats

Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,774,GVB,Grover Beach,California,35.12126,-120.629266,12,146,12.1667,3.5,13.8356,1,0.0833,0.0,0.2887,0.0068,54.0,145
1,774,GUA,Guadalupe-Santa Maria,California,34.962927,-120.57341,10,94,9.4,6.0,11.5297,1,0.1,0.0,0.3162,0.0106,22.0,93
2,774,LPS,Lompoc-Surf,California,34.682704,-120.605001,12,118,9.8333,8.5,4.7832,14,1.1667,1.0,1.1146,0.1186,41.25,104
3,774,GTA,Goleta,California,34.437729,-119.843092,12,1729,144.0833,151.0,43.4584,500,41.6667,40.0,21.0641,0.2892,28.3333,1229
4,774,SBA,Santa Barbara,California,34.413718,-119.692785,12,8608,717.3333,728.5,139.1751,1806,150.5,140.5,102.3581,0.2098,34.3333,6802
5,774,CPN,Carpinteria,California,34.39678,-119.522975,12,2929,244.0833,241.5,62.8005,947,78.9167,83.5,50.9393,0.3233,29.0909,1982
6,774,MPK,Moorpark,California,34.28476,-118.878059,11,1238,112.5455,116.0,47.0773,251,22.8182,17.0,20.2031,0.2027,40.6364,987
7,774,VEC,Ventura,California,34.276929,-119.299918,12,3773,314.4167,287.5,116.0975,891,74.25,65.5,50.9637,0.2362,32.5833,2882
8,774,SIM,Simi Valley,California,34.270204,-118.695163,12,1832,152.6667,147.0,33.9688,423,35.25,28.0,19.7904,0.2309,45.4167,1409
9,774,CWT,Chatsworth,California,34.253205,-118.599417,12,4994,416.1667,393.0,112.2836,1213,101.0833,79.5,73.3676,0.2429,40.9167,3781


In [None]:
#hidden tests are within this cell

##### 3.6.1.1 Write to file [1 pt]

Write `amtk_774_rte_stats` to a CSV file named `stu-amtk_774_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_774_rte_stats.csv` file. It must match line for line, character for character.

In [52]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk_774_rte_stats.csv")
amtk_774_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 3.6.2 _Pacific Surfliner_ Train 774: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ Train 774. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [53]:
# Drop missing values
amtk_774_avg_mm_late = amtk_774[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_774_avg_mm_late_describe = frm.describe_numeric_column(amtk_774_avg_mm_late)
amtk_774_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(268),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(39.57462686567164),
  'median': 37.0,
  'mode': np.float64(37.0)},
 'position': {'min': 18.0,
  '25%': np.float64(32.0),
  '50%': np.float64(37.0),
  '75%': np.float64(45.0),
  'max': 128.0},
 'spread': {'variance': 147.5112639051932,
  'std': 12.145421520276404,
  'range': 110.0,
  'iqr': np.float64(13.0)},
 'shape': {'skewness': np.float64(2.541218278069484),
  'kurtosis': np.float64(13.800652191508451)}}

##### 3.6.2.1 Retrieve the chart data

In [55]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_774_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,,#00537e
1,2021Q4,22.0,#00537e
2,2021Q4,25.0,#00537e
3,2021Q4,28.0,#00537e
4,2021Q4,30.0,#00537e
...,...,...,...
288,2024Q3,41.0,#ef3824
289,2024Q3,54.0,#ef3824
290,2024Q3,58.0,#ef3824
291,2024Q3,60.0,#ef3824


##### 3.6.2.2 Preaggregate the data

In [56]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 3.6.2.3 Generate box plots

In [57]:
# Create chart title
txt = TRN["774"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_774_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 3.6.3 _Pacific Surfliner_ Train 777, westbound, detraining passengers summary statistics [1 pt]

Departs Sunday to Friday from [Santa Fe Depot](https://www.amtrak.com/stations/san), San Diego, CA ([SAN](https://www.amtrak.com/stations/san)).

Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_777`, `amtk_777_rte`, and `amtk_777_rte_stats`, respectively.

In [59]:
# YOUR CODE HERE
# Base columns for routes
rte_cols = [
    COLS["trn"],
    COLS["station_code"],
    COLS["station"],
    COLS["state"],
    COLS["lat"],
    COLS["lon"],
]

# Train 774 southbound
amtk_777 = ntwk.by_train_number(trains, 777)
amtk_777_rte = ntwk.create_route(amtk_777, TRN["777"]["direction"])
amtk_777_rte_stats = detrn.get_route_sum_stats(
    amtk_777_rte,
    COLS["station_code"],
    AGG["columns"],
    AGG["funcs"],
    rte_cols,
)
amtk_777_rte_stats.sort_values(by=[COLS["lat"]], ascending=True, inplace=True)

amtk_777_rte_stats

Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,777,OLT,San Diego (Old Town),California,32.755266,-117.200073,11,344,31.2727,37.0,19.9253,5,0.4545,0.0,0.8202,0.0145,29.6667,339
1,777,SOL,Solana Beach,California,32.992937,-117.271135,11,2227,202.4545,185.0,143.0569,23,2.0909,1.0,2.9139,0.0103,40.1667,2204
2,777,OSD,Oceanside,California,33.192515,-117.37943,11,4740,430.9091,456.0,163.0751,138,12.5455,12.0,11.8858,0.0291,48.5556,4602
3,777,SNC,San Juan Capistrano,California,33.501318,-117.663813,11,11642,1058.3636,1186.0,437.4074,900,81.8182,62.0,69.2717,0.0773,53.4,10742
4,777,IRV,Irvine,California,33.656771,-117.733695,12,14275,1189.5833,1382.5,584.7832,1080,90.0,78.5,70.7377,0.0757,55.4545,13195
5,777,SNA,Santa Ana,California,33.751629,-117.856607,12,9223,768.5833,819.5,272.3101,760,63.3333,51.0,44.7667,0.0824,51.0,8463
6,777,ANA,Anaheim,California,33.803803,-117.877308,12,20234,1686.1667,1513.5,550.7651,1781,148.4167,130.5,80.1322,0.088,51.5833,18453
7,777,FUL,Fullerton,California,33.868969,-117.922849,12,17801,1483.4167,1488.0,214.2112,1801,150.0833,150.0,66.3057,0.1012,44.5833,16000
8,777,LAX,Los Angeles,California,34.056177,-118.236778,12,128371,10697.5833,10209.0,1638.2428,11042,920.1667,828.0,324.0058,0.086,56.0833,117329
9,777,GDL,Glendale,California,34.123706,-118.258868,12,9433,786.0833,724.0,263.4117,1161,96.75,72.0,112.2636,0.1231,57.5833,8272


In [None]:
#hidden tests are within this cell

##### 3.6.3.1 Write to file [1 pt]

Write `amtk_777_rte_stats` to a CSV file named `stu-amtk_777_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk_777_rte_stats.csv` file. It must match line for line, character for character.

In [62]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk_777_rte_stats.csv")
amtk_777_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 3.6.4 _Pacific Surfliner_ Train 777: late detraining metrics (fiscal year and quarter) [1 pt]

Review the central tendency, dispersion, and shape for the mean late arrival times of _Pacific Surfliner_ Train 777. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [63]:
# Drop missing values
amtk_777_avg_mm_late = amtk_777[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_777_avg_mm_late_describe = frm.describe_numeric_column(amtk_777_avg_mm_late)
amtk_777_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(296),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(49.12837837837838),
  'median': 45.0,
  'mode': np.float64(40.0)},
 'position': {'min': 17.0,
  '25%': np.float64(37.0),
  '50%': np.float64(45.0),
  '75%': np.float64(56.25),
  'max': 137.0},
 'spread': {'variance': 349.14956481905637,
  'std': 18.68554427409211,
  'range': 120.0,
  'iqr': np.float64(19.25)},
 'shape': {'skewness': np.float64(1.772095312174482),
  'kurtosis': np.float64(4.790974152163351)}}

In [None]:
#hidden tests are within this cell

##### 3.6.4.1 Retrieve the chart data

In [64]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_777_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
# chrt_data

##### 3.6.4.2 Preaggregate the data

In [65]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 3.6.4.3 Generate box plots

Visualize the distribution of mean late arrival times for late detraining passengers. Illustrate with box plots.

In [66]:
# Create chart title
txt = TRN["777"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_777_rte_stats, title_txt)

# Create and display the vertical boxplot
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 4.0 Long-distance trains

Amtrak's long-distance trains operate across the United States. These trains are typically overnight services that connect major cities and regions. Long-distance trains are known for their scenic routes and dining car services.

### 4.1 _City of New Orleans_ service (Chicago - Memphis - New Orleans) [1 pt]

The [_City of New Orleans_](https://www.amtrak.com/city-of-new-orleans-train), operates daily between [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)) and [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)) via [Central Station](https://www.amtrak.com/stations/mem), Memphis, TN ([MEM](https://www.amtrak.com/stations/mem)). The train revives the name of the Illinois Central Railroad's [_City of New Orleans_](https://en.wikipedia.org/wiki/City_of_New_Orleans_(train)) that operated between 1947 and 1971. The train is also known for its association with the classic tune [\"City of New Orleans\"](https://en.wikipedia.org/wiki/City_of_New_Orleans_(song)) written by Steve Goodman and popularized by Arlo Guthrie. 

Retrieve the _City of New Orleans_ performance data by calling the appropriate `amtk_network` function. Assign the return value of the function call to a variable named `cno`.

In [69]:
trains

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,914,52,LOR,Lorton (Auto Train),Station Building (with waiting room),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,0.54932,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,914,53,SFA,Sanford (Auto Train),Station Building (with waiting room),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,0.65774,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,BRL,Burlington,Station Building (with waiting room),...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,0.40036,54.0
3,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,COX,Colfax,Station Building (with waiting room),...,California,Pacific,West,United States,39.099172,-120.953075,508,326,0.64173,99.0
4,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,CRN,Creston,Station Building (with waiting room),...,Iowa,West North Central,Midwest,United States,41.056920,-94.361617,205,144,0.70244,67.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68407,2021,4,State Supported,Vermonter,Vermonter,602,57,WAS,Washington,Station Building (with waiting room),...,District of Columbia,South Atlantic,South,United States,38.896993,-77.006422,5191,187,0.03602,37.0
68408,2021,4,State Supported,Vermonter,Vermonter,602,57,WIL,Wilmington,Station Building (with waiting room),...,Delaware,South Atlantic,South,United States,39.737263,-75.551095,464,45,0.09698,28.0
68409,2021,4,State Supported,Vermonter,Vermonter,602,57,WNL,Windsor Locks,Platform with Shelter,...,Connecticut,New England,Northeast,United States,41.913956,-72.626101,21,12,0.57143,35.0
68410,2021,4,State Supported,Vermonter,Vermonter,602,57,WNM,Windsor,Platform only (no shelter),...,Vermont,New England,Northeast,United States,43.479908,-72.384985,14,10,0.71429,26.0


In [70]:
cno = ntwk.by_sub_service(trains, "City Of New Orleans")
cno

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,City Of New Orleans,City Of New Orleans,930,58,BRH,Brookhaven,Station Building (with waiting room),...,Mississippi,East South Central,South,United States,31.582961,-90.441070,166,108,0.65060,37.0
1,2024,3,Long Distance,City Of New Orleans,City Of New Orleans,930,58,CDL,Carbondale (Amtrak),Station Building (with waiting room),...,Illinois,East North Central,Midwest,United States,37.724235,-89.216628,857,170,0.19837,126.0
2,2024,3,Long Distance,City Of New Orleans,City Of New Orleans,930,58,CEN,Centralia,Station Building (with waiting room),...,Illinois,East North Central,Midwest,United States,38.527531,-89.136118,84,32,0.38095,52.0
3,2024,3,Long Distance,City Of New Orleans,City Of New Orleans,930,58,CHI,Chicago (Union Station),Station Building (with waiting room),...,Illinois,East North Central,Midwest,United States,41.878992,-87.641015,16591,2365,0.14255,145.0
4,2024,3,Long Distance,City Of New Orleans,City Of New Orleans,930,58,CHM,Champaign-Urbana,Station Building (with waiting room),...,Illinois,East North Central,Midwest,United States,40.115843,-88.241366,1159,532,0.45902,88.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
531,2022,1,Long Distance,City Of New Orleans,City Of New Orleans,930,59,MEM,Memphis,Station Building (with waiting room),...,Tennessee,East South Central,South,United States,35.132458,-90.059109,2294,324,0.14124,127.0
532,2022,1,Long Distance,City Of New Orleans,City Of New Orleans,930,59,MKS,Marks,Platform with Shelter,...,Mississippi,East South Central,South,United States,34.258227,-90.272342,191,54,0.28272,62.0
533,2022,1,Long Distance,City Of New Orleans,City Of New Orleans,930,59,NBN,Newbern-Dyersburg,Station Building (with waiting room),...,Tennessee,East South Central,South,United States,36.112711,-89.262264,239,125,0.52301,78.0
534,2022,1,Long Distance,City Of New Orleans,City Of New Orleans,930,59,NOL,New Orleans,Station Building (with waiting room),...,Louisiana,West South Central,South,United States,29.946085,-90.078291,6351,721,0.11353,167.0


In [None]:
#hidden tests are within this cell

### 4.2 _City of New Orleans_: on-time performance metrics (entire period)

_City of New Orleans_ performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [71]:
# Total train arrivals
cno_trn_arrivals = cno.shape[0]

# Detraining totals
cno_detrn = cno[COLS["total_detrn"]].sum()
cno_detrn_late = cno[COLS["late_detrn"]].sum()
cno_detrn_on_time = cno_detrn - cno_detrn_late

print(
    f"Train Arrivals: {cno_trn_arrivals}",
    f"Total Detraining Customers: {cno_detrn}",
    f"Late Detraining Customers: {cno_detrn_late}",
    f"On-Time Detraining Customers: {cno_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
cno_stats = detrn.get_sum_stats(cno, AGG["columns"], AGG["funcs"])
cno_stats

Train Arrivals: 536
Total Detraining Customers: 570194
Late Detraining Customers: 159831
On-Time Detraining Customers: 410363


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,536,570194.0,1063.7948,212.5,2503.7947,159831.0,298.1922,95.5,605.9606,0.2803,,410363.0


### 4.3 _City of New Orleans_ trains [1 pt]

Each _City of New Orleans_ train is identified by a unique train number.

Create a `DataFrame` named `cno_trns` that contains one row for each train comprising the _City of New Orleans_ service. Include the following columns in the `DataFrame` in the order specified:

1. "Service Line"
2. "Service"
3. "Sub Service"
4. "Route Miles"
5. "Train Number"

Reset the index (set `drop=True`) when creating the new `DataFrame`

In [72]:
cno_trns = cno.drop_duplicates(subset='Train Number', ignore_index=True)
cno_trns = cno_trns[["Service Line", "Service", "Sub Service", "Route Miles", "Train Number"]]
cno_trns.head(3)

Unnamed: 0,Service Line,Service,Sub Service,Route Miles,Train Number
0,Long Distance,City Of New Orleans,City Of New Orleans,930,58
1,Long Distance,City Of New Orleans,City Of New Orleans,930,59
2,Long Distance,City Of New Orleans,City Of New Orleans,930,1058


In [None]:
#hidden tests are within this cell

### 4.4 _City of New Orleans_: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [76]:
# Drop missing values
cno_avg_mm_late = cno[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
cno_avg_mm_late_describe = frm.describe_numeric_column(cno_avg_mm_late)
cno_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(493),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(71.50101419878297),
  'median': 68.0,
  'mode': np.float64(68.0)},
 'position': {'min': 18.0,
  '25%': np.float64(51.0),
  '50%': np.float64(68.0),
  '75%': np.float64(85.0),
  'max': 387.0},
 'spread': {'variance': 1251.2423770180908,
  'std': 35.372904560102086,
  'range': 369.0,
  'iqr': np.float64(34.0)},
 'shape': {'skewness': np.float64(2.54874272793014),
  'kurtosis': np.float64(15.515159904674796)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of _City of New Orleans_ trains is positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 4.5 _City of New Orleans_: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 4.5.1 Create the chart data

In [77]:
# Convert to DataFrame
cno_avg_mm_late = cno_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = cno_avg_mm_late_describe["center"]["mean"]
sigma = cno_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = cno_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
cno_mm_late, bins, num_bins, bin_width = frm.create_bins(cno_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(cno_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 4.5.2 Generate the histogram

In [78]:
# Chart title
title_txt = f"Amtrak {SUB_SVC['cno']} Service Late Detraining Passengers"
title = ttl.format_title(cno_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 4.6 _City of New Orleans_, Train 59 and 58

City of New Orleans trains 59 (southbound) and 58 (northbound) operate daily between [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)) and [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)).

#### 4.6.1 _City of New Orleans_ Train 59, southbound, detraining passengers summary statistics [1 pt]

Departs daily from [Chicago Union Station](https://www.amtrak.com/stations/chi), Chicago, IL ([CHI](https://www.amtrak.com/stations/chi)).

In [81]:
# Base columns for routes
rte_cols = [COLS["trn"], COLS["station_code"], COLS["station"], COLS["state"], COLS["lat"], COLS["lon"]]

# Train 59 southbound
amtk_59 = ntwk.by_train_number(trains, 59)
amtk_59_rte = ntwk.create_route(amtk_59, "southbound")
amtk_59_rte_stats = detrn.get_route_sum_stats(
    amtk_59_rte, COLS["station_code"], AGG["columns"], AGG["funcs"], rte_cols
)
amtk_59_rte_stats

Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,59,HMW,Homewood,Illinois,41.562446,-87.668685,11,469,42.6364,44.0,18.0902,86,7.8182,9.0,3.7635,0.1834,88.1818,383
1,59,KKI,Kankakee,Illinois,41.119259,-87.86543,11,2392,217.4545,220.0,48.9374,583,53.0,54.0,9.3381,0.2437,65.5455,1809
2,59,CHM,Champaign-Urbana,Illinois,40.115843,-88.241366,11,54699,4972.6364,4972.0,960.7651,14149,1286.2727,1262.0,238.6508,0.2587,66.7273,40550
3,59,MAT,Mattoon,Illinois,39.48273,-88.376045,11,7434,675.8182,675.0,236.6866,2722,247.4545,219.0,93.7213,0.3662,63.6364,4712
4,59,EFG,Effingham,Illinois,39.117059,-88.54711,11,3548,322.5455,333.0,111.2622,1397,127.0,121.0,41.9952,0.3937,63.7273,2151
5,59,CEN,Centralia,Illinois,38.527531,-89.136118,11,2435,221.3636,215.0,81.8099,1074,97.6364,99.0,32.7667,0.4411,62.1818,1361
6,59,CDL,Carbondale (Amtrak),Illinois,37.724235,-89.216628,11,15294,1390.3636,1421.0,442.8492,5902,536.5455,450.0,201.4921,0.3859,67.3636,9392
7,59,FTN,Fulton,Kentucky,36.525704,-88.888772,11,3159,287.1818,281.0,104.772,1920,174.5455,161.0,72.6682,0.6078,66.8182,1239
8,59,NBN,Newbern-Dyersburg,Tennessee,36.112711,-89.262264,11,3105,282.2727,258.0,70.9804,1775,161.3636,157.0,36.9277,0.5717,67.3636,1330
9,59,MEM,Memphis,Tennessee,35.132458,-90.059109,11,31894,2899.4545,2749.0,790.1422,6495,590.4545,620.0,189.5006,0.2036,101.9091,25399


In [None]:
#hidden tests are within this cell

##### 4.6.1.1 Write to file [1 pt]

Write `amtk_59_rte_stats` to a CSV file named `stu-amtk_59_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk-59_route_stats.csv` file. It must match line for line, character for character.

In [82]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk_59_rte_stats.csv")
amtk_59_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 4.6.2 _City of New Orleans_ Train 59: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ Train 59. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [83]:
# Drop missing values
amtk_59_avg_mm_late = amtk_59[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_59_avg_mm_late_describe = frm.describe_numeric_column(amtk_59_avg_mm_late)
amtk_59_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(209),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(81.14832535885168),
  'median': 76.0,
  'mode': np.float64(60.0)},
 'position': {'min': 41.0,
  '25%': np.float64(62.0),
  '50%': np.float64(76.0),
  '75%': np.float64(91.0),
  'max': 172.0},
 'spread': {'variance': 677.0788553551712,
  'std': 26.020738947139282,
  'range': 131.0,
  'iqr': np.float64(29.0)},
 'shape': {'skewness': np.float64(1.2400740474347807),
  'kurtosis': np.float64(1.7221454158217755)}}

##### 4.6.2.1 Retrieve the chart data

In [84]:
# Base columns for chart data
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Get the chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_59_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2022Q1,107.0,#ef3824
1,2022Q1,57.0,#ef3824
2,2022Q1,51.0,#ef3824
3,2022Q1,59.0,#ef3824
4,2022Q1,68.0,#ef3824
...,...,...,...
204,2024Q3,121.0,#ef3824
205,2024Q3,67.0,#ef3824
206,2024Q3,74.0,#ef3824
207,2024Q3,71.0,#ef3824


##### 4.6.2.2 Preaggregate the data

In [85]:
# Base columns for aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 4.6.2.3 Generate box plots

In [86]:
# Create chart title
txt = TRN["59"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_59_rte_stats, title_txt)

# Create and display vertical boxplots
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

#### 4.6.3 _City of New Orleans_ Train 58, northbound, detraining passengers summary statistics [1 pt]

Departs daily from [Union Passenger Terminal](https://www.amtrak.com/stations/NOL), New Orleans, LA ([NOL](https://www.amtrak.com/stations/NOL)). Review previous code employed to generate summary statistics for an Amtrak train. Then leverage functions available in the `amtk_network` and `amtk_detrain` modules to create three new `DataFrame` objects named `amtk_58`, `amtk_58_rte`, and `amtk_58_rte_stats`, respectively.

In [88]:
# Train 58 northbound
amtk_58 = ntwk.by_train_number(trains, 58)
amtk_58_rte = ntwk.create_route(amtk_58, "northbound")
amtk_58_rte_stats = detrn.get_route_sum_stats(
    amtk_58_rte, COLS["station_code"], AGG["columns"], AGG["funcs"], rte_cols
)
amtk_58_rte_stats


Unnamed: 0,Train Number,Arrival Station Code,Arrival Station,State,Latitude,Longitude,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,58,HMD,Hammond,Louisiana,30.50718,-90.462169,11,2264,205.8182,198.0,69.5698,1126,102.3636,98.0,40.1952,0.4973,30.6364,1138
1,58,MCB,McComb,Mississippi,31.244467,-90.451334,11,1698,154.3636,155.0,48.5042,1025,93.1818,89.0,30.8539,0.6037,34.4545,673
2,58,BRH,Brookhaven,Mississippi,31.582961,-90.44107,11,1558,141.6364,150.0,60.3892,985,89.5455,78.0,39.7577,0.6322,41.9091,573
3,58,HAZ,Hazlehurst,Mississippi,31.86132,-90.394347,11,959,87.1818,78.0,31.1731,657,59.7273,55.0,22.4058,0.6851,38.3636,302
4,58,JAN,Jackson,Mississippi,32.300808,-90.190936,11,21050,1913.6364,1970.0,573.9058,2584,234.9091,203.0,163.0825,0.1228,63.2727,18466
5,58,YAZ,Yazoo City,Mississippi,32.848477,-90.41523,11,993,90.2727,72.0,50.3212,200,18.1818,17.0,10.9619,0.2014,60.8182,793
6,58,GWD,Greenwood,Mississippi,33.517159,-90.176454,11,4154,377.6364,379.0,131.9661,1155,105.0,96.0,48.0687,0.278,61.6364,2999
7,58,MKS,Marks,Mississippi,34.258227,-90.272342,11,2664,242.1818,208.0,69.0924,1112,101.0909,106.0,42.7819,0.4174,59.3636,1552
8,58,MEM,Memphis,Tennessee,35.132458,-90.059109,11,43340,3940.0,4042.0,1226.0761,11764,1069.4545,870.0,538.3936,0.2714,74.1818,31576
9,58,NBN,Newbern-Dyersburg,Tennessee,36.112711,-89.262264,11,1459,132.6364,145.0,55.9665,639,58.0909,50.0,28.137,0.438,62.0909,820


In [None]:
#hidden tests are within this cell

##### 4.6.3.1 Write to file [1 pt]

Write `amtk_58_rte_stats` to a CSV file named `stu-amtk_58_rte_stats.csv`. Store the file in the `data/student` directory. Then compare it to the accompanying `fxt-amtk-58_route_stats.csv` file. It must match line for line, character for character.

In [91]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk_58_rte_stats.csv")
amtk_58_rte_stats.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

#### 4.6.4 _City of New Orleans_ Train 58: late detraining metrics (fiscal year and quarter)

Review the central tendency, dispersion, and shape for the mean late arrival times of _City of New Orleans_ Train 58. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics. Then visualize each fiscal year and quarter data with a box plot.

In [92]:
# Drop missing values
amtk_58_avg_mm_late = amtk_58[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Describe the column
amtk_58_avg_mm_late_describe = frm.describe_numeric_column(amtk_58_avg_mm_late)
amtk_58_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(209),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(66.76555023923444),
  'median': 62.0,
  'mode': np.float64(76.0)},
 'position': {'min': 23.0,
  '25%': np.float64(45.0),
  '50%': np.float64(62.0),
  '75%': np.float64(81.0),
  'max': 221.0},
 'spread': {'variance': 960.7764998159735,
  'std': 30.996394948702882,
  'range': 198.0,
  'iqr': np.float64(36.0)},
 'shape': {'skewness': np.float64(1.4509054944484152),
  'kurtosis': np.float64(3.869952017175466)}}

##### 4.6.4.1 Retrieve the chart data [1 pt]

In [97]:
# Base columns for average minutes late
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Chart data
chrt_data = detrn.get_qtr_avg_min_late(
    amtk_58_rte, cols, COLS["year_quarter"], [COLORS["amtk_blue"], COLORS["amtk_red"]]
)
chrt_data

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2022Q1,34.0,#ef3824
1,2022Q1,35.0,#ef3824
2,2022Q1,40.0,#ef3824
3,2022Q1,56.0,#ef3824
4,2022Q1,59.0,#ef3824
...,...,...,...
204,2024Q3,76.0,#ef3824
205,2024Q3,88.0,#ef3824
206,2024Q3,84.0,#ef3824
207,2024Q3,77.0,#ef3824


In [None]:
#hidden tests are within this cell

##### 4.6.4.2 Preaggregate the data

In [98]:
# Base columns for average minutes late
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

##### 4.6.4.3 Generate box plots

In [99]:
# Create chart title
txt = TRN["58"]
title_txt = (
    f"Amtrak {txt['name']} Train {txt['number']} Late Detraining Passengers\n"
    f"{txt['route']} ({txt['direction']})"
)
title = ttl.format_title(amtk_58_rte_stats, title_txt)

# Create and display vertical boxplots
chart_vertical = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Fiscal Year Quarter:N",
    x_title="Period",
    y_shorthand="Late Detraining Customers Avg Min Late:Q",
    y_title="Average Minutes Late",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.VERTICAL,
)
chart_vertical.display()

## 5.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v