# Explore: Amtrak Network

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand the Amtrak network as a whole and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import scipy.stats as stats
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.chart_bar as bar
import fra_amtrak.chart_box_preagg as boxp
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl

## 1.0 Read files

### 1.1 Resolve paths


In [2]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

PosixPath('/home/jovyan/work/assignments/Course4')

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file named `notebook.toml` containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
COLORS = const["colors"]
COLS = const["columns"]


### 1.3 Retrieve performance data

In [4]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
network = pd.read_csv(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning


### 1.4 Review the `DataFrame`

In [5]:
network.shape

(68412, 24)

In [6]:
network.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 24 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Fiscal Year                               68412 non-null  int64  
 1   Fiscal Quarter                            68412 non-null  int64  
 2   Service Line                              68412 non-null  object 
 3   Service                                   68412 non-null  object 
 4   Sub Service                               68412 non-null  object 
 5   Route Miles                               68412 non-null  int64  
 6   Train Number                              68412 non-null  int64  
 7   Arrival Station Code                      68412 non-null  object 
 8   Arrival Station                           68412 non-null  object 
 9   Arrival Station Type                      68386 non-null  object 
 10  City                              

In [7]:
network.head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,914,52,LOR,Lorton (Auto Train),Station Building (with waiting room),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,0.54932,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,914,53,SFA,Sanford (Auto Train),Station Building (with waiting room),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,0.65774,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,BRL,Burlington,Station Building (with waiting room),...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,0.40036,54.0


## 2.0 The Amtrak network

Network-wide summary statistics covering all available fiscal years and quarters. The data is
derived by flattening a `DataFrame` into a single row of metrics.

In [8]:
# Total train arrivals
network_trn_arrivals = network.shape[0]

# Detraining totals
network_detrn = network[COLS["total_detrn"]].sum()
network_detrn_late = network[COLS["late_detrn"]].sum()
network_detrn_on_time = network_detrn - network_detrn_late

print(
    f"Train Arrivals: {network_trn_arrivals}",
    f"Total Detraining Customers: {network_detrn}",
    f"Late Detraining Customers: {network_detrn_late}",
    f"On-Time Detraining Customers: {network_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
network_stats = detrn.get_sum_stats(network, AGG["columns"], AGG["funcs"])
network_stats

Train Arrivals: 68412
Total Detraining Customers: 78330934
Late Detraining Customers: 19425149
On-Time Detraining Customers: 58905785


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,68412,78330934.0,1144.9882,323.0,2429.1023,19425149.0,283.9436,50.0,788.9444,0.248,,58905785.0


### 2.1 Service lines

The Amtrak network consists of several service lines. The service lines are used to group
services, stations, and train routes. Service line summary statistics cover all available fiscal
years and quarters and are derived by aggregating metrics across each service line.

In [9]:
svc_line_stats = detrn.get_sum_stats_by_group(
    network, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)
svc_line_stats

Unnamed: 0,Service Line,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,Long Distance,9391,10585139,1127.1578,400.0,2542.7741,5108705,544.0001,188.0,1346.9626,0.4826,88.3944,5476434
1,Northeast Corridor,26353,33736114,1280.1622,379.0,2526.2699,7243328,274.8578,52.0,684.2435,0.2147,41.4993,26492786
2,State Supported,32668,34009681,1041.0702,260.0,2307.465,7073116,216.5151,28.0,616.6383,0.208,41.3533,26936565


### 2.2 Services

Each Amtrak service line consists of one or more services. A service is a named train (e.g., the
[_California Zephyr_](https://www.amtrak.com/california-zephyr-train)) that operates on a specific
route.

Note that certain state-supported services
(e.g., [Michigan service](https://www.amtrak.com/michigan-services-train)) represent a roll up of
one or more named train sub services. The named trains that comprise these services are categorized
as sub services and are excluded from the services column.

In [10]:
serv = network.loc[:, COLS["svc"]].unique()
serv.sort()

print(f"services (n={np.count_nonzero(serv)})")
serv

services (n=36)


array(['Acela', 'Acela Express', 'Auto Train', 'Borealis',
       'California Zephyr', 'Capitol Corridor', 'Capitol Ltd', 'Cardinal',
       'Carolinian', 'Cascades', 'City Of New Orleans', 'Coast Starlight',
       'Crescent', 'Downeaster', 'Empire', 'Empire Builder',
       'Heartland Flyer', 'Hiawatha', 'Illinois', 'Keystone',
       'Lake Shore Ltd', 'Lincoln / Missouri', 'Michigan', 'Missouri',
       'Northeast Regional', 'Pacific Surfliner', 'Palmetto',
       'Pennsylvanian', 'Piedmont', 'San Joaquins', 'Silver Meteor',
       'Silver Star', 'Southwest Chief', 'Sunset Ltd', 'Texas Eagle',
       'Vermonter'], dtype=object)

Service-level summary statistics cover all available fiscal years and quarters and are derived by
aggregating metrics across each service.

In [11]:
svc_stats = detrn.get_sum_stats_by_group(network, COLS["svc"], AGG["columns"], AGG["funcs"])
svc_stats

Unnamed: 0,Service,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,Acela,2902,4619992,1592.0028,513.5,2664.8921,892373,307.5028,75.0,587.717,0.1932,32.6096,3727619
1,Acela Express,2453,3310356,1349.5132,377.0,2404.6604,576649,235.0791,48.0,501.387,0.1742,30.6143,2733707
2,Auto Train,22,769068,34957.6364,33955.0,5709.5216,388045,17638.4091,18247.5,7461.718,0.5046,103.4091,381023
3,Borealis,24,26076,1086.5,315.0,1998.4487,13948,581.1667,151.5,1308.8852,0.5349,40.6667,12128
4,California Zephyr,924,879399,951.7305,296.5,1832.0447,580584,628.3377,180.0,1281.2414,0.6602,129.7636,298815
5,Capitol Corridor,6330,2344876,370.4385,152.0,596.52,326186,51.5302,14.0,110.6663,0.1391,38.1251,2018690
6,Capitol Ltd,330,410704,1244.5576,412.0,2515.8664,188242,570.4303,174.0,1413.6061,0.4583,63.0424,222462
7,Cardinal,677,231145,341.4254,132.0,613.9132,107414,158.6617,52.0,293.592,0.4647,70.2027,123731
8,Carolinian,512,1120174,2187.8398,1148.5,2713.9962,447639,874.2949,449.0,1276.108,0.3996,49.1932,672535
9,Cascades,998,1680195,1683.5621,518.5,3059.3645,686792,688.1683,172.0,1430.2456,0.4088,42.1682,993403


### 2.3 Sub services

The Amtrak sub service category is used to group named trains that operate as part of a service such
as the [Illinois](https://www.amtrak.com/illinois-services-train) or
[Michigan](https://www.amtrak.com/michigan-services-train) services (e.g., _Blue Water_,
_Carl Sandburg_, _Illinois Zephyr_, _Pere Marquette_, _Saluki_, and _Wolverine_). Note that
service-level named trains (e.g., _Acela_) are also included in the sub service category.

In [12]:
sub_serv = network.loc[:, COLS["sub_svc"]].unique()
sub_serv.sort()

print(f"subservices (n={np.count_nonzero(sub_serv)})")
sub_serv

subservices (n=48)


array(['Acela', 'Acela Express', 'Adirondack', 'Auto Train',
       'Berkshire Flyer', 'Blue Water', 'Borealis', 'California Zephyr',
       'Capitol Corridor', 'Capitol Ltd', 'Cardinal',
       'Carl Sandburg / Illinois Zephyr', 'Carolinian', 'Cascades',
       'City Of New Orleans', 'Coast Starlight', 'Crescent', 'Downeaster',
       'Empire Builder', 'Ethan Allen Express', 'Heartland Flyer',
       'Hiawatha', 'Illini / Saluki', 'Keystone', 'Lake Shore Ltd',
       'Lincoln / Missouri', 'Lincoln Service', 'Maple Leaf', 'Missouri',
       'New York - Albany', 'New York - Niagara Falls',
       'On Spine Northeast Regional', 'Pacific Surfliner', 'Palmetto',
       'Pennsylvanian', 'Pere Marquette', 'Piedmont',
       'Richmond / Newport News / Norfolk', 'Roanoke', 'San Joaquins',
       'Silver Meteor', 'Silver Star', 'Southwest Chief',
       'Springfield Shuttles', 'Sunset Ltd', 'Texas Eagle', 'Vermonter',
       'Wolverine'], dtype=object)

Sub service-level summary statistics cover all available fiscal years and quarters and are derived
by aggregating metrics across each sub service.

In [13]:
sub_svc_stats = detrn.get_sum_stats_by_group(network, COLS["sub_svc"], AGG["columns"], AGG["funcs"])
sub_svc_stats

Unnamed: 0,Sub Service,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,Acela,2902,4619992,1592.0028,513.5,2664.8921,892373,307.5028,75.0,587.717,0.1932,32.6096,3727619
1,Acela Express,2453,3310356,1349.5132,377.0,2404.6604,576649,235.0791,48.0,501.387,0.1742,30.6143,2733707
2,Adirondack,189,198905,1052.4074,155.0,2482.1179,54967,290.8307,42.0,966.4512,0.2763,45.7814,143938
3,Auto Train,22,769068,34957.6364,33955.0,5709.5216,388045,17638.4091,18247.5,7461.718,0.5046,103.4091,381023
4,Berkshire Flyer,83,18308,220.5783,22.0,532.6021,8778,105.759,9.0,398.0666,0.4795,57.5385,9530
5,Blue Water,220,436721,1985.0955,527.0,4256.8445,139527,634.2136,150.0,1592.8122,0.3195,51.7512,297194
6,Borealis,24,26076,1086.5,315.0,1998.4487,13948,581.1667,151.5,1308.8852,0.5349,40.6667,12128
7,California Zephyr,924,879399,951.7305,296.5,1832.0447,580584,628.3377,180.0,1281.2414,0.6602,129.7636,298815
8,Capitol Corridor,6330,2344876,370.4385,152.0,596.52,326186,51.5302,14.0,110.6663,0.1391,38.1251,2018690
9,Capitol Ltd,330,410704,1244.5576,412.0,2515.8664,188242,570.4303,174.0,1413.6061,0.4583,63.0424,222462


### 2.4 Stations

Each `network` row represents Amtrak station arrival metrics, aggregated quarterly, for a given
train (e.g., _Wolverine_ Train 350 arrivals at [Ann Arbor](https://www.amtrak.com/stations/arb)
station (ARB) during Q3 2024).

Station-level summary statistics will be explored in a separate notebook.

In [14]:
stn_count = network.loc[:, COLS["station_code"]].nunique()
stn_count

533

### 2.5 Regions and divisions

The Amtrak stations in `network` can be grouped by region and/or division. These geographical
groupings are based on US Census Bureau
[categories](https://www.census.gov/programs-surveys/economic-census/guidance-geographies/levels.html)
rather than FRA or Amtrak designations.

Group by stations by region. Ensure that regional station counts are consistent with the total station count.

#### 2.5.1 Regions [1 pt]

 The `DataFrame` named `region_stn_counts` provides stations counts grouped by region.

In [15]:
region_stn_counts = (
    network.groupby(COLS["region"])[COLS["station_code"]]
    .nunique()
    .reset_index()
    .sort_values(by=COLS["region"])
)
region_stn_counts

Unnamed: 0,Region,Arrival Station Code
0,Midwest,121
1,Northeast,115
2,South,150
3,West,147


In [16]:
#hidden tests are within this cell

#### 2.5.2 Divisions [1 pt]

The `DataFrame` named `division_stn_counts` provides station counts grouped by region and divison.

In [17]:
div_stn_counts = (
    network.groupby([COLS["region"], COLS["division"]])[COLS["station_code"]]
    .nunique()
    .reset_index()
    .sort_values(by=[COLS["region"], COLS["division"]])
)
div_stn_counts

Unnamed: 0,Region,Division,Arrival Station Code
0,Midwest,East North Central,78
1,Midwest,West North Central,43
2,Northeast,Eastern Canada,4
3,Northeast,Middle Atlantic,57
4,Northeast,New England,54
5,South,East South Central,20
6,South,South Atlantic,93
7,South,West South Central,37
8,West,Mountain,44
9,West,Pacific,102


In [18]:
#hidden tests are within this cell

## 3.0 On-time performance metrics (entire period)

Amtrak station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

### 3.1 Mean late arrival times summary statistics [1 pt]

Review the central tendency, dispersion, and shape of mean late arrival times across the
network. Calling the custom function named `frm.describe_numeric_column()` will return an
"enriched" dictionary of summary statistics.

In [19]:
# Drop missing values
network_avg_mm_late = network[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the function
network_avg_mm_late_describe = frm.describe_numeric_column(network_avg_mm_late)
network_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(57373),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(48.78976173461384),
  'median': 39.0,
  'mode': np.float64(30.0)},
 'position': {'min': 2.0,
  '25%': np.float64(29.0),
  '50%': np.float64(39.0),
  '75%': np.float64(56.0),
  'max': 866.0},
 'spread': {'variance': 1334.5036970306662,
  'std': 36.53085951672457,
  'range': 864.0,
  'iqr': np.float64(27.0)},
 'shape': {'skewness': np.float64(4.902910630653301),
  'kurtosis': np.float64(49.51440917305096)}}

In [20]:
#hidden tests are within this cell

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times
of Amtrak trains are positively skewed and features a sharper peak and heavier right tail than a
normal distribution. Let's confirm this visually by generating a histogram.

### 3.2 Visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.2.1 Create the chart data

In [21]:
# Convert to DataFrame
network_avg_mm_late = network_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = network_avg_mm_late_describe["center"]["mean"]
sigma = network_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = int(network_avg_mm_late_describe["position"]["max"])
max_val_ceil = int((np.ceil(max_val / 10) * 10))

# Create bins
network_avg_mm_late, bins, num_bins, bin_width = frm.create_bins(
    network_avg_mm_late, COLS["avg_mm_late"], 15
)

# Bin the data
chrt_data = frm.bin_data(network_avg_mm_late, COLS["avg_mm_late"], bins)
# chart_data

#### 3.2.2 Generate the histogram

In [22]:
# Chart title
title_txt = "Amtrak Network Late Detraining Passengers"
title = ttl.format_title(network_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
    height=300,
    width=680,
)
chart.display()

## 4.0 On-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `network_qtr_stats`.

In [23]:
# Get quarterly stats
network_qtr_stats = detrn.get_sum_stats_by_group(
    network,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    network_trn_arrivals,
    network_detrn,
)
network_qtr_stats

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,3860,0.056423,0.042569,3334492,863.858,232.0,1830.229,560163,145.1199,21.0,395.5343,0.168,40.8754,2774329
1,2022,1,5656,0.082676,0.070222,5500565,972.5186,274.0,2122.3845,1323483,233.9963,41.0,749.2579,0.2406,47.503,4177082
2,2022,2,5484,0.080161,0.052795,4135460,754.0956,195.0,1721.7037,994549,181.3547,31.0,618.5795,0.2405,51.7952,3140911
3,2022,3,5335,0.077983,0.080138,6277302,1176.6264,326.0,2514.7755,1623780,304.3636,48.0,946.0424,0.2587,49.6261,4653522
4,2022,4,5530,0.080834,0.088057,6897574,1247.3009,352.0,2612.6326,2045867,369.9579,65.0,953.5369,0.2966,48.4402,4851707
5,2023,1,5682,0.083056,0.088201,6908867,1215.9217,347.5,2574.4133,1758247,309.4416,62.0,830.6276,0.2545,51.2844,5150620
6,2023,2,5819,0.085058,0.075116,5883945,1011.1609,280.0,2150.7339,1188089,204.1741,33.0,591.8177,0.2019,56.9817,4695856
7,2023,3,6137,0.089706,0.09332,7309838,1191.1093,339.0,2524.0602,1915619,312.1426,58.0,810.8011,0.2621,45.0473,5394219
8,2023,4,6112,0.089341,0.10394,8141720,1332.0877,391.5,2739.8684,2368441,387.5067,81.0,929.0555,0.2909,50.2355,5773279
9,2024,1,6059,0.088566,0.106022,8304809,1370.6567,435.0,2742.9429,1839525,303.6021,66.0,711.0765,0.2215,46.3932,6465284


### 4.1 Write to file [1 pt]

Write `network_qtr_stats` to a CSV file named `stu-amtk-network_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-network_qtr_stats.csv` file.
It must match line for line, character for character.

In [24]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-network_qtr_stats.csv")
network_qtr_stats.to_csv(filepath, index=False)

In [25]:
#hidden tests are within this cell

## 5.0 Visualize detraining passengers (by fiscal year and quarter)

Visualize Amtrak's detraining passengers, both on-time and late, across all years and quarters with
a bar chart.

In [26]:
# Assemble the data for the chart
chrt_data = bar.create_detrain_chart_frame(network_qtr_stats, CHRT_BAR["columns"])

# Create chart title
title_text = f"Amtrak {const['service_lines']['nec']} (NEC) Detraining Passengers"
title = ttl.format_title(network_stats, title_text)

# Grouped bar chart
chart = bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

## 6.0 Visualize distribution of mean late arrival times (by fiscal year and quarter)

Visualizing mean late arrival times grouped by fiscal year and quarter may reveal interesting
patterns.

The data is flattened prior to creating a series of box plots. The fiscal year and quarter column
values are combined (e.g., `< year >Q< quarter >`) by applying the function named
`detrn.format_year_quarter()` to each row to create a new column named "Fiscal Year Quarter". A
second column is also added to color code each quarter and its associated box plot.

### 6.1 Create the chart data [1 pt]

In [27]:
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Group by fiscal year and quarter and flatten
chrt_data = network.groupby(cols[:2])[cols].apply(lambda x: x).reset_index(drop=True)

# Add 'Fiscal Year Quarter' column
# chrt_data["Fiscal Year Quarter"] = chrt_data[COLS["year"]].astype(str) + "Q" + chrt_data[COLS["quarter"]].astype(str)
chrt_data["Fiscal Year Quarter"] = chrt_data[["Fiscal Year", "Fiscal Quarter"]].apply(detrn.format_year_quarter, axis=1)

# Drop columns and reorder
chrt_data.drop([COLS["year"], COLS["quarter"]], axis=1, inplace=True)
chrt_data.insert(0, COLS["year_quarter"], chrt_data.pop(COLS["year_quarter"]))

# Add color column
colors = [COLORS["amtk_blue"], COLORS["amtk_red"]]
chrt_data.loc[:, "Color"] = chrt_data[COLS["year_quarter"]].apply(detrn.assign_color, colors=colors)
chrt_data.head()

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,100.0,#00537e
1,2021Q4,25.0,#00537e
2,2021Q4,51.0,#00537e
3,2021Q4,78.0,#00537e
4,2021Q4,53.0,#00537e


In [28]:
#hidden tests are within this cell

### 6.2 Preaggregate the chart data

Attempting to instantiate an instance of a Vega-Altair [`alt.Chart()`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html) class by passing to it a dataset comprising more than `5000` rows will trigger a `MaxRowsError`. You can disable the `MaxRows` check by calling `alt.data_transformers.disable_max_rows()` method. However, disabling the check may result in performance issues, including browser crashes.

The preferred approach when [working with large datasets](https://altair-viz.github.io/user_guide/large_datasets.html#large-datasets) is to _preaggregate_ the data before generating a plot. This can be achieved "manually"&mdash;the approach adopted in this notebook&mdash;or by [installing](https://altair-viz.github.io/user_guide/large_datasets.html#installing-vegafusion[) and [enabling](https://altair-viz.github.io/user_guide/large_datasets.html#enabling-the-vegafusion-data-transformer) Altair's companion [vegafusion](https://vegafusion.io/) data transformer package.

In [29]:
# Compute aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

### 6.3 Generate the box plots

In [30]:
# Create chart title
title_text = "Amtrak Network Late Detraining Passengers"
title = ttl.format_title(network_stats, title_text)

chart_horizontal = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Late Detraining Customers Avg Min Late:Q",
    x_title="Average Minutes Late",
    y_shorthand="Fiscal Year Quarter:N",
    y_title="Period",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.HORIZONTAL,
    height=400,
    width=680,
)
chart_horizontal.display()

## 7.0 Distance traveled and late detraining passengers

A number of factors influence the likelihood of late detraining passengers. Distance traveled is
possibly one such factor. Does a linear relationship exist between a train's route miles and the
late arrival times experienced by late detraining passengers?

### 7.1 Data preparation

The first step involves preparing the data for the regression analysis.

In [31]:
lm_data = network[[COLS["route_miles"], COLS["late_detrn_avg_mm_late"]]].reset_index(drop=True)
lm_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 2 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Route Miles                             68412 non-null  int64  
 1   Late Detraining Customers Avg Min Late  57373 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 1.0 MB


### 7.2 Linear regression [1 pt]

The code below calculates a linear least-squares regression for two sets of measurements. The
dependent variable is "Late Detraining Customers Avg Min Late". The independent variable is
"Route Miles". The `result` object contains the regression attributes.

In [32]:
# YOUR CODE HERE
lm_data_clean = lm_data[["Route Miles", "Late Detraining Customers Avg Min Late"]].dropna()
result = stats.linregress(lm_data_clean["Route Miles"], lm_data_clean["Late Detraining Customers Avg Min Late"])
result

LinregressResult(slope=np.float64(0.033865823231644165), intercept=np.float64(28.48236724388004), rvalue=np.float64(0.5048653697334845), pvalue=np.float64(0.0), stderr=np.float64(0.000241741003417831), intercept_stderr=np.float64(0.1958177028456057))

In [33]:
#hidden tests are within this cell

The linear regression analysis suggests that there is a moderate positive relationship between a train's route miles and mean late arrival times for late detraining passengers. The slope suggests that with every additional route mile traveled, the average minutes late for late detraining passengers increases by approximately `0.0338` minutes. The *R*² value indicates that around `25.5%` of the variability in the average minutes late can be explained by route miles, with the remaining variability due to other factors or random noise. The very small standard errors for both the slope and intercept coefficients suggest a high level of precision in these estimates.

### 7.3 Predictions

#### 7.3.1 Twenty-five mile intervals [1 pt]

Create a new `DataFrame` named `route_mi_intervals` comprising a single column named "Route Miles"
with values ranging from `0` to `2600` in increments of `25`.

Then apply the function `detrn.predict_avg_min_late()` to each of the "Route Miles" values to
generate predicted average late times for every twenty-five (`25`) miles of rail travel up to `2600`
miles.

For the starting zero (`0`) route mile mark, assign the predicted average late time to `0.0`
minutes. Round each predicted value to two decimal places. Assign each predicted value to a new
column named "Predicted Avg Min Late".

In [35]:
# YOUR CODE HERE
route_mi_intervals = pd.DataFrame({"Route Miles": np.arange(0, 2601, 25)})
route_mi_intervals["Predicted Avg Min Late"] = route_mi_intervals["Route Miles"].apply(lambda x: round(detrn.predict_avg_min_late_by_distance(result, x),2))
route_mi_intervals.loc[0,"Predicted Avg Min Late"] = 0

In [None]:
#hidden tests are within this cell

#### 7.3.2 Named train route miles

Create a `DataFrame` of predicted late times for named trains to combine with `route_mi_intervals`.
Retrieve each named train (e.g. the sub service) and their associate route miles from `network` and
store in two columns named "Sub Service" and "Route Miles". Assign the new `DataFrame` to a variable
named `trn_route_mi`.

Next, generate predictions for each row in `trn_route_mi`. Round the predictions to the second
(`2nd`) decimal place. Assign the predictions to a new column named "Predicted Avg Min Late".

Note that these predicted average late times are relevant only for late detraining passengers who
travel the entire route.

In [39]:
trn_route_mi = network[["Sub Service", "Route Miles"]].drop_duplicates().reset_index(drop=True)
trn_route_mi["Predicted Avg Min Late"] = trn_route_mi["Route Miles"].apply(lambda x: round(detrn.predict_avg_min_late_by_distance(result, x),2))

In [None]:
#hidden tests are within this cell

### 7.4 Combine the data [1 pt]

Combine `route_mi_intervals` and `trn_route_mi`. Assign the new `DataFrame` to a variable named
`lm_predict`. Then sort the `DataFrame` rows by the route miles (ascending) and the sub service
(descending). Finally, reset the index.

In [41]:
# Columns in play
cols = [COLS["route_miles"], COLS["predict_avg_mm_late"], COLS["sub_svc"]]

# Concatenate DataFrames, sort, and reset the index
lm_predict = pd.concat([route_mi_intervals, trn_route_mi], ignore_index=True)
lm_predict.sort_values(by=[cols[0], cols[-1]], ascending=[True, False], inplace=True)
lm_predict.reset_index(drop=True, inplace=True)
lm_predict

Unnamed: 0,Route Miles,Predicted Avg Min Late,Sub Service
0,0,0.000000,
1,25,29.329013,
2,50,30.175658,
3,75,31.022304,
4,82,31.260000,Hiawatha
...,...,...,...
148,2525,113.993571,
149,2550,114.840216,
150,2560,115.180000,Empire Builder
151,2575,115.686862,


In [None]:
#hidden tests are within this cell

### 7.5 Write to file [1 pt]

Write `lm_predict` to a CSV file named `stu-amtk-avg_min_late_predict.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-avg_min_late_predict.csv` file.
It must match line for line, character for character.

In [42]:
filepath = parent_path.joinpath("data", "student", "stu-amtk-avg_min_late_predict.csv")
lm_predict.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

## 8.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v