# Explore: Amtrak Services

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand Amtrak's three service lines and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk
import fra_amtrak.chart_bar as bar
import fra_amtrak.chart_box_preagg as boxp
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl

## 1.0 Read files

### 1.1 Resolve paths

In [2]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

PosixPath('/home/jovyan/work/assignments/Course4')

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
COLORS = const["colors"]
COLS = const["columns"]
SVC_LINES = const["service_lines"]


### 1.3 Retrieve performance data

In [4]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
network = pd.read_csv(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning
network.shape

(68412, 24)

## 2.0 Amtrak service lines

Provide service line summary statistics covering all available fiscal years and quarters.

In [5]:
svc_line_stats = detrn.get_sum_stats_by_group(
    network, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)
svc_line_stats

Unnamed: 0,Service Line,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,Long Distance,9391,10585139,1127.1578,400.0,2542.7741,5108705,544.0001,188.0,1346.9626,0.4826,88.3944,5476434
1,Northeast Corridor,26353,33736114,1280.1622,379.0,2526.2699,7243328,274.8578,52.0,684.2435,0.2147,41.4993,26492786
2,State Supported,32668,34009681,1041.0702,260.0,2307.465,7073116,216.5151,28.0,616.6383,0.208,41.3533,26936565


## 3.0 Northeast Corridor (NEC)

In [6]:
nec = ntwk.by_service_line(network, SVC_LINES["nec"])
nec.shape

(26353, 24)

### 3.1 NEC: on-time performance metrics (entire period)

NEC station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [7]:
# Total train arrivals
nec_trn_arrivals = nec.shape[0]

# Detraining totals
nec_detrn = nec[COLS["total_detrn"]].sum()
nec_detrn_late = nec[COLS["late_detrn"]].sum()
nec_detrn_on_time = nec_detrn - nec_detrn_late

print(
    f"Train Arrivals: {nec_trn_arrivals}",
    f"Total Detraining Customers: {nec_detrn}",
    f"Late Detraining Customers: {nec_detrn_late}",
    f"On-Time Detraining Customers: {nec_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
nec_stats = detrn.get_sum_stats(nec, AGG["columns"], AGG["funcs"])
nec_stats

Train Arrivals: 26353
Total Detraining Customers: 33736114
Late Detraining Customers: 7243328
On-Time Detraining Customers: 26492786


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,26353,33736114.0,1280.1622,379.0,2526.2699,7243328.0,274.8578,52.0,684.2435,0.2147,,26492786.0


### 3.2 NEC: mean late arrival times summary statistics [1 pt]

Review the central tendency, dispersion, and shape for the mean late arrival times of Northeast Corridor trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [8]:
# Drop missing values
nec_avg_mm_late = nec[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
nec_avg_mm_late_describe = frm.describe_numeric_column(nec_avg_mm_late)
nec_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(21833),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(41.49933586772317),
  'median': 36.0,
  'mode': np.float64(30.0)},
 'position': {'min': 9.0,
  '25%': np.float64(27.0),
  '50%': np.float64(36.0),
  '75%': np.float64(48.0),
  'max': 828.0},
 'spread': {'variance': 810.5346390788515,
  'std': 28.46989004332211,
  'range': 819.0,
  'iqr': np.float64(21.0)},
 'shape': {'skewness': np.float64(6.942480122762284),
  'kurtosis': np.float64(110.18247984981019)}}

In [9]:
#hidden tests are within this cell

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of NEC trains are positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.3 NEC: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.3.1 Create the chart data

In [10]:
# Convert to DataFrame
nec_avg_mm_late = nec_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = nec_avg_mm_late_describe["center"]["mean"]
sigma = nec_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = nec_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
nec_mm_late, bins, num_bins, bin_width = frm.create_bins(nec_avg_mm_late, COLS["avg_mm_late"], 15)

# Bin the data
chrt_data = frm.bin_data(nec_mm_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 3.3.2 Generate the histogram

In [11]:
# Chart title
title_txt = f"Amtrak {SVC_LINES['nec']} (NEC) Late Detraining Passengers"
title = ttl.format_title(nec_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()


### 3.4 NEC: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `nec_qtr_stats`.

In [12]:
# Get quarterly stats
nec_qtr_stats = detrn.get_sum_stats_by_group(
    nec,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    nec_trn_arrivals,
    nec_detrn,
)
nec_qtr_stats

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,1619,0.061435,0.047893,1615738,997.9852,284.0,2003.5881,288477,178.1822,24.0,477.4623,0.1785,41.5145,1327261
1,2022,1,2093,0.079422,0.064543,2177445,1040.3464,295.0,2122.5037,407495,194.6942,37.0,462.7494,0.1871,38.4606,1769950
2,2022,2,1995,0.075703,0.046809,1579142,791.5499,215.0,1635.6237,287811,144.2662,30.0,338.6386,0.1823,41.9328,1291331
3,2022,3,1998,0.075817,0.075699,2553803,1278.1797,362.0,2575.3038,495591,248.0435,44.0,625.5676,0.1941,42.6638,2058212
4,2022,4,2136,0.081053,0.086716,2925472,1369.603,410.5,2692.2553,746395,349.4359,65.0,835.7636,0.2551,38.6777,2179077
5,2023,1,2235,0.08481,0.088573,2988116,1336.9647,394.0,2645.7572,668139,298.9436,63.0,715.578,0.2236,44.1413,2319977
6,2023,2,2126,0.080674,0.074779,2522765,1186.6251,366.5,2261.8128,326660,153.65,26.0,389.6308,0.1295,43.0768,2196105
7,2023,3,2417,0.091716,0.094891,3201238,1324.4675,380.0,2621.7413,731038,302.4568,58.0,713.856,0.2284,39.7841,2470200
8,2023,4,2326,0.088263,0.10703,3610787,1552.359,512.5,2906.5942,998983,429.4854,101.0,965.4039,0.2767,44.0181,2611804
9,2024,1,2369,0.089895,0.105415,3556307,1501.1849,495.0,2836.5871,780873,329.6214,71.0,736.7398,0.2196,42.8054,2775434


#### 3.4.1 Write to file [1 pt]

Write `nec_qtr_stats` to a CSV file named `stu-amtk-nec_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-nec_qtr_stats.csv` file.
It must match line for line, character for character.

In [13]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-nec_qtr_stats.csv")
nec_qtr_stats.to_csv(filepath, index=True)

In [14]:
#hidden tests are within this cell

### 3.5 NEC: visualize detraining passengers

Visualize Amtrak's Northeast Corridor (NEC) detraining passengers, both on-time and late, across
all years and quarters with a bar chart.

In [15]:
# Assemble the data for the chart
chrt_data = bar.create_detrain_chart_frame(nec_qtr_stats, CHRT_BAR["columns"])

# Create chart title
title_text = f"Amtrak {SVC_LINES['nec']} (NEC) Detraining Passengers"
title = ttl.format_title(nec_stats, title_text)

# Grouped bar chart
chart = bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.6 NEC: visualize distribution of mean late arrival times (by fiscal year and quarter)

Visualizing mean late arrival times grouped by fiscal year and quarter may reveal interesting
patterns.

The data is flattened prior to creating a series of box plots. The fiscal year and quarter columns
are combined (e.g., `< year >Q< quarter >`) to create a new column named "Fiscal Year Quarter" while
a second column is added to color code each quarter and its associated box plot.

#### 3.6.1 Create the chart data

In [16]:
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Group by fiscal year and quarter, flatten, and reset index
chrt_data = nec.groupby(cols[:2])[cols].apply(lambda x: x).reset_index(drop=True)

# Add column
chrt_data.loc[:, COLS["year_quarter"]] = chrt_data.apply(detrn.format_year_quarter, axis=1)

# Drop columns and reorder
chrt_data.drop(cols[:2], axis=1, inplace=True)
chrt_data.dropna(inplace=True)
chrt_data.insert(0, COLS["year_quarter"], chrt_data.pop(COLS["year_quarter"]))

# Add alternating colors
colors = [COLORS["amtk_blue"], COLORS["amtk_red"]]
chrt_data.loc[:, "Color"] = chrt_data[COLS["year_quarter"]].apply(detrn.assign_color, colors=colors)
chrt_data.head()

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,14.0,#00537e
1,2021Q4,20.0,#00537e
3,2021Q4,17.0,#00537e
4,2021Q4,20.0,#00537e
5,2021Q4,28.0,#00537e


#### 3.6.2 Preaggregate the chart data

Attempting to instantiate an instance of a Vega-Altair [`alt.Chart()`](https://altair-viz.github.io/user_guide/generated/toplevel/altair.Chart.html) class by passing to it a dataset comprising more than `5000` rows will trigger a `MaxRowsError`. You can disable the `MaxRows` check by calling `alt.data_transformers.disable_max_rows()` method. However, disabling the check may result in performance issues, including browser crashes.

The preferred approach when [working with large datasets](https://altair-viz.github.io/user_guide/large_datasets.html#large-datasets) is to _preaggregate_ the data before generating a plot. This can be achieved "manually"&mdash;the approach adopted in this notebook&mdash;or by [installing](https://altair-viz.github.io/user_guide/large_datasets.html#installing-vegafusion[) and [enabling](https://altair-viz.github.io/user_guide/large_datasets.html#enabling-the-vegafusion-data-transformer) Altair's companion [vegafusion](https://vegafusion.io/) data transformer package.

In [17]:
# Compute aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(chrt_data, cols)

#### 3.6.3 Generate the box plots

In [18]:
# Create chart title
title_text = f"Amtrak {SVC_LINES['nec']} (NEC) Late Detraining Passengers"
title = ttl.format_title(nec_stats, title_text)

chart_horizontal = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Late Detraining Customers Avg Min Late:Q",
    x_title="Average Minutes Late",
    y_shorthand="Fiscal Year Quarter:N",
    y_title="Period",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.HORIZONTAL,
    height=400,
    width=680,
)
chart_horizontal.display()

## 4.0 State Supported services

In [19]:
state = ntwk.by_service_line(network, SVC_LINES["state"])
state.shape

(32668, 24)

### 4.1 State Supported: on-time performance metrics (entire period) [1 pt]

State supported station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [20]:
# Total train arrivals
state_trn_arrivals = state.shape[0]

# Detraining totals
state_detrn = state[COLS["total_detrn"]].sum()
state_detrn_late = state[COLS["late_detrn"]].sum()
state_detrn_on_time = state_detrn - state_detrn_late

print(
    f"Train Arrivals: {state_trn_arrivals}",
    f"Total Detraining Customers: {state_detrn}",
    f"Late Detraining Customers: {state_detrn_late}",
    f"On-Time Detraining Customers: {state_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
state_stats = detrn.get_sum_stats(state, AGG["columns"], AGG["funcs"])
state_stats

Train Arrivals: 32668
Total Detraining Customers: 34009681
Late Detraining Customers: 7073116
On-Time Detraining Customers: 26936565


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,32668,34009681.0,1041.0702,260.0,2307.465,7073116.0,216.5151,28.0,616.6383,0.208,,26936565.0


In [21]:
#hidden tests are within this cell

### 4.2 State Supported: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of state
supported trains. Call the custom function named `frm.describe_numeric_column()` to return a
dictionary of summary statistics.

In [22]:
# Drop missing values
state_avg_mm_late = state[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
state_avg_mm_late_describe = frm.describe_numeric_column(state_avg_mm_late)
state_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(26538),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(41.353304695154115),
  'median': 37.0,
  'mode': np.float64(33.0)},
 'position': {'min': 2.0,
  '25%': np.float64(28.0),
  '50%': np.float64(37.0),
  '75%': np.float64(48.0),
  'max': 866.0},
 'spread': {'variance': 569.7897055122363,
  'std': 23.870268232934382,
  'range': 864.0,
  'iqr': np.float64(20.0)},
 'shape': {'skewness': np.float64(7.4154966261985455),
  'kurtosis': np.float64(156.26897185850837)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times
of NEC trains are positively skewed and features features a sharper peak and heavier right tail
than a normal distribution. Let's confirm this visually by generating a histogram.

### 4.3 State Supported: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 4.3.1 Create the chart data

In [23]:
# Convert to DataFrame
state_avg_mm_late = state_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = state_avg_mm_late_describe["center"]["mean"]
sigma = state_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = state_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
state_min_late, bins, num_bins, bin_width = frm.create_bins(
    state_avg_mm_late, COLS["avg_mm_late"], 15
)

# Bin the data
chrt_data = frm.bin_data(state_min_late, COLS["avg_mm_late"], bins)

866.0

#### 4.3.2 Generate the histogram [1 pt]

In [24]:
# Chart title
title_txt = f"Amtrak {SVC_LINES['state']} Late Detraining Passengers"
title = ttl.format_title(state_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
    width=680,
)

chart.display()

In [25]:
#hidden tests are within this cell

### 4.4 State Supported: on-time performance metrics (by fiscal year and quarter) [1 pt]

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `state_qtr_stats`.

In [26]:
# Get quarterly stats
state_qtr_stats = detrn.get_sum_stats_by_group(
    state,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    state_trn_arrivals,
    state_detrn,
)
state_qtr_stats

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,2140,0.065508,0.047953,1630859,762.0836,187.0,1698.2159,243362,113.7206,16.0,309.95,0.1492,39.343,1387497
1,2022,1,2768,0.084731,0.071157,2420004,874.2789,209.5,2018.3384,471803,170.4491,23.0,490.5807,0.195,42.0054,1948201
2,2022,2,2516,0.077017,0.055615,1891436,751.7631,185.5,1671.0329,367965,146.25,21.0,427.0863,0.1945,40.4508,1523471
3,2022,3,2604,0.079711,0.082085,2791673,1072.071,266.5,2346.6767,539226,207.076,23.0,577.202,0.1932,39.7649,2252447
4,2022,4,2611,0.079925,0.088708,3016893,1155.455,298.0,2514.2213,729924,279.5573,36.0,738.7811,0.2419,40.2078,2286969
5,2023,1,2634,0.080629,0.08624,2932984,1113.5095,265.0,2501.0491,635842,241.3979,34.0,636.15,0.2168,44.7644,2297142
6,2023,2,2729,0.083537,0.073508,2499957,916.0707,217.0,2002.6144,468393,171.6354,21.0,461.7221,0.1874,41.7219,2031564
7,2023,3,2929,0.08966,0.090739,3085975,1053.5934,271.0,2351.3805,693595,236.8027,29.0,674.7411,0.2248,38.561,2392380
8,2023,4,3001,0.091864,0.10302,3503648,1167.4935,290.0,2607.2907,857504,285.7394,38.0,787.6521,0.2447,43.5911,2646144
9,2024,1,2863,0.087639,0.107524,3656837,1277.2745,353.0,2697.72,696392,243.2386,39.0,654.1101,0.1904,41.1537,2960445


In [None]:
#hidden tests are within this cell

#### 4.4.1 Write to file [1 pt]

Write `state_qtr_stats` to a CSV file named `stu-amtk-state_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-state_qtr_stats.csv` file.
It must match line for line, character for character.

In [27]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-state_qtr_stats.csv")
state_qtr_stats.to_csv(filepath, index=True)

In [None]:
#hidden tests are within this cell

### 4.5 State Supported: visualize detraining passengers

Visualize Amtrak's state-supported detraining passengers, both on-time and late, across all years
and quarters with a bar chart.

In [29]:
# Assemble the data for the chart
chrt_data = bar.create_detrain_chart_frame(state_qtr_stats, CHRT_BAR["columns"])

# Create chart title
title_text = f"Amtrak {SVC_LINES['state']} Detraining Passengers"
title = ttl.format_title(state_stats, title_text)

# Grouped bar chart
chart = bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 4.6 State Supported: distribution of mean late arrival times (by fiscal year and quarter)

Visualizing mean late arrival times grouped by fiscal year and quarter may reveal interesting
patterns.

The data is flattened prior to creating a series of box plots. The fiscal year and quarter columns
are combined (e.g., `< year >Q< quarter >`) to create a new column named "Fiscal Year Quarter" while
a second column is added to color code each quarter and its associated box plot.

#### 4.6.1 Create the chart data [1 pt]

In [30]:
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Group by fiscal year and quarter, flatten, and reset index
state_avg_mm_late = state.groupby(cols[:2])[cols].apply(lambda x: x).reset_index(drop=True)

# Add column
state_avg_mm_late.loc[:, COLS["year_quarter"]] = state_avg_mm_late.apply(detrn.format_year_quarter, axis=1)

# Drop columns and reorder
state_avg_mm_late.drop(cols[:2], axis=1, inplace=True)
state_avg_mm_late.insert(0, COLS["year_quarter"], state_avg_mm_late.pop(COLS["year_quarter"]))

# Add alternating colors
colors = [COLORS["amtk_blue"], COLORS["amtk_red"]]
state_avg_mm_late.loc[:, "Color"] = state_avg_mm_late[COLS["year_quarter"]].apply(
    detrn.assign_color, colors=colors
)

state_avg_mm_late


Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,37.0,#00537e
1,2021Q4,,#00537e
2,2021Q4,41.0,#00537e
3,2021Q4,,#00537e
4,2021Q4,33.0,#00537e
...,...,...,...
32663,2024Q3,73.0,#ef3824
32664,2024Q3,40.0,#ef3824
32665,2024Q3,42.0,#ef3824
32666,2024Q3,18.0,#ef3824


In [None]:
#hidden tests are within this cell

#### 4.6.2 Preaggregate the data

In [31]:
# Compute aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
agg_stats = frm.aggregate_data(state_avg_mm_late, cols)

#### 4.6.3 Generate the box plots

In [32]:
# Create chart title
title_text = f"Amtrak {SVC_LINES['state']} Late Detraining Passengers"
title = ttl.format_title(nec_stats, title_text)

chart_horizontal = boxp.create_boxplot(
    data=agg_stats,
    x_shorthand="Late Detraining Customers Avg Min Late:Q",
    x_title="Average Minutes Late",
    y_shorthand="Fiscal Year Quarter:N",
    y_title="Period",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.HORIZONTAL,
    height=400,
    width=680,
)
chart_horizontal.display()

## 5.0 Long Distance services

In [33]:
long_dist = ntwk.by_service_line(network, SVC_LINES["long_dist"])
long_dist.shape

(9391, 24)

### 5.1 Long Distance: on-time performance metrics (entire period)

State supported station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [35]:
# Total train arrivals
long_dist_trn_arrivals = long_dist.shape[0]

# Detraining totals
long_dist_detrn = long_dist[COLS["total_detrn"]].sum()
long_dist_detrn_late = long_dist[COLS["late_detrn"]].sum()
long_dist_detrn_on_time = long_dist_detrn - long_dist_detrn_late

print(
    f"Train Arrivals: {long_dist_trn_arrivals}",
    f"Total Detraining Customers: {long_dist_detrn}",
    f"Late Detraining Customers: {long_dist_detrn_late}",
    f"On-Time Detraining Customers: {long_dist_detrn_on_time}",
    sep="\n",
)

# Compute summary statistics
long_dist_stats = detrn.get_sum_stats(long_dist, AGG["columns"], AGG["funcs"])
long_dist_stats


Train Arrivals: 9391
Total Detraining Customers: 10585139
Late Detraining Customers: 5108705
On-Time Detraining Customers: 5476434


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,9391,10585139.0,1127.1578,400.0,2542.7741,5108705.0,544.0001,188.0,1346.9626,0.4826,,5476434.0


### 5.2 Long Distance: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of long distance
trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of
summary statistics.

In [36]:
# Drop missing values
long_dist_avg_min_late = long_dist[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
long_dist_avg_min_late_describe = frm.describe_numeric_column(long_dist_avg_min_late)
long_dist_avg_min_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(9002),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(88.39435680959787),
  'median': 75.0,
  'mode': np.float64(58.0)},
 'position': {'min': 10.0,
  '25%': np.float64(55.0),
  '50%': np.float64(75.0),
  '75%': np.float64(106.0),
  'max': 630.0},
 'spread': {'variance': 2999.5762730058777,
  'std': 54.76838753337437,
  'range': 620.0,
  'iqr': np.float64(51.0)},
 'shape': {'skewness': np.float64(3.0544918832115258),
  'kurtosis': np.float64(16.143866090902232)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times
of long distance trains are positively skewed and features features a sharper peak and heavier right
tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 5.3 Long Distance: visualize detraining passengers

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 5.3.1 Create the chart data

In [37]:
# Convert to DataFrame
long_dist_avg_min_late = long_dist_avg_min_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = long_dist_avg_min_late_describe["center"]["mean"]
sigma = long_dist_avg_min_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = long_dist_avg_min_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
long_dist_min_late, bins, num_bins, bin_width = frm.create_bins(
    long_dist_avg_min_late, COLS["avg_mm_late"], 10
)

# Bin the data
chrt_data = frm.bin_data(long_dist_min_late, COLS["avg_mm_late"], bins)
# chrt_data

#### 5.3.2 Generate the histogram

In [38]:
# Chart title
title_txt = f"Amtrak {SVC_LINES['long_dist']} Service Late Detraining Passengers"
title = ttl.format_title(long_dist_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
    width=680,
)
chart.display()

### 5.4 Long Distance: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `long_dist_qtr_stats`.

In [40]:
# Get quarterly stats
long_dist_qtr_stats = detrn.get_sum_stats_by_group(
    long_dist,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    long_dist_trn_arrivals,
    long_dist_detrn,
)
long_dist_qtr_stats


Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,101,0.010755,0.008304,87895,870.2475,354.0,1453.8185,28324,280.4356,114.0,486.3524,0.3222,59.1735,59571
1,2022,1,795,0.084656,0.085319,903116,1135.995,433.0,2437.9046,444185,558.7233,196.0,1572.151,0.4918,82.9495,458931
2,2022,2,973,0.10361,0.062813,664882,683.332,177.0,2000.0337,338773,348.1737,85.0,1190.536,0.5095,95.1462,326109
3,2022,3,733,0.078053,0.088032,931826,1271.2497,423.0,2886.3182,588963,803.4966,264.0,1994.3889,0.6321,93.2751,342863
4,2022,4,783,0.083378,0.090241,955209,1219.9349,402.0,2702.6445,569548,727.3921,242.0,1594.8463,0.5963,95.4587,385661
5,2023,1,813,0.086572,0.093316,987767,1214.9656,448.0,2598.5291,454266,558.7528,210.0,1424.2459,0.4599,86.2015,533501
6,2023,2,964,0.102651,0.081362,861223,893.3849,272.5,2277.9249,393036,407.7137,129.0,1061.6093,0.4564,116.7527,468187
7,2023,3,791,0.08423,0.09661,1022625,1292.8255,511.0,2799.1489,490986,620.7155,250.0,1320.4383,0.4801,78.4698,531639
8,2023,4,785,0.083591,0.09705,1027285,1308.6433,513.0,2683.0387,511954,652.1707,279.0,1214.5009,0.4984,89.3501,515331
9,2024,1,827,0.088063,0.103132,1091665,1320.0302,548.0,2610.5391,362260,438.0411,196.0,798.2051,0.3318,71.0511,729405


#### 5.4.1 Write to file [1 pt]

Write `long_dist_qtr_stats` to a CSV file named `stu-amtk-long_dist_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-long_dist_qtr_stats.csv` file.
It must match line for line, character for character.

In [41]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-long_dist_qtr_stats.csv")
long_dist_qtr_stats.to_csv(filepath, index=True)

In [None]:
#hidden tests are within this cell

### 5.5 Long Distance: visualize detraining passengers

Visualize Amtrak's long distance detraining passengers, both on-time and late, across all years
and quarters with a bar chart.

In [42]:
# Assemble the data for the chart
chrt_data = bar.create_detrain_chart_frame(long_dist_qtr_stats, CHRT_BAR["columns"])

# Create chart title
title_text = f"Amtrak {SVC_LINES['long_dist']} Detraining Passengers"
title = ttl.format_title(long_dist_stats, title_text)

# Grouped bar chart
chart = bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

###  5.6 Long Distance: distribution of mean late arrival times (by fiscal year and quarter)

Visualizing mean late arrival times grouped by fiscal year and quarter may reveal interesting
patterns.

The data is flattened prior to creating a series of box plots. The fiscal year and quarter columns
are combined (e.g., `< year >Q< quarter >`) to create a new column named "Fiscal Year Quarter" while
a second column is added to color code each quarter and its associated box plot.

#### 5.6.1 Create the chart data

In [43]:
cols = [COLS["year"], COLS["quarter"], COLS["late_detrn_avg_mm_late"]]

# Group by fiscal year and quarter, flatten, and reset index
long_dist_avg_min_late = long_dist.groupby(cols[:2])[cols].apply(lambda x: x).reset_index(drop=True)

# Add column
long_dist_avg_min_late.loc[:, COLS["year_quarter"]] = long_dist_avg_min_late.apply(
    detrn.format_year_quarter, axis=1
)

# Drop columns and reorder
long_dist_avg_min_late.drop(cols[:2], axis=1, inplace=True)
long_dist_avg_min_late.insert(
    0, COLS["year_quarter"], long_dist_avg_min_late.pop(COLS["year_quarter"])
)

# Add alternating colors
colors = [COLORS["amtk_blue"], COLORS["amtk_red"]]
long_dist_avg_min_late.loc[:, "Color"] = long_dist_avg_min_late[COLS["year_quarter"]].apply(
    detrn.assign_color, colors=colors
)
long_dist_avg_min_late

Unnamed: 0,Fiscal Year Quarter,Late Detraining Customers Avg Min Late,Color
0,2021Q4,100.0,#00537e
1,2021Q4,25.0,#00537e
2,2021Q4,51.0,#00537e
3,2021Q4,78.0,#00537e
4,2021Q4,53.0,#00537e
...,...,...,...
9386,2024Q3,156.0,#ef3824
9387,2024Q3,89.0,#ef3824
9388,2024Q3,124.0,#ef3824
9389,2024Q3,135.0,#ef3824


#### 5.6.2 Preaggregate the data

In [45]:
# Compute aggregation statistics
cols = [COLS["year_quarter"], COLS["late_detrn_avg_mm_late"]]

# Pre-aggregate the data
chrt_data = frm.aggregate_data(long_dist_avg_min_late, cols)

#### 5.6.3 Generate box plots [1 pt]

In [52]:
# Create chart title
title_text = f"Amtrak {SVC_LINES['long_dist']} Late Detraining Passengers"
title = ttl.format_title(nec_stats, title_text)

chart_horizontal = boxp.create_boxplot(
    data=chrt_data,
    x_shorthand="Late Detraining Customers Avg Min Late:Q",
    x_title="Average Minutes Late",
    y_shorthand="Fiscal Year Quarter:N",
    y_title="Period",
    box_size=20,
    outlier_shorthand="outliers:Q",
    color_shorthand="Color:N",
    chart_title=title,
    orient=boxp.Orient.HORIZONTAL,
    height=400,
    width=680,
)
chart_horizontal.display()

In [None]:
#hidden tests are within this cell

## 6.0 Watermark

In [47]:
%load_ext watermark
%watermark -h -i -iv -m -v

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.26.0

Compiler    : GCC 12.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 32
Architecture: 64bit

Hostname: 731aada350b5

numpy : 2.1.3
pandas: 2.2.3

