# Explore: Amtrak Stations

## Intercity Passenger Rail Service Station Performance Metrics

The Amtrak [network](https://www.amtrak.com/content/dam/projects/dotcom/english/public/documents/Maps/Amtrak-System-Map-020923.pdf)
is a passenger rail service that provides intercity rail service in the
continental United States and to select Canadian cities. The network is operated by the
[National Railroad Passenger Corporation](https://railroads.dot.gov/passenger-rail/amtrak/amtrak),
a federally chartered for-profit corporation that receives some state funding and covers its
operating costs by selling tickets and providing other services.

This notebook commences exploration of the augmented quarterly
[Amtrak](https://www.amtrak.com/home.html) station performance metrics. The goal is to better
understand individual Amtrak station performance and identify potential areas for further analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import tomllib as tl

import fra_amtrak.amtk_detrain as detrn
import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk
import fra_amtrak.chart_bar as vis_bar
import fra_amtrak.chart_box as box
import fra_amtrak.chart_hist as hst
import fra_amtrak.chart_title as ttl


## 1.0 Read files

### 1.1 Resolve paths

In [2]:
parent_path = pl.Path.cwd() # current working directory
parent_path


PosixPath('/home/jovyan/work/assignments/Course4')

### 1.2 Load constants

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
AGG = const["agg"]
CHRT_BAR = const["chart"]["bar"]
CHRT_BOX = const["chart"]["box"]
COLORS = const["colors"]
COLS = const["columns"]
STNS = const["stations"]


### Retrieve performance data

In [4]:
filepath = parent_path.joinpath("data", "processed", "station_performance_metrics-v1p2.csv")
stations = pd.read_csv(
    filepath, dtype={"Address 02": "str", "ZIP Code": "str"}, low_memory=False
)  # avoid DtypeWarning

### 1.4 Review the `DataFrame`

In [5]:
stations.shape

(68412, 24)

In [6]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 24 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Fiscal Year                               68412 non-null  int64  
 1   Fiscal Quarter                            68412 non-null  int64  
 2   Service Line                              68412 non-null  object 
 3   Service                                   68412 non-null  object 
 4   Sub Service                               68412 non-null  object 
 5   Route Miles                               68412 non-null  int64  
 6   Train Number                              68412 non-null  int64  
 7   Arrival Station Code                      68412 non-null  object 
 8   Arrival Station                           68412 non-null  object 
 9   Arrival Station Type                      68386 non-null  object 
 10  City                              

In [7]:
stations.head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Route Miles,Train Number,Arrival Station Code,Arrival Station,Arrival Station Type,...,State,Division,Region,Country,Latitude,Longitude,Total Detraining Customers,Late Detraining Customers,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,914,52,LOR,Lorton (Auto Train),Station Building (with waiting room),...,Virginia,South Atlantic,South,United States,38.708143,-77.220942,42445,23316,0.54932,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,914,53,SFA,Sanford (Auto Train),Station Building (with waiting room),...,Florida,South Atlantic,South,United States,28.808544,-81.291274,28034,18439,0.65774,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,2408,5,BRL,Burlington,Station Building (with waiting room),...,Iowa,West North Central,Midwest,United States,40.805788,-91.101951,557,223,0.40036,54.0


## 2.0 Passenger arrivals

### 2.1 Top 10 stations (entire period)

The function `ntwk.get_n_busiest_stations()` is designed to retrieve the `n` busiest stations. The
results can be filtered on a geographical unit (e.g., state, divisiion, region) and/or a fiscal year
and its associated quarters.

Below is listed the top ten (`10`) busiest stations based on passenger arrivals for the entire period under review.

In [8]:
# Columns of interest (for display output only)
cols = [
    COLS["station_code"],
    COLS["station"],
    COLS["city"],
    COLS["state"],
    COLS["division"],
    COLS["region"],
    COLS["total_detrn"],
]

top_n_stns = ntwk.get_n_busiest_stations(stations, 10)[cols]
top_n_stns

Unnamed: 0,Arrival Station Code,Arrival Station,City,State,Division,Region,Total Detraining Customers
0,NYP,NY Moynihan Train Hall at Penn Station,New York,New York,Middle Atlantic,Northeast,14571687
1,WAS,Washington,Washington,District of Columbia,South Atlantic,South,6648302
2,PHL,Philadelphia (Gray 30th St Sta),Philadelphia,Pennsylvania,Middle Atlantic,Northeast,5852475
3,CHI,Chicago (Union Station),Chicago,Illinois,East North Central,Midwest,3728162
4,BOS,Boston (South Station),Boston,Massachusetts,New England,Northeast,2202779
5,BAL,Baltimore (Penn Station),Baltimore,Maryland,South Atlantic,South,1522995
6,LAX,Los Angeles,Los Angeles,California,Pacific,West,1383151
7,NHV,New Haven (Union Station),New Haven,Connecticut,New England,Northeast,1133913
8,BBY,Boston (Back Bay Station),Boston,Massachusetts,New England,Northeast,1121454
9,ALB,Albany-Rensselaer,Rensselaer,New York,Middle Atlantic,Northeast,1105524


### 2.2 Top 10 stations (2023 Q1-Q2) [1 pt]

Top ten (`10`) busiest stations based on passenger arrivals for the year `2023`, quarters `01` and
`02`. This example demonstrates how to filter the data based on a fiscal year and its associated
quarters.

In [9]:
top_n_stns = ntwk.get_n_busiest_stations(stations, 10, None, 2023, 1, 2)[cols]
top_n_stns

Unnamed: 0,Arrival Station Code,Arrival Station,City,State,Division,Region,Total Detraining Customers
0,NYP,NY Moynihan Train Hall at Penn Station,New York,New York,Middle Atlantic,Northeast,2397999
1,WAS,Washington,Washington,District of Columbia,South Atlantic,South,1080289
2,PHL,Philadelphia (Gray 30th St Sta),Philadelphia,Pennsylvania,Middle Atlantic,Northeast,962038
3,CHI,Chicago (Union Station),Chicago,Illinois,East North Central,Midwest,603897
4,BOS,Boston (South Station),Boston,Massachusetts,New England,Northeast,353126
5,BAL,Baltimore (Penn Station),Baltimore,Maryland,South Atlantic,South,245041
6,NHV,New Haven (Union Station),New Haven,Connecticut,New England,Northeast,195653
7,ALB,Albany-Rensselaer,Rensselaer,New York,Middle Atlantic,Northeast,191058
8,BBY,Boston (Back Bay Station),Boston,Massachusetts,New England,Northeast,183284
9,LAX,Los Angeles,Los Angeles,California,Pacific,West,178735


In [10]:
#hidden tests are within this cell

### 2.3 Top 3 stations (by region, entire period)

 Top three (`3`) busiest stations in each US Census Bureau region based on passenger arrivals.

In [11]:
region_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["region"])[cols]
region_top_n_stns

Unnamed: 0,Arrival Station Code,Arrival Station,City,State,Division,Region,Total Detraining Customers
0,CHI,Chicago (Union Station),Chicago,Illinois,East North Central,Midwest,3728162
1,MKE,Milwaukee (Downtown),Milwaukee,Wisconsin,East North Central,Midwest,647676
2,STL,St. Louis,St. Louis,Missouri,West North Central,Midwest,476491
3,NYP,NY Moynihan Train Hall at Penn Station,New York,New York,Middle Atlantic,Northeast,14571687
4,PHL,Philadelphia (Gray 30th St Sta),Philadelphia,Pennsylvania,Middle Atlantic,Northeast,5852475
5,BOS,Boston (South Station),Boston,Massachusetts,New England,Northeast,2202779
6,WAS,Washington,Washington,District of Columbia,South Atlantic,South,6648302
7,BAL,Baltimore (Penn Station),Baltimore,Maryland,South Atlantic,South,1522995
8,BWI,BWI Thurgood Marshall Airport Station,Baltimore,Maryland,South Atlantic,South,1012325
9,LAX,Los Angeles,Los Angeles,California,Pacific,West,1383151


### 2.4 Top 3 stations (by division, entire period)

Top three (`3`) busiest stations in each US Census Bureau division based on passenger arrivals.

In [12]:
div_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["division"])[cols]
div_top_n_stns

Unnamed: 0,Arrival Station Code,Arrival Station,City,State,Division,Region,Total Detraining Customers
0,CHI,Chicago (Union Station),Chicago,Illinois,East North Central,Midwest,3728162
1,MKE,Milwaukee (Downtown),Milwaukee,Wisconsin,East North Central,Midwest,647676
2,BNL,Bloomington-Normal,Normal,Illinois,East North Central,Midwest,296534
3,MEM,Memphis,Memphis,Tennessee,East South Central,South,76373
4,JAN,Jackson,Jackson,Mississippi,East South Central,South,46701
5,BHM,Birmingham,Birmingham,Alabama,East South Central,South,41606
6,MTR,Montreal (Gare Centrale),Montreal,Quebec,Eastern Canada,Northeast,22430
7,NFS,Niagara Falls,Niagara Falls,Ontario,Eastern Canada,Northeast,21447
8,SLQ,St-Lambert,St-Lambert,Quebec,Eastern Canada,Northeast,859
9,NYP,NY Moynihan Train Hall at Penn Station,New York,New York,Middle Atlantic,Northeast,14571687


### 2.5 Top 3 stations (by state)

The top three (`3`) busiest stations in each state based on passenger arrivals.

In [13]:
state_top_n_stns = ntwk.get_n_busiest_stations(stations, 3, COLS["state"])[cols]
state_top_n_stns

Unnamed: 0,Arrival Station Code,Arrival Station,City,State,Division,Region,Total Detraining Customers
0,BHM,Birmingham,Birmingham,Alabama,East South Central,South,41606
1,TCL,Tuscaloosa,Tuscaloosa,Alabama,East South Central,South,11481
2,ATN,Anniston,Anniston,Alabama,East South Central,South,4731
3,FLG,Flagstaff (Amtrak Station),Flagstaff,Arizona,Mountain,West,40203
4,TUS,Tucson,Tucson,Arizona,Mountain,West,28936
...,...,...,...,...,...,...,...
135,HFY,Harpers Ferry,Harpers Ferry,West Virginia,South Atlantic,South,9517
136,CHW,Charleston,Charleston,West Virginia,South Atlantic,South,9325
137,MKE,Milwaukee (Downtown),Milwaukee,Wisconsin,East North Central,Midwest,647676
138,MKA,Milwaukee Airport (Trains),Milwaukee,Wisconsin,East North Central,Midwest,145344


## 3.0 Select Station metrics

### 3.1 Moynihan Train Hall at Penn Station (NYP), New York, NY

[Moynihan Train Hall](https://www.amtrak.com/stations/nyp) at Penn Station ([NYP](https://www.amtrak.com/stations/nyp)) is a major transportation hub and Amtrak's busiest station.

In [14]:
# All fiscal years and quarters
nyp = ntwk.by_station(stations, "NYP")
nyp.shape

(1719, 24)

### 3.2 NYP: on-time performance metrics (entire period)

NYP station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [15]:
# Train arrivals (total)
nyp_trn_arrivals = nyp.shape[0]

# Detraining totals
nyp_detrn = nyp[COLS["total_detrn"]].sum()
nyp_detrn_late = nyp[COLS["late_detrn"]].sum()
nyp_detrn_on_time = nyp_detrn - nyp_detrn_late

print(
    f"Train Arrivals: {nyp_trn_arrivals}",
    f"Total Detraining Customers: {nyp_detrn}",
    f"Late Detraining Customers: {nyp_detrn_late}",
    f"On-Time Detraining Customers: {nyp_detrn_on_time}",
    sep="\n",
)

nyp_stats = detrn.get_sum_stats(nyp, AGG["columns"], AGG["funcs"])
nyp_stats

Train Arrivals: 1719
Total Detraining Customers: 14571687
Late Detraining Customers: 2497718
On-Time Detraining Customers: 12073969


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,1719,14571687.0,8476.8394,7861.0,5390.6046,2497718.0,1453.0064,884.0,1724.1642,0.1714,,12073969.0


### 3.3 NYP: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of NYP trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [16]:
# Drop missing values
nyp_avg_mm_late = nyp[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
nyp_avg_mm_late_describe = frm.describe_numeric_column(nyp_avg_mm_late)
nyp_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(1579),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(43.91703609879671),
  'median': 37.0,
  'mode': np.float64(28.0)},
 'position': {'min': 11.0,
  '25%': np.float64(27.0),
  '50%': np.float64(37.0),
  '75%': np.float64(51.0),
  'max': 620.0},
 'spread': {'variance': 993.9924789156773,
  'std': 31.52764626348877,
  'range': 609.0,
  'iqr': np.float64(24.0)},
 'shape': {'skewness': np.float64(6.254233955826336),
  'kurtosis': np.float64(82.83535199296716)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of NYP trains are positively skewed and features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.4 NYP: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.4.1 Create chart data

In [17]:
# Convert to DataFrame
nyp_avg_mm_late = nyp_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = nyp_avg_mm_late_describe["center"]["mean"]
sigma = nyp_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = nyp_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
nyp_min_late, bins, num_bins, bin_width = frm.create_bins(nyp_avg_mm_late, COLS["avg_mm_late"], 15)

# Bin the data
chrt_data = frm.bin_data(nyp_min_late, COLS["avg_mm_late"], bins)

# chrt_data

#### 3.4.2 Generate the histogram

In [18]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['nyp']}"
title = ttl.format_title(nyp_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.5 NYP: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `nyp_qtr_stats`.

In [19]:
nyp_qtr_stats = detrn.get_sum_stats_by_group(
    nyp,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    nyp_trn_arrivals,
    nyp_detrn,
)
nyp_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,103,0.059919,0.048781,710825,6901.2136,6243.0,4148.492,94408,916.5825,647.0,1147.1338,0.1328,46.0588,616417
1,2022,1,129,0.075044,0.069116,1007131,7807.2171,7300.0,4996.9533,133399,1034.1008,748.0,1000.3398,0.1325,39.7913,873732
2,2022,2,126,0.073298,0.050477,735530,5837.5397,5627.0,3910.4336,93814,744.5556,536.0,804.6166,0.1275,47.4554,641716
3,2022,3,130,0.075625,0.07959,1159760,8921.2308,8654.0,5247.1844,161258,1240.4462,724.5,1546.5171,0.139,43.4138,998502
4,2022,4,134,0.077952,0.086093,1254516,9362.0597,9357.5,5526.6458,266192,1986.5075,1233.5,2066.1894,0.2122,39.8485,988324
5,2023,1,146,0.084933,0.089987,1311263,8981.2534,8448.5,5901.8054,226136,1548.8767,1061.5,1783.114,0.1725,47.0074,1085127
6,2023,2,138,0.080279,0.074579,1086736,7874.8986,7700.0,4474.3212,117626,852.3623,489.5,989.5623,0.1082,44.1557,969110
7,2023,3,162,0.094241,0.093413,1361186,8402.3827,7659.0,5519.8425,265248,1637.3333,1124.0,1862.6475,0.1949,40.3642,1095938
8,2023,4,154,0.089587,0.103743,1511712,9816.3117,9211.0,5623.7868,336474,2184.8961,1387.0,2173.3855,0.2226,47.2039,1175238
9,2024,1,158,0.091914,0.104808,1527234,9666.038,8891.0,6194.7703,267687,1694.2215,1120.5,1769.6038,0.1753,41.7838,1259547


#### 3.5.1 Write to file [1 pt]

Write `nyp_qtr_stats` to a CSV file named `stu-amtk-nyp_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-nyp_qtr_stats.csv` file. It
must match line for line, character for character.

In [20]:
filepath = parent_path.joinpath("data", "student", "stu-amtk-nyp_qtr_stats.csv")
nyp_qtr_stats.to_csv(filepath, index=True)

In [21]:
#hidden tests are within this cell

### 3.6 NYP: visualize detraining passengers

Visualize NYP detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [22]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(nyp_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    nyp, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(nyp_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.7 NYP: On-time performance metrics by service line

Group train arrivals by service line.

In [23]:
nyp_svc_trns = nyp.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
nyp_svc_trns.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
nyp_svc_trns.sort_values(by=COLS["trn_arrivals"], ascending=False, inplace=True)
nyp_svc_trns.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
nyp_svc_trns.loc[:, COLS["trn_arrival_ratio"]] = (
    nyp_svc_trns[COLS["trn_arrivals"]] / nyp_trn_arrivals
)
nyp_svc_trns

Unnamed: 0,Service Line,Train Arrivals,Train Arrival Ratio
0,Northeast Corridor,1207,0.702152
1,State Supported,448,0.260617
2,Long Distance,64,0.037231


#### 3.7.1 NYP: compute on-time performance metrics by service line. [1 pt]

In [24]:
# Get summary stats by COLS["svc_line"]
nyp_svc_line_stats = detrn.get_sum_stats_by_group(
    nyp, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
nyp_svc_line_stats = nyp_svc_line_stats.merge(nyp_svc_trns, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = nyp_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
nyp_svc_line_stats = nyp_svc_line_stats[cols]

# Add service line detraining ratios
nyp_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    nyp_svc_line_stats["Total Detraining Customers sum"] / nyp_detrn
)

# Move service line detraining ratio column
nyp_svc_line_stats.insert(
    3, "Service Line Detraining Ratio", nyp_svc_line_stats.pop("Service Line Detraining Ratio")
)

# Sort by passengers detrained (descending order)
nyp_svc_line_stats.sort_values(by="Total Detraining Customers sum", ascending=False, inplace=True)

# Reset index
nyp_svc_line_stats.reset_index(drop=True, inplace=True)
nyp_svc_line_stats

Unnamed: 0,Service Line,Train Arrivals_y,Train Arrival Ratio,Service Line Detraining Ratio,Train Arrivals_x,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,Northeast Corridor,1207,0.702152,0.651362,1207,9491436,7863.6587,7326.0,5113.7391,1706848,1414.1243,907.0,1699.6482,0.1798,41.6775,7784588
1,State Supported,448,0.260617,0.317086,448,4620479,10313.5692,9945.0,5816.6205,617090,1377.433,790.5,1685.7938,0.1336,42.8652,4003389
2,Long Distance,64,0.037231,0.031552,64,459772,7183.9375,6890.0,4345.8832,173780,2715.3125,2072.5,1978.382,0.378,89.3594,285992


In [None]:
#hidden tests are within this cell

#### 3.7.2 NYP: visualize distribution of mean late arrival times

Illustrate with box plots.

In [25]:
nyp_svc_lines = nyp.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = nyp_svc_lines.apply(lambda x: x).reset_index(drop=True)  # Flatten for Altair
chrt_data.head()

Unnamed: 0,Service Line,Late Detraining Customers Avg Min Late
0,Long Distance,103.0
1,Long Distance,82.0
2,Long Distance,40.0
3,Long Distance,75.0
4,Long Distance,110.0


In [26]:
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(nyp_stats, title_txt)

# Create and display the box plots
chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

### 3.8 Chicago Union Station (CHI), Chicago, IL

[Chicago Union Station](https://www.amtrak.com/stations/chi) ([CHI](https://www.amtrak.com/stations/chi)) is a key node in the Amtrak
network, supporting both regional services in the Midwest and long distance routes.

In [27]:
chi = ntwk.by_station(stations, "CHI")
chi.shape

(334, 24)

### 3.9 CHI: on-time performance metrics (entire period)

CHI station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [28]:
# Train arrivals (total)
chi_trn_arrivals = chi.shape[0]

# Detraining totals
chi_detrn = chi[COLS["total_detrn"]].sum()
chi_detrn_late = chi[COLS["late_detrn"]].sum()
chi_detrn_on_time = chi_detrn - chi_detrn_late

print(
    f"Train Arrivals: {chi_trn_arrivals}",
    f"Total Detraining Customers: {chi_detrn}",
    f"Late Detraining Customers: {chi_detrn_late}",
    f"On-Time Detraining Customers: {chi_detrn_on_time}",
    sep="\n",
)

chi_stats = detrn.get_sum_stats(chi, AGG["columns"], AGG["funcs"])
chi_stats

Train Arrivals: 334
Total Detraining Customers: 3728162
Late Detraining Customers: 1068254
On-Time Detraining Customers: 2659908


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,334,3728162.0,11162.1617,11178.5,5128.1625,1068254.0,3198.3653,2343.0,2996.0464,0.2865,,2659908.0


### 3.10 CHI: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of CHI trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [32]:
# Drop missing values
chi_avg_mm_late = chi[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
chi_avg_mm_late_describe = frm.describe_numeric_column(chi_avg_mm_late)
chi_avg_mm_late_describe

{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(329),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(70.48936170212765),
  'median': 49.0,
  'mode': np.float64(36.0)},
 'position': {'min': 17.0,
  '25%': np.float64(37.0),
  '50%': np.float64(49.0),
  '75%': np.float64(75.0),
  'max': 516.0},
 'spread': {'variance': 4145.0128437986505,
  'std': 64.38177415851982,
  'range': 499.0,
  'iqr': np.float64(38.0)},
 'shape': {'skewness': np.float64(3.6196602828263766),
  'kurtosis': np.float64(17.251866900544588)}}

The skewness and kurtosis values returned suggest that the distribution of mean late arrival times of CHI trains are positively skewed and features features a sharper peak and heavier right tail than a normal distribution. Let's confirm this visually by generating a histogram.

### 3.11 CHI: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.11.1 Create chart data [1 pt]

In [33]:
# Convert to DataFrame
chi_avg_mm_late = chi_avg_mm_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = chi_avg_mm_late_describe["center"]["mean"]
sigma = chi_avg_mm_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = chi_avg_mm_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
chi_min_late, bins, num_bins, bin_width = frm.create_bins(chi_avg_mm_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(chi_min_late, COLS["avg_mm_late"], bins)
# chrt_data

In [None]:
#hidden tests are within this cell

#### 3.11.2 Generate the histogram

In [34]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['chi']}"
title = ttl.format_title(chi_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()

### 3.12 CHI: on-time performance metrics (by fiscal year and quarter) [1 pt]

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `chi_qtr_stats`.

In [35]:
chi_qtr_stats = detrn.get_sum_stats_by_group(
    chi, [COLS["year"], COLS["quarter"]], AGG["columns"], AGG["funcs"], chi_trn_arrivals, chi_detrn
)
chi_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])
chi_qtr_stats

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,13,0.038922,0.029866,111344,8564.9231,8783.0,3040.1924,10687,822.0769,391.0,954.2883,0.096,55.0769,100657
1,2022,1,28,0.083832,0.079966,298127,10647.3929,10743.0,3874.3448,85085,3038.75,2406.5,2772.3757,0.2854,56.4286,213042
2,2022,2,31,0.092814,0.055993,208750,6733.871,7099.0,4129.6256,71142,2294.9032,1750.0,2131.3324,0.3408,68.6667,137608
3,2022,3,28,0.083832,0.087466,326087,11645.9643,11799.0,4485.1134,118808,4243.1429,2963.5,4076.5326,0.3643,64.8519,207279
4,2022,4,28,0.083832,0.096348,359202,12828.6429,13513.5,5632.3036,122908,4389.5714,3151.0,4049.1612,0.3422,72.0741,236294
5,2023,1,29,0.086826,0.088513,329989,11378.931,10888.0,5523.3084,92802,3200.069,2318.0,2625.2786,0.2812,82.1379,237187
6,2023,2,31,0.092814,0.07347,273908,8835.7419,9530.0,4431.7537,70810,2284.1935,1827.0,1978.7552,0.2585,99.5333,203098
7,2023,3,28,0.083832,0.097224,362468,12945.2857,13578.0,4478.6503,109215,3900.5357,2541.5,3641.166,0.3013,60.75,253253
8,2023,4,28,0.083832,0.104405,389237,13901.3214,14293.5,5139.271,125618,4486.3571,2867.0,3852.7373,0.3227,64.1071,263619
9,2024,1,29,0.086826,0.102328,381496,13155.0345,12472.0,5345.1562,85648,2953.3793,2563.0,1976.6944,0.2245,81.1034,295848


In [None]:
#hidden tests are within this cell

#### 3.12.1 Write to file [1 pt]

Write `chi_qtr_stats` to a CSV file named `stu-amtk-chi_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-chi_qtr_stats.csv` file. It
must match line for line, character for character.

In [36]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-chi_qtr_stats.csv")
chi_qtr_stats.to_csv(filepath, index=True)

In [None]:
#hidden tests are within this cell

### 3.13 CHI: visualize detraining passengers

Visualize CHI detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [37]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(chi_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    chi, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(chi_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.14 CHI: On-time performance metrics by service line [1 pt]

Group train arrivals by service line.

In [40]:
chi_svc_trns = chi.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
chi_svc_trns.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
chi_svc_trns.sort_values(by=COLS["trn_arrivals"], ascending=False, inplace=True)
chi_svc_trns.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
chi_svc_trns.loc[:, COLS["trn_arrival_ratio"]] = (
    chi_svc_trns[COLS["trn_arrivals"]] / chi_trn_arrivals
)
chi_svc_trns

Unnamed: 0,Service Line,Train Arrivals,Train Arrival Ratio
0,State Supported,233,0.697605
1,Long Distance,101,0.302395


In [None]:
#hidden tests are within this cell

#### 3.14.1 CHI: compute on-time performance metrics by service line.

In [41]:
# Get summary stats by COLS["svc_line"]
chi_svc_line_stats = detrn.get_sum_stats_by_group(
    chi, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
chi_svc_line_stats = chi_svc_line_stats.merge(chi_svc_trns, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = chi_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
chi_svc_line_stats = chi_svc_line_stats[cols]

# Add service line detraining ratios
chi_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    chi_svc_line_stats["Total Detraining Customers sum"] / chi_detrn
)

# Move service line detraining ratio column
chi_svc_line_stats.insert(
    3, "Service Line Detraining Ratio", chi_svc_line_stats.pop("Service Line Detraining Ratio")
)

# Sort by passengers detrained (descending order)
chi_svc_line_stats.sort_values(by="Total Detraining Customers sum", ascending=False, inplace=True)

# Reset index
chi_svc_line_stats.reset_index(drop=True, inplace=True)
chi_svc_line_stats

Unnamed: 0,Service Line,Train Arrivals_y,Train Arrival Ratio,Service Line Detraining Ratio,Train Arrivals_x,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,State Supported,233,0.697605,0.718483,233,2678621,11496.2275,11074.0,4863.7,552773,2372.4163,1899.0,2133.4779,0.2064,47.3609,2125848
1,Long Distance,101,0.302395,0.281517,101,1049541,10391.495,11940.0,5641.9565,515481,5103.7723,4467.0,3751.6213,0.4911,124.2222,534060


#### 3.14.2 CHI: visualize distribution of mean late arrival times

Illustrate with box plots.

In [42]:
chi_svc_lines = chi.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = chi_svc_lines.apply(lambda x: x).reset_index(drop=True)  # Flatten for Altair
chrt_data.head()

Unnamed: 0,Service Line,Late Detraining Customers Avg Min Late
0,Long Distance,137.0
1,Long Distance,75.0
2,Long Distance,123.0
3,Long Distance,145.0
4,Long Distance,114.0


In [43]:
# Chart title
title_txt = (
    f"Late Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(chi_stats, title_txt)

# Create and display the box plots
chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

### 3.15 Los Angeles Union Station (LAX), Los Angeles, CA

[Los Angeles Union Station](https://www.amtrak.com/stations/lax) ([LAX](https://www.amtrak.com/stations/lax)) serves the West Coast with
connections to Amtrak's long distance routes.

In [26]:
lax = ntwk.by_station(stations, "LAX")
lax.shape

(220, 24)

### 3.16 LAX: on-time performance metrics (entire period) [1 pt]

LAX station performance data is a compilation of quarterly metrics that focus on late
detraining passengers. Detraining assengers are considered on-time if they arrive at their
destination no later than fifteen (`15`) minutes after their scheduled arrival time. All other
detraining passengers are considered late.

In [27]:
# Train arrivals (total)
lax_trn_arrivals = lax.shape[0]

# Detraining totals
lax_detrn = lax[COLS["total_detrn"]].sum()
lax_detrn_late = lax[COLS["late_detrn"]].sum()
lax_detrn_on_time = lax_detrn - lax_detrn_late

print(
    f"Train Arrivals: {lax_trn_arrivals}",
    f"Total Detraining Customers: {lax_detrn}",
    f"Late Detraining Customers: {lax_detrn_late}",
    f"On-Time Detraining Customers: {lax_detrn_on_time}",
    sep="\n",
)

lax_stats = detrn.get_sum_stats(lax, AGG["columns"], AGG["funcs"])
lax_stats

Train Arrivals: 220
Total Detraining Customers: 1383151
Late Detraining Customers: 298156
On-Time Detraining Customers: 1084995


Unnamed: 0,Train Arrivals,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,220,1383151.0,6287.05,5750.0,3558.478,298156.0,1355.2545,892.0,1460.4418,0.2156,,1084995.0


In [None]:
#hidden tests are within this cell

### 3.17 LAX: mean late arrival times summary statistics

Review the central tendency, dispersion, and shape for the mean late arrival times of LAX trains. Call the custom function named `frm.describe_numeric_column()` to return a dictionary of summary statistics.

In [49]:
# Drop missing values
lax_avg_min_late = lax[COLS["late_detrn_avg_mm_late"]].dropna().reset_index(drop=True)

# Call the custom frm.describe_numeric_column() function again
lax_avg_min_late_describe = frm.describe_numeric_column(lax_avg_min_late)
lax_avg_min_late_describe


{'type': pandas.core.series.Series,
 'name': 'Late Detraining Customers Avg Min Late',
 'values': {'non_null': np.int64(213),
  'missing': np.int64(0),
  'dtype': dtype('float64')},
 'center': {'mean': np.float64(63.6150234741784),
  'median': 45.0,
  'mode': np.float64(34.0)},
 'position': {'min': 16.0,
  '25%': np.float64(34.0),
  '50%': np.float64(45.0),
  '75%': np.float64(65.0),
  'max': 383.0},
 'spread': {'variance': 2711.992603419258,
  'std': 52.076795249124714,
  'range': 367.0,
  'iqr': np.float64(31.0)},
 'shape': {'skewness': np.float64(2.536460931553542),
  'kurtosis': np.float64(8.204368446736076)}}

### 3.18 LAX: visualize distribution of mean late arrival times

Visualize mean late arrival times for the entire period. The data is binned prior to plotting.

#### 3.18.1 Create chart data [1 pt]

In [50]:
# Convert to DataFrame
lax_avg_min_late = lax_avg_min_late.to_frame(name=COLS["avg_mm_late"])

# Get mean and standard deviation
mu = lax_avg_min_late_describe["center"]["mean"]
sigma = lax_avg_min_late_describe["spread"]["std"]

# Get max value (for x-axis ticks); pad max value for chart display
max_val = lax_avg_min_late_describe["position"]["max"]
max_val_ceil = (np.ceil(max_val / 10) * 10).astype(int)

# Create bins
lax_min_late, bins, num_bins, bin_width = frm.create_bins(lax_avg_min_late, COLS["avg_mm_late"], 10)

# Bin the data
chrt_data = frm.bin_data(lax_min_late, COLS["avg_mm_late"], bins)
# chrt_data


In [None]:
#hidden tests are within this cell

#### 3.18.2 Generate the histogram

In [51]:
# Chart title
title_txt = f"Late Detraining Passengers: {STNS['lax']}"
title = ttl.format_title(lax_stats, title_txt)

# Tooltips
tooltip_config = [
    {"shorthand": "bin_center:Q", "title": "Average Minutes Late", "format": None},
    {"shorthand": "count:Q", "title": "Late Arrivals Count", "format": None},
]

# Create and display the histogram
chart = hst.create_histogram(
    frame=chrt_data,
    x_shorthand="bin_center:Q",
    x_title="Average Minutes Late",
    y_shorthand="count:Q",
    y_title="Late Arrivals Count",
    y_stack=False,
    line_shorthand="Avg Min Late:Q",
    mu=mu,
    sigma=sigma,
    num_bins=num_bins,
    bin_width=bin_width,
    x_tick_count_max=max_val_ceil,
    bar_color=COLORS["amtk_blue"],
    mu_color=COLORS["amtk_red"],
    sigma_color=COLORS["anth_gray"],
    tooltip_config=tooltip_config,
    title=title,
)
chart.display()


### 3.19 LAX: on-time performance metrics (by fiscal year and quarter)

Compute OTP summary statistics per fiscal year and quarter. Add quarterly train arrival metrics to
the `DataFrame` named `lax_qtr_stats`.

In [52]:
lax_qtr_stats = detrn.get_sum_stats_by_group(
    lax,
    [COLS["year"], COLS["quarter"]],
    AGG["columns"],
    AGG["funcs"],
    lax_trn_arrivals,
    lax_detrn,
)
lax_qtr_stats.sort_values(by=[COLS["year"], COLS["quarter"]], ascending=[True, True])

Unnamed: 0,Fiscal Year,Fiscal Quarter,Train Arrivals,Train Arrival Ratio,Detraining Ratio,Total Detraining Customers sum,Total Detraining Customers mean,Total Detraining Customers median,Total Detraining Customers std,Late Detraining Customers sum,Late Detraining Customers mean,Late Detraining Customers median,Late Detraining Customers std,Late to Total Detraining Customers Ratio,Late Detraining Customers Avg Min Late mean,Total On Time Detraining Customers sum
0,2021,4,14,0.063636,0.054844,75858,5418.4286,5507.5,2317.0101,11456,818.2857,847.0,528.6046,0.151,43.2143,64402
1,2022,1,25,0.113636,0.07927,109642,4385.68,3756.0,3322.4681,23767,950.68,559.0,1110.7817,0.2168,72.9091,85875
2,2022,2,18,0.081818,0.065893,91140,5063.3333,4760.5,2846.3386,18051,1002.8333,910.5,840.6817,0.1981,68.9412,73089
3,2022,3,17,0.077273,0.098682,136492,8028.9412,7901.0,3622.6644,31514,1853.7647,1444.0,1828.8681,0.2309,74.5294,104978
4,2022,4,18,0.081818,0.108741,150405,8355.8333,8667.5,4234.1648,42910,2383.8889,1854.5,2174.2089,0.2853,62.8235,107495
5,2023,1,14,0.063636,0.070757,97868,6990.5714,7146.5,2650.9018,20476,1462.5714,831.0,1471.1409,0.2092,60.7857,77392
6,2023,2,16,0.072727,0.058466,80867,5054.1875,5103.0,2881.4689,18553,1159.5625,902.5,1051.6684,0.2294,62.0,62314
7,2023,3,20,0.090909,0.074609,103195,5159.75,5559.5,3146.4702,24261,1213.05,769.0,1524.168,0.2351,58.5263,78934
8,2023,4,21,0.095455,0.09895,136863,6517.2857,5709.0,3998.5819,33286,1585.0476,1351.0,1641.19,0.2432,63.2857,103577
9,2024,1,18,0.081818,0.10253,141815,7878.6111,7263.5,3403.255,24637,1368.7222,953.5,1236.8138,0.1737,54.8333,117178


#### 3.19.1 Write to file [1 pt]

Write `lax_qtr_stats` to a CSV file named `stu-amtk-lax_qtr_stats.csv`. Store the file in the
`data/student` directory. Then compare it to the accompanying `fxt-amtk-lax_qtr_stats.csv` file. It
must match line for line, character for character.

In [53]:
# YOUR CODE HERE
filepath = parent_path.joinpath("data", "student", "stu-amtk-lax_qtr_stats.csv")
lax_qtr_stats.to_csv(filepath, index=True)

In [None]:
#hidden tests are within this cell

### 3.20 LAX: visualize detraining passengers

Visualize LAX detraining passengers, both on-time and late, across all years and quarters with a
bar chart.

In [54]:
# Assemble the data for the chart
chrt_data = vis_bar.create_detrain_chart_frame(lax_qtr_stats, CHRT_BAR["columns"])

# Get station code, station name, city, and state to use in the chart title
text = frm.drop_dups_and_squeeze(
    lax, [COLS["station_code"], COLS["station"], COLS["city"], COLS["state"]]
)

# Chart title
title_txt = (
    f"Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(lax_stats, title_txt)

# Create and display grouped bar chart
chart = vis_bar.create_grouped_bar_chart(
    chrt_data,
    "Fiscal Period:N",
    "Passengers:Q",
    "Arrival Status:N",
    CHRT_BAR["xoffset_sort"],
    CHRT_BAR["colors"],
    title,
)

chart.display()

### 3.21 LAX: On-time performance metrics by service line [1 pt]

Group train arrivals by service line.

In [56]:
lax_svc_trains = lax.groupby(COLS["svc_line"]).size().reset_index()  # Includes rows with NaN
lax_svc_trains.columns = [COLS["svc_line"], COLS["trn_arrivals"]]
lax_svc_trains.sort_values(by=COLS["trn_arrivals"], ascending=False, inplace=True)
lax_svc_trains.reset_index(drop=True, inplace=True)

# Add train arrival ratios (year_qtr/total)
lax_svc_trains.loc[:, COLS["trn_arrival_ratio"]] = (
    lax_svc_trains[COLS["trn_arrivals"]] / lax_trn_arrivals
)
lax_svc_trains

Unnamed: 0,Service Line,Train Arrivals,Train Arrival Ratio
0,State Supported,183,0.831818
1,Long Distance,37,0.168182


In [None]:
#hidden tests are within this cell

#### 3.21.1 LAX: compute on-time performance metrics by service line. [1 pt]

In [28]:
# Get summary stats by COLS["svc_line"]
lax_svc_line_stats = detrn.get_sum_stats_by_group(
    lax, COLS["svc_line"], AGG["columns"], AGG["funcs"]
)

# Merge train arrivals by service line
lax_svc_line_stats = lax_svc_line_stats.merge(lax_svc_trains, on=COLS["svc_line"], how="inner")

# Move train arrival columns
cols = lax_svc_line_stats.columns.tolist()
cols = [cols[0]] + cols[-2:] + cols[1:-2]
lax_svc_line_stats = lax_svc_line_stats[cols]

# Add service line detraining ratios
lax_svc_line_stats.loc[:, "Service Line Detraining Ratio"] = (
    lax_svc_line_stats["Total Detraining Customers sum"] / lax_detrn
)

# Move service line detraining ratio column
lax_svc_line_stats.insert(
    3, "Service Line Detraining Ratio", lax_svc_line_stats.pop("Service Line Detraining Ratio")
)

# Sort by passengers detrained (descending order)
lax_svc_line_stats.sort_values(by="Total Detraining Customers sum", ascending=False, inplace=True)

# Reset index
lax_svc_line_stats.reset_index(drop=True, inplace=True)
lax_svc_line_stats

NameError: name 'lax_svc_trains' is not defined

In [None]:
#hidden tests are within this cell

#### 3.21.2 LAX: visualize distribution of mean late arrival times

Illustrate with box plots.

In [59]:
lax_svc_lines = lax.groupby(COLS["svc_line"])[[COLS["svc_line"], COLS["late_detrn_avg_mm_late"]]]
chrt_data = lax_svc_lines.apply(lambda x: x).reset_index(drop=True)  # Flatten for Altair
chrt_data.head()

Unnamed: 0,Service Line,Late Detraining Customers Avg Min Late
0,Long Distance,134.0
1,Long Distance,165.0
2,Long Distance,239.0
3,Long Distance,80.0
4,Long Distance,157.0


In [60]:
# Chart title
title_txt = (
    f"Late Detraining Passengers: {text['Arrival Station']} ({text['Arrival Station Code']}), "
    f"{text['City']}, {text['State']}"
)
title = ttl.format_title(lax_stats, title_txt)

chart = box.create_box_plot(
    chrt_data,
    "Late Detraining Customers Avg Min Late:Q",
    "Average Minutes Late",
    "Service Line:N",
    COLS["svc_line"],
    CHRT_BOX["y_axis"]["sort"],
    CHRT_BOX["colors"],
    title,
    CHRT_BOX["padding"],
)

chart.display()

## 3.0 Watermark

In [None]:
%load_ext watermark
%watermark -h -i -iv -m -v