# Clean: Intercity Passenger Rail Service Station Performance Metrics

This notebook "cleans" the combined [Amtrak](https://www.amtrak.com/home.html) station performance
metrics, addressing issues involving missing values,string formatting, type conversion, and column
redundancy. The notebook also leverages each station's "State" value to add "Division" and "Region"
columns based on [US Census](https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf)
geographic groupings. The notebook then writes the updated dataset to a CSV file for follow up
cleaning, manipulation, and analysis.

### Variable names

A number of variable names in this project leverage the following abbreviations. The naming
strategy is to strike a balance between brevity and readability:

* `amtk`: Amtrak (reporting mark)
* `chrt`: chart
* `cols`: columns
* `const`: constant
* `cwd`: current working directory
* `eb`: eastbound direction of travel
* `lm`: linear model
* `mi`: miles
* `mm`: minutes (ISO 8601)
* `nb`: northbound direction of travel
* `psgr`: passenger
* `qtr`: quarter
* `rte`: route
* `sb`: southbound direction of travel
* `stats`: summary statistics
* `stn`: station
* `stns`: stations
* `svc`: service
* `trn`: train
* `wb`: westbound direction of travel

In [1]:
import json
import numpy as np
import pandas as pd
import pathlib as pl
import re
import tomllib as tl

import fra_amtrak.amtk_frame as frm
import fra_amtrak.amtk_network as ntwk

# Set random seed
rdg = np.random.default_rng(24)

## 1.0 Read files

### 1.1 Resolve paths

Instantiate instances of `pathlib.Path` to represent absolute paths to the `data/interim` and `data/processed` directories.

In [2]:
parent_path = pl.Path.cwd()  # current working directory
parent_path

data_interim_path = parent_path.joinpath("data", "interim")
data_processed_path = parent_path.joinpath("data", "processed")

### 1.2 Load constraints

Load a companion [TOML](https://toml.io/en/) file containing constants.

In [3]:
filepath = parent_path.joinpath("notebook.toml")
with open(filepath, "rb") as file_obj:
    const = tl.load(file_obj)

# Access constants
COLS = const["columns"]

### 1.3 Retrieve performance data (interim)

In [4]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p0.csv")
stations = pd.read_csv(filepath)

### 1.4 Review the `DataFrame`

In [5]:
stations.shape

(68412, 12)

In [6]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Fiscal Year                 68412 non-null  int64  
 1   Fiscal Quarter              68412 non-null  int64  
 2   Service Line                68412 non-null  object 
 3   Service                     68412 non-null  object 
 4   Sub Service                 68412 non-null  object 
 5   Train Number                68412 non-null  int64  
 6   Arrival Station Code        68412 non-null  object 
 7   Arrival Station Name        68412 non-null  object 
 8   Total Detraining Customers  68412 non-null  int64  
 9   Late Detraining Customers   68412 non-null  int64  
 10  Avg Min Late (Lt CS)        52705 non-null  float64
 11  Avg Min Late (Lt C)         4668 non-null   float64
dtypes: float64(2), int64(5), object(5)
memory usage: 6.3+ MB


In [7]:
stations.head()

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS),Avg Min Late (Lt C)
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",42445,23316,95.0,
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",28034,18439,91.0,
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",557,223,54.0,
3,2024,3,Long Distance,California Zephyr,California Zephyr,5,COX,"Colfax, California",508,326,99.0,
4,2024,3,Long Distance,California Zephyr,California Zephyr,5,CRN,"Creston, Iowa",205,144,67.0,


## 2.0 Normalize strings

Trim each string value of leading/trailing spaces. Also search and remove unnecessary spaces in each string value based on the regular expression `re.Pattern` object. Call the function `frm.normalize_dataframe_strings()` to perform this operation.

### 2.1 Locate suspect strings

As is illustrated below, the regex pattern to employ is `"\s{2,}"`.

In [7]:
# Locate extra spaces in all string columns
extra_spaces = stations.select_dtypes(include="object").apply(
    lambda x: x.str.contains(r"\s{2,}").sum()
)
extra_spaces

Service Line             0
Service                  0
Sub Service              0
Arrival Station Code     0
Arrival Station Name    22
dtype: int64

### 2.2 Clean strings [1 pt]

In [8]:
stations["Service Line"] = stations["Service Line"].str.strip()
stations["Service"] = stations["Service"].str.strip()
stations["Sub Service"] = stations["Sub Service"].str.strip()
stations["Arrival Station Code"] = stations["Arrival Station Code"].str.strip()
stations["Arrival Station Name"] = stations["Arrival Station Name"].str.strip()
stations = frm.normalize_dataframe_strings(frame=stations, pattern="\s{2,}")

In [10]:
#hidden tests are within this cell

## 3.0 Manipulate data

### 3.1 Why two "average min late" columns?

The dataset contains two columns that appear to record the same information: average minutes late. The columns are: "Avg Min Late (Lt CS)" and "Avg Min Late (Lt C)". The "Lt CS" column is well-stocked with non-`NaN` values; in contrast "Lt C" column contains only `4668` numeric values. Perhaps this data can be moved to the "Avg Min Late (Lt CS)". Investigate.

#### 3.1.1 Compare "Avg Min Late (Lt CS)" and "Avg Min Late (Lt C)" values

First, return a `DataFrame` filtered on "Avg Min Late (Lt C)" non-NA values.

In [9]:
mask = stations[COLS["avg_mm_late_c"]].notna()
lt_c_notna = stations[mask].reset_index(drop=True)
lt_c_notna.shape

(4668, 12)

Check if `lt_c_notna` numeric values can be found throughout the dataset or are confined to a specific years and/or quarters.

In [10]:
years_qtrs = lt_c_notna[[COLS["year"], COLS["quarter"]]].drop_duplicates().reset_index(drop=True)
years_qtrs

Unnamed: 0,Fiscal Year,Fiscal Quarter
0,2022,1


Next, create a second `DataFrame` filtered on "Avg Min Late (Lt C)" non-NA values _and_ "Avg Min Late (Lt CS)" NA values.

In [11]:
mask = (stations[COLS["avg_mm_late_c"]].notna()) & (stations[COLS["avg_mm_late_cs"]].isna())
lt_c_notna_lt_cs_isna = stations[mask].reset_index(drop=True)
lt_c_notna_lt_cs_isna.shape

(4668, 12)

Check the two `DataFrames` for equality. If they are equal, the non-NA "Avg Min Late (Lt C)" values can be copied to the "Avg Min Late (Lt CS)" column.

In [12]:
assert lt_c_notna.equals(lt_c_notna_lt_cs_isna)

#### 3.1.2 Update the "Avg Min Late (Lt CS)" column with non-NA "Avg Min Late (Lt C)" values

The values are safe to transfer.

In [13]:
mask = stations[COLS["avg_mm_late_c"]].notna()
stations.loc[mask, COLS["avg_mm_late_cs"]] = stations.loc[mask, COLS["avg_mm_late_c"]]
stations[mask].head(3)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS),Avg Min Late (Lt C)
58896,2022,1,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",26631,18819,145.0,145.0
58897,2022,1,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",39969,31652,176.0,176.0
58898,2022,1,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",542,170,36.0,36.0


#### 3.1.3 Drop the "Avg Min Late (Lt C)" column [1 pt]

The column is now redundant.

In [16]:
stations.drop(columns=["Avg Min Late (Lt C)"], inplace=True)

In [17]:
stations.columns
# ['Fiscal Year', 'Fiscal Quarter', 'Service Line', 'Service', 'Sub Service', 
#  'Train Number', 'Arrival Station Code', 'Arrival Station Name', 
#  'Total Detraining Customers', 'Late Detraining Customers', 'Avg Min Late (Lt CS)']

Index(['Fiscal Year', 'Fiscal Quarter', 'Service Line', 'Service',
       'Sub Service', 'Train Number', 'Arrival Station Code',
       'Arrival Station Name', 'Total Detraining Customers',
       'Late Detraining Customers', 'Avg Min Late (Lt CS)'],
      dtype='object')

In [17]:
#hidden tests are within this cell

### 3.2 Split "Arrival Station Name" string into multiple columns [1 pt]

The "Arrival Station Name" column is overloaded with location information. The station name, state,
and country are usually resident in the string.

Split the column values and unpack the substrings into three new columns named "Arrival Station",
"State", and "Country". Use the available `COLS` constants to define the new column names.

In [19]:
temp = stations["Arrival Station Name"].str.split(pat=",", expand=True)
temp = temp.rename(columns={0: "Arrival Station", 1: "State", 2: "Country"})
stations = pd.concat([stations, temp], axis=1)
stations["Arrival Station"] = stations["Arrival Station"].str.strip()
stations["State"] = stations["State"].str.strip()
stations["Country"] = stations["Country"].str.strip()
stations = stations[['Fiscal Year', 'Fiscal Quarter', 'Service Line', 'Service', 'Sub Service', 
                     'Train Number', 'Arrival Station Code', 'Arrival Station Name', 
                     'Total Detraining Customers', 'Late Detraining Customers', 
                     'Avg Min Late (Lt CS)', 'Arrival Station', 'State', 'Country']]

In [None]:
#hidden tests are within this cell

#### 3.2.1 Review "State" column values

Compare values to jurisdictions contained in `states_provinces.json` file. The file contains a list of US states, the District of Columbia, and Canadian provinces. Update values as needed.

In [20]:
with open(data_processed_path.joinpath("states_provinces.json"), "r") as file:
    states_provinces = json.load(file)

# Combine US and Canadian jurisdictions
jurisdictions = states_provinces["United States"] + states_provinces["Canada"]

# Check for missing and/or incorrect values
mask = ~stations[COLS["state"]].isin(jurisdictions)  # negation
bad_values = stations[mask].loc[:, COLS["state"]].unique()
bad_values

array(['VT', None, 'CA'], dtype=object)

#### 3.2.2 Update "State" column CA and VT values [1 pt]

Update the "State" column, replacing the US state codes "CA" and "VT" with "California" and "Vermont", respectively.

In [21]:
# YOUR CODE HERE
stations["State"] = stations["State"].replace(to_replace="CA", value="California")
stations["State"] = stations["State"].replace(to_replace="VT", value="Vermont")

In [None]:
#hidden tests are within this cell

#### 3.2.3 Update "State" column `NaN` values

In [22]:
# Check "States" column for missing values
mask = stations[COLS["state"]].isna()
bad_values = (
    stations[mask]
    .loc[:, [COLS["station_code"], COLS["station"], COLS["state"]]]
    .drop_duplicates()
    .reset_index(drop=True)
)
bad_values

Unnamed: 0,Arrival Station Code,Arrival Station,State
0,CBN,Canadian Border New York,
1,NRG,Northridge Station,


The `NaN` values are associated with the following stations:

* CBN: [Canadian Border (Niagara Falls, NY)](https://www.amtrak.com/stations/cbn)
* NRG: [Northridge, CA](https://www.amtrak.com/stations/nrg)

Update the "State" column values for these stations.

In [23]:
# Update missing States and Country valuee
mapper = {"CBN": "New York", "NRG": "California"}
stations[COLS["state"]] = stations[COLS["station_code"]].map(mapper).fillna(stations["State"])

Sample to confirm that the "State" column values have been updated.

In [24]:
# Sample to confirm CBN and NRB stations have been updated
mask = (stations[COLS["station_code"]] == "CBN") | (stations[COLS["station_code"]] == "NRG")

# Apply weights to sample (CBN stations are fewer)
weights = stations[mask][COLS["station_code"]].apply(lambda x: 7 if x == "CBN" else 1)
stations[mask].sample(n=7, weights=weights, random_state=rdg)

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS),Arrival Station,State,Country
22926,2023,4,State Supported,Empire,Maple Leaf,63,CBN,Canadian Border New York,9314,1363,72.0,Canadian Border New York,New York,
24172,2023,4,State Supported,Pacific Surfliner,Pacific Surfliner,761,NRG,Northridge Station,993,189,33.0,Northridge Station,California,
30496,2023,3,State Supported,Pacific Surfliner,Pacific Surfliner,784,NRG,Northridge Station,255,17,42.0,Northridge Station,California,
29101,2023,3,State Supported,Empire,Maple Leaf,63,CBN,Canadian Border New York,6071,861,87.0,Canadian Border New York,New York,
30473,2023,3,State Supported,Pacific Surfliner,Pacific Surfliner,777,NRG,Northridge Station,607,108,53.0,Northridge Station,California,
46339,2022,4,State Supported,Empire,Maple Leaf,63,CBN,Canadian Border New York,11,0,,Canadian Border New York,New York,
4319,2024,3,State Supported,Empire,Maple Leaf,63,CBN,Canadian Border New York,5786,512,85.0,Canadian Border New York,New York,


### 3.3 Update the "Country" column [1 pt]

Levarage the "State" column to update each "Country" column row value with either the "United States" or "Canada".

In [25]:
# Read states
filepath = data_processed_path.joinpath("states_provinces.json")
with open(filepath, "r") as file_obj:
    states_provinces = json.load(file_obj)

# Count US and Canadian stations
country_counts = stations[COLS["country"]].value_counts()
print(f"country_counts = {country_counts}")

country_counts = Country
United States    174
Canada            23
Name: count, dtype: int64


Update the "Country" column with "United States" and "Canada" values by applying the function `get_country()` to each row value.

In [26]:
# YOUR CODE HERE
# Build reverse mapping
state_to_country = {state: country for country, states in states_provinces.items() for state in states}

# Define the mapping function
def get_country(state):
    return state_to_country.get(state, np.nan)

stations["Country"] = stations["State"].apply(get_country)
stations

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS),Arrival Station,State,Country
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",42445,23316,95.0,Lorton (Auto Train),Virginia,United States
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",28034,18439,91.0,Sanford (Auto Train),Florida,United States
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",557,223,54.0,Burlington,Iowa,United States
3,2024,3,Long Distance,California Zephyr,California Zephyr,5,COX,"Colfax, California",508,326,99.0,Colfax,California,United States
4,2024,3,Long Distance,California Zephyr,California Zephyr,5,CRN,"Creston, Iowa",205,144,67.0,Creston,Iowa,United States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68407,2021,4,State Supported,Vermonter,Vermonter,57,WAS,"Washington, District of Columbia",5191,187,37.0,Washington,District of Columbia,United States
68408,2021,4,State Supported,Vermonter,Vermonter,57,WIL,"Wilmington, Delaware",464,45,28.0,Wilmington,Delaware,United States
68409,2021,4,State Supported,Vermonter,Vermonter,57,WNL,"Windsor Locks, Connecticut",21,12,35.0,Windsor Locks,Connecticut,United States
68410,2021,4,State Supported,Vermonter,Vermonter,57,WNM,"Windsor, Vermont",14,10,26.0,Windsor,Vermont,United States


In [None]:
#hidden tests are within this cell

Recheck the "Country" column values.

In [27]:
# Count US and Canadian stations
country_counts = stations[COLS["country"]].value_counts()
print(f"country_counts = {country_counts}")

country_counts = Country
United States    68376
Canada              36
Name: count, dtype: int64


### 3.3 Add region and division columns

Read the `regions_divisions.json` file to acquire region and division values. Then levarage the "State" column to add new "Region" and "Division" columns to the `DataFrame`.

In [28]:
filepath = data_processed_path.joinpath("regions_divisions.json")
with open(filepath, "r") as file_obj:
    regions_divisions = json.load(file_obj)

print(regions_divisions.keys())
print(regions_divisions["West"].keys())
print(regions_divisions["West"].items())

dict_keys(['Northeast', 'Midwest', 'South', 'West'])
dict_keys(['Mountain', 'Pacific', 'Western Canada'])
dict_items([('Mountain', ['Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming']), ('Pacific', ['Alaska', 'California', 'Hawaii', 'Oregon', 'Washington']), ('Western Canada', ['Alberta', 'British Columbia', 'Manitoba', 'Saskatchewan'])])


Apply the function `add_regions_divisions()` to each "Region" and "Division" row.

In [29]:
# Assign region to each state, province, and district
stations.loc[:, [COLS["region"], COLS["division"]]] = (
    stations.loc[:, COLS["state"]]
    .apply(lambda x: pd.Series(ntwk.get_region_division(regions_divisions, x)))
    .values
)
stations.head()

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS),Arrival Station,State,Country,Region,Division
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",42445,23316,95.0,Lorton (Auto Train),Virginia,United States,South,South Atlantic
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",28034,18439,91.0,Sanford (Auto Train),Florida,United States,South,South Atlantic
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",557,223,54.0,Burlington,Iowa,United States,Midwest,West North Central
3,2024,3,Long Distance,California Zephyr,California Zephyr,5,COX,"Colfax, California",508,326,99.0,Colfax,California,United States,West,Pacific
4,2024,3,Long Distance,California Zephyr,California Zephyr,5,CRN,"Creston, Iowa",205,144,67.0,Creston,Iowa,United States,Midwest,West North Central


### 3.4 Reorder columns [1 pt]

Reorder the columns as specified in the table below.

| Position | Column Name | Note |
| :----- | :------------- | :------------- |
| `0`-`1` | "Fiscal Year", "Fiscal Quarter" | &nbsp; |
| `2`-`5` | "Service Line", "Service", "Sub Service", "Train Number" | &nbsp; |
| `6-8` | "Arrival Station Code", "Arrival Station Name", "Arrival Station" | Drop "Arrival Station Name" after confirming column order. |
| `9`-`12` | "State", "Division", "Region", "Country" | &nbsp; |
| `13`-`14` | "Total Detraining Customers", "Late Detraining Customers" | &nbsp; |
| `15` | "Avg Min Late (Lt CS)" | &nbsp; |

In [30]:
stations = stations[["Fiscal Year", "Fiscal Quarter", "Service Line", "Service", "Sub Service", "Train Number", "Arrival Station Code", "Arrival Station Name", "Arrival Station",
             "State", "Division", "Region", "Country", "Total Detraining Customers", "Late Detraining Customers", "Avg Min Late (Lt CS)"]]
stations

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,State,Division,Region,Country,Total Detraining Customers,Late Detraining Customers,Avg Min Late (Lt CS)
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),Virginia,South Atlantic,South,United States,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),Florida,South Atlantic,South,United States,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,Iowa,West North Central,Midwest,United States,557,223,54.0
3,2024,3,Long Distance,California Zephyr,California Zephyr,5,COX,"Colfax, California",Colfax,California,Pacific,West,United States,508,326,99.0
4,2024,3,Long Distance,California Zephyr,California Zephyr,5,CRN,"Creston, Iowa",Creston,Iowa,West North Central,Midwest,United States,205,144,67.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68407,2021,4,State Supported,Vermonter,Vermonter,57,WAS,"Washington, District of Columbia",Washington,District of Columbia,South Atlantic,South,United States,5191,187,37.0
68408,2021,4,State Supported,Vermonter,Vermonter,57,WIL,"Wilmington, Delaware",Wilmington,Delaware,South Atlantic,South,United States,464,45,28.0
68409,2021,4,State Supported,Vermonter,Vermonter,57,WNL,"Windsor Locks, Connecticut",Windsor Locks,Connecticut,New England,Northeast,United States,21,12,35.0
68410,2021,4,State Supported,Vermonter,Vermonter,57,WNM,"Windsor, Vermont",Windsor,Vermont,New England,Northeast,United States,14,10,26.0


In [None]:
#hidden tests are within this cell

### 3.5 Drop "Arrival Station Name" column [1 pt]

Now redundant. Remove.

In [33]:
stations.drop(columns=["Arrival Station Name"], inplace=True)

KeyError: "['Arrival Station Name'] not found in axis"

In [None]:
#hidden tests are within this cell

### 3.6 Rename the "Avg Min Late (Lt CS)" column [1 pt]

The presence of parentheses `()` in the "Avg Min Late (Lt CS)" column name may cause issues in subsequent analysis. Rename the column to "Late Detraining Customers Avg Min Late".

In [37]:
# YOUR CODE HERE
stations.rename(columns={"Avg Min Late (Lt CS)": "Late Detraining Customers Avg Min Late"}, inplace=True)
stations

Unnamed: 0,Fiscal Year,Fiscal Quarter,Service Line,Service,Sub Service,Train Number,Arrival Station Code,Arrival Station Name,Arrival Station,State,Division,Region,Country,Total Detraining Customers,Late Detraining Customers,Late Detraining Customers Avg Min Late
0,2024,3,Long Distance,Auto Train,Auto Train,52,LOR,"Lorton (Auto Train), Virginia",Lorton (Auto Train),Virginia,South Atlantic,South,United States,42445,23316,95.0
1,2024,3,Long Distance,Auto Train,Auto Train,53,SFA,"Sanford (Auto Train), Florida",Sanford (Auto Train),Florida,South Atlantic,South,United States,28034,18439,91.0
2,2024,3,Long Distance,California Zephyr,California Zephyr,5,BRL,"Burlington, Iowa",Burlington,Iowa,West North Central,Midwest,United States,557,223,54.0
3,2024,3,Long Distance,California Zephyr,California Zephyr,5,COX,"Colfax, California",Colfax,California,Pacific,West,United States,508,326,99.0
4,2024,3,Long Distance,California Zephyr,California Zephyr,5,CRN,"Creston, Iowa",Creston,Iowa,West North Central,Midwest,United States,205,144,67.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68407,2021,4,State Supported,Vermonter,Vermonter,57,WAS,"Washington, District of Columbia",Washington,District of Columbia,South Atlantic,South,United States,5191,187,37.0
68408,2021,4,State Supported,Vermonter,Vermonter,57,WIL,"Wilmington, Delaware",Wilmington,Delaware,South Atlantic,South,United States,464,45,28.0
68409,2021,4,State Supported,Vermonter,Vermonter,57,WNL,"Windsor Locks, Connecticut",Windsor Locks,Connecticut,New England,Northeast,United States,21,12,35.0
68410,2021,4,State Supported,Vermonter,Vermonter,57,WNM,"Windsor, Vermont",Windsor,Vermont,New England,Northeast,United States,14,10,26.0


In [None]:
#hidden tests are within this cell

## 4.0 Persist data

### 4.1 Recheck data.

In [38]:
stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68412 entries, 0 to 68411
Data columns (total 16 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Fiscal Year                             68412 non-null  int64  
 1   Fiscal Quarter                          68412 non-null  int64  
 2   Service Line                            68412 non-null  object 
 3   Service                                 68412 non-null  object 
 4   Sub Service                             68412 non-null  object 
 5   Train Number                            68412 non-null  int64  
 6   Arrival Station Code                    68412 non-null  object 
 7   Arrival Station Name                    68412 non-null  object 
 8   Arrival Station                         68412 non-null  object 
 9   State                                   68412 non-null  object 
 10  Division                                68412 non-null  ob

### 4.2 Write to file [1 pt]

Write data to a CSV file.

In [39]:
filepath = data_interim_path.joinpath("station_performance_metrics-v1p1.csv")
stations.to_csv(filepath, index=False)

In [None]:
#hidden tests are within this cell

## 5.0 Watermark

In [40]:
%load_ext watermark
%watermark -h -i -iv -m -v

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.26.0

Compiler    : GCC 12.3.0
OS          : Linux
Release     : 6.5.0-1020-aws
Machine     : x86_64
Processor   : x86_64
CPU cores   : 32
Architecture: 64bit

Hostname: e738d0a7f38b

pandas: 2.2.3
re    : 2.2.1
numpy : 2.1.3
json  : 2.0.9

