# State-Level Cost of Travel Index Calculation

This notebook loads multiple CSV files from BigQuery exports (2024-2025) and calculates the cost of a three-day weekend trip by state.

**Data Sources:**
- Accommodations: 2 nights (median per-transaction cost)
- Restaurants: 6 meals (4 breakfast/lunch at P35, 2 dinners at P65)
- Attractions: 2 days (median per-transaction cost)
- Retail: 1 day (median per-transaction cost)

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from pathlib import Path

## 1. Load and Concatenate Data Files

Use glob to find all CSV files in each category folder and concatenate them.

In [2]:
# Define base path
base_path = '../state_data'

# Load Accommodations - all CSV files in accommodations folder
acc_files = glob.glob(os.path.join(base_path, 'accommodations', '*.csv'))
print(f"Found {len(acc_files)} accommodation files:")
for f in acc_files:
    print(f"  - {os.path.basename(f)}")

acc_dfs = [pd.read_csv(f) for f in acc_files]
acc = pd.concat(acc_dfs, ignore_index=True)

print(f"\nTotal accommodation records: {len(acc)}")
print(f"Date range: {acc['month_date'].min()} to {acc['month_date'].max()}")
print(f"Unique months: {acc['month_date'].nunique()}")

Found 2 accommodation files:
  - cti_state_accommodations_2024.csv
  - cti_state_accommodations_2025.csv

Total accommodation records: 1128
Date range: 2024-01-01 to 2025-10-01
Unique months: 22


In [3]:
# Quick check of accommodations data
acc.head()

Unnamed: 0,merch_state,month_date,accommodation_cost,transaction_count,unique_visitors,min_cost,max_cost,q25,q75,data_quality_flag,period_start,period_end,calculation_timestamp
0,FL,2024-01-01,222.830002,3768824,14863,60.669998,3682.149902,124.879997,485.619995,SINGLE_MONTH,2024-01-01,2024-12-31,2025-11-18 17:28:50.264976 UTC
1,TX,2024-01-01,185.600006,3282305,12875,56.810001,1306.900024,116.589996,360.579987,SINGLE_MONTH,2024-01-01,2024-12-31,2025-11-18 17:28:50.264976 UTC
2,CA,2024-01-01,213.929993,3168776,13136,61.139999,1930.02002,118.150002,407.850006,SINGLE_MONTH,2024-01-01,2024-12-31,2025-11-18 17:28:50.264976 UTC
3,NV,2024-01-01,210.809998,3029201,9922,62.23,1644.430054,122.449997,390.559998,SINGLE_MONTH,2024-01-01,2024-12-31,2025-11-18 17:28:50.264976 UTC
4,TN,2024-01-01,183.119995,1329522,5185,56.389999,1451.5,119.139999,325.820007,SINGLE_MONTH,2024-01-01,2024-12-31,2025-11-18 17:28:50.264976 UTC


In [4]:
# Load Attractions - all CSV files in attractions folder
att_files = glob.glob(os.path.join(base_path, 'attractions', '*.csv'))
print(f"Found {len(att_files)} attraction files:")
for f in att_files:
    print(f"  - {os.path.basename(f)}")

att_dfs = [pd.read_csv(f) for f in att_files]
att = pd.concat(att_dfs, ignore_index=True)

print(f"\nTotal attraction records: {len(att)}")
print(f"Date range: {att['month_date'].min()} to {att['month_date'].max()}")
print(f"Unique months: {att['month_date'].nunique()}")

Found 2 attraction files:
  - cti_state_attractions_2024.csv
  - cti_state_attractions_2025.csv

Total attraction records: 1144
Date range: 2024-01-01 to 2025-10-01
Unique months: 22


In [5]:
# Load Restaurants - all CSV files in retaurants folder (note typo in folder name)
rest_files = glob.glob(os.path.join(base_path, 'restaurants', '*.csv'))
print(f"Found {len(rest_files)} restaurant files:")
for f in rest_files:
    print(f"  - {os.path.basename(f)}")

rest_dfs = [pd.read_csv(f) for f in rest_files]
rest = pd.concat(rest_dfs, ignore_index=True)

print(f"\nTotal restaurant records: {len(rest)}")
print(f"Date range: {rest['month_date'].min()} to {rest['month_date'].max()}")
print(f"Unique months: {rest['month_date'].nunique()}")

Found 2 restaurant files:
  - cti_state_restaurants_2024.csv
  - cti_state_restaurants_2025.csv

Total restaurant records: 1144
Date range: 2024-01-01 to 2025-10-01
Unique months: 22


In [6]:
# Load Retail - all CSV files in retail folder
ret_files = glob.glob(os.path.join(base_path, 'retail', '*.csv'))
print(f"Found {len(ret_files)} retail files:")
for f in ret_files:
    print(f"  - {os.path.basename(f)}")

ret_dfs = [pd.read_csv(f) for f in ret_files]
ret = pd.concat(ret_dfs, ignore_index=True)

print(f"\nTotal retail records: {len(ret)}")
print(f"Date range: {ret['month_date'].min()} to {ret['month_date'].max()}")
print(f"Unique months: {ret['month_date'].nunique()}")

Found 2 retail files:
  - cti_state_retail_2025.csv
  - cti_state_retail_2024.csv

Total retail records: 1140
Date range: 2024-01-01 to 2025-10-01
Unique months: 22


## 2. Data Quality Check

Verify that all datasets have consistent months and check for any data issues.

In [7]:
# Check unique months in each dataset
print("Unique months by category:")
print(f"Accommodations: {sorted(acc['month_date'].unique())}")
print(f"Attractions: {sorted(att['month_date'].unique())}")
print(f"Restaurants: {sorted(rest['month_date'].unique())}")
print(f"Retail: {sorted(ret['month_date'].unique())}")

Unique months by category:
Accommodations: ['2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01', '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01', '2024-09-01', '2024-10-01', '2024-11-01', '2024-12-01', '2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01', '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01', '2025-09-01', '2025-10-01']
Attractions: ['2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01', '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01', '2024-09-01', '2024-10-01', '2024-11-01', '2024-12-01', '2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01', '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01', '2025-09-01', '2025-10-01']
Restaurants: ['2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01', '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01', '2024-09-01', '2024-10-01', '2024-11-01', '2024-12-01', '2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01', '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01', '2025-09-01', '2025-10-01']
Reta

In [8]:
# Check for states with low visitor counts
print("\nStates with lowest visitor counts (Accommodations):")
print(acc.sort_values(by='unique_visitors', ascending=True)[['merch_state', 'month_date', 'unique_visitors', 'data_quality_flag']].head(10))


States with lowest visitor counts (Accommodations):
     merch_state  month_date  unique_visitors data_quality_flag
1127          XX  2025-10-01                1           EXCLUDE
973           XX  2025-07-01                1           EXCLUDE
358           XX  2024-07-01                1           EXCLUDE
819           XX  2025-04-01                1           EXCLUDE
410           XX  2024-08-01                2           EXCLUDE
51            XX  2024-01-01                2           EXCLUDE
716           RI  2025-02-01               86           EXCLUDE
665           RI  2025-01-01               87           EXCLUDE
614           RI  2024-12-01              111           EXCLUDE
818           RI  2025-04-01              124           EXCLUDE


## 3. Merge Data Sources

Join all four datasets on state and month_date to create a unified dataset.

In [9]:
# Merge accommodations and attractions
merge_1 = pd.merge(
    acc, 
    att, 
    how='left', 
    on=['merch_state', 'month_date'], 
    suffixes=('_acc', '_att')
)

print(f"After merging accommodations + attractions: {len(merge_1)} records")

After merging accommodations + attractions: 1128 records


In [10]:
# Merge restaurants and retail
merge_2 = pd.merge(
    rest, 
    ret, 
    how='left', 
    on=['merch_state', 'month_date'], 
    suffixes=('_rest', '_ret')
)

print(f"After merging restaurants + retail: {len(merge_2)} records")

After merging restaurants + retail: 1144 records


In [11]:
# Final merge to combine all four data sources
final = pd.merge(
    merge_1, 
    merge_2, 
    how='left', 
    on=['merch_state', 'month_date']
)

print(f"Final merged dataset: {len(final)} records")
print(f"Date range: {final['month_date'].min()} to {final['month_date'].max()}")
print(f"Unique states: {final['merch_state'].nunique()}")
print(f"Unique months: {final['month_date'].nunique()}")

Final merged dataset: 1128 records
Date range: 2024-01-01 to 2025-10-01
Unique states: 52
Unique months: 22


In [12]:
# Check merged data structure
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   merch_state                 1128 non-null   object 
 1   month_date                  1128 non-null   object 
 2   accommodation_cost          1128 non-null   float64
 3   transaction_count_acc       1128 non-null   int64  
 4   unique_visitors_acc         1128 non-null   int64  
 5   min_cost                    1128 non-null   float64
 6   max_cost                    1128 non-null   float64
 7   q25                         1128 non-null   float64
 8   q75                         1128 non-null   float64
 9   data_quality_flag_acc       1128 non-null   object 
 10  period_start_acc            1128 non-null   object 
 11  period_end_acc              1128 non-null   object 
 12  calculation_timestamp_acc   1128 non-null   object 
 13  attraction_cost             1128 

## 4. Calculate Cost of Travel Index

Apply basket multiplication:
- 2 nights accommodation
- 4 breakfast/lunch meals
- 2 dinner meals
- 2 days attractions
- 1 day retail (x2 for couple)

In [13]:
# Select and rename key columns for clarity
cti = final[[
    'month_date', 
    'merch_state', 
    'accommodation_cost', 
    'attraction_cost', 
    'breakfast_lunch_cost', 
    'dinner_cost', 
    'median_meal_cost', 
    'retail_cost'
]].copy()

cti.head()

Unnamed: 0,month_date,merch_state,accommodation_cost,attraction_cost,breakfast_lunch_cost,dinner_cost,median_meal_cost,retail_cost
0,2024-01-01,FL,222.830002,32.34,24.68,47.34,34.209999,39.619999
1,2024-01-01,TX,185.600006,13.88,19.889999,37.84,27.4,41.669998
2,2024-01-01,CA,213.929993,20.129999,19.690001,40.380001,27.959999,25.68
3,2024-01-01,NV,210.809998,34.73,24.1,47.389999,33.860001,25.85
4,2024-01-01,TN,183.119995,24.030001,22.549999,40.650002,31.35,42.580002


In [14]:
# Calculate total cost per trip (3-day weekend for a couple)
cti['cost_per_trip'] = (
    (cti['accommodation_cost'] * 2) +      # 2 nights
    (cti['attraction_cost'] * 2) +         # 2 days of attractions
    (cti['breakfast_lunch_cost'] * 4) +    # 4 breakfast/lunch meals
    (cti['dinner_cost'] * 2) +             # 2 dinner meals
    (cti['retail_cost'] * 2)               # 2 days of retail (couple)
)

print("Cost calculation complete!")
cti.head()

Cost calculation complete!


Unnamed: 0,month_date,merch_state,accommodation_cost,attraction_cost,breakfast_lunch_cost,dinner_cost,median_meal_cost,retail_cost,cost_per_trip
0,2024-01-01,FL,222.830002,32.34,24.68,47.34,34.209999,39.619999,782.980003
1,2024-01-01,TX,185.600006,13.88,19.889999,37.84,27.4,41.669998,637.540007
2,2024-01-01,CA,213.929993,20.129999,19.690001,40.380001,27.959999,25.68,678.999989
3,2024-01-01,NV,210.809998,34.73,24.1,47.389999,33.860001,25.85,733.959995
4,2024-01-01,TN,183.119995,24.030001,22.549999,40.650002,31.35,42.580002,670.959995


## 5. Data Cleaning and Filtering

In [15]:
# Remove 'XX' state (unknown/invalid state codes)
print(f"Records before filtering XX: {len(cti)}")
cti = cti[cti['merch_state'] != 'XX'].copy()
print(f"Records after filtering XX: {len(cti)}")

Records before filtering XX: 1128
Records after filtering XX: 1122


In [16]:
# Sort by state and date for better readability
cti = cti.sort_values(['merch_state', 'month_date']).reset_index(drop=True)

print("\nFinal dataset summary:")
print(f"Total records: {len(cti)}")
print(f"States: {cti['merch_state'].nunique()}")
print(f"Months: {cti['month_date'].nunique()}")
print(f"Date range: {cti['month_date'].min()} to {cti['month_date'].max()}")


Final dataset summary:
Total records: 1122
States: 51
Months: 22
Date range: 2024-01-01 to 2025-10-01


## 6. Summary Statistics and Validation

In [17]:
# Show cost range by component
print("Cost per trip statistics:")
print(cti['cost_per_trip'].describe())

print("\nComponent cost ranges:")
print(f"Accommodation (2 nights): ${cti['accommodation_cost'].min():.2f} - ${cti['accommodation_cost'].max():.2f}")
print(f"Attractions (2 days): ${cti['attraction_cost'].min():.2f} - ${cti['attraction_cost'].max():.2f}")
print(f"Breakfast/Lunch (each): ${cti['breakfast_lunch_cost'].min():.2f} - ${cti['breakfast_lunch_cost'].max():.2f}")
print(f"Dinner (each): ${cti['dinner_cost'].min():.2f} - ${cti['dinner_cost'].max():.2f}")
print(f"Retail (per day): ${cti['retail_cost'].min():.2f} - ${cti['retail_cost'].max():.2f}")

Cost per trip statistics:
count    1122.000000
mean      718.914010
std        96.989814
min       495.919987
25%       661.899998
50%       703.400002
75%       753.320000
max      1291.679939
Name: cost_per_trip, dtype: float64

Component cost ranges:
Accommodation (2 nights): $136.16 - $525.22
Attractions (2 days): $6.06 - $90.77
Breakfast/Lunch (each): $6.40 - $29.78
Dinner (each): $9.85 - $58.12
Retail (per day): $18.53 - $92.23


In [18]:
# Show most and least expensive states (using most recent month)
latest_month = cti['month_date'].max()
latest_data = cti[cti['month_date'] == latest_month].sort_values('cost_per_trip')

print(f"\nCost rankings for {latest_month}:\n")
print("Top 10 Most Expensive States:")
print(latest_data[['merch_state', 'cost_per_trip']].tail(10).to_string(index=False))

print("\nTop 10 Least Expensive States:")
print(latest_data[['merch_state', 'cost_per_trip']].head(10).to_string(index=False))


Cost rankings for 2025-10-01:

Top 10 Most Expensive States:
merch_state  cost_per_trip
         MD     785.359997
         NH     789.860012
         ME     793.180012
         OR     797.680000
         NY     802.580009
         HI     825.920021
         MA     829.879982
         RI     858.539970
         VT     904.219994
         DC    1028.919994

Top 10 Least Expensive States:
merch_state  cost_per_trip
         OK     577.740013
         MS     613.320007
         WV     639.959999
         KS     645.759998
         AR     648.939995
         NE     651.700005
         IA     656.860008
         CO     658.340000
         GA     659.999977
         VA     666.180008


In [19]:
# Check for any missing or null values
print("\nMissing value check:")
print(cti.isnull().sum())


Missing value check:
month_date              0
merch_state             0
accommodation_cost      0
attraction_cost         0
breakfast_lunch_cost    0
dinner_cost             0
median_meal_cost        0
retail_cost             0
cost_per_trip           0
dtype: int64


## 7. Sample State Analysis (Rhode Island)

In [20]:
# Focus on Rhode Island as an example
ri_data = cti[cti['merch_state'] == 'RI'].sort_values('month_date')
print("Rhode Island Cost of Travel Trend:")
print(ri_data[['month_date', 'cost_per_trip']].to_string(index=False))

Rhode Island Cost of Travel Trend:
month_date  cost_per_trip
2024-01-01     801.760029
2024-02-01     742.619987
2024-03-01     961.160004
2024-04-01    1038.140026
2024-05-01     864.299992
2024-06-01    1104.639977
2024-07-01    1232.340012
2024-08-01    1260.360008
2024-09-01     993.500008
2024-10-01     984.379974
2024-11-01     844.379993
2024-12-01     751.600010
2025-01-01     716.720013
2025-02-01     882.560028
2025-03-01     870.119991
2025-04-01     816.560005
2025-05-01     895.759987
2025-06-01     988.679985
2025-07-01     996.559994
2025-08-01     862.420021
2025-09-01     877.960014
2025-10-01     858.539970


## 8. Export Final Dataset

In [21]:
# Export to CSV
output_filename = '../output/cti_state_2024_2025.csv'

# Create output directory if it doesn't exist
os.makedirs('../output', exist_ok=True)

cti.to_csv(output_filename, index=False)
print(f"\nData exported successfully to: {output_filename}")
print(f"Total records: {len(cti)}")
print(f"File size: {os.path.getsize(output_filename) / 1024:.2f} KB")


Data exported successfully to: ../output/cti_state_2024_2025.csv
Total records: 1122
File size: 153.09 KB


In [None]:
# Display final preview
print("\nFinal dataset preview (first 20 rows):")
cti.head(20)