Data from: https://www.ncei.noaa.gov/cdo-web/search?datasetid=GHCND

Collected Daily Summaries from 3 different stations to find average weather across regions of Minnesota: 
- INTERNATIONAL FALLS INTERNATIONAL AIRPORT, MN US (N Region)
- MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US (SE Region)
- ROCHESTER INTERNATIONAL AIRPORT, MN US (S/SE Region)

In [1]:
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, project_root)
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely import wkt
import warnings

warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('default', category=DeprecationWarning)

In [2]:
df = pd.read_csv('../data/raw/weather_data.csv')
df.tail(1)

Unnamed: 0,STATION,NAME,DATE,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09
10682,USW00014922,"MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US",2025-09-30,0.0,0.0,0.0,,84.0,66.0,,,,,,,,


Weather Type Codes
- WT01	Fog, ice fog, or freezing fog (or haze): Reduced visibility, increased slickness.
- WT02	Heavy fog or thick fog: Severely reduced visibility (major crash factor).
- WT03	Thunder: Often correlated with heavy rain and sudden visibility changes.
- WT04	Ice pellets, sleet, snow pellets, or small hail: Immediate increase in road slickness and difficulty controlling vehicles.
- WT05	Hail (larger): Can cause property damage and sudden driver maneuvers.
- WT06	Glaze or rime (freezing rain): The most dangerous condition for black ice formation.
- WT08	Smoke or ash: Reduced air quality and significant visibility reduction.
- WT09	Blowing or drifting snow: Reduced visibility and accumulation creating slick, uneven conditions.

## 1. Initial Weather Data 

In [3]:
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y-%m-%d')
df = df.sort_values(by='DATE', ascending=True)

# Checking NaN values in WT columns - filling NaN with 0
wt_cols = ['WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09']
df[wt_cols] = df[wt_cols].fillna(0)

# Count NaN in TAVG column 
cols_to_check = ['TAVG', 'TMAX', 'TMIN', 'PRCP', 'SNOW']
for col in cols_to_check: 
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Initial NaN counts
print(f'Initial NaN values in TAVG: {df["TAVG"].isna().sum()}, NaN values in PRCP: {df["PRCP"].isna().sum()}, NaN values in SNOW: {df["SNOW"].isna().sum()}')

Initial NaN values in TAVG: 134, NaN values in PRCP: 5, NaN values in SNOW: 9


## 2. Dealing with NaNs for Numerical Features



In [4]:
numerical_features = ['TAVG', 'PRCP', 'SNOW'] 

# First check if TMAX and TMIN are available to find TAVG
df['TAVG_ESTIMATE'] = (df['TMAX'] + df['TMIN']) / 2
df['TAVG'] = df['TAVG'].fillna(df['TAVG_ESTIMATE'])
print(f"NaNs remaining in TAVG after TMAX/TMIN imputation: {df['TAVG'].isnull().sum()}")

# If NaN still present, fill with daily median across stations
for col in numerical_features:
    daily_median_across_stations = df.groupby('DATE')[col].transform('median')
    df[col] = df[col].fillna(daily_median_across_stations)
    print(f"NaNs remaining in {col} after peer median fill: {df[col].isnull().sum()}")

NaNs remaining in TAVG after TMAX/TMIN imputation: 30
NaNs remaining in TAVG after peer median fill: 0
NaNs remaining in PRCP after peer median fill: 0
NaNs remaining in SNOW after peer median fill: 0


## 3. Aggregate and Merge

In [5]:
daily_median_weather = df.groupby('DATE')[numerical_features].median().reset_index()
daily_median_weather.columns = ['DATE'] + [f'{col}_MEDIAN' for col in numerical_features]

daily_max_wt = df.groupby('DATE')[wt_cols].max().reset_index()
daily_max_wt.columns = ['DATE'] + [f'{col}_MAX' for col in wt_cols]

final_weather_df = pd.merge(daily_median_weather, daily_max_wt, on='DATE', how='inner') 
print(final_weather_df.sample(10))
print(final_weather_df.info())


           DATE  TAVG_MEDIAN  PRCP_MEDIAN  SNOW_MEDIAN  WT01_MAX  WT02_MAX  \
826  2018-04-06         23.0         0.00          0.0       0.0       0.0   
2045 2021-08-07         71.0         0.65          0.0       1.0       0.0   
3081 2024-06-08         63.0         0.04          0.0       1.0       0.0   
1988 2021-06-11         80.0         0.00          0.0       0.0       0.0   
112  2016-04-22         51.0         0.00          0.0       0.0       0.0   
3330 2025-02-12          2.0         0.02          0.1       0.0       0.0   
3216 2024-10-21         67.0         0.00          0.0       1.0       0.0   
1420 2019-11-21         39.0         0.43          0.1       1.0       0.0   
2291 2022-04-10         44.0         0.00          0.0       1.0       0.0   
3116 2024-07-13         75.0         0.43          0.0       1.0       0.0   

      WT03_MAX  WT04_MAX  WT05_MAX  WT06_MAX  WT08_MAX  WT09_MAX  
826        0.0       0.0       0.0       0.0       1.0       1.0  
2045   

## 4. Save Output for Weather (Optional)

In [6]:
""""
# Create processed data directory if it doesn't exist
import os
processed_dir = '../data/processed'
os.makedirs(processed_dir, exist_ok=True)

# Save the cleaned dataset
cleaned_file_path = os.path.join(processed_dir, 'cleaned_weather.csv')
final_weather_df.to_csv(cleaned_file_path, index=False)

print(f"Cleaned dataset saved to: {cleaned_file_path}")
"""

'"\n# Create processed data directory if it doesn\'t exist\nimport os\nprocessed_dir = \'../data/processed\'\nos.makedirs(processed_dir, exist_ok=True)\n\n# Save the cleaned dataset\ncleaned_file_path = os.path.join(processed_dir, \'cleaned_weather.csv\')\nfinal_weather_df.to_csv(cleaned_file_path, index=False)\n\nprint(f"Cleaned dataset saved to: {cleaned_file_path}")\n'

## 5. Initial Traffic Data

In [7]:
df = pd.read_csv('../data/raw/all_crashes.csv')

print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nSample of first few rows:")
display(df.head())

print("\nDataset Info:")
df.info()

Dataset Shape: (642813, 40)

Columns: ['geom', 'Num_Occupants', 'Num_MotorVehicles', 'Fatalities_calc', 'SuspectedSeriousInjury_calc', 'NonMotorist_calc', 'Aggregated_Non_Motorist_txt', 'CityTownship', 'CountyNameTxt', 'Region', 'Zipcode_txtCARTO', 'TribalLand_txtCARTO', 'InjuryLevel', 'ExceedingSpeedLimitInd_boolCARTO', 'Seatbelt_boolCARTO', 'DistractedDrivingCde_boolCARTO', 'AlcoholSuspectedCde_txtCARTO', 'DrugSuspectedCde_txtCARTO', 'WeatherCde_txtCARTO', 'SurfaceConditionCde_txtCARTO', 'RdwyTypeCde_txtCARTO', 'RelativeLocIntersectCde', 'PostedSpeedNbr_binCARTO', 'WorkZoneInd_txtCARTO', 'DateOfIncident', 'AgencyIdNbr_txtCARTO', 'Injuries_calc', 'LatDec', 'LongDec', 'RdwyNameTxt', 'LocalCodeTxt', 'doi_date', 'doi_time', 'doi_dow_txtCARTO', 'HitAndRunInd_txtCARTO', 'VehicleMakeModelYearTxt_list', 'VehicleTypeCde_txtCARTO_list', 'doi_hour_txtCARTO', 'CrashType', 'CrashDetail']

Sample of first few rows:


Unnamed: 0,geom,Num_Occupants,Num_MotorVehicles,Fatalities_calc,SuspectedSeriousInjury_calc,NonMotorist_calc,Aggregated_Non_Motorist_txt,CityTownship,CountyNameTxt,Region,...,LocalCodeTxt,doi_date,doi_time,doi_dow_txtCARTO,HitAndRunInd_txtCARTO,VehicleMakeModelYearTxt_list,VehicleTypeCde_txtCARTO_list,doi_hour_txtCARTO,CrashType,CrashDetail
0,POINT(-93.38856788 45.28870554),7,2,1,1,0,Motorist Only,ANDOVER,ANOKA,Metro,...,25217514,"September 30, 2025",06:59:00 AM,Tuesday,"No, Did not Leave Scene","2022 GMC SIERRA, 2015 UNKNOWN 3000","Pickup, School Bus",6:00 AM - 6:59 AM,Collision w/ Non-Fixed Object,Motor Vehicle in Transport
1,POINT(-93.37732941 45.25945286),2,2,0,0,0,Motorist Only,ANDOVER,ANOKA,Metro,...,25217042,"September 30, 2025",02:55:00 PM,Tuesday,"No, Did not Leave Scene","2015 CHEVROLET MALIBU, 2004 CHEVROLET SILVERADO","Passenger Car, Pickup",2:00 PM - 2:59 PM,Collision w/ Non-Fixed Object,Motor Vehicle in Transport
2,POINT(-94.14949353 45.12032389),1,1,0,0,0,Motorist Only,COKATO,WRIGHT,East Central,...,25027310,"September 30, 2025",06:58:00 PM,Tuesday,"No, Did not Leave Scene",2017 CHEVROLET EQUINOX,Sport Utility Vehicle,6:00 PM - 6:59 PM,Non-Collision,Overturn/Rollover
3,POINT(-94.14932826 45.1203201),1,1,0,0,0,Motorist Only,COKATO,WRIGHT,East Central,...,WP25027349,"September 30, 2025",07:20:00 AM,Tuesday,"No, Did not Leave Scene",2017 CHEVROLET BOLT EV,Sport Utility Vehicle,7:00 AM - 7:59 AM,Collision w/ Fixed Object,Standing Tree/Shrubbery
4,POINT(-93.31679869 45.18055865),2,2,0,0,0,Motorist Only,COON RAPIDS,ANOKA,Metro,...,25217663,"September 30, 2025",10:52:00 AM,Tuesday,"No, Did not Leave Scene","2025 TESLA MODEL 3, 2014 BMW 535","Passenger Car, Passenger Car",10:00 AM - 10:59 AM,Collision w/ Non-Fixed Object,Motor Vehicle in Transport



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 642813 entries, 0 to 642812
Data columns (total 40 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   geom                              642813 non-null  object 
 1   Num_Occupants                     642813 non-null  int64  
 2   Num_MotorVehicles                 642813 non-null  int64  
 3   Fatalities_calc                   642813 non-null  int64  
 4   SuspectedSeriousInjury_calc       642813 non-null  int64  
 5   NonMotorist_calc                  642813 non-null  int64  
 6   Aggregated_Non_Motorist_txt       642813 non-null  object 
 7   CityTownship                      642055 non-null  object 
 8   CountyNameTxt                     642813 non-null  object 
 9   Region                            642813 non-null  object 
 10  Zipcode_txtCARTO                  626752 non-null  float64
 11  TribalLand_txtCARTO               109

In [8]:
# Convert the 'geom' column from WKT to geometry objects
df['geometry'] = df['geom'].apply(wkt.loads)
gdf = gpd.GeoDataFrame(df, geometry='geometry', crs='EPSG:4326')

gdf['DateOfIncident'] = pd.to_datetime(gdf['DateOfIncident'])
gdf['DATE'] = gdf['DateOfIncident'].dt.date
gdf['day_of_week'] = gdf['DateOfIncident'].dt.day_name()
gdf['month'] = gdf['DateOfIncident'].dt.month_name()
display(gdf[['DateOfIncident', 'DATE','day_of_week', 'month']].head())

Unnamed: 0,DateOfIncident,DATE,day_of_week,month
0,2025-09-30 06:59:00,2025-09-30,Tuesday,September
1,2025-09-30 14:55:00,2025-09-30,Tuesday,September
2,2025-09-30 18:58:00,2025-09-30,Tuesday,September
3,2025-09-30 07:20:00,2025-09-30,Tuesday,September
4,2025-09-30 10:52:00,2025-09-30,Tuesday,September


Keep info about location, environment, weather/surface conditions, temporal for feature consideration

In [9]:
selected_cols = ['geometry', 'CountyNameTxt', 'Region', 
                'WeatherCde_txtCARTO', 'SurfaceConditionCde_txtCARTO', 'RdwyTypeCde_txtCARTO', 'DATE']
new_df = gdf[selected_cols].copy()

new_df.head()

Unnamed: 0,geometry,CountyNameTxt,Region,WeatherCde_txtCARTO,SurfaceConditionCde_txtCARTO,RdwyTypeCde_txtCARTO,DATE
0,POINT (-93.38857 45.28871),ANOKA,Metro,Cloudy,Dry,County State Aid Highway - CSAH,2025-09-30
1,POINT (-93.37733 45.25945),ANOKA,Metro,Clear,Dry,County State Aid Highway - CSAH,2025-09-30
2,POINT (-94.14949 45.12032),WRIGHT,East Central,Clear,Dry,County State Aid Highway - CSAH,2025-09-30
3,POINT (-94.14933 45.12032),WRIGHT,East Central,Clear,Dry,County State Aid Highway - CSAH,2025-09-30
4,POINT (-93.3168 45.18056),ANOKA,Metro,Clear,Dry,County State Aid Highway - CSAH,2025-09-30


## 6. Handle Missing Values for Traffic

In [10]:
missing_values = new_df.isnull().sum()
print("Missing Values in Each Column:")
print(missing_values)

Missing Values in Each Column:
geometry                        0
CountyNameTxt                   0
Region                          0
WeatherCde_txtCARTO             0
SurfaceConditionCde_txtCARTO    0
RdwyTypeCde_txtCARTO            0
DATE                            0
dtype: int64


Find values that are null, missing, or unknown and mark them as 'unknown' for uniformity


In [11]:
for col in new_df.columns:
    print(f"{col}: {new_df[col].unique()}")

geometry: <GeometryArray>
[<POINT (-93.389 45.289)>, <POINT (-93.377 45.259)>,  <POINT (-94.149 45.12)>,
  <POINT (-94.149 45.12)>, <POINT (-93.317 45.181)>, <POINT (-93.355 45.051)>,
  <POINT (-94.182 46.51)>, <POINT (-93.308 44.993)>, <POINT (-93.279 44.963)>,
 <POINT (-92.448 44.028)>,
 ...
 <POINT (-94.206 45.559)>,  <POINT (-94.219 45.55)>,  <POINT (-95.33 48.934)>,
 <POINT (-92.892 45.373)>, <POINT (-92.893 45.373)>, <POINT (-93.561 45.325)>,
 <POINT (-92.799 47.478)>, <POINT (-94.586 45.153)>,  <POINT (-92.845 45.39)>,
 <POINT (-96.768 46.862)>]
Length: 617291, dtype: geometry
CountyNameTxt: ['ANOKA' 'WRIGHT' 'HENNEPIN' 'CROW WING' 'OLMSTED' 'WINONA' 'DAKOTA'
 'WASHINGTON' 'RAMSEY' 'CARLTON' 'PINE' 'STEARNS' 'SAINT LOUIS' 'SIBLEY'
 'LAKE' 'RICE' 'CLAY' 'SHERBURNE' 'ISANTI' 'NICOLLET' 'BROWN' 'SCOTT'
 'RENVILLE' 'BLUE EARTH' 'CARVER' 'WATONWAN' 'CASS' 'KOOCHICHING' 'MEEKER'
 'BENTON' 'MILLE LACS' 'LINCOLN' 'LE SUEUR' 'WABASHA' 'GOODHUE' 'GRANT'
 'MCLEOD' 'LYON' 'OTTER TAIL' 'WADE

In [12]:
import os
import sys
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.insert(0, project_root)
from src.preprocessing import standardize_unknown_values
    
# Standardize unknown/missing values
new_df = standardize_unknown_values(new_df)

# Calculate statistics for unknown values
total_records = len(new_df)
unknown_stats = {}

for col in new_df.columns:
    if col != 'geom' and col != 'date':  # Skip non-categorical columns
        unknown_count = (new_df[col] == 'unknown').sum()
        unknown_percentage = (unknown_count / total_records) * 100
        unknown_stats[col] = {
            'unknown_count': unknown_count,
            'unknown_percentage': unknown_percentage
        }

# Display the results
print(f"Total records in dataset: {total_records:,}\n")
print("Unknown Values Statistics:")
print("-" * 60)
print(f"{'Column':<30} {'Count':>10} {'Percentage':>12}")
print("-" * 60)
for col, stats in unknown_stats.items():
    print(f"{col:<30} {stats['unknown_count']:>10,} {stats['unknown_percentage']:>11.2f}%")

Total records in dataset: 642,813

Unknown Values Statistics:
------------------------------------------------------------
Column                              Count   Percentage
------------------------------------------------------------
geometry                                0        0.00%
CountyNameTxt                           0        0.00%
Region                                  0        0.00%
WeatherCde_txtCARTO                 5,002        0.78%
SurfaceConditionCde_txtCARTO        4,626        0.72%
RdwyTypeCde_txtCARTO                6,904        1.07%
DATE                                    0        0.00%


In [13]:
display(new_df.describe(include='all'))
display(pd.DataFrame(new_df.columns.tolist(), columns=["Column Name"]))

Unnamed: 0,geometry,CountyNameTxt,Region,WeatherCde_txtCARTO,SurfaceConditionCde_txtCARTO,RdwyTypeCde_txtCARTO,DATE
count,642813,642813,642813,642813,642813,642813,642813
unique,617291,87,8,9,12,37,3588
top,POINT (-89.506587842973 48.004777071176),HENNEPIN,Metro,Clear,Dry,County State Aid Highway - CSAH,2017-12-28
freq,16056,188806,410911,418405,438905,138526,1045


Unnamed: 0,Column Name
0,geometry
1,CountyNameTxt
2,Region
3,WeatherCde_txtCARTO
4,SurfaceConditionCde_txtCARTO
5,RdwyTypeCde_txtCARTO
6,DATE


## 7. Save Cleaned Dataset for Traffic (Optional)

In [14]:
""""
processed_dir = '../data/processed'
os.makedirs(processed_dir, exist_ok=True)

# Save the cleaned dataset
cleaned_file_path = os.path.join(processed_dir, 'cleaned_crashes.csv')
new_df.to_csv(cleaned_file_path, index=False)

print(f"Cleaned dataset saved to: {cleaned_file_path}")
"""

'"\nprocessed_dir = \'../data/processed\'\nos.makedirs(processed_dir, exist_ok=True)\n\n# Save the cleaned dataset\ncleaned_file_path = os.path.join(processed_dir, \'cleaned_crashes.csv\')\nnew_df.to_csv(cleaned_file_path, index=False)\n\nprint(f"Cleaned dataset saved to: {cleaned_file_path}")\n'

## 8. Combine Datasets

In [15]:
from src.preprocessing import build_master_dataset

# Testing on small sample first (2016-2017)
master_df_final2 = build_master_dataset('2016-01-01', '2017-12-31', new_df, final_weather_df, gdf)
master_df_final2.tail()

Merging crash and weather data...
Generating grid...
Generating cell_day_df from 2016-01-01 to 2018-01-01...
Filtering master_df to feature window and attaching cell_id...
Merging with cell_day_df to attach crash_tomorrow...
Final record count: 101662
crash_tomorrow value counts:
crash_tomorrow
0.0    94894
1.0     6768
Name: count, dtype: int64


Unnamed: 0,geometry,CountyNameTxt,Region,WeatherCde_txtCARTO,SurfaceConditionCde_txtCARTO,RdwyTypeCde_txtCARTO,DATE,TAVG_MEDIAN,PRCP_MEDIAN,SNOW_MEDIAN,WT01_MAX,WT02_MAX,WT03_MAX,WT04_MAX,WT05_MAX,WT06_MAX,WT08_MAX,WT09_MAX,cell_id,crash_tomorrow
148193,POINT (-93.52066 45.41451),SHERBURNE,East Central,Cloudy,Ice/Frost,Township Road,2017-12-31,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,137_197,0.0
148194,POINT (-93.37469 45.20425),ANOKA,Metro,Clear,Ice/Frost,U.S. Trunk Highway - USTH,2017-12-31,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,161_151,1.0
148195,POINT (-93.40051 44.98642),HENNEPIN,Metro,Clear,Ice/Frost,U.S. Trunk Highway - USTH,2017-12-31,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,159_102,0.0
148196,POINT (-93.40085 45.03769),HENNEPIN,Metro,Clear,Ice/Frost,U.S. Trunk Highway - USTH,2017-12-31,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,158_113,0.0
148199,POINT (-94.20524 45.55269),STEARNS,East Central,Clear,Ice/Frost,State Trunk Highway - MNTH,2017-12-31,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30_225,0.0


In [16]:
# Full dataset (2016-01-01 to 2025-09-29)
master_df = build_master_dataset('2016-01-01', '2025-09-28', new_df, final_weather_df, gdf)

pd.set_option('display.max_columns', None)
master_df = master_df.sort_values(by='DATE', ascending=True)



Merging crash and weather data...
Generating grid...
Generating cell_day_df from 2016-01-01 to 2025-09-29...
Filtering master_df to feature window and attaching cell_id...
Merging with cell_day_df to attach crash_tomorrow...
Final record count: 451145
crash_tomorrow value counts:
crash_tomorrow
0.0    425242
1.0     25903
Name: count, dtype: int64


In [17]:
# Add date features BEFORE saving
master_df['day_of_week'] = master_df['DATE'].dt.dayofweek  # 0=Monday, 6=Sunday
master_df['day_of_year'] = master_df['DATE'].dt.dayofyear  # 1-365
master_df['month'] = master_df['DATE'].dt.month  # 1-12

# Reorder columns so crash_tomorrow is last
cols = [c for c in master_df.columns if c != 'crash_tomorrow']
cols.append('crash_tomorrow')
master_df = master_df[cols]

master_df.describe()

# SAVE (Optional) 
processed_dir = '../data/processed'
os.makedirs(processed_dir, exist_ok=True)
final_df_path = os.path.join(processed_dir, 'master_df.csv')
master_df.to_csv(final_df_path, index=False)

In [18]:
master_df.tail(2)

Unnamed: 0,geometry,CountyNameTxt,Region,WeatherCde_txtCARTO,SurfaceConditionCde_txtCARTO,RdwyTypeCde_txtCARTO,DATE,TAVG_MEDIAN,PRCP_MEDIAN,SNOW_MEDIAN,WT01_MAX,WT02_MAX,WT03_MAX,WT04_MAX,WT05_MAX,WT06_MAX,WT08_MAX,WT09_MAX,cell_id,day_of_week,day_of_year,month,crash_tomorrow
638314,POINT (-93.16705 45.02108),RAMSEY,Metro,Clear,Dry,State Trunk Highway - MNTH,2025-09-28,61.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,195_111,6,271,9,0.0
638386,POINT (-93.29897 44.74644),DAKOTA,Metro,Clear,Dry,County State Aid Highway - CSAH,2025-09-28,61.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,176_49,6,271,9,0.0


In [19]:
cutoff = pd.Timestamp("2024-01-01")
df_training = master_df[master_df["DATE"] < cutoff]
df_testing = master_df[master_df["DATE"] >= cutoff]

df_training.to_csv(os.path.join(processed_dir, 'master_training.csv'), index=False)
df_testing.to_csv(os.path.join(processed_dir, 'master_testing.csv'), index=False)