## Combine Cleaned Data

Finally, we combine all cleaned years into a single dataset for modeling.

In [1]:
import pandas as pd
import glob
import os

In [2]:
path = "data/processed/atp_*.csv" # Path where all cleaned csv files are stored
all_files = glob.glob(path)

In [3]:
# Print how many files were found
print(f"Found {len(all_files)} cleaned CSV files.")

Found 9 cleaned CSV files.


In [4]:
df_list = [] # List to store each csv's dataframe

# Loop through each csv, read in as a dataframe, store in list
for filename in all_files:
    df = pd.read_csv(filename)
    df_list.append(df)

all_data = pd.concat(df_list, ignore_index = True) # Combine all dataframes into one

In [5]:
# Print total rows after combining
print(f"Combined dataset contains {all_data.shape[0]} rows and {all_data.shape[1]} columns.")

Combined dataset contains 22790 rows and 57 columns.


In [6]:
# Drop helper column
all_data = all_data.drop('Unnamed: 0', axis=1)

# Remove NextGen Finals matches (exhibition style tournament)
all_data = all_data[all_data["tourney_name"] != "NextGen Finals"]

# Fix inconsistenty in naming of US Open tournament
df['tourney_name'] = df['tourney_name'].replace({
    'Us Open': 'US Open',
    'U.S. Open': 'US Open'
})

In [7]:
# Save combined data
all_data.to_csv("data/processed/tennis_clean_2016_2024.csv", index=False)
print("Saved combined dataset: tennis_clean_2016_2024.csv")

Saved combined dataset: tennis_clean_2016_2024.csv
