<center><h1><font size=6> Splitting the Data into Training and Testing Sets </h1></center>

At this point in the process, I need to seperate out my training and testing datasets. I want to do this before I run any Exploratory Data Analysis (EDA) and advanced feature engineering/feature selection to keep the testing set completely unseen and prevent any possible data snooping bias, which could help me analyse patterns or behaviours in the test set.

But before splitting the data into test and train, I want to create two different models. This is because the extend of EPL data collection improved in the 2017-18 season and data on variables like expected goals started to be collected. These could be important features so I want to run a model using only more recent data which includes these variables and compare it to the simpler model which is trained on more historic data. 

### Load libraries and setup notebook configuration

In [1]:
# import packages
import pandas as pd 
import numpy as np
import os
from pathlib import Path
import warnings


# set pandas configurations
pd.set_option("display.precision", 2) # display to 1 decimpal place
pd.set_option("display.max.columns", None) # display all columns so we can view the whole dataset
pd.set_option('display.float_format', '{:.2f}'.format) # Disable scientific notation for pandas
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning) # Disable setting with copy warnings


# set directories
os.chdir('..') # change current working directory to the parent directory to help access files/directories at a higher level
DATAPATH = Path(r'data') # set data path


# import from source directory
from src import constants

### Load data from local file

In [2]:
# load EPL match data
matches = pd.read_csv(f"{DATAPATH}/processed/matches_processed.csv")
matches['date'] = pd.to_datetime(matches['date'])
matches.head(5)

Unnamed: 0,unique_match_id,season,date,day_of_week,round,day,team,promoted,opponent,promoted_opponent,home,points,days_since_last_game,games_played_last_21_days,pl_total_points,pl_total_gf,pl_total_ga,pl_total_goal_diff,pl_position,last_h2h,last_h2h_form,last_h2h_venue,last_h2h_venue_form,prev_season_points,prev_season_gf,prev_season_ga,prev_season_goal_diff,points_pl_form,gf_pl_form,ga_pl_form,poss_pl_form,xg_pl_form,xga_pl_form,days_since_last_game_opponent,games_played_last_21_days_opponent,pl_total_points_opponent,pl_total_gf_opponent,pl_total_ga_opponent,pl_total_goal_diff_opponent,pl_position_opponent,points_pl_form_opponent,gf_pl_form_opponent,ga_pl_form_opponent,poss_pl_form_opponent,xg_pl_form_opponent,xga_pl_form_opponent,prev_season_points_opponent,prev_season_gf_opponent,prev_season_ga_opponent,prev_season_goal_diff_opponent
0,1993081414182,1994,1993-08-14,Sat,1,Sat,Ipswich Town,0,Oldham Athletic,0,0,3,,,0.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,52.0,50.0,55.0,-5.0,1.2,1.6,1.8,,,,,,0.0,0.0,0.0,0.0,14.0,2.0,2.0,2.0,,,,49.0,63.0,74.0,-11.0
1,199308141441,1994,1993-08-14,Sat,1,Sat,Wimbledon,0,West Ham United,1,0,3,,,0.0,0.0,0.0,0.0,20.0,,,,,54.0,56.0,55.0,1.0,1.0,1.2,1.6,,,,,,0.0,0.0,0.0,0.0,20.0,,,,,,,,,,
2,1993081411512,1994,1993-08-14,Sat,1,Sat,Everton,0,Southampton,0,0,3,,,0.0,0.0,0.0,0.0,6.0,1.0,2.0,1.0,1.0,53.0,53.0,55.0,-2.0,1.0,1.6,1.8,,,,,,0.0,0.0,0.0,0.0,18.0,0.8,1.2,2.0,,,,50.0,54.0,61.0,-7.0
3,1993081422234,1994,1993-08-14,Sat,1,Sat,Sheffield United,0,Swindon Town,1,1,3,,,0.0,0.0,0.0,0.0,16.0,,,,,52.0,54.0,53.0,1.0,2.0,2.0,1.2,,,,,,0.0,0.0,0.0,0.0,19.0,,,,,,,,,,
4,1993081413516,1994,1993-08-14,Sat,1,Sat,Blackburn Rovers,0,Chelsea,0,0,3,,,0.0,0.0,0.0,0.0,3.0,3.0,2.0,1.0,1.0,71.0,68.0,46.0,22.0,2.4,2.0,1.0,,,,,,0.0,0.0,0.0,0.0,4.0,1.2,1.6,2.2,,,,56.0,51.0,54.0,-3.0


### Split full dataset into full and shorter data sets

In [3]:
# for the full dataset, drop the variables that are only recently collected
matches_full = matches.copy()
columns_to_drop = ['xg_pl_form', 'xga_pl_form', 'poss_pl_form', 'xg_pl_form_opponent', 'xga_pl_form_opponent', 'poss_pl_form_opponent']

matches_full = matches_full.drop(columns=columns_to_drop)

# data on expected goals came into play in the 2017-18 season
matches_short = matches[matches['season'] >= 2018]

### Split datasets into training and test sets

In [4]:
from sklearn.model_selection import train_test_split

# Specify the features and target variable
features_full = matches_full.drop('points', axis=1)
target_full = matches_full[['unique_match_id', 'points']]
features_short = matches_short.drop('points', axis=1)
target_short = matches_short[['unique_match_id', 'points']]

# Perform train-test split with stratified sampling of the target variable to ensure we have a representative sample of all results
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(features_full, target_full, test_size=0.2, stratify=target_full['points'], random_state=42)
X_train_short, X_test_short, y_train_short, y_test_short = train_test_split(features_short, target_short, test_size=0.2, stratify=target_short['points'], random_state=42)

In [5]:
# store in local data file
# Define the output directory
output_dir = f"{DATAPATH}/processed/"

# Save the train and test sets as CSV files
X_train_full.to_csv(os.path.join(output_dir, 'X_train_full.csv'), index=False)
X_test_full.to_csv(os.path.join(output_dir, 'X_test_full.csv'), index=False)
y_train_full.to_csv(os.path.join(output_dir, 'y_train_full.csv'), index=False)
y_test_full.to_csv(os.path.join(output_dir, 'y_test_full.csv'), index=False)

X_train_short.to_csv(os.path.join(output_dir, 'X_train_short.csv'), index=False)
X_test_short.to_csv(os.path.join(output_dir, 'X_test_short.csv'), index=False)
y_train_short.to_csv(os.path.join(output_dir, 'y_train_short.csv'), index=False)
y_test_short.to_csv(os.path.join(output_dir, 'y_test_short.csv'), index=False)