Step 1: 
    Data Prep 
    a. Download the training data (Let's start with train_0 - train_2) 
    b. Size the dataset down; you should be able to use something akin to: df.groupby('target_col').sample(frac=0.n) where n=% of data we want to keep 
    c. Compare distributions of training data to that of test data; use the ks2 test to determine if columns are of the same distribution 
    d. Determine proper types for each column. Assume all numerical are float. Create a dictionary, data_types, where the k is the column and the v is the pandas dtype it should be
Step 2:
    Exploration
    a. Visualization
        i. For categorical, create histograms w/ KDE estimation, color by class
        ii. For continuous, create scatter (or cat) plot w/ class
        iii. For continuous, create histograms w/ KDE estimation, color by class
    b. Correlation
        i. Run correlation analysis on your columns (Spearman Correlation)
        ii. Plot the correlation heatmap
    c. Quality
        i. If column has a single unique value
        ii. If column is categorical, but has all unique values
        iii. If column is more than 50% NaN, null, or missing (Cabin)
        iv. If column is highly correlated with another  
Step 3:
    Feature Engineering 
    a. Feature Removal
        i. Determine which features can be removed from above analysis
    b. Feature Addition
        i. Consider creation of "lagging" features (i.e. over the last X timestamps, taking averages of numerical values/counts of categorical)
        ii. Create indicators for any missing or null data (if it exists)
        iii. Consider imputation methods
Step 4:
    Modeling 
    a. Optuna Framework       

In [None]:
# Import statements
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Creating test/train splits for train_0
_d = pd.read_csv('train_0.csv')
train_sample = _d.groupby('team_scoring_next').sample(frac=0.1)

x_train = train_sample.groupby('team_scoring_next').sample(frac=0.8)
y_train = x_train['team_scoring_next']
x_train.drop('team_scoring_next', inplace=True, axis=1)
idxs = x_train.index

x_test = train_sample.drop(idxs)
y_test = x_test['team_scoring_next']
x_test.drop('team_scoring_next', inplace=True, axis=1)

In [None]:
print(len(x_train), len(y_train), len(x_test), len(y_test))

In [None]:
# Using ks2 
ks2_results = {}
for cols in x_train.columns:
    ks2 = stats.ks_2samp(x_train[cols], x_test[cols])
    ks2_results[cols] = ks2
ks2_results

In [None]:
# Checking data_types
data_types = {}
for cols in x_train:
    data_types[cols] = x_train[cols].dtype.name
data_types

In [None]:
# Changing values for column "team_scoring_next" from A/B to 0/1 
y_train = y_train.map(lambda x: 0 if x == "A" else 1)
y_train.astype("int64")
y_test = y_test.map(lambda x: 0 if x == "A" else 1)
y_test.astype("int64")

# STEP 2

### Visualizations

Only issues i have run into is creating scatter plots for continuous variables

In [None]:
# histogram for categorical variables (player_scoring_next)
player_scoring_next_array = x_train.player_scoring_next.to_numpy()
sns.histplot(data=x_train, x='player_scoring_next', kde=True, hue='player_scoring_next')

In [None]:
# Scatter plots for continous variables
continuous_variables = x_train.drop(['player_scoring_next','event_id','team_A_scoring_within_10sec','team_B_scoring_within_10sec'], axis=1)
continuous_variables.columns
sns.scatterplot(data=continuous_variables, x='ball_pos_x', y='ball_pos_y')

In [None]:
list_of_x_columns = []
list_of_y_columns = []
for cols in continuous_variables:
    if cols.__contains__('x'):
        list_of_x_columns.append(cols)
    if cols.__contains__('y'):
        list_of_y_columns.append(cols)

print(list_of_x_columns, list_of_y_columns)

In [None]:
# Histograms of continuous variables

for cols in continuous_variables:
    sns.histplot(data=continuous_variables, x=cols)
    plt.show()
    plt.close()

### Correlation

In [None]:
corr = x_train.corr(method='spearman')
corr

In [None]:
# Heatmap of Spearman Corr
corr = x_train.corr(method='spearman')
sns.heatmap(corr, annot=True)

### Quality

In [None]:
# Number of unique values in each columns in x_train
print(x_train.nunique(axis=0))

In [None]:
# number of NaN values in each column
print(x_train.isna().sum())
# At most a columns contains about 1% of missing values