## Author's notes
### 28.10.2025
#### Issues to solve
I am starting the model creation process. The files created in the prerocessing will be used, however, I do have some issues with them.
1) The data has not been split into train and test set during the preprocessing phase. I am aware that label encoding of categorical variables has been done and we should know all the categories for the teams, but it is still best practice to do encoding AFTER the split in order to avoid data leakage and it may be a problem in the assessment of our work by the teachers. Therefore, it would probably be for the best if this was corrected. All the preprocessing steps should be fitted to only the train data and only then used to transform the test data.
2) Only the market values were used in the bigger dataset. I am not saying this is wrong or right and I will trust Vojta on this one, BUT there should be a very detailed explanation for why exactly we didn't use the rest of the data.
3) There are missing values in the first observations in the derived column. I have dropped them for now, but I think a KNN imputer might do the trick, we could do some imputations and use everything.
4) I am having some doubts about the team encoding values because it's pretty big numbers compared to everything else. I know it's categorical but I'm wondering, whether it can mess with our results. Would like the team's opinion on that, I have no idea, just a thought.
5) There are 2 infinity values in model A1 and it's absolutely screwing me over, needs to be fixed. Idk how it happened, I'll just throw the observations out for now, BUT it would be better to fix it somehow. Check my code for finding the infinity values so you don't have to figure it out yourselves and deal with that along with the train test split. Feel free to steal my code, much of it can be just altered to make everything work. However, note that it is a working version and I didn't deal with a lot of stuff properly, but just enough to be able to make the models.

Because of the issues with the train test split, I will split the data here, but I would like to change it once it is fixed.

#### A0 models

The models for dataset A0 are done, along with hyperparameter tuning done thrugh RandomizedSearchCV and GridSearchCV. The data has been split chornologically, to use only historic data to predict future data. For this reason, TimeSeriesSplit has been used as the cross-validation method for the hyperparameter tuning, in order to prevent data leakage from the future into the past.
The models implemented so far are a random forest classifier and a tree based XGBoost classifier. Both have been tuned and the results of the grid and random searches are saved as text for convenience in future work on this project. However, it in important to note that if any changes are done on the data (like fixing the issues highlighted above), it will be necessary to recompute the results.
A Logistic regression model has been added, a time series cross validation is also used in it.

IF YOU WANT TO DO ONLY EVALUATION AND NOT TUNING, DON'T RUN THE CELLS WITH RANDOMIZED AND GRID SEARCH. USE THE COMMENT VERSION OF THE PARAMETERS THAT IS SAVED IN THE CELLS BELOW.

If I remember correctly what the teachers said in the first lesson, the persformance of our model will not be much better than flipping a coin, so keep that in mind when evaluating the models, they are unlikely to be outstandingly good. Honestly, if they are too good, maybe let the rest of the team now, because there may be some data leakage happening.

To enhance the performance, maybe playing around with the decision threshold could help, but I will leave that up to Zahra and her tuning efforts. Please, also check the feature importance for the models and it something appears suspiciously important, look into it (may be data leakage)

#### A1 models
There was an issue with the random forest right off the bat, because some infinity values ended up in the data. This must be fixed in preprocessing so please, go back to it and check it. It is written more in detail in the Issues section above.
The models for A0 are created analogously to the models for A0, so there is much less text there. If I use anything different, I always write it in the comments of markdown cells by the changes!
The searches are taking much longer on the big data so, once again, the results are written out so no one has to run it again unless the data has changed.

#### Disclaimers!!!
1)If you run the prepared parameters for the models, please put the cells back to being a comment after doing so. Otherwise, it could mess with the code. Also, I didn't actually try using it, since I ran all the grid searches, so if there is some issue, please don't be mad at me and fix it.
2)Remember that the code and results will change when the additional preprocessing is done, but it shouldn't be too much of an issue. Just keep in mind that it may be necessary to go back to all this.

## Libraries import

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error 

## Data import, train test split and shenanigans
The train test split should later be replaced by just loading the already split data after it has been done in preprocessing.
After this step, the models for A0 dataset will be made in the first chunk and A1 on the second chunk of this notebook.

In [7]:
from pathlib import Path
import os

FILE_PATH_a0 = Path("ready_data") / "data_a0_encoded.csv"
FILE_PATH_a1 = Path("ready_data") / "data_a1_encoded.csv"

data_a0 = pd.read_csv(FILE_PATH_a0)
data_a1 = pd.read_csv(FILE_PATH_a1)
    

In [14]:
#check that everything checks out
print(data_a0.info())
print(data_a0.head())

print(data_a1.info())
print(data_a1.head())

'''
The averaged data of the last 5 games contains missing values. 
This is because for the first 5 matches, it is always impossible to compute the average.
Because this is a derived column, this makes sense and should not be an issue for the data.
'''

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42593 entries, 0 to 42592
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Time                           42593 non-null  int64  
 1   Target                         42593 non-null  int64  
 2   HomeTeam_enc                   42593 non-null  int64  
 3   avg_goals_in_last5_home        42115 non-null  float64
 4   avg_goals_conceded_last5_home  42115 non-null  float64
 5   AwayTeam_enc                   42593 non-null  int64  
 6   avg_goals_in_last5_away        42115 non-null  float64
 7   avg_goals_conceded_last5_away  42115 non-null  float64
 8   Year                           42593 non-null  int64  
 9   Month                          42593 non-null  int64  
 10  Dayofweek                      42593 non-null  int64  
 11  Is_weekend                     42593 non-null  int64  
 12  Season_of_year                 42593 non-null 

'\nThe averaged data of the last 5 games contains missing values. \nThis is because for the first 5 matches, it is always impossible to compute the average.\nBecause this is a derived column, this makes sense and should not be an issue for the data.\n'

The country and division dummies are booleans, change that into numerical.

In [69]:
data_a0[data_a0.select_dtypes(include='bool').columns]=data_a0[data_a0.select_dtypes(include='bool').columns].astype(int)
data_a1[data_a1.select_dtypes(include='bool').columns]=data_a1[data_a1.select_dtypes(include='bool').columns].astype(int)

In [70]:
#check that all columns have only numerical values
non_numeric_cols0 = data_a0.select_dtypes(exclude=[np.number]).columns
non_numeric_cols1 = data_a1.select_dtypes(exclude=[np.number]).columns

assert len(non_numeric_cols0)==0
assert len(non_numeric_cols1)==0

The data has been sorted chronologically in the preprocessing phase. Because this data is a time series and is likely time dependend, we will not be doing a random split of the data, but rather, a chronological one.

In [71]:
#just a check to see that we are good to keep working with the data and it's in the form we want
assert type(data_a0)==pd.core.frame.DataFrame
assert type(data_a1)==pd.core.frame.DataFrame

In [72]:
#train test split, 80/20 ratio
#for A0
split_index_0 = int(0.8 * len(data_a0))

train0 = data_a0.iloc[:split_index_0]
test0  = data_a0.iloc[split_index_0:]

X_train_0, y_train_0 = train0.drop(columns='Target'), train0['Target']
X_test_0,  y_test_0  = test0.drop(columns='Target'),  test0['Target']

#for A1
split_index_1 = int(0.8 * len(data_a1))

train1 = data_a1.iloc[:split_index_1]
test1  = data_a1.iloc[split_index_1:]

X_train_1, y_train_1 = train1.drop(columns='Target'), train1['Target']
X_test_1,  y_test_1  = test1.drop(columns='Target'),  test1['Target']

## Model creation A0

### RandomForestClassifier
Because this is a classification task and we aren't looking at a continuous target variable, we will use the RandomForestClassifier and not the RandomForestRegressor we have used in the lectures.

Because we have derived the avg variables, there is some missingness in the data. I will drop the observations with missing values for now but I think it can be fixed (check author's notes at the top of this markdown)

In [32]:
#temporary solution to missing values
X_train_0 = X_train_0.dropna()
y_train_0 = y_train_0.loc[X_train_0.index]

X_test_0 = X_test_0.dropna()
y_test_0 = y_test_0.loc[X_test_0.index]

In [34]:
#initiate the model
rf0_1 = RandomForestClassifier(random_state=42)

In [30]:
#because the data is chronological, we cannot do a randomized cross-validation when choosing the model
#we will use a rolling cross-validation instead

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5) 

#this will split out data into 5 folds and we will always evaluate only based on the past, preventing leakage

In [None]:
#random search for some good values to use in gridsearch (hyperparameter tuning)
#parameters to go through
param_grid= {
    'max_depth':[i for i in range(1, 30)],
    'min_samples_split':[i for i in range(1,300)],
    'min_samples_leaf':[i for i in range(1, 200)],
    'criterion' :['gini', 'entropy', 'log_loss']
}


In [None]:
#the random search
params={
    'max_depth':[],
    'min_samples_split':[],
    'min_samples_leaf':[],
    'criterion' :[]
} #empty parameter grid to input the results of the random search
for state in [1, 20, 42, 200]: 
    #the loop numbers were randomly chosen, they are just here as random seeds to make the results reproducible
    random_search = RandomizedSearchCV(
        estimator=rf0_1,
        param_distributions=param_grid,
        cv=tscv,
        n_iter=100,
        random_state=state,
        n_jobs=-1
    )
    random_search.fit(X_train_0, y_train_0)
    new_params=random_search.best_params_
    for key, value in new_params.items():
        params[key].append(value)

print(params)

{'max_depth': [18, 16, 14, 18], 'min_samples_split': [268, 239, 291, 282], 'min_samples_leaf': [58, 43, 61, 66], 'criterion': ['entropy', 'entropy', 'gini', 'log_loss']}


In [37]:
#keeping only unique values in the parameter grid
for key in params:
    params[key] = list(set(params[key]))
print(params)

{'max_depth': [16, 18, 14], 'min_samples_split': [282, 291, 268, 239], 'min_samples_leaf': [66, 58, 43, 61], 'criterion': ['log_loss', 'gini', 'entropy']}


The random search is run multiple times to find some values that could be used in the grid search. As this takes some time, here are the values that it gave me when I ran the code (so they can be used immediately and without running the code):

params={'max_depth': [16, 18, 14],'min_samples_split': [282, 291, 268, 239], 'min_samples_leaf': [66, 58, 43, 61], 'criterion': ['log_loss', 'gini', 'entropy']}

In [38]:
#for convenience, the params output can be loaded here
'''
params={'max_depth': [16, 18, 14],
        'min_samples_split': [282, 291, 268, 239],
        'min_samples_leaf': [66, 58, 43, 61],
        'criterion': ['log_loss', 'gini', 'entropy']}
'''

"\nparams={'max_depth': [16, 18, 14],\n        'min_samples_split': [282, 291, 268, 239],\n        'min_samples_leaf': [66, 58, 43, 61],\n        'criterion': ['log_loss', 'gini', 'entropy']}\n"

In [39]:
#grid search
grid=GridSearchCV(estimator=rf0_1,
                  param_grid=params,
                  n_jobs=-1, 
                  cv=tscv)
grid.fit(X_train_0, y_train_0)

best_params=grid.best_params_
print(best_params)

rf_0=grid.best_estimator_

{'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}


As grid search takes a lot of time to be computed, here are the hyperparameter values for the best estimator for future convenience.

best_params={'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}

In [None]:
#for convenience, the best estimator can be loaded here
'''
best_params={'criterion': 'log_loss', 'max_depth': 14, 'min_samples_leaf': 61, 'min_samples_split': 282}
rf_0=RandomForestClassifier(best_params, random_state=42)
'''

In [42]:
#Random forest model predictions
pred_rf_0=rf_0.predict(X_test_0)

### Tree boosting

This process is mostly analogous to the random forest, only using a different classifier and different hyperparameters (since we need different things for the two). Otherwise, the model creation has the same steps and uses the time series cross validation that has already been used before.

In [53]:
from xgboost import XGBClassifier
import scipy.stats as st

In [54]:
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42
)

In [None]:
xgb_parameter_grid= {
    'n_estimators': st.randint(2, 300),
    'max_depth': st.randint(3, 8),
    'learning_rate': st.uniform(0.01, 0.25),
    'subsample': st.uniform(0.8, 0.2),
    'colsample_bytree': st.uniform(0.8, 0.2)
} #hyperparameters for xgboost, some are from a uniform distribution because they are floats and not integers

In [57]:
#the random search
xgb_params={
    'n_estimators': [],
    'max_depth': [],
    'learning_rate': [],
    'subsample': [],
    'colsample_bytree': []
} #empty parameter grid to input the results of the random search
for state in [1, 20, 42, 200]:
    random_search = RandomizedSearchCV(
        estimator=xgb,
        param_distributions=xgb_parameter_grid,
        n_iter=100,
        cv=tscv,
        n_jobs=-1,
        random_state=state
        )
    random_search.fit(X_train_0, y_train_0)
    new_params=random_search.best_params_
    for key, value in new_params.items():
        xgb_params[key].append(value)

print(xgb_params)

{'n_estimators': [285, 71, 102, 154], 'max_depth': [3, 4, 5, 3], 'learning_rate': [np.float64(0.030563726508461113), np.float64(0.059119574314573674), np.float64(0.019337047187303606), np.float64(0.021098111168009435)], 'subsample': [np.float64(0.8834765592635814), np.float64(0.9215534158836575), np.float64(0.9044486520109609), np.float64(0.8412233846481453)], 'colsample_bytree': [np.float64(0.8509405508246773), np.float64(0.993252878676858), np.float64(0.8061000499878099), np.float64(0.999332288418085)]}


random search results:

{'n_estimators': [285, 71, 102, 154], 'max_depth': [3, 4, 5, 3], 'learning_rate': [np.float64(0.030563726508461113), np.float64(0.059119574314573674), np.float64(0.019337047187303606), np.float64(0.021098111168009435)], 'subsample': [np.float64(0.8834765592635814), np.float64(0.9215534158836575), np.float64(0.9044486520109609), np.float64(0.8412233846481453)], 'colsample_bytree': [np.float64(0.8509405508246773), np.float64(0.993252878676858), np.float64(0.8061000499878099), np.float64(0.999332288418085)]}

In [None]:
#for convenience, the params output can be loaded here
'''
xgb_params={'n_estimators': [285, 71, 102, 154], 
            'max_depth': [3, 4, 5, 3], 
            'learning_rate': [np.float64(0.030563726508461113), np.float64(0.059119574314573674), 
                              np.float64(0.019337047187303606), np.float64(0.021098111168009435)], 
            'subsample': [np.float64(0.8834765592635814), np.float64(0.9215534158836575), 
                          np.float64(0.9044486520109609), np.float64(0.8412233846481453)],
            'colsample_bytree': [np.float64(0.8509405508246773), np.float64(0.993252878676858),
                                  np.float64(0.8061000499878099), np.float64(0.999332288418085)]}
'''

In [58]:
xgb_grid = GridSearchCV(
    estimator=xgb,
    param_grid=xgb_params,
    cv=tscv,
    n_jobs=-1,
)

xgb_grid.fit(X_train_0, y_train_0)

best_params=xgb_grid.best_params_
print(best_params)

xgb_0=xgb_grid.best_estimator_

{'colsample_bytree': np.float64(0.8061000499878099), 'learning_rate': np.float64(0.021098111168009435), 'max_depth': 5, 'n_estimators': 154, 'subsample': np.float64(0.9044486520109609)}


Best estimator parameters from GridSearchCV
{'colsample_bytree': np.float64(0.8061000499878099), 'learning_rate': np.float64(0.021098111168009435), 'max_depth': 5, 'n_estimators': 154, 'subsample': np.float64(0.9044486520109609)}

In [None]:
#for convenience, the best estimator can be loaded here
'''
best_params={'colsample_bytree': np.float64(0.8061000499878099),
             'learning_rate': np.float64(0.021098111168009435),
             'max_depth': 5, 'n_estimators': 154, 'subsample': np.float64(0.9044486520109609)}
xgb_0=RandomForestClassifier(best_params, random_state=42)
'''

In [59]:
#Random forest model predictions
pred_xgb_0=xgb_0.predict(X_test_0)

### Logistic Regression
Because we are predicting a binary outcome, we can use a logistic regression model, as it uses the probability of the outcome

In [None]:
from sklearn.linear_model import LogisticRegressionCV

logreg_0=LogisticRegressionCV(max_iter=10000, cv=tscv) 
#logistic regression with time series Cross Validation

logreg_0.fit(X_train_0, y_train_0)
pred_logreg=logreg_0.predict(X_test_0) #predictions for the logistic regression

## Model creation A1

### Random Forest Classifier
Analogous to model creation 1

In [91]:
#temporary solution to missing values
X_train_1 = X_train_1.dropna()
y_train_1 = y_train_1.loc[X_train_1.index]

X_test_1 = X_test_1.dropna()
y_test_1 = y_test_1.loc[X_test_1.index]

In [None]:
#There were some issues with the randomized search that claims there are missing or infinite values in the data
#assert that this is not happening, if yes, fix it, if not, look for a different cause of the problem
'''
assert np.isfinite(X_train_1.to_numpy()).all(), "there are NaNs or infinite numbers in X_train_1"
assert np.isfinite(y_train_1.to_numpy()).all(), "there are NaNs or infinite numbers in y_train_1"
'''

AssertionError: there are NaNs or infinite numbers in X_train_1

In [89]:
print(X_train_1.isna().sum().sum()) #no missing values so it must be infinity
print(X_test_1.isna().sum().sum())

0
0


In [None]:
(X_train_1 == np.inf).sum() + (X_train_1 == -np.inf).sum() #the issue is infinity values


Time                             0
HomeTeam_enc                     0
avg_goals_in_last5_home          0
avg_goals_conceded_last5_home    0
AwayTeam_enc                     0
avg_goals_in_last5_away          0
avg_goals_conceded_last5_away    0
market_decisiveness              0
expected_total_goals             0
Norm_Ah_P_home                   0
Norm_Ah_P_away                   0
ah_imbalance                     1
ah_market_confidence             1
Year                             0
Month                            0
Dayofweek                        0
Is_weekend                       0
Season_of_year                   0
Country_D                        0
Country_E                        0
Country_F                        0
Country_G                        0
Country_I                        0
Country_N                        0
Country_P                        0
Country_SC                       0
Country_SP                       0
Country_T                        0
Division_1          

In [None]:
(X_test_1 == np.inf).sum() + (X_test_1 == -np.inf).sum() 
#no infinities in the test set, but I'll write a code to remove it from everywhere, just to be sure

Time                             0
HomeTeam_enc                     0
avg_goals_in_last5_home          0
avg_goals_conceded_last5_home    0
AwayTeam_enc                     0
avg_goals_in_last5_away          0
avg_goals_conceded_last5_away    0
market_decisiveness              0
expected_total_goals             0
Norm_Ah_P_home                   0
Norm_Ah_P_away                   0
ah_imbalance                     0
ah_market_confidence             0
Year                             0
Month                            0
Dayofweek                        0
Is_weekend                       0
Season_of_year                   0
Country_D                        0
Country_E                        0
Country_F                        0
Country_G                        0
Country_I                        0
Country_N                        0
Country_P                        0
Country_SC                       0
Country_SP                       0
Country_T                        0
Division_1          

In [92]:
#removing problematic infinity values (temporary solution)
#replace with nans
X_train_1 = X_train_1.replace([np.inf, -np.inf], np.nan)
X_test_1 = X_test_1.replace([np.inf, -np.inf], np.nan)
#drop
X_train_1 = X_train_1.dropna()
y_train_1 = y_train_1.loc[X_train_1.index]

X_test_1 = X_test_1.dropna()
y_test_1 = y_test_1.loc[X_test_1.index]

In [93]:
#run the asserts to make sure the issue is no longer here
assert np.isfinite(X_train_1.to_numpy()).all(), "there are NaNs or infinite numbers in X_train_1"
assert np.isfinite(y_train_1.to_numpy()).all(), "there are NaNs or infinite numbers in y_train_1"

In [94]:
#initiate the model
rf1_1 = RandomForestClassifier(random_state=42)

In [95]:
#the random search
params_1={
    'max_depth':[],
    'min_samples_split':[],
    'min_samples_leaf':[],
    'criterion' :[]

} #empty parameter grid to input the results of the random search
for state in [1, 20, 42, 200]:
    random_search = RandomizedSearchCV(
        estimator=rf1_1,
        param_distributions=param_grid, #we use the same parameter grid for this as in model of A0
        cv=tscv,
        n_iter=100,
        random_state=state,
        n_jobs=-1
    )
    random_search.fit(X_train_1, y_train_1)
    new_params=random_search.best_params_
    for key, value in new_params.items():
        params_1[key].append(value)

print(params_1)

{'max_depth': [15, 22, 10, 7], 'min_samples_split': [297, 190, 174, 144], 'min_samples_leaf': [128, 69, 98, 167], 'criterion': ['log_loss', 'entropy', 'gini', 'gini']}


In [96]:
#keeping only unique values in the parameter grid
for key in params_1:
    params_1[key] = list(set(params_1[key]))
print(params_1)

{'max_depth': [10, 7, 22, 15], 'min_samples_split': [144, 297, 174, 190], 'min_samples_leaf': [128, 98, 69, 167], 'criterion': ['log_loss', 'gini', 'entropy']}


params_1={'max_depth': [10, 7, 22, 15], 'min_samples_split': [144, 297, 174, 190], 'min_samples_leaf': [128, 98, 69, 167], 'criterion': ['log_loss', 'gini', 'entropy']}

In [None]:
#for convenience, the params output can be loaded here
'''
params_1={
    'max_depth': [10, 7, 22, 15],
    'min_samples_split': [144, 297, 174, 190],
    'min_samples_leaf': [128, 98, 69, 167],
    'criterion': ['log_loss', 'gini', 'entropy']}
'''

In [97]:
#grid search
grid=GridSearchCV(estimator=rf1_1,
                  param_grid=params_1,
                  n_jobs=-1, 
                  cv=tscv)
grid.fit(X_train_1, y_train_1)

best_params=grid.best_params_
print(best_params)

rf_1=grid.best_estimator_

{'criterion': 'log_loss', 'max_depth': 15, 'min_samples_leaf': 128, 'min_samples_split': 297}


In [None]:
#for convenience, the best estimator can be loaded here
'''
best_params={'criterion': 'log_loss', 'max_depth': 15, 'min_samples_leaf': 128, 'min_samples_split': 297}
rf_1=RandomForestClassifier(best_params, random_state=42)
'''

In [98]:
pred_rf_1=rf_1.predict(X_test_1)

### Tree boosting

Analogous to model A0

In [99]:
xgb1_1 = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42
)

In [100]:
#the random search
xgb_params={
    'n_estimators': [],
    'max_depth': [],
    'learning_rate': [],
    'subsample': [],
    'colsample_bytree': []
} #empty parameter grid to input the results of the random search
for state in [1, 20, 42, 200]:
    random_search = RandomizedSearchCV(
        estimator=xgb1_1,
        param_distributions=xgb_parameter_grid, #use the same grid as before
        n_iter=100,
        cv=tscv,
        n_jobs=-1,
        random_state=state
        )
    random_search.fit(X_train_1, y_train_1)
    new_params=random_search.best_params_
    for key, value in new_params.items():
        xgb_params[key].append(value)

print(xgb_params)

{'n_estimators': [32, 50, 202, 154], 'max_depth': [3, 4, 3, 3], 'learning_rate': [np.float64(0.06955272909778563), np.float64(0.02068706309628359), np.float64(0.03257244251360208), np.float64(0.021098111168009435)], 'subsample': [np.float64(0.8562433791337292), np.float64(0.8405358828849236), np.float64(0.8641560129943472), np.float64(0.8412233846481453)], 'colsample_bytree': [np.float64(0.9411364273423346), np.float64(0.9048980247717036), np.float64(0.9071549368149517), np.float64(0.999332288418085)]}


Best parameters:

{'n_estimators': [32, 50, 202, 154], 'max_depth': [3, 4, 3, 3], 'learning_rate': [np.float64(0.06955272909778563), np.float64(0.02068706309628359), np.float64(0.03257244251360208), np.float64(0.021098111168009435)], 'subsample': [np.float64(0.8562433791337292), np.float64(0.8405358828849236), np.float64(0.8641560129943472), np.float64(0.8412233846481453)], 'colsample_bytree': [np.float64(0.9411364273423346), np.float64(0.9048980247717036), np.float64(0.9071549368149517), np.float64(0.999332288418085)]}

In [None]:
# load parameters for convenience
'''
xgb_params={'n_estimators': [32, 50, 202, 154],
             'max_depth': [3, 4, 3, 3],
             'learning_rate': [np.float64(0.06955272909778563), np.float64(0.02068706309628359),
                               np.float64(0.03257244251360208), np.float64(0.021098111168009435)],
             'subsample': [np.float64(0.8562433791337292), np.float64(0.8405358828849236),
                           np.float64(0.8641560129943472), np.float64(0.8412233846481453)],
             'colsample_bytree': [np.float64(0.9411364273423346), np.float64(0.9048980247717036),
                                  np.float64(0.9071549368149517), np.float64(0.999332288418085)]}

'''

In [101]:
xgb_grid = GridSearchCV(
    estimator=xgb1_1,
    param_grid=xgb_params,
    cv=tscv,
    n_jobs=-1,
)

xgb_grid.fit(X_train_1, y_train_1)

best_params=xgb_grid.best_params_
print(best_params)

xgb_1=xgb_grid.best_estimator_

{'colsample_bytree': np.float64(0.9048980247717036), 'learning_rate': np.float64(0.03257244251360208), 'max_depth': 4, 'n_estimators': 32, 'subsample': np.float64(0.8405358828849236)}


In [None]:
#for convenience, best parameters and model
'''
best_params={'colsample_bytree': np.float64(0.9048980247717036),
             'learning_rate': np.float64(0.03257244251360208),
             'max_depth': 4, 'n_estimators': 32, 'subsample': np.float64(0.8405358828849236)}
xgb_1=RandomForestClassifier(best_params, random_state=42)
'''

In [102]:
#Random forest model predictions
pred_xgb_1=xgb_1.predict(X_test_1)

### Logistic Regression

Analogous to model A0

In [103]:
logreg_1=LogisticRegressionCV(max_iter=10000, cv=tscv) 
#logistic regression with time series Cross Validation

logreg_1.fit(X_train_1, y_train_1)
pred_logreg_1=logreg_1.predict(X_test_1) #predictions for the logistic regression