## Plain logistic regression isn't looking promising

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm


# Load data
data = pd.read_csv("./out/d_fight_level_dataset_1line.csv", index_col = 0)

# Change winner to binary 1/0:
data.Winner = data.Winner.apply(lambda x: np.where(x == -1, 0, 1))

# Initial features and target
features = pd.Series(data.columns, index = data.columns)
target = "Winner"

# Remove referree, date, location, winner, title_bout, weight_class, no_of_rounds
features.drop(index = ["Referee", "date", "location", "Winner", "title_bout",
                       "weight_class", "no_of_rounds"], inplace = True)

# Diff_draw is mostly NA/0
features.drop(index = "Diff_draw", inplace = True)

# Lots of win columns
features.drop(index = ["Diff_win_by_Decision_Majority",
                       "Diff_win_by_Decision_Split",
                       "Diff_win_by_Decision_Unanimous",
                       "Diff_win_by_KO/TKO",
                       "Diff_win_by_Submission",
                       "Diff_win_by_TKO_Doctor_Stoppage"], inplace = True)

If we just throw everything in, getting lots of bad p-values (and a pretty bad r-squared).

In [2]:
# Naive throw everything in
logit_model = sm.Logit(data[target], data[features])
result = logit_model.fit()
print(result.summary2())


Optimization terminated successfully.
         Current function value: 0.652126
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.009    
Dependent Variable:      Winner                AIC:                   3090.0410
Date:                    2019-11-17 14:06      BIC:                   3296.9234
No. Observations:        2314                  Log-Likelihood:        -1509.0  
Df Model:                35                    LL-Null:               -1522.0  
Df Residuals:            2278                  LLR p-value:           0.86368  
Converged:               1.0000                Scale:                 1.0000   
No. Iterations:          5.0000                                                
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
---------------------------------------------------

Remove most of the insignificant features to see if something looks better. The only problem is that there isn't much predictive value regardless.

In [3]:
features_adj = features.drop(index = result.summary2().tables[1].index[result.summary2().tables[1]["P>|z|"] > .15])
logit_model = sm.Logit(data[target], data[features_adj])
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.656807
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.001    
Dependent Variable:      Winner                AIC:                   3069.7042
Date:                    2019-11-17 14:06      BIC:                   3155.9052
No. Observations:        2314                  Log-Likelihood:        -1519.9  
Df Model:                14                    LL-Null:               -1522.0  
Df Residuals:            2299                  LLR p-value:           0.99272  
Converged:               1.0000                Scale:                 1.0000   
No. Iterations:          5.0000                                                
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
---------------------------------------------------

Maybe the effects are different by weight class? These results show more promise in some cases (though probably not enough effectiveness for a betting strategy).

In [4]:
classes = pd.DataFrame(data.weight_class.value_counts())
classes.drop(index = classes.index[np.where(classes.weight_class < 100)], inplace = True)

for x in range(len(classes.index)):
    df = data.loc[data.weight_class == classes.index[x]]
    print("Class: " + classes.index[x])
    logit_model = sm.Logit(df[target], df[features_adj])
    result = logit_model.fit()
    print(result.summary2())

Class: Lightweight
Optimization terminated successfully.
         Current function value: 0.634156
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.031   
Dependent Variable:      Winner                AIC:                   642.5950
Date:                    2019-11-17 14:06      BIC:                   705.2952
No. Observations:        483                   Log-Likelihood:        -306.30 
Df Model:                14                    LL-Null:               -316.24 
Df Residuals:            468                   LLR p-value:           0.13382 
Converged:               1.0000                Scale:                 1.0000  
No. Iterations:          5.0000                                               
------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025 0.975]
------------------------------------------

                                Results: Logit
Model:                    Logit                Pseudo R-squared:     0.131     
Dependent Variable:       Winner               AIC:                  272.3590  
Date:                     2019-11-17 14:06     BIC:                  322.4220  
No. Observations:         208                  Log-Likelihood:       -121.18   
Df Model:                 14                   LL-Null:              -139.49   
Df Residuals:             193                  LLR p-value:          0.00084439
Converged:                1.0000               Scale:                1.0000    
No. Iterations:           7.0000                                               
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------------------
Diff_age                         0.0017   0.0419  0.0414 0.9670 -0.0803  