## Plain logistic regression isn't looking promising

In [20]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# read functions.
import os
for f in os.listdir('../fun/'): exec(open('../fun/'+f).read())
del f

# Load data
load( '../out/d3-fight-level-standardize-normalize.pkl' )

X = pd.DataFrame(X)
X.columns = cols

# Change winner to binary 1/0:
y[ y == -1 ] = 0
X.shape

(2305, 155)

If we just throw everything in, getting lots of bad p-values (and a pretty bad r-squared).

In [21]:
# Naive throw everything in
logit_model = sm.Logit( y, X )
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.559665
         Iterations 13
                                          Results: Logit
Model:                         Logit                       Pseudo R-squared:            0.149     
Dependent Variable:            Winner                      AIC:                         2886.0542 
Date:                          2019-11-20 20:16            BIC:                         3764.7081 
No. Observations:              2305                        Log-Likelihood:              -1290.0   
Df Model:                      152                         LL-Null:                     -1516.3   
Df Residuals:                  2152                        LLR p-value:                 1.2890e-31
Converged:                     1.0000                      Scale:                       1.0000    
No. Iterations:                13.0000                                                            
------------------------------------------------

  return 1/(1+np.exp(-X))


In [22]:
# make predictions and check recall, precision, f1 score.

from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score

pred = result.predict()
print( 
    'Mean wins: %s \nMean predict: %s\n' % ( 
    y.mean(),
    pred.mean()
))

# what is our base level if we predict the majority?
print( 'Accuracy predicting all wins:\n')
print( classification_report( 
    y, 
    [ 1 for x in pred ]
))

# what is the outcome of different cutoffs?
print( 'Accuracy with varying cutoffs:\n' )
for i in range(11): 
    
    icutoff = i/10
    
    predwin = [ 1 if x > i/10 else 0 for x in pred ]
    predloss = [ 0 if x > i/10 else 1 for x in pred ]
    
    fscorewin = f1_score( y, predwin )
    fscoreloss = f1_score( ( y == 0 ) * 1, predloss )    
    prec = precision_score( y, predwin )
    recall = recall_score( y, predwin )
    
    print(
        '%s: \t f1-score: %s   \t precision %s   \t recall: %s' % ( 
            i/10, 
            round( (fscorewin + fscoreloss) / 2, 2 ),
            round( prec, 2 ),
            round( recall, 2 )
    ))

Mean wins: 0.6321041214750542 
Mean predict: 0.6306435377214473

Accuracy predicting all wins:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       848
           1       0.63      1.00      0.77      1457

    accuracy                           0.63      2305
   macro avg       0.32      0.50      0.39      2305
weighted avg       0.40      0.63      0.49      2305

Accuracy with varying cutoffs:

0.0: 	 f1-score: 0.39   	 precision 0.63   	 recall: 1.0
0.1: 	 f1-score: 0.4   	 precision 0.63   	 recall: 1.0
0.2: 	 f1-score: 0.43   	 precision 0.64   	 recall: 0.99
0.3: 	 f1-score: 0.53   	 precision 0.67   	 recall: 0.97
0.4: 	 f1-score: 0.6   	 precision 0.69   	 recall: 0.93
0.5: 	 f1-score: 0.66   	 precision 0.73   	 recall: 0.83
0.6: 	 f1-score: 0.68   	 precision 0.78   	 recall: 0.72
0.7: 	 f1-score: 0.65   	 precision 0.84   	 recall: 0.56
0.8: 	 f1-score: 0.55   	 precision 0.89   	 recall: 0.35
0.9: 	 f1-score: 0.4   	 pr

Seems like a cutoff of around .5 gives us way above average wins 
while participating in a large number of fights.
We are capturing 71% of the wins (recall) and winning 75% of the time (precision).
Strangely though, we could win 63% of the time and capture 100% of the wins by
always betting to win.
I guess we need to think about betting and what make the most sense.

In [23]:
# train-test split.
from sklearn.model_selection import train_test_split
X_train , X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state = 729,
    test_size = 0.25
)

# fit on train and get in-model accuracy.
m = sm.Logit( y_train, X_train ).fit()

Optimization terminated successfully.
         Current function value: 0.545037
         Iterations 14


In [24]:
print( 
    classification_report( 
        y_train, [ 1 if x > 0.5 else 0 for x in m.predict(X_train) ] 
))

              precision    recall  f1-score   support

           0       0.63      0.50      0.56       644
           1       0.74      0.83      0.78      1084

    accuracy                           0.71      1728
   macro avg       0.68      0.66      0.67      1728
weighted avg       0.70      0.71      0.70      1728



In [25]:
print( 
    classification_report( 
        y_test, [ 1 if x > 0.5 else 0 for x in m.predict(X_test) ] 
))

              precision    recall  f1-score   support

           0       0.49      0.43      0.45       204
           1       0.71      0.75      0.73       373

    accuracy                           0.64       577
   macro avg       0.60      0.59      0.59       577
weighted avg       0.63      0.64      0.63       577



Test accuracy is similar to in-model, so there doesn't seem to be overfitting.

f1-score is lower than test on decision tree though.

Remove most of the insignificant features to see if something looks better. The only problem is that there isn't much predictive value regardless.

In [7]:
# prior model with incomplete dataset:

data = pd.read_csv("../out/d_fight_level_dataset_1line.csv", index_col = 0)

# Change winner to binary 1/0:\n",
data.Winner = data.Winner.apply(lambda x: np.where(x == -1, 0, 1))

# Initial features and target\n",
features = pd.Series(data.columns, index = data.columns)
target = "Winner"

# Remove referree, date, location, winner, title_bout, weight_class, no_of_rounds\n",
features.drop(index = ["Referee", "date", "location", "Winner", "title_bout",
                       "weight_class", "no_of_rounds"], inplace = True)

# Diff_draw is mostly NA/0\n",
features.drop(index = "Diff_draw", inplace = True)

# Lots of win columns\n",
features.drop(index = ["Diff_win_by_Decision_Majority",
                       "Diff_win_by_Decision_Split",
                       "Diff_win_by_Decision_Unanimous",
                       "Diff_win_by_KO/TKO",
                       "Diff_win_by_Submission",
                       "Diff_win_by_TKO_Doctor_Stoppage"], inplace = True)

# Naive throw everything in
logit_model = sm.Logit( data[target], data[features])
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.652126
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.009    
Dependent Variable:      Winner                AIC:                   3090.0410
Date:                    2019-11-20 19:43      BIC:                   3296.9234
No. Observations:        2314                  Log-Likelihood:        -1509.0  
Df Model:                35                    LL-Null:               -1522.0  
Df Residuals:            2278                  LLR p-value:           0.86368  
Converged:               1.0000                Scale:                 1.0000   
No. Iterations:          5.0000                                                
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
---------------------------------------------------

In [8]:
features_adj = features.drop(index = result.summary2().tables[1].index[result.summary2().tables[1]["P>|z|"] > .15])
logit_model = sm.Logit(data[target], data[features_adj])
result = logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.656807
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.001    
Dependent Variable:      Winner                AIC:                   3069.7042
Date:                    2019-11-20 19:43      BIC:                   3155.9052
No. Observations:        2314                  Log-Likelihood:        -1519.9  
Df Model:                14                    LL-Null:               -1522.0  
Df Residuals:            2299                  LLR p-value:           0.99272  
Converged:               1.0000                Scale:                 1.0000   
No. Iterations:          5.0000                                                
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
---------------------------------------------------

Maybe the effects are different by weight class? These results show more promise in some cases (though probably not enough effectiveness for a betting strategy).

In [9]:
classes = pd.DataFrame(data.weight_class.value_counts())
classes.drop(index = classes.index[np.where(classes.weight_class < 100)], inplace = True)

for x in range(len(classes.index)):
    df = data.loc[data.weight_class == classes.index[x]]
    print("Class: " + classes.index[x])
    logit_model = sm.Logit(df[target], df[features_adj])
    result = logit_model.fit()
    print(result.summary2())

Class: Lightweight
Optimization terminated successfully.
         Current function value: 0.634156
         Iterations 5
                                Results: Logit
Model:                   Logit                 Pseudo R-squared:      0.031   
Dependent Variable:      Winner                AIC:                   642.5950
Date:                    2019-11-20 19:43      BIC:                   705.2952
No. Observations:        483                   Log-Likelihood:        -306.30 
Df Model:                14                    LL-Null:               -316.24 
Df Residuals:            468                   LLR p-value:           0.13382 
Converged:               1.0000                Scale:                 1.0000  
No. Iterations:          5.0000                                               
------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025 0.975]
------------------------------------------


Class: Heavyweight
Optimization terminated successfully.
         Current function value: 0.596961
         Iterations 6
                                Results: Logit
Model:                    Logit                 Pseudo R-squared:      0.071   
Dependent Variable:       Winner                AIC:                   253.2636
Date:                     2019-11-20 19:43      BIC:                   301.7302
No. Observations:         187                   Log-Likelihood:        -111.63 
Df Model:                 14                    LL-Null:               -120.15 
Df Residuals:             172                   LLR p-value:           0.25423 
Converged:                1.0000                Scale:                 1.0000  
No. Iterations:           6.0000                                               
-------------------------------------------------------------------------------
                                 Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------