In this notebook, we will attempt to predict the outcome of fights before they happen based on the previous records of the fighters.

TODO:  use_train_test split to split into training and testing set.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import fighters_cleanser
import constants

In [131]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [22]:
fights = fighters_cleanser.load_cleanse_and_merge(constants.DEFAULT_FIGHTERS_FILE_NAME, constants.DEFAULT_FIGHTS_FILE_NAME)
fights.shape

(4393, 142)

In [23]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_kd,b_kd,r_sig_str,b_sig_str,r_sig_str_pct,b_sig_str_pct,r_total_str,b_total_str,...,age_diff,height_diff,weight_diff,reach_diff,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties
0,Henry Cejudo,Marlon Moraes,0,0,90 of 171,57 of 119,52,47,99 of 182,59 of 121,...,1.210959,-2,0,-3,12,6,0,8,5,0
1,Jimmie Rivera,Marlon Moraes,0,1,0 of 3,7 of 9,0,77,0 of 3,7 of 9,...,-1.175342,-2,0,1,9,4,0,6,5,0
2,John Dodson,Marlon Moraes,1,0,43 of 105,45 of 131,40,34,47 of 109,45 of 131,...,3.583562,-3,0,-1,12,7,0,4,5,0
3,Raphael Assuncao,Marlon Moraes,0,1,2 of 12,10 of 23,16,43,3 of 13,12 of 25,...,5.775342,-1,0,-1,15,6,0,7,5,0
4,Raphael Assuncao,Marlon Moraes,0,0,43 of 134,44 of 150,32,29,43 of 134,44 of 150,...,5.775342,-1,0,-1,12,6,0,4,4,0


A few ways to do it.

1. Feed it into a DNN model, the hot thing in ML these days.

2. Logistic regression.

3. Just look at the record and go with the one with a higher percentage.

4. A slightly more advanced approach of 3, where you have some sort of bayesian calculation.

The first two are the two things I have done elsewhere in this project.  They have some potential to work, but I don't like them as much.  I guess at least part of that is that they are not "explainable."

3 is pretty shallow and probably won't work, because someone with little to no previous record can't realistically be predicted this way.

4 is my favorite, but will require some thinking about how to go about assigning probabilities to people with little to no past record.

I think I will start with 4 actually.

The approach for 4 will be that everyone starts with some fictitious record, with an even number of wins and losses and possibly a tie.  Something like 4-4, but we will experiment with it.  This will, if the red fighter is 10=0 and the blue fighter is 0-10, the model will still show some small probability that blue will win.  But as the number of fights each fighter has done, the less the fictious probability matters.

In [24]:
# remove some columns we don't need
#list(fights.columns)
columns_to_use = ['r_fighter','b_fighter','r_prior_wins', 'r_prior_losses', 'r_prior_ties', 'b_prior_wins', 'b_prior_losses', 'b_prior_ties', 'winner', 'loser', 'r_b_winner']
fights = fights[columns_to_use]
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r


In [25]:
#fights.to_csv('fights_backup.csv', index=False)

In [26]:
#fights_backup = pd.read_csv('fights_backup.csv')
#fights_backup.head()

In [27]:
def add_probs_for_row(row):
    total_prior_fights = row[['r_prior_wins', 'r_prior_losses', 'r_prior_ties', 'b_prior_wins', 'b_prior_losses', 'b_prior_ties']].sum()
    row['r_win_prob'] = row.r_prior_wins / total_prior_fights
    row['r_loss_prob'] = row.r_prior_losses / total_prior_fights
    row['r_tie_prob'] = row.r_prior_ties / total_prior_fights
    row['b_win_prob'] = row.b_prior_wins / total_prior_fights
    row['b_loss_prob'] = row.b_prior_losses / total_prior_fights
    row['b_tie_prob'] = row.b_prior_ties / total_prior_fights
    return row
    

In [28]:
def add_probs(fights_df):
    fights_df['r_win_prob'] = 0
    fights_df['r_loss_prob'] = 0
    fights_df['r_tie_prob'] = 0
    fights_df['b_win_prob'] = 0
    fights_df['b_loss_prob'] = 0
    fights_df['b_tie_prob'] = 0
    fights_df = fights_df.apply(lambda row: add_probs_for_row(row), axis=1)
    return fights_df

In [29]:
fights = add_probs(fights)

In [30]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r,0.387097,0.193548,0.0,0.258065,0.16129,0.0
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b,0.375,0.166667,0.0,0.25,0.208333,0.0
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b,0.428571,0.25,0.0,0.142857,0.178571,0.0
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b,0.454545,0.181818,0.0,0.212121,0.151515,0.0
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r,0.461538,0.230769,0.0,0.153846,0.153846,0.0


In [31]:
add_probs_for_row(fights.iloc[0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a sl

r_fighter          Henry Cejudo
b_fighter         Marlon Moraes
r_prior_wins                 12
r_prior_losses                6
r_prior_ties                  0
b_prior_wins                  8
b_prior_losses                5
b_prior_ties                  0
winner             Henry Cejudo
loser             Marlon Moraes
r_b_winner                    r
r_win_prob             0.387097
r_loss_prob            0.193548
r_tie_prob                    0
b_win_prob             0.258065
b_loss_prob             0.16129
b_tie_prob                    0
Name: 0, dtype: object

In [32]:
12 + 6 + 8 + 5

31

In [33]:
12/31, 6/31, 8/31,5/31

(0.3870967741935484,
 0.1935483870967742,
 0.25806451612903225,
 0.16129032258064516)

In [34]:
fights['r_win_prob'] = 0
fights['r_loss_prob'] = 0
fights['r_tie_prob'] = 0
fights['b_win_prob'] = 0
fights['b_loss_prob'] = 0
fights['b_tie_prob'] = 0

In [35]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r,0,0,0,0,0,0
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b,0,0,0,0,0,0
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b,0,0,0,0,0,0
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b,0,0,0,0,0,0
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r,0,0,0,0,0,0


In [36]:
add_probs_for_row(fights.iloc[0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a sl

r_fighter          Henry Cejudo
b_fighter         Marlon Moraes
r_prior_wins                 12
r_prior_losses                6
r_prior_ties                  0
b_prior_wins                  8
b_prior_losses                5
b_prior_ties                  0
winner             Henry Cejudo
loser             Marlon Moraes
r_b_winner                    r
r_win_prob             0.387097
r_loss_prob            0.193548
r_tie_prob                    0
b_win_prob             0.258065
b_loss_prob             0.16129
b_tie_prob                    0
Name: 0, dtype: object

In [37]:
fights = fights.apply(lambda row: add_probs_for_row(row), axis=1)

In [38]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r,0.387097,0.193548,0.0,0.258065,0.16129,0.0
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b,0.375,0.166667,0.0,0.25,0.208333,0.0
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b,0.428571,0.25,0.0,0.142857,0.178571,0.0
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b,0.454545,0.181818,0.0,0.212121,0.151515,0.0
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r,0.461538,0.230769,0.0,0.153846,0.153846,0.0


In [39]:
fights.r_won

AttributeError: 'DataFrame' object has no attribute 'r_won'

In [43]:
fights['r_guess'] = fights.r_win_prob > fights.b_win_prob

In [44]:
fights['guess'] = fights.r_guess.apply(lambda x: 'r' if x == True else 'b')

In [45]:
fights.guess

0       r
1       r
2       r
3       r
4       r
       ..
4388    r
4389    b
4390    b
4391    b
4392    b
Name: guess, Length: 4393, dtype: object

In [46]:
(fights.r_b_winner == fights.guess).sum()

2147

In [47]:
fights.shape[0]

4393

In [48]:
(fights.r_b_winner == fights.guess).sum() / fights.shape[0]

0.4887320737536991

In [49]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob,r_guess,guess
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r,0.387097,0.193548,0.0,0.258065,0.16129,0.0,True,r
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b,0.375,0.166667,0.0,0.25,0.208333,0.0,True,r
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b,0.428571,0.25,0.0,0.142857,0.178571,0.0,True,r
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b,0.454545,0.181818,0.0,0.212121,0.151515,0.0,True,r
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r,0.461538,0.230769,0.0,0.153846,0.153846,0.0,True,r


So this has a less than 50% accuracy.  Not what I was hoping for.

How often is the one with a greater prior win percentage the one who is labeled "Red"?

In [53]:
(fights['r_win_prob'] > fights['b_win_prob']).sum() / fights.shape[0]

0.516958798087867

So only a little over half the time.

So...how do they determine who is Red and who is Blue?  We already know that Red wins something like 2/3 of the time, so there must be a little bit of accuracy.

The way I did things above is, on second look, probably not the best way to calculate probablities of winning.  It simply adds up the total number of wins, losses and ties for each player and divides each number by the total number of fights.  Let's say for example that fighter R was 12-6-0 and B was 6-0-0.  It would first convert that to 16-10-0 and 10-4-0, and the probabilities would be .4, .25, 0 for R and .25, .1, 0 for B.  This means R has a greater estimated probability of winning, even though B was 6 and 0.

The thing that clued me in that something in this calcuation was weird was seeing a row up there (row 2, John Dodson vs Marlon Moraes) where R had both a higher chance of winning...and of losing.  This method calculates the probabilities that, if you were to randomly select a prior that one of them had been in, what the outcome out be.  It also does not factor in any previous matches between the two.  I don't think it calculates anything like the probability of winning the next match.

Probilities for both winning and losing, are weighted toward the one who has more fights in their record.  

What can be done about the case where one person has more fights than the other?  Which is most fights.  I still like the idea of adding ficticious fights to the record to smooth out the randomness.

1. Not worry about it.  Once the set of ficticious fights has been added, just use the win percentage as is, even if one person has a bigger record.
2. Add a different number of ficticous fights to the one with the smaller record so that they both have the same number of fights, then do the calculation above.  This is of course an estimate and could be totally wrong, but might be a good enough estimate to work.
3. Some other more complicated math that takes into account a margin of uncertainty.
4. Abandon the ficticious fights idea and look at how people do with a 0-0-0 record, 1-0-0, record, and so forth.  I suspect there is not enough sample size and each record combination will only appear a few times.

Another way is to abandon the idea of looking only at past record numbers and looking also at who each person has fought - each other? opponents with good records? opponents with bad records? etc.  A lot more complicated.

In [72]:
r0 = pd.Series([12, 6, 0])
b0 = pd.Series([6, 0, 0])

In [98]:
r = pd.Series([16, 10, 0])
b = pd.Series([10, 4, 0])
r, b

(0    16
 1    10
 2     0
 dtype: int64,
 0    10
 1     4
 2     0
 dtype: int64)

In [74]:
s = r.sum() + b.sum()
s

40

In [75]:
r / s, b/s

(0    0.40
 1    0.25
 2    0.00
 dtype: float64,
 0    0.25
 1    0.10
 2    0.00
 dtype: float64)

Let's look at 2 above.

In [77]:
r0, b0

(0    12
 1     6
 2     0
 dtype: int64,
 0    6
 1    0
 2    0
 dtype: int64)

In [79]:
r0.sum(), b0.sum(), r.sum(), b.sum()

(18, 6, 26, 14)

We could add ficticious fights to B only, or add additional after adding the same amount to everyone.  Let's look at both.

First option

In [81]:
r0.sum() - b0.sum()

12

In [82]:
# => neet to add 12 ficticious fights to B.  Just do 6 and 6.  Of course, this won't work if there is an odd number of fights.
b2 = b0 + [6, 6, 0]
b2

0    12
1     6
2     0
dtype: int64

In [83]:
# If this were the true record of both, the calculation above would work.  In this case, both would have an equal chance of winning, which I still find suspicious because I still suspect B has a greater chance.

In [84]:
# using extra ficticious fights
r, b

(0    16
 1    10
 2     0
 dtype: int64,
 0    10
 1     4
 2     0
 dtype: int64)

In [85]:
r.sum() - b.sum()

12

In [86]:
# again, give B 12 more ficticious fights

In [87]:
b3 = b + [6, 6, 0]
b3

0    16
1    10
2     0
dtype: int64

Again, they have the same chances.  I think I picked a bad example by having R have twice as many wins as B so that when I added the ficticious fights, the numbers matched up to well.  I don't know how much that matters

Now look at option 1 above, where the initial value of ficticious fights is enough to give us enough confidence in their winning percentages.

In [89]:
fights

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,r_b_winner,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob,r_guess,guess
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,r,0.387097,0.193548,0.0,0.258065,0.161290,0.0,True,r
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,b,0.375000,0.166667,0.0,0.250000,0.208333,0.0,True,r
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,b,0.428571,0.250000,0.0,0.142857,0.178571,0.0,True,r
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,b,0.454545,0.181818,0.0,0.212121,0.151515,0.0,True,r
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,r,0.461538,0.230769,0.0,0.153846,0.153846,0.0,True,r
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4388,Tim Sylvia,Jeff Monson,6,6,0,5,5,0,Tim Sylvia,Jeff Monson,r,0.272727,0.272727,0.0,0.227273,0.227273,0.0,True,r
4389,Eric Schafer,Rob MacDonald,4,4,0,4,5,0,Eric Schafer,Rob MacDonald,r,0.235294,0.235294,0.0,0.235294,0.294118,0.0,False,b
4390,Jason Lambert,Rob MacDonald,4,4,0,4,4,0,Jason Lambert,Rob MacDonald,r,0.250000,0.250000,0.0,0.250000,0.250000,0.0,False,b
4391,Dale Hartt,Corey Hill,4,5,0,4,5,0,Dale Hartt,Corey Hill,r,0.222222,0.277778,0.0,0.222222,0.277778,0.0,False,b


=> Calculate prior win, loss, tie percentages.  Go with the one with the higher win percentage, or greatest difference between win and loss percentages (since ties can mean using only win percentage might be a little wrong).

In [92]:
r_total_fights = fights.r_prior_wins + fights.r_prior_losses + fights.r_prior_ties
b_total_fights = fights.b_prior_wins + fights.b_prior_losses + fights.b_prior_ties

In [96]:
fights['r_total_fights'] = r_total_fights
fights['b_total_fights'] = b_total_fights

In [97]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,r_win_prob,r_loss_prob,r_tie_prob,b_win_prob,b_loss_prob,b_tie_prob,r_guess,guess,r_total_fights,b_total_fights
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,...,0.387097,0.193548,0.0,0.258065,0.16129,0.0,True,r,18,13
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,...,0.375,0.166667,0.0,0.25,0.208333,0.0,True,r,13,11
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,...,0.428571,0.25,0.0,0.142857,0.178571,0.0,True,r,19,9
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,...,0.454545,0.181818,0.0,0.212121,0.151515,0.0,True,r,21,12
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,...,0.461538,0.230769,0.0,0.153846,0.153846,0.0,True,r,18,8


In [118]:
fights['r_win_pct'] = fights.r_prior_wins / fights.r_total_fights
fights['r_loss_pct'] = fights.r_prior_losses / fights.r_total_fights
fights['r_tie_pct'] = fights.r_prior_ties / fights.r_total_fights

fights['b_win_pct'] = fights.b_prior_wins / fights.b_total_fights
fights['b_loss_pct'] = fights.b_prior_losses / fights.b_total_fights
fights['b_tie_pct'] = fights.b_prior_ties / fights.b_total_fights

In [119]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,r_guess,guess,r_total_fights,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,...,True,r,18,13,0.666667,0.333333,0.0,0.615385,0.384615,0.0
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,...,True,r,13,11,0.692308,0.307692,0.0,0.545455,0.454545,0.0
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,...,True,r,19,9,0.631579,0.368421,0.0,0.444444,0.555556,0.0
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,...,True,r,21,12,0.714286,0.285714,0.0,0.583333,0.416667,0.0
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,...,True,r,18,8,0.666667,0.333333,0.0,0.5,0.5,0.0


In [102]:
fights.columns

Index(['r_fighter', 'b_fighter', 'r_prior_wins', 'r_prior_losses',
       'r_prior_ties', 'b_prior_wins', 'b_prior_losses', 'b_prior_ties',
       'winner', 'loser', 'r_b_winner', 'r_win_prob', 'r_loss_prob',
       'r_tie_prob', 'b_win_prob', 'b_loss_prob', 'b_tie_prob', 'r_guess',
       'guess', 'r_total_fights', 'b_total_fights', 'r_win_pct', 'r_loss_pct',
       'r_tie_pct', 'b_win_pct', 'b_loss_pct', 'b_tie_pct'],
      dtype='object')

In [103]:
fights = fights.drop(['r_win_prob', 'r_loss_prob', 'r_tie_prob', 'b_win_prob', 'b_loss_prob', 'b_tie_prob'], axis=1)

In [104]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,r_guess,guess,r_total_fights,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,...,True,r,18,13,0.666667,0.333333,0.0,0.615385,0.461538,0.0
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,...,True,r,13,11,0.692308,0.307692,0.0,0.545455,0.363636,0.0
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,...,True,r,19,9,0.631579,0.368421,0.0,0.444444,0.777778,0.0
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,...,True,r,21,12,0.714286,0.285714,0.0,0.583333,0.5,0.0
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,...,True,r,18,8,0.666667,0.333333,0.0,0.5,0.75,0.0


In [120]:
fights[fights.r_tie_pct > 0].shape[0]

340

In [121]:
fights[fights.b_tie_pct > 0].shape[0]

292

In [122]:
fights.shape[0]

4393

In [123]:
fights[fights.r_tie_pct != fights.b_tie_pct].shape[0]

583

In [124]:
fights[(fights.r_tie_pct > 0) | (fights.b_tie_pct > 0)].head(80)

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,r_guess,guess,r_total_fights,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct
10,Alex Caceres,Sergio Pettis,8,7,1,5,4,0,Alex Caceres,Sergio Pettis,...,True,r,16,9,0.500000,0.437500,0.062500,0.555556,0.444444,0.000000
15,Yaotzin Meza,Sergio Pettis,5,5,1,5,5,0,Sergio Pettis,Yaotzin Meza,...,False,b,11,10,0.454545,0.454545,0.090909,0.500000,0.500000,0.000000
19,Demetrious Johnson,Wilson Reis,17,5,1,10,6,0,Demetrious Johnson,Wilson Reis,...,True,r,23,16,0.739130,0.217391,0.043478,0.625000,0.375000,0.000000
24,Henry Cejudo,Chico Camus,6,4,0,7,6,1,Henry Cejudo,Chico Camus,...,False,b,10,14,0.600000,0.400000,0.000000,0.500000,0.428571,0.071429
25,Brad Pickett,Chico Camus,8,8,0,6,6,1,Chico Camus,Brad Pickett,...,True,r,16,13,0.500000,0.500000,0.000000,0.461538,0.461538,0.076923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401,Charles Oliveira,Christos Giagos,14,12,1,5,6,0,Charles Oliveira,Christos Giagos,...,True,r,27,11,0.518519,0.444444,0.037037,0.454545,0.545455,0.000000
403,Mizuto Hirota,Christos Giagos,5,8,1,5,7,0,Christos Giagos,Mizuto Hirota,...,False,b,14,12,0.357143,0.571429,0.071429,0.416667,0.583333,0.000000
414,Stevie Ray,Leonardo Santos,10,7,0,9,4,1,Leonardo Santos,Stevie Ray,...,True,r,17,14,0.588235,0.411765,0.000000,0.642857,0.285714,0.071429
417,Darren Till,Jessin Ayari,5,4,1,5,4,0,Darren Till,Jessin Ayari,...,False,b,10,9,0.500000,0.400000,0.100000,0.555556,0.444444,0.000000


In [125]:
fights['r_wldiff_pct'] = fights.r_win_pct - fights.r_loss_pct
fights['b_wldiff_pct'] = fights.b_win_pct - fights.b_loss_pct

In [126]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,r_total_fights,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct,r_wldiff_pct,b_wldiff_pct
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,...,18,13,0.666667,0.333333,0.0,0.615385,0.384615,0.0,0.333333,0.230769
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,...,13,11,0.692308,0.307692,0.0,0.545455,0.454545,0.0,0.384615,0.090909
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,...,19,9,0.631579,0.368421,0.0,0.444444,0.555556,0.0,0.263158,-0.111111
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,...,21,12,0.714286,0.285714,0.0,0.583333,0.416667,0.0,0.428571,0.166667
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,...,18,8,0.666667,0.333333,0.0,0.5,0.5,0.0,0.333333,0.0


In [129]:
fights['guess_opt1'] = fights.apply(lambda row: 'r' if row.r_wldiff_pct > row.b_wldiff_pct else 'b', axis=1)

In [130]:
fights.head()

Unnamed: 0,r_fighter,b_fighter,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,winner,loser,...,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct,r_wldiff_pct,b_wldiff_pct,guess_opt1
0,Henry Cejudo,Marlon Moraes,12,6,0,8,5,0,Henry Cejudo,Marlon Moraes,...,13,0.666667,0.333333,0.0,0.615385,0.384615,0.0,0.333333,0.230769,r
1,Jimmie Rivera,Marlon Moraes,9,4,0,6,5,0,Marlon Moraes,Jimmie Rivera,...,11,0.692308,0.307692,0.0,0.545455,0.454545,0.0,0.384615,0.090909,r
2,John Dodson,Marlon Moraes,12,7,0,4,5,0,Marlon Moraes,John Dodson,...,9,0.631579,0.368421,0.0,0.444444,0.555556,0.0,0.263158,-0.111111,r
3,Raphael Assuncao,Marlon Moraes,15,6,0,7,5,0,Marlon Moraes,Raphael Assuncao,...,12,0.714286,0.285714,0.0,0.583333,0.416667,0.0,0.428571,0.166667,r
4,Raphael Assuncao,Marlon Moraes,12,6,0,4,4,0,Raphael Assuncao,Marlon Moraes,...,8,0.666667,0.333333,0.0,0.5,0.5,0.0,0.333333,0.0,r


In [134]:
print(accuracy_score(fights.r_b_winner, fights.guess_opt1))

0.5219667653084452


In [135]:
(fights.r_b_winner == fights.guess_opt1).sum() / fights.shape[0]

0.5219667653084452

So, hardly better than the first try.  This is disappointing, and also perplexing since it uses the difference in win and loss percentages.  One option is to experiment with the number of ficticious fights (0, 5, 10, etc.).  If I do that, I'll need to put this process into a function so I don't have to keep doing the same manual process over and over.

In [139]:
fights[['r_prior_wins', 'r_prior_losses',
       'r_prior_ties', 'b_prior_wins', 'b_prior_losses', 'b_prior_ties', 'r_total_fights',
       'b_total_fights', 'r_win_pct', 'r_loss_pct', 'r_tie_pct', 'b_win_pct',
       'b_loss_pct', 'b_tie_pct', 'r_wldiff_pct', 'b_wldiff_pct',
       'guess_opt1', 'r_b_winner']].head(40)

Unnamed: 0,r_prior_wins,r_prior_losses,r_prior_ties,b_prior_wins,b_prior_losses,b_prior_ties,r_total_fights,b_total_fights,r_win_pct,r_loss_pct,r_tie_pct,b_win_pct,b_loss_pct,b_tie_pct,r_wldiff_pct,b_wldiff_pct,guess_opt1,r_b_winner
0,12,6,0,8,5,0,18,13,0.666667,0.333333,0.0,0.615385,0.384615,0.0,0.333333,0.230769,r,r
1,9,4,0,6,5,0,13,11,0.692308,0.307692,0.0,0.545455,0.454545,0.0,0.384615,0.090909,r,b
2,12,7,0,4,5,0,19,9,0.631579,0.368421,0.0,0.444444,0.555556,0.0,0.263158,-0.111111,r,b
3,15,6,0,7,5,0,21,12,0.714286,0.285714,0.0,0.583333,0.416667,0.0,0.428571,0.166667,r,b
4,12,6,0,4,4,0,18,8,0.666667,0.333333,0.0,0.5,0.5,0.0,0.333333,0.0,r,r
5,11,6,0,16,7,0,17,23,0.647059,0.352941,0.0,0.695652,0.304348,0.0,0.294118,0.391304,b,r
6,10,4,0,14,7,0,14,21,0.714286,0.285714,0.0,0.666667,0.333333,0.0,0.428571,0.333333,r,b
7,11,4,0,9,6,0,15,15,0.733333,0.266667,0.0,0.6,0.4,0.0,0.466667,0.2,r,b
8,8,5,0,8,5,0,13,13,0.615385,0.384615,0.0,0.615385,0.384615,0.0,0.230769,0.230769,b,r
9,9,6,0,11,6,0,15,17,0.6,0.4,0.0,0.647059,0.352941,0.0,0.2,0.294118,b,r
