# Purpose
This notebook details data preparation and modelling of the most recent model iteration.

1. Feature Engineering

2. Modelling

In [3]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

import pandas as pd
from sqlalchemy import create_engine
from src import local
from src import functions

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Feature engineering
### Features
The model in this notebook uses differentials and per-minute rates for the following technique categories:
- Significant Strikes
    - Total
    - on the Ground
    - in the Clinch
    - at a Distance
- takedowns

#### Differentials
Differentials are the difference in one fighters metric and his opponents. In this model, the differentials are calculated from
the per-15-minute or per-1-minute rates in a single round.

### Target
The target is the Combined Significant Strike Attempts per-1-minute of both fighters, which allows us to measure the amount of striking action in a fight.

#### Load the data

In [4]:
# Set up tables
# Credentials
USER = local.user 
PASS = local.password
HOST = local.host
PORT = local.port

#create engine
engine = create_engine(f'postgresql://{USER}:{PASS}@{HOST}:{PORT}/match_finder')

#### Join our 5 advanced statistics tables with our bouts and events table to get dates for each bout

In [5]:
query = """
SELECT distance_ss_stats.bout_link, distance_ss_stats.fighter_link, distance_ss_stats.round, "Date", 
sig_str_a_p1m, td_s_p15m_di, sig_str_s_p1m_di, sig_str_a_p1m_di, 
ground_s_p15m_di, ground_a_p15m_di, clinch_s_p15m_di,
clinch_a_p15m_di, distance_a_p1m_di, distance_s_p1m_di
FROM distance_ss_stats
    JOIN clinch_ss_stats ON 
    CONCAT(distance_ss_stats.bout_link, distance_ss_stats.fighter_link, CAST(distance_ss_stats.round AS CHAR)) =
    CONCAT(clinch_ss_stats.bout_link, clinch_ss_stats.fighter_link, CAST(clinch_ss_stats.round AS CHAR))

    JOIN ground_ss_stats ON 
    CONCAT(distance_ss_stats.bout_link, distance_ss_stats.fighter_link, CAST(distance_ss_stats.round AS CHAR)) =
    CONCAT(ground_ss_stats.bout_link, ground_ss_stats.fighter_link, CAST(ground_ss_stats.round AS CHAR))

    JOIN sig_str_stats ON 
    CONCAT(distance_ss_stats.bout_link, distance_ss_stats.fighter_link, CAST(distance_ss_stats.round AS CHAR)) =
    CONCAT(sig_str_stats.bout_link, sig_str_stats.fighter_link, CAST(sig_str_stats.round AS CHAR))
    
    JOIN takedown_stats ON 
    CONCAT(distance_ss_stats.bout_link, distance_ss_stats.fighter_link, CAST(distance_ss_stats.round AS CHAR)) =
    CONCAT(takedown_stats.bout_link, takedown_stats.fighter_link, CAST(takedown_stats.round AS CHAR))

JOIN bouts ON bouts.link = distance_ss_stats.bout_link
JOIN events ON events.link = bouts.event_link
"""

data = pd.read_sql(query, engine)

In [6]:
data = functions.format_data(data, event=False)

data.head()

Unnamed: 0,bout_link,fighter_link,round,Date,sig_str_a_p1m,td_s_p15m_di,sig_str_s_p1m_di,sig_str_a_p1m_di,ground_s_p15m_di,ground_a_p15m_di,clinch_s_p15m_di,clinch_a_p15m_di,distance_a_p1m_di,distance_s_p1m_di,date,fighter_id,bout_id
0,http://www.ufcstats.com/fight-details/000da315...,http://www.ufcstats.com/fighter-details/6da991...,1,"July 08, 2006",7.8,3.0,1.4,5.6,0.0,3.0,9.0,27.0,3.6,0.8,2006-07-08,6da99156486ed6c2,000da3152b7b5ab1
1,http://www.ufcstats.com/fight-details/000da315...,http://www.ufcstats.com/fighter-details/6da991...,2,"July 08, 2006",5.2,0.0,1.2,1.4,-12.0,-18.0,21.0,27.0,0.8,0.6,2006-07-08,6da99156486ed6c2,000da3152b7b5ab1
2,http://www.ufcstats.com/fight-details/000da315...,http://www.ufcstats.com/fighter-details/6da991...,3,"July 08, 2006",4.6,6.0,0.6,2.8,6.0,18.0,3.0,3.0,1.4,0.0,2006-07-08,6da99156486ed6c2,000da3152b7b5ab1
3,http://www.ufcstats.com/fight-details/000da315...,http://www.ufcstats.com/fighter-details/d1a131...,1,"July 08, 2006",2.2,-3.0,-1.4,-5.6,0.0,-3.0,-9.0,-27.0,-3.6,-0.8,2006-07-08,d1a1314976c50bef,000da3152b7b5ab1
4,http://www.ufcstats.com/fight-details/000da315...,http://www.ufcstats.com/fighter-details/d1a131...,2,"July 08, 2006",3.8,0.0,-1.2,-1.4,12.0,18.0,-21.0,-27.0,-0.8,-0.6,2006-07-08,d1a1314976c50bef,000da3152b7b5ab1


## Create fighter-bout instance dataframe

A fighter-bout instance represents one fighter in one bout.
 - The same fighter has exactly one fighter-bout instance for every single bout he has been in. 
 - Every bout has exactly two fighter-bout instances, one for each fighter in the bout. 
  
In this case a fighter-bout instance is assigned a unique identifier comprised of the bout_id combined with the fighter_link.

In [25]:
fighter_bout_inst = functions.create_fighter_bout_instance_table(data, 'sig_str_a_p1m')

In [29]:
fighter_bout_inst

Unnamed: 0_level_0,bout_id,fighter_id,date,target
fighter_bout_inst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000da3152b7b5ab16da99156486ed6c2,000da3152b7b5ab1,6da99156486ed6c2,2006-07-08,5.866667
000da3152b7b5ab1d1a1314976c50bef,000da3152b7b5ab1,d1a1314976c50bef,2006-07-08,2.600000
0019ec81fd706ade326f94d6cfb1bf25,0019ec81fd706ade,326f94d6cfb1bf25,2019-10-18,6.466667
0019ec81fd706ade85073dbd1be65ed9,0019ec81fd706ade,85073dbd1be65ed9,2019-10-18,7.000000
0027e179b743c86c3aa794cbe1e3484b,0027e179b743c86c,3aa794cbe1e3484b,2015-03-14,2.550000
...,...,...,...,...
ffe629a5232a878bb361180739bed4b0,ffe629a5232a878b,b361180739bed4b0,2003-06-06,0.000000
ffea776913451b6d22a92d7f62195791,ffea776913451b6d,22a92d7f62195791,2015-02-28,14.522293
ffea776913451b6d75e5fec9f72910ef,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,7.643312
fffa21388cdd78b75d7bdab5e03e3216,fffa21388cdd78b7,5d7bdab5e03e3216,2013-10-19,7.866667


#### Remove bouts that take place after 2012

Mixed Martial Arts has changed over the years and stylistic match ups from a decade ago might not be analgous to matchups today.
Almost every fighter nowadays has at least some exposure to Wrestling, Brazilian Jiu Jitsu, Boxing, and Muay Thai, so stylistic 
differences tend to reflect personal choices by the fighter rather than the limits of their martial arts discipline. For now,
we choose 2012 as the cut off point, but further analysis needs to be done regarding the appropriate date.

In [30]:
fighter_bout_inst = fighter_bout_inst[fighter_bout_inst['date']>pd.to_datetime('12-31-2011')]

In [31]:
fighter_bout_inst

Unnamed: 0_level_0,bout_id,fighter_id,date,target
fighter_bout_inst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0019ec81fd706ade326f94d6cfb1bf25,0019ec81fd706ade,326f94d6cfb1bf25,2019-10-18,6.466667
0019ec81fd706ade85073dbd1be65ed9,0019ec81fd706ade,85073dbd1be65ed9,2019-10-18,7.000000
0027e179b743c86c3aa794cbe1e3484b,0027e179b743c86c,3aa794cbe1e3484b,2015-03-14,2.550000
0027e179b743c86c91ea901c458e95dd,0027e179b743c86c,91ea901c458e95dd,2015-03-14,3.412500
002921976d27b7dab4ad3a06ee4d660c,002921976d27b7da,b4ad3a06ee4d660c,2014-12-13,5.928854
...,...,...,...,...
ffd3e3d37cba32da92a9aa9c93192871,ffd3e3d37cba32da,92a9aa9c93192871,2014-10-25,10.266667
ffea776913451b6d22a92d7f62195791,ffea776913451b6d,22a92d7f62195791,2015-02-28,14.522293
ffea776913451b6d75e5fec9f72910ef,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,7.643312
fffa21388cdd78b75d7bdab5e03e3216,fffa21388cdd78b7,5d7bdab5e03e3216,2013-10-19,7.866667


## Calculate metrics

The metrics will be calculated from the fighters entire career and from their last 3 fights, to give an idea of their current state.
Career averages are prefixed with 'ca_' and 3 fight averages with '3fa_'

In [33]:
list_of_metrics = ['sig_str_a_p1m',
       'td_s_p15m_di', 'sig_str_s_p1m_di', 'sig_str_a_p1m_di',
       'ground_s_p15m_di', 'ground_a_p15m_di', 'clinch_s_p15m_di',
       'clinch_a_p15m_di', 'distance_a_p1m_di', 'distance_s_p1m_di']

#### How these metrics are calculated
The following cell iterates through each row in the fighter bout instance table. It takes the unique fighter_id and the date from that row and calculates the fighters metrics up until that date. This represents the fact that our model will only have prior knowledge of the fighters when making it's predictions.


This cell takes about 5-10 minutes to load.

In [34]:
for metric in list_of_metrics:
    print(metric+'\n')
    values = fighter_bout_inst.apply(lambda row: functions.calculate_metric_average(metric, 
                                                                                    row['fighter_id'], 
                                                                                    row['date'], 
                                                                                    data), 
                                     axis=1)
    fighter_bout_inst['ca_'+metric] = values
    values = fighter_bout_inst.apply(lambda row: functions.calculate_3_fight_average(metric, 
                                                                                     row['fighter_id'], 
                                                                                     row['date'], 
                                                                                     data), 
                                     axis=1)
    fighter_bout_inst['3fa_'+metric] = values

sig_str_a_p1m



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fighter_bout_inst['ca_'+metric] = values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fighter_bout_inst['3fa_'+metric] = values


td_s_p15m_di

sig_str_s_p1m_di

sig_str_a_p1m_di

ground_s_p15m_di

ground_a_p15m_di

clinch_s_p15m_di

clinch_a_p15m_di

distance_a_p1m_di

distance_s_p1m_di



### Debut fights and inexperienced fighters
Many fights include fighter who have never fought in the UFC before, and some do not have long enough records to calculate 3 fight averages. These fighters would have null values in their career and 3-fight-averages, so I drop all of those rows here.

In [37]:
mask = fighter_bout_inst['3fa_sig_str_a_p1m'].notnull()
fighter_bout_inst = fighter_bout_inst[mask]

In [38]:
fighter_bout_inst.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6069 entries, 0019ec81fd706ade85073dbd1be65ed9 to fffa21388cdd78b7c80095f6092271a7
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   bout_id                6069 non-null   object        
 1   fighter_id             6069 non-null   object        
 2   date                   6069 non-null   datetime64[ns]
 3   target                 6069 non-null   float64       
 4   ca_sig_str_a_p1m       6069 non-null   float64       
 5   3fa_sig_str_a_p1m      6069 non-null   object        
 6   ca_td_s_p15m_di        6069 non-null   float64       
 7   3fa_td_s_p15m_di       6069 non-null   object        
 8   ca_sig_str_s_p1m_di    6069 non-null   float64       
 9   3fa_sig_str_s_p1m_di   6069 non-null   object        
 10  ca_sig_str_a_p1m_di    6069 non-null   float64       
 11  3fa_sig_str_a_p1m_di   6069 non-null   object        
 12  ca_groun

## Create the final dataframe

The current fighter bout instance table has two rows for each fight (one row for each fighter). In order to create a table where each row represents one fight, I need get each fighter on the same row.

In [39]:
model_df = functions.merge_fighter_instances(fighter_bout_inst)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  instances_df['inst_id'] = instances_df['bout_id'] + instances_df['fighter_id']


## Creating Combined Significant Strike Attempts Per 1 Minute (c_sig_str_a_p1m)

In [42]:
model_df['c_sig_str_a_p1m'] = model_df['target_0'] + model_df['target_1']
model_df

Unnamed: 0,bout_id,fighter_id_0,date_0,target_0,ca_sig_str_a_p1m_0,3fa_sig_str_a_p1m_0,ca_td_s_p15m_di_0,3fa_td_s_p15m_di_0,ca_sig_str_s_p1m_di_0,3fa_sig_str_s_p1m_di_0,...,ca_clinch_s_p15m_di_1,3fa_clinch_s_p15m_di_1,ca_clinch_a_p15m_di_1,3fa_clinch_a_p15m_di_1,ca_distance_a_p1m_di_1,3fa_distance_a_p1m_di_1,ca_distance_s_p1m_di_1,3fa_distance_s_p1m_di_1,inst_id_1,c_sig_str_a_p1m
0,0027e179b743c86c,91ea901c458e95dd,2015-03-14,3.412500,5.235224,6.02175,1.500000,0.857143,1.599104,1.59872,...,-5.000000,-5,-3.000000,-3,-5.733333,-5.73333,-2.533333,-2.53333,0027e179b743c86c3aa794cbe1e3484b,5.962500
1,002921976d27b7da,ebc1f40e00e0c481,2014-12-13,1.185771,8.788065,10.8811,-0.433461,-0.6,-0.916931,0.864547,...,3.825798,-15.2929,-7.405384,-35.4358,-3.095198,-2.87071,-1.265419,-1.43001,002921976d27b7dab4ad3a06ee4d660c,7.114625
2,002c1562708ac307,44470bfd9483c7ad,2014-05-24,10.731707,3.200000,3.2,3.000000,3,0.466667,0.466667,...,2.628510,-1.00714,3.107143,-0.407143,1.837749,1.71,0.249837,-0.0128571,002c1562708ac30722a92d7f62195791,27.073171
3,00494c77d2a88f8c,7ea1f74cef32f906,2016-11-05,5.200000,9.009428,8.42222,0.000000,-0.333333,2.353199,2.35556,...,2.859965,2.70283,4.733184,5.09721,-0.790958,-1.12187,0.216110,0.686228,00494c77d2a88f8c08af939f41b5a57b,13.200000
4,0051d7fbb7893d27,b6c4451cb13c9303,2013-10-26,3.400000,4.900000,4.9,1.000000,1,0.633333,0.633333,...,1.094035,-0.5,0.711377,-2.58333,1.030239,1.84205,-0.239637,-0.100849,0051d7fbb7893d27282fa667ff9c51ed,6.066667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2525,ff4cef86bef2d0aa,db1f2ed63b54b9a7,2017-09-16,10.790850,6.728064,6.23423,0.610672,-0.223602,-1.297033,-4.06294,...,3.772551,1.71429,3.311594,0.857143,1.325490,0.877551,0.728171,0.152041,ff4cef86bef2d0aad4c9dcd330403612,20.699346
2526,ff64fc34065565d0,6fb1ba67bef41b37,2015-05-30,6.133333,6.486257,6.44444,1.429825,-0.333333,0.573392,0.955556,...,5.199375,6.25735,6.996808,6.25735,-0.600820,0.109339,-0.009273,0.565661,ff64fc34065565d007225ba28ae309b6,13.923810
2527,ffd3e3d37cba32da,92a9aa9c93192871,2014-10-25,10.266667,9.590387,5.06667,3.018444,2.66667,0.320330,1.48889,...,-2.625000,-3,-3.750000,-4.28571,0.427614,0.759949,-0.747097,-0.419829,ffd3e3d37cba32da7413b80dbb0f8f9f,15.733333
2528,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,7.643312,5.866690,4.37778,3.846692,5.33333,-0.431887,-0.711111,...,1.852669,0.540941,1.276528,1.54094,1.106705,5.03862,0.508914,2.59808,ffea776913451b6d22a92d7f62195791,22.165605


#### drop unnecessary columns

In [44]:
model_df = model_df.loc[:,['ca_sig_str_a_p1m_0', '3fa_sig_str_a_p1m_0', 'ca_td_s_p15m_di_0',
                           '3fa_td_s_p15m_di_0', 'ca_sig_str_s_p1m_di_0', '3fa_sig_str_s_p1m_di_0',
                           'ca_sig_str_a_p1m_di_0', '3fa_sig_str_a_p1m_di_0',
                           'ca_ground_s_p15m_di_0', '3fa_ground_s_p15m_di_0',
                           'ca_ground_a_p15m_di_0', '3fa_ground_a_p15m_di_0',
                           'ca_clinch_s_p15m_di_0', '3fa_clinch_s_p15m_di_0',
                           'ca_clinch_a_p15m_di_0', '3fa_clinch_a_p15m_di_0',
                           'ca_distance_a_p1m_di_0', '3fa_distance_a_p1m_di_0',
                           'ca_distance_s_p1m_di_0', '3fa_distance_s_p1m_di_0',
                           'ca_sig_str_a_p1m_1',
                           '3fa_sig_str_a_p1m_1', 'ca_td_s_p15m_di_1', '3fa_td_s_p15m_di_1',
                           'ca_sig_str_s_p1m_di_1', '3fa_sig_str_s_p1m_di_1',
                           'ca_sig_str_a_p1m_di_1', '3fa_sig_str_a_p1m_di_1',
                           'ca_ground_s_p15m_di_1', '3fa_ground_s_p15m_di_1',
                           'ca_ground_a_p15m_di_1', '3fa_ground_a_p15m_di_1',
                           'ca_clinch_s_p15m_di_1', '3fa_clinch_s_p15m_di_1',
                           'ca_clinch_a_p15m_di_1', '3fa_clinch_a_p15m_di_1',
                           'ca_distance_a_p1m_di_1', '3fa_distance_a_p1m_di_1',
                           'ca_distance_s_p1m_di_1', '3fa_distance_s_p1m_di_1',
                           'c_sig_str_a_p1m']]

In [45]:
model_df

Unnamed: 0,ca_sig_str_a_p1m_0,3fa_sig_str_a_p1m_0,ca_td_s_p15m_di_0,3fa_td_s_p15m_di_0,ca_sig_str_s_p1m_di_0,3fa_sig_str_s_p1m_di_0,ca_sig_str_a_p1m_di_0,3fa_sig_str_a_p1m_di_0,ca_ground_s_p15m_di_0,3fa_ground_s_p15m_di_0,...,3fa_ground_a_p15m_di_1,ca_clinch_s_p15m_di_1,3fa_clinch_s_p15m_di_1,ca_clinch_a_p15m_di_1,3fa_clinch_a_p15m_di_1,ca_distance_a_p1m_di_1,3fa_distance_a_p1m_di_1,ca_distance_s_p1m_di_1,3fa_distance_s_p1m_di_1,c_sig_str_a_p1m
0,5.235224,6.02175,1.500000,0.857143,1.599104,1.59872,2.388358,2.38337,13.204478,6.86354,...,0,-5.000000,-5,-3.000000,-3,-5.733333,-5.73333,-2.533333,-2.53333,5.962500
1,8.788065,10.8811,-0.433461,-0.6,-0.916931,0.864547,-1.198873,3.37951,-1.755793,-0.0156793,...,24.518,3.825798,-15.2929,-7.405384,-35.4358,-3.095198,-2.87071,-1.265419,-1.43001,7.114625
2,3.200000,3.2,3.000000,3,0.466667,0.466667,-1.000000,-1,3.000000,3,...,1.60714,2.628510,-1.00714,3.107143,-0.407143,1.837749,1.71,0.249837,-0.0128571,27.073171
3,9.009428,8.42222,0.000000,-0.333333,2.353199,2.35556,3.319697,3.46667,5.270202,4,...,29.7697,2.859965,2.70283,4.733184,5.09721,-0.790958,-1.12187,0.216110,0.686228,13.200000
4,4.900000,4.9,1.000000,1,0.633333,0.633333,0.933333,0.933333,14.000000,14,...,4.54052,1.094035,-0.5,0.711377,-2.58333,1.030239,1.84205,-0.239637,-0.100849,6.066667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2525,6.728064,6.23423,0.610672,-0.223602,-1.297033,-4.06294,-2.203561,-6.50476,-10.786923,-37.2174,...,-7.28571,3.772551,1.71429,3.311594,0.857143,1.325490,0.877551,0.728171,0.152041,20.699346
2526,6.486257,6.44444,1.429825,-0.333333,0.573392,0.955556,1.119006,0.755556,5.789474,10,...,0.0808824,5.199375,6.25735,6.996808,6.25735,-0.600820,0.109339,-0.009273,0.565661,13.923810
2527,9.590387,5.06667,3.018444,2.66667,0.320330,1.48889,4.057003,2.55556,6.073953,18.6667,...,23.8957,-2.625000,-3,-3.750000,-4.28571,0.427614,0.759949,-0.747097,-0.419829,15.733333
2528,5.866690,4.37778,3.846692,5.33333,-0.431887,-0.711111,-0.736226,-1.4,1.829027,4.66667,...,2.67857,1.852669,0.540941,1.276528,1.54094,1.106705,5.03862,0.508914,2.59808,22.165605


# 2. Modelling

In [47]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import PoissonRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

## Split

In [46]:
X = model_df.drop('c_sig_str_a_p1m', axis=1)
y = model_df.c_sig_str_a_p1m

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

## Preprocessing

In [49]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

## Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

#### R-squared cross-val scores

In [50]:
pr = PoissonRegressor()
cross_val_score(pr, X_train_ss, y_train, scoring='r2')

array([0.11226211, 0.12476193, 0.14353579, 0.05537271, 0.12823143])

#### Possoin Deviance cross-val scores

In [51]:
cross_val_score(pr, X_train_ss, y_train)

array([0.11451101, 0.14070609, 0.14833022, 0.06588771, 0.12988321])

#### 

## Evaluation on Test Set

In [52]:
pr.fit(X_train_ss, y_train)

PoissonRegressor()

#### R-squared

In [53]:
y_hat = pr.predict(X_test_ss)

In [54]:
r2_score(y_test, y_hat)

0.12757670886530392

#### Poisson Deviance

In [55]:
pr.score(X_test_ss, y_test)

0.1431795182807215

Our goal is to predict at least 95% of the matches to within 5 strikes of the actual result, so this metric is also included. The table created below has a column for the predicitions and the actual results, with each row representing one observation.

In [55]:
results = pd.DataFrame({'model_predictions': list(y_hat), 'actual_results': list(y_test)})

In [56]:
def compare_within_window(row):
    """
    givern a row for the dataframe above, returns True if the prediction is
    within 5 strikes of the actual result.
    """
    pred = row['model_predictions']
    true = row['actual_results']
    return pred>=(true-5) and pred<=(true+5)

In [57]:
accuracy_within_window = results.apply(compare_within_window, axis=1)

In [58]:
accuracy_within_window.mean()

0.47709320695102686

The model is prediciton almost half of the fights within a 5 strike window. When it does get it right, what is it guessing?

In [59]:
within_5 = results[accuracy_within_window]
within_5.describe()

Unnamed: 0,model_predictions,actual_results
count,302.0,302.0
mean,17.265488,17.007273
std,3.07754,4.216742
min,10.710051,8.466667
25%,15.178987,13.919718
50%,16.91663,17.119298
75%,18.983574,19.689868
max,31.039112,29.090909


It looks like it can handle fights where the target lies between 8 and 30 strikes. Here's a description of which matches it gets wrong.

In [60]:
results[-accuracy_within_window].describe()

Unnamed: 0,model_predictions,actual_results
count,331.0,331.0
mean,17.812987,18.290402
std,3.85132,12.886663
min,11.131357,1.139241
25%,15.199922,8.511671
50%,17.165392,14.210526
75%,19.468335,26.192381
max,39.842291,82.5


Of these observations, the actual results have a much higher spread, with a standard deviation over 3 times higher than the predictions.
This model is consistently predicting that the fight will be within 11 and 39 strikes, even when the data has a much greater spread.