# Purpose
This notebook details data preparation and modelling of the most recent model iteration.

1. Feature Engineering

2. Modelling

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

import pandas as pd
from sqlalchemy import create_engine
from src import local
from src import functions

# 1. Feature engineering
### Features
The model in this notebook uses differentials and per-minute rates for the following technique categories:
- Significant Strikes
    - Total
    - on the Ground
    - in the Clinch
    - at a Distance
- takedowns

#### Differentials
Differentials are the difference in one fighters metric and his opponents. In this model, the differentials are calculated from
the per-15-minute or per-1-minute rates in a single round.

### Target
The target is the Combined Significant Strike Attempts per-1-minute of both fighters, which allows us to measure the amount of striking action in a fight.

#### Load the data

In [2]:
# Set up tables
# Credentials
USER = local.user 
PASS = local.password
HOST = local.host
PORT = local.port

#create engine
engine = create_engine(f'postgresql://{USER}:{PASS}@{HOST}:{PORT}/match_finder')

#### Join our 5 advanced statistics tables with our bouts and events table to get dates for each bout

In [3]:
query = """
SELECT striking_adv.bout_id, striking_adv.fighter_id, striking_adv.round, date, 
ss_a_p15m, td_s_p15m_di, ss_s_p15m_di, ss_a_p15m_di, 
g_ss_s_p15m_di, g_ss_a_p15m_di, c_ss_s_p15m_di,
c_ss_a_p15m_di, d_ss_a_p15m_di, d_ss_s_p15m_di
FROM striking_adv
    JOIN striking_position_adv ON 
    CONCAT(striking_adv.bout_id, striking_adv.fighter_id, CAST(striking_adv.round AS CHAR)) =
    CONCAT(striking_position_adv.bout_id, striking_position_adv.fighter_id, CAST(striking_position_adv.round AS CHAR))
    
    JOIN grappling_adv ON 
    CONCAT(striking_adv.bout_id, striking_adv.fighter_id, CAST(striking_adv.round AS CHAR)) =
    CONCAT(grappling_adv.bout_id, grappling_adv.fighter_id, CAST(grappling_adv.round AS CHAR))

JOIN bouts ON bouts.id = striking_adv.bout_id
JOIN events ON events.id = bouts.event_id
"""

data = pd.read_sql(query, engine)

In [4]:
data.head()

Unnamed: 0,bout_id,fighter_id,round,date,ss_a_p15m,td_s_p15m_di,ss_s_p15m_di,ss_a_p15m_di,g_ss_s_p15m_di,g_ss_a_p15m_di,c_ss_s_p15m_di,c_ss_a_p15m_di,d_ss_a_p15m_di,d_ss_s_p15m_di
0,000da3152b7b5ab1,6da99156486ed6c2,1,"July 08, 2006",117.0,3.0,21.0,84.0,0.0,3.0,9.0,27.0,54.0,12.0
1,000da3152b7b5ab1,6da99156486ed6c2,2,"July 08, 2006",78.0,0.0,18.0,21.0,-12.0,-18.0,21.0,27.0,12.0,9.0
2,000da3152b7b5ab1,6da99156486ed6c2,3,"July 08, 2006",69.0,6.0,9.0,42.0,6.0,18.0,3.0,3.0,21.0,0.0
3,000da3152b7b5ab1,d1a1314976c50bef,1,"July 08, 2006",33.0,-3.0,-21.0,-84.0,0.0,-3.0,-9.0,-27.0,-54.0,-12.0
4,000da3152b7b5ab1,d1a1314976c50bef,2,"July 08, 2006",57.0,0.0,-18.0,-21.0,12.0,18.0,-21.0,-27.0,-12.0,-9.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25772 entries, 0 to 25771
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   bout_id         25772 non-null  object 
 1   fighter_id      25772 non-null  object 
 2   round           25772 non-null  object 
 3   date            25772 non-null  object 
 4   ss_a_p15m       25772 non-null  float64
 5   td_s_p15m_di    25772 non-null  float64
 6   ss_s_p15m_di    25772 non-null  float64
 7   ss_a_p15m_di    25772 non-null  float64
 8   g_ss_s_p15m_di  25772 non-null  float64
 9   g_ss_a_p15m_di  25772 non-null  float64
 10  c_ss_s_p15m_di  25772 non-null  float64
 11  c_ss_a_p15m_di  25772 non-null  float64
 12  d_ss_a_p15m_di  25772 non-null  float64
 13  d_ss_s_p15m_di  25772 non-null  float64
dtypes: float64(10), object(4)
memory usage: 2.8+ MB


In [6]:
data['date'] = pd.to_datetime(data['date'])

## Create fighter-bout instance dataframe

A fighter-bout instance represents one fighter in one bout.
 - The same fighter has exactly one fighter-bout instance for every single bout he has been in. 
 - Every bout has exactly two fighter-bout instances, one for each fighter in the bout. 
  
In this case a fighter-bout instance is assigned a unique identifier comprised of the bout_id combined with the fighter_link.

In [7]:
fighter_bout_inst = functions.create_fighter_bout_instance_table(data, 'ss_a_p15m')

In [8]:
fighter_bout_inst

Unnamed: 0_level_0,bout_id,fighter_id,date,target
fighter_bout_inst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000da3152b7b5ab16da99156486ed6c2,000da3152b7b5ab1,6da99156486ed6c2,2006-07-08,88.000000
000da3152b7b5ab1d1a1314976c50bef,000da3152b7b5ab1,d1a1314976c50bef,2006-07-08,39.000000
0019ec81fd706ade326f94d6cfb1bf25,0019ec81fd706ade,326f94d6cfb1bf25,2019-10-18,97.000000
0019ec81fd706ade85073dbd1be65ed9,0019ec81fd706ade,85073dbd1be65ed9,2019-10-18,105.000000
0027e179b743c86c3aa794cbe1e3484b,0027e179b743c86c,3aa794cbe1e3484b,2015-03-14,38.250000
...,...,...,...,...
ffe629a5232a878bb361180739bed4b0,ffe629a5232a878b,b361180739bed4b0,2003-06-06,0.000000
ffea776913451b6d22a92d7f62195791,ffea776913451b6d,22a92d7f62195791,2015-02-28,217.834395
ffea776913451b6d75e5fec9f72910ef,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,114.649682
fffa21388cdd78b75d7bdab5e03e3216,fffa21388cdd78b7,5d7bdab5e03e3216,2013-10-19,118.000000


#### Remove bouts that take place after 2012

Mixed Martial Arts has changed over the years and stylistic match ups from a decade ago might not be analgous to matchups today.
Almost every fighter nowadays has at least some exposure to Wrestling, Brazilian Jiu Jitsu, Boxing, and Muay Thai, so stylistic 
differences tend to reflect personal choices by the fighter rather than the limits of their martial arts discipline. For now,
we choose 2012 as the cut off point, but further analysis needs to be done regarding the appropriate date.

In [9]:
fighter_bout_inst = fighter_bout_inst[fighter_bout_inst['date']>pd.to_datetime('12-31-2011')]

In [10]:
fighter_bout_inst

Unnamed: 0_level_0,bout_id,fighter_id,date,target
fighter_bout_inst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0019ec81fd706ade326f94d6cfb1bf25,0019ec81fd706ade,326f94d6cfb1bf25,2019-10-18,97.000000
0019ec81fd706ade85073dbd1be65ed9,0019ec81fd706ade,85073dbd1be65ed9,2019-10-18,105.000000
0027e179b743c86c3aa794cbe1e3484b,0027e179b743c86c,3aa794cbe1e3484b,2015-03-14,38.250000
0027e179b743c86c91ea901c458e95dd,0027e179b743c86c,91ea901c458e95dd,2015-03-14,51.187500
002921976d27b7dab4ad3a06ee4d660c,002921976d27b7da,b4ad3a06ee4d660c,2014-12-13,88.932806
...,...,...,...,...
ffd3e3d37cba32da92a9aa9c93192871,ffd3e3d37cba32da,92a9aa9c93192871,2014-10-25,154.000000
ffea776913451b6d22a92d7f62195791,ffea776913451b6d,22a92d7f62195791,2015-02-28,217.834395
ffea776913451b6d75e5fec9f72910ef,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,114.649682
fffa21388cdd78b75d7bdab5e03e3216,fffa21388cdd78b7,5d7bdab5e03e3216,2013-10-19,118.000000


## Calculate metrics

The metrics will be calculated from the fighters entire career and from their last 3 fights, to give an idea of their current state.
Career averages are prefixed with 'ca_' and 3 fight averages with '3fa_'

In [11]:
list_of_metrics = ['ss_a_p15m', 'td_s_p15m_di', 'ss_s_p15m_di', 'ss_a_p15m_di', 
                    'g_ss_s_p15m_di', 'g_ss_a_p15m_di', 'c_ss_s_p15m_di',
                    'c_ss_a_p15m_di', 'd_ss_a_p15m_di', 'd_ss_s_p15m_di']

#### How these metrics are calculated
The following cell iterates through each row in the fighter bout instance table. It takes the unique fighter_id and the date from that row and calculates the fighters metrics up until that date. This represents the fact that our model will only have prior knowledge of the fighters when making it's predictions.


This cell takes about 5-10 minutes to load.

In [12]:
for metric in list_of_metrics:
    print(metric+'\n')
    values = fighter_bout_inst.apply(lambda row: functions.calculate_metric_average(metric, 
                                                                                    row['fighter_id'], 
                                                                                    row['date'], 
                                                                                    data), 
                                     axis=1)

    fighter_bout_inst['ca_'+metric] = values.map(lambda x: x[0])
    fighter_bout_inst['3fa_'+metric] = values.map(lambda x: x[1])

ss_a_p15m



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fighter_bout_inst['ca_'+metric] = values.map(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fighter_bout_inst['3fa_'+metric] = values.map(lambda x: x[1])


td_s_p15m_di

ss_s_p15m_di

ss_a_p15m_di

g_ss_s_p15m_di

g_ss_a_p15m_di

c_ss_s_p15m_di

c_ss_a_p15m_di

d_ss_a_p15m_di

d_ss_s_p15m_di



### Debut fights and inexperienced fighters
Many fights include fighter who have never fought in the UFC before, and some do not have long enough records to calculate 3 fight averages. These fighters would have null values in their career and 3-fight-averages, so I drop all of those rows here.

In [13]:
fighter_mask = fighter_bout_inst['3fa_ss_a_p15m'].isnull()
debut_bouts = fighter_bout_inst[fighter_mask].bout_id

In [14]:
debut_bouts

fighter_bout_inst
0019ec81fd706ade326f94d6cfb1bf25    0019ec81fd706ade
003c84fc7c3fa028873626e5547b5235    003c84fc7c3fa028
004885f7983f46582aa49b3766a59bcd    004885f7983f4658
0067048aaa2aa2da494b0bfdbac74502    0067048aaa2aa2da
00731068c3195f7fc0eecd851dbf3146    00731068c3195f7f
                                          ...       
fe6b45e7210bfa8edef8166ff24bd237    fe6b45e7210bfa8e
fe9a58e79d54695bb4192a975027aab6    fe9a58e79d54695b
febbd4eb209b4579e93b04e308913c2e    febbd4eb209b4579
fee873593f588437ed9d8ee3a4239b1c    fee873593f588437
ffbc12e4f821ec687a703c565ccaa18f    ffbc12e4f821ec68
Name: bout_id, Length: 1683, dtype: object

In [16]:
bout_mask = fighter_bout_inst.bout_id.isin(debut_bouts)
fighter_bout_inst = fighter_bout_inst[-bout_mask]

## Create the final dataframe

The current fighter bout instance table has two rows for each fight (one row for each fighter). In order to create a table where each row represents one fight, I need get each fighter on the same row.

In [17]:
model_df = functions.merge_fighter_instances(fighter_bout_inst)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  instances_df['inst_id'] = instances_df['bout_id'] + instances_df['fighter_id']


## Creating Combined Significant Strike Attempts Per 15 Minute (c_sig_str_a_p15m)

In [18]:
model_df['c_ss_a_p15m'] = model_df['target_0'] + model_df['target_1']
model_df

Unnamed: 0,bout_id,fighter_id_0,date_0,target_0,ca_ss_a_p15m_0,3fa_ss_a_p15m_0,ca_td_s_p15m_di_0,3fa_td_s_p15m_di_0,ca_ss_s_p15m_di_0,3fa_ss_s_p15m_di_0,...,ca_c_ss_s_p15m_di_1,3fa_c_ss_s_p15m_di_1,ca_c_ss_a_p15m_di_1,3fa_c_ss_a_p15m_di_1,ca_d_ss_a_p15m_di_1,3fa_d_ss_a_p15m_di_1,ca_d_ss_s_p15m_di_1,3fa_d_ss_s_p15m_di_1,inst_id_1,c_ss_a_p15m
0,0027e179b743c86c,91ea901c458e95dd,2015-03-14,51.187500,78.528358,67.1834,1.500000,1.71429,23.986567,27.838,...,-5.000000,-5,-3.000000,-3,-86.000000,-86,-38.000000,-38,0027e179b743c86c3aa794cbe1e3484b,89.437500
1,002921976d27b7da,ebc1f40e00e0c481,2014-12-13,17.786561,131.820971,136.544,-0.433461,-1,-13.753972,-21.3637,...,3.825798,22.1577,-7.405384,30.1415,-46.427974,-12.943,-18.981288,11.4666,002921976d27b7dab4ad3a06ee4d660c,106.719368
2,002c1562708ac307,44470bfd9483c7ad,2014-05-24,160.975610,48.000000,48,3.000000,3,7.000000,7,...,2.628510,1.42347,3.107143,2.70918,27.566239,36.3214,3.747558,-3.9949,002c1562708ac30722a92d7f62195791,406.097561
3,00494c77d2a88f8c,7ea1f74cef32f906,2016-11-05,78.000000,135.141414,155.855,0.000000,0.333333,35.297980,37.7306,...,2.859965,5.47656,4.733184,8.82031,-11.864374,-15.3281,3.241653,15.4453,00494c77d2a88f8c08af939f41b5a57b,198.000000
4,0051d7fbb7893d27,b6c4451cb13c9303,2013-10-26,51.000000,73.500000,73.5,1.000000,1,9.500000,9.5,...,1.094035,2.57143,0.711377,2.14286,15.453581,38.3433,-3.594559,3.13362,0051d7fbb7893d27282fa667ff9c51ed,91.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2525,ff4cef86bef2d0aa,db1f2ed63b54b9a7,2017-09-16,161.862745,100.920958,99.2524,0.610672,-0.570652,-19.455496,-74.2916,...,3.772551,6.03383,3.311594,6.46739,19.882345,23.1807,10.922561,15.5133,ff4cef86bef2d0aad4c9dcd330403612,310.490196
2526,ff64fc34065565d0,6fb1ba67bef41b37,2015-05-30,92.000000,97.293860,104.333,1.429825,-0.333333,8.600877,10.3333,...,5.199375,8.58364,6.996808,11.8649,-9.012294,-10.3594,-0.139091,-4.54688,ff64fc34065565d007225ba28ae309b6,208.857143
2527,ffd3e3d37cba32da,92a9aa9c93192871,2014-10-25,154.000000,143.855810,175.261,3.018444,0,4.804949,-10.3109,...,-2.625000,-3,-3.750000,-4.28571,6.414204,11.3992,-11.206462,-6.29744,ffd3e3d37cba32da7413b80dbb0f8f9f,236.000000
2528,ffea776913451b6d,75e5fec9f72910ef,2015-02-28,114.649682,88.000347,98,3.846692,3.66667,-6.478311,-12.3333,...,1.852669,0.688907,1.276528,-1.46939,16.600578,0.152086,7.633712,13.4629,ffea776913451b6d22a92d7f62195791,332.484076


#### drop unnecessary columns

In [19]:
model_df.columns

Index(['bout_id', 'fighter_id_0', 'date_0', 'target_0', 'ca_ss_a_p15m_0',
       '3fa_ss_a_p15m_0', 'ca_td_s_p15m_di_0', '3fa_td_s_p15m_di_0',
       'ca_ss_s_p15m_di_0', '3fa_ss_s_p15m_di_0', 'ca_ss_a_p15m_di_0',
       '3fa_ss_a_p15m_di_0', 'ca_g_ss_s_p15m_di_0', '3fa_g_ss_s_p15m_di_0',
       'ca_g_ss_a_p15m_di_0', '3fa_g_ss_a_p15m_di_0', 'ca_c_ss_s_p15m_di_0',
       '3fa_c_ss_s_p15m_di_0', 'ca_c_ss_a_p15m_di_0', '3fa_c_ss_a_p15m_di_0',
       'ca_d_ss_a_p15m_di_0', '3fa_d_ss_a_p15m_di_0', 'ca_d_ss_s_p15m_di_0',
       '3fa_d_ss_s_p15m_di_0', 'inst_id_0', 'fighter_id_1', 'date_1',
       'target_1', 'ca_ss_a_p15m_1', '3fa_ss_a_p15m_1', 'ca_td_s_p15m_di_1',
       '3fa_td_s_p15m_di_1', 'ca_ss_s_p15m_di_1', '3fa_ss_s_p15m_di_1',
       'ca_ss_a_p15m_di_1', '3fa_ss_a_p15m_di_1', 'ca_g_ss_s_p15m_di_1',
       '3fa_g_ss_s_p15m_di_1', 'ca_g_ss_a_p15m_di_1', '3fa_g_ss_a_p15m_di_1',
       'ca_c_ss_s_p15m_di_1', '3fa_c_ss_s_p15m_di_1', 'ca_c_ss_a_p15m_di_1',
       '3fa_c_ss_a_p15m_di_1', 

In [22]:
model_df = model_df.loc[:,['ca_ss_a_p15m_0',
       '3fa_ss_a_p15m_0', 'ca_td_s_p15m_di_0', '3fa_td_s_p15m_di_0',
       'ca_ss_s_p15m_di_0', '3fa_ss_s_p15m_di_0', 'ca_ss_a_p15m_di_0',
       '3fa_ss_a_p15m_di_0', 'ca_g_ss_s_p15m_di_0', '3fa_g_ss_s_p15m_di_0',
       'ca_g_ss_a_p15m_di_0', '3fa_g_ss_a_p15m_di_0', 'ca_c_ss_s_p15m_di_0',
       '3fa_c_ss_s_p15m_di_0', 'ca_c_ss_a_p15m_di_0', '3fa_c_ss_a_p15m_di_0',
       'ca_d_ss_a_p15m_di_0', '3fa_d_ss_a_p15m_di_0', 'ca_d_ss_s_p15m_di_0',
       '3fa_d_ss_s_p15m_di_0', 'ca_ss_a_p15m_1', '3fa_ss_a_p15m_1', 'ca_td_s_p15m_di_1',
       '3fa_td_s_p15m_di_1', 'ca_ss_s_p15m_di_1', '3fa_ss_s_p15m_di_1',
       'ca_ss_a_p15m_di_1', '3fa_ss_a_p15m_di_1', 'ca_g_ss_s_p15m_di_1',
       '3fa_g_ss_s_p15m_di_1', 'ca_g_ss_a_p15m_di_1', '3fa_g_ss_a_p15m_di_1',
       'ca_c_ss_s_p15m_di_1', '3fa_c_ss_s_p15m_di_1', 'ca_c_ss_a_p15m_di_1',
       '3fa_c_ss_a_p15m_di_1', 'ca_d_ss_a_p15m_di_1', '3fa_d_ss_a_p15m_di_1',
       'ca_d_ss_s_p15m_di_1', '3fa_d_ss_s_p15m_di_1','c_ss_a_p15m']]

In [23]:
model_df

Unnamed: 0,ca_ss_a_p15m_0,3fa_ss_a_p15m_0,ca_td_s_p15m_di_0,3fa_td_s_p15m_di_0,ca_ss_s_p15m_di_0,3fa_ss_s_p15m_di_0,ca_ss_a_p15m_di_0,3fa_ss_a_p15m_di_0,ca_g_ss_s_p15m_di_0,3fa_g_ss_s_p15m_di_0,...,3fa_g_ss_a_p15m_di_1,ca_c_ss_s_p15m_di_1,3fa_c_ss_s_p15m_di_1,ca_c_ss_a_p15m_di_1,3fa_c_ss_a_p15m_di_1,ca_d_ss_a_p15m_di_1,3fa_d_ss_a_p15m_di_1,ca_d_ss_s_p15m_di_1,3fa_d_ss_s_p15m_di_1,c_ss_a_p15m
0,78.528358,67.1834,1.500000,1.71429,23.986567,27.838,35.825373,36.6077,13.204478,15.435,...,0,-5.000000,-5,-3.000000,-3,-86.000000,-86,-38.000000,-38,89.437500
1,131.820971,136.544,-0.433461,-1,-13.753972,-21.3637,-17.983095,-1.4005,-1.755793,0,...,33.9381,3.825798,22.1577,-7.405384,30.1415,-46.427974,-12.943,-18.981288,11.4666,106.719368
2,48.000000,48,3.000000,3,7.000000,7,-15.000000,-15,3.000000,3,...,1.14796,2.628510,1.42347,3.107143,2.70918,27.566239,36.3214,3.747558,-3.9949,406.097561
3,135.141414,155.855,0.000000,0.333333,35.297980,37.7306,49.795455,56.0606,5.270202,6.6936,...,15.8203,2.859965,5.47656,4.733184,8.82031,-11.864374,-15.3281,3.241653,15.4453,198.000000
4,73.500000,73.5,1.000000,1,9.500000,9.5,14.000000,14,14.000000,14,...,-4.34601,1.094035,2.57143,0.711377,2.14286,15.453581,38.3433,-3.594559,3.13362,91.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2525,100.920958,99.2524,0.610672,-0.570652,-19.455496,-74.2916,-33.053421,-106.927,-10.786923,-39.7764,...,-16.6799,3.772551,6.03383,3.311594,6.46739,19.882345,23.1807,10.922561,15.5133,310.490196
2526,97.293860,104.333,1.429825,-0.333333,8.600877,10.3333,16.785088,35,5.789474,7.66667,...,-6.07996,5.199375,8.58364,6.996808,11.8649,-9.012294,-10.3594,-0.139091,-4.54688,208.857143
2527,143.855810,175.261,3.018444,0,4.804949,-10.3109,60.855043,69.0756,6.073953,1.78992,...,23.8957,-2.625000,-3,-3.750000,-4.28571,6.414204,11.3992,-11.206462,-6.29744,236.000000
2528,88.000347,98,3.846692,3.66667,-6.478311,-12.3333,-11.043388,-30.3333,1.829027,2.66667,...,-7.42085,1.852669,0.688907,1.276528,-1.46939,16.600578,0.152086,7.633712,13.4629,332.484076


# 2. Modelling

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import PoissonRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

### Target
The target variable for all current model iterations is the Combined Average Significant Strikes Attempts per Minute of a match. This calculates the sum of the significant strike attempts per minute of both fighters.

#### First Simple Model
- Model: Poisson Regressor
- Features: Career Average Significant Strike Attempts per Minute
- Preprocessing: Standard Scaler

#### Split

In [25]:
model_df.ca_ss_a_p15m_0

0        78.528358
1       131.820971
2        48.000000
3       135.141414
4        73.500000
           ...    
2525    100.920958
2526     97.293860
2527    143.855810
2528     88.000347
2529     94.686768
Name: ca_ss_a_p15m_0, Length: 2530, dtype: float64

In [27]:
X = model_df.loc[:,['ca_ss_a_p15m_0', 'ca_ss_a_p15m_1']]
y = model_df.c_ss_a_p15m

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [29]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [30]:
pr = PoissonRegressor()
cross_val_score(pr, X_train_ss, y_train, scoring='r2')

array([0.13798269, 0.09213633, 0.09666464, 0.0394901 , 0.11371957])

##### Possoin Deviance cross-val scores

In [31]:
cross_val_score(pr, X_train_ss, y_train)

array([0.14262972, 0.10545858, 0.109673  , 0.04396868, 0.11916607])

## Evaluation on Test Set

In [32]:
pr.fit(X_train_ss, y_train)

PoissonRegressor()

#### R-squared

In [33]:
y_hat = pr.predict(X_test_ss)

In [34]:
r2_score(y_test, y_hat)

0.12123274259025885

#### Poisson Deviance

In [35]:
pr.score(X_test_ss, y_test)

0.12886329003133157

Our goal is to predict at least 95% of the matches to within 5 strikes of the actual result, so this metric is also included. The table created below has a column for the predicitions and the actual results, with each row representing one observation.

In [36]:
results = pd.DataFrame({'model_predictions': list(y_hat), 'actual_results': list(y_test)})

In [37]:
def compare_within_window(row):
    """
    givern a row for the dataframe above, returns True if the prediction is
    within 5 strikes of the actual result.
    """
    pred = row['model_predictions']
    true = row['actual_results']
    return pred>=(true-5) and pred<=(true+5)

In [38]:
accuracy_within_window = results.apply(compare_within_window, axis=1)

In [39]:
accuracy_within_window.mean()

0.0315955766192733

The model is prediciton almost half of the fights within a 5 strike window. When it does get it right, what is it guessing?

In [40]:
within_5 = results[accuracy_within_window]
within_5.describe()

Unnamed: 0,model_predictions,actual_results
count,20.0,20.0
mean,254.249929,254.934066
std,40.033294,39.627806
min,179.720789,182.681818
25%,225.350084,223.724299
50%,253.58243,256.1875
75%,275.124765,276.25
max,363.84271,362.068966


It looks like it can handle fights where the target lies between 8 and 30 strikes. Here's a description of which matches it gets wrong.

In [41]:
results[-accuracy_within_window].describe()

Unnamed: 0,model_predictions,actual_results
count,613.0,613.0
mean,260.246112,265.507521
std,43.358574,148.857945
min,165.116156,17.088608
25%,230.180349,164.0
50%,254.583513,248.0
75%,284.702238,333.0
max,522.325849,1237.5


Of these observations, the actual results have a much higher spread, with a standard deviation over 3 times higher than the predictions.
This model is consistently predicting that the fight will be within 11 and 39 strikes, even when the data has a much greater spread.

#### Latest model
- Model: Poisson Regressor
- Features: 
    - Career Average Significant Strike Attempts per Minute
    - Differentials for both Attempts and Successes (3-fight and career averages):
        - Ground Strikes
        - Clinch Strikes
        - Distance Strikes
        - Significant Strikes
        - Takedowns
- Preprocessing: Standard Scaler

#### Split

In [43]:
X = model_df.drop('c_ss_a_p15m', axis=1)
y = model_df.c_ss_a_p15m

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [45]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [47]:
pr = PoissonRegressor(max_iter=200)
cross_val_score(pr, X_train_ss, y_train, scoring='r2')

array([0.10380242, 0.12178651, 0.1472482 , 0.01400341, 0.11080822])

##### Possoin Deviance cross-val scores

In [48]:
cross_val_score(pr, X_train_ss, y_train)

array([0.11883002, 0.13560646, 0.15118895, 0.02439766, 0.11346774])

#### 

## Evaluation on Test Set

In [49]:
pr.fit(X_train_ss, y_train)

PoissonRegressor(max_iter=200)

#### R-squared

In [50]:
y_hat = pr.predict(X_test_ss)

In [51]:
r2_score(y_test, y_hat)

0.10473800471537609

#### Poisson Deviance

In [52]:
pr.score(X_test_ss, y_test)

0.120981379628327

Our goal is to predict at least 95% of the matches to within 5 strikes of the actual result, so this metric is also included. The table created below has a column for the predicitions and the actual results, with each row representing one observation.

In [53]:
results = pd.DataFrame({'model_predictions': list(y_hat), 'actual_results': list(y_test)})

In [54]:
def compare_within_window(row):
    """
    givern a row for the dataframe above, returns True if the prediction is
    within 5 strikes of the actual result.
    """
    pred = row['model_predictions']
    true = row['actual_results']
    return pred>=(true-5) and pred<=(true+5)

In [55]:
accuracy_within_window = results.apply(compare_within_window, axis=1)

In [56]:
accuracy_within_window.mean()

0.045813586097946286

The model is prediciton almost half of the fights within a 5 strike window. When it does get it right, what is it guessing?

In [57]:
within_5 = results[accuracy_within_window]
within_5.describe()

Unnamed: 0,model_predictions,actual_results
count,29.0,29.0
mean,265.899585,265.443675
std,40.986773,41.145494
min,192.960352,193.0
25%,249.483763,247.0
50%,264.330104,264.454976
75%,283.358228,285.0
max,369.521294,368.0


It looks like it can handle fights where the target lies between 8 and 30 strikes. Here's a description of which matches it gets wrong.

In [58]:
results[-accuracy_within_window].describe()

Unnamed: 0,model_predictions,actual_results
count,604.0,604.0
mean,262.561047,265.160472
std,55.602406,149.879522
min,157.315581,17.088608
25%,224.877003,159.876561
50%,255.00026,246.965704
75%,286.996049,334.5
max,624.925491,1237.5


Of these observations, the actual results have a much higher spread, with a standard deviation over 3 times higher than the predictions.
This model is consistently predicting that the fight will be within 11 and 39 strikes, even when the data has a much greater spread.