# Purpose
Add stat differentials.

### Target
The target variable for all current model iterations is the Combined Average Distance Significant Strike Attempts per Minute of a match. This calculates the sum of the distance significant strike attempts per minute of both fighters. Distance strikes exclude clinch and ground strikes.

#### Model
- Model: Linear Regressor
- Features: 
    - Career Average Significant Strike Attempts per 15 Minutes
    - Career Average Takedown Attempts per 15 Minutes
    - Career Average Significant Strike Attempts per 15 Minutes Differentials
    - Career Average Takedown Attempts per 15 Minutes Differentials
- Preprocessing: Standard Scaler

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

#### Import and Split

In [3]:
data = pd.read_csv('../../data/modelling_data/model_3_data.csv', index_col=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3971 entries, 0 to 3970
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date_0               3971 non-null   object 
 1   bout_id              3971 non-null   object 
 2   fighter_id_0         3971 non-null   object 
 3   ca_s_ss_a_p15m_0     3971 non-null   float64
 4   ca_td_a_p15m_0       3971 non-null   float64
 5   ca_s_ss_a_p15m_di_0  3971 non-null   float64
 6   ca_td_a_p15m_di_0    3971 non-null   float64
 7   fighter_id_1         3971 non-null   object 
 8   ca_s_ss_a_p15m_1     3971 non-null   float64
 9   ca_td_a_p15m_1       3971 non-null   float64
 10  ca_s_ss_a_p15m_di_1  3971 non-null   float64
 11  ca_td_a_p15m_di_1    3971 non-null   float64
 12  c_s_ss_a_p15m        3971 non-null   float64
dtypes: float64(9), object(4)
memory usage: 434.3+ KB


In [4]:
X = data.loc[:,['ca_s_ss_a_p15m_0', 'ca_td_a_p15m_0', 'ca_s_ss_a_p15m_di_0', 'ca_td_a_p15m_di_0',
                'ca_s_ss_a_p15m_1', 'ca_td_a_p15m_1', 'ca_s_ss_a_p15m_di_1', 'ca_td_a_p15m_di_1',]]
y = data.c_s_ss_a_p15m

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [6]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [7]:
lr = LinearRegression()
cross_val_score(lr, X_train_ss, y_train)

array([0.13961657, 0.15111629, 0.15742948, 0.1313611 , 0.12593388])

Most of these scores are between .13 and .16 with a small amount of variance, and still a high level of bias.

## Evaluation on Test Set

In [8]:
lr.fit(X_train_ss, y_train)

LinearRegression()

In [9]:
lr.score(X_test_ss, y_test)

0.17978855843279706

This model achieves a .180, an improvement over the .157 from the first simple model. We do include 6 more features which in and of itself could be responsible for this increase. 

## Polynomial Test

#### Preprocessing

In [10]:
from sklearn.preprocessing import PolynomialFeatures

In [11]:
pf = PolynomialFeatures() # Default degree = 2
X_train_pf = pf.fit_transform(X_train)
X_test_pf = pf.transform(X_test)


ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train_pf)
X_test_ss = ss.transform(X_test_pf)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [12]:
lr = LinearRegression()
cross_val_score(lr, X_train_ss, y_train)

array([0.12115948, 0.17669153, 0.16985228, 0.02188464, 0.1214572 ])

Most of these scores are between .13 and .16 with a high amount of variance, and still a high level of bias.

## Evaluation on Test Set

In [13]:
lr.fit(X_train_ss, y_train)

LinearRegression()

In [14]:
lr.score(X_test_ss, y_test)

0.1820330973394524

This model performs slightly better than the previous model with a .002 improvement on the r-squared score. However, because it introduced such a high degree of variance according to the previous model, polynomials would decrease the performance of the model.

## Results

Polynomial features failed reduced a small amount of bias but introduced a large amount of variance. The first model, without polynomial features, achieves a .180 r-squared, which is a slight but noticeable improvement over the .157 of the FSM. This is likely due to the increase in number of features rather than the predictive ability of the fighters.

#### Next steps
It's possible that career long averages don't capture a fighters current abilities because they could have gotten better or worse since their first few fights. This data set also includes fighters with only 1 or 2 fights, which likely does not include enough information to make reliable predictions. We will include 3 fight averages and exclude all fighters with less than 3 fights.