# Purpose
Add stat differentials.

### Target
The target variable for all current model iterations is the Combined Average Distance Significant Strike Attempts per Minute of a match. This calculates the sum of the distance significant strike attempts per minute of both fighters. Distance strikes exclude clinch and ground strikes.

#### Model
- Model: Linear Regressor
- Features: 
    - Career Average Significant Strike Attempts per 15 Minutes
    - Career Average Takedown Attempts per 15 Minutes
    - Career Average Significant Strike Attempts per 15 Minutes Differentials
    - Career Average Takedown Attempts per 15 Minutes Differentials
- Preprocessing: Standard Scaler

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

#### Import and Split

In [2]:
data = pd.read_csv('../../data/modelling_data/model_4_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3316 entries, 0 to 3315
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date_0                3316 non-null   object 
 1   bout_id               3316 non-null   object 
 2   fighter_id_0          3316 non-null   object 
 3   ca_s_ss_a_p15m_0      3316 non-null   float64
 4   ca_td_a_p15m_0        3316 non-null   float64
 5   ca_s_ss_a_p15m_di_0   3316 non-null   float64
 6   ca_td_a_p15m_di_0     3316 non-null   float64
 7   3fa_s_ss_a_p15m_0     3316 non-null   float64
 8   3fa_td_a_p15m_0       3316 non-null   float64
 9   3fa_s_ss_a_p15m_di_0  3316 non-null   float64
 10  3fa_td_a_p15m_di_0    3316 non-null   float64
 11  fighter_id_1          3316 non-null   object 
 12  ca_s_ss_a_p15m_1      3316 non-null   float64
 13  ca_td_a_p15m_1        3316 non-null   float64
 14  ca_s_ss_a_p15m_di_1   3316 non-null   float64
 15  ca_td_a_p15m_di_1    

In [3]:
X = data.loc[:,['ca_s_ss_a_p15m_0', 'ca_td_a_p15m_0', 'ca_s_ss_a_p15m_di_0', 'ca_td_a_p15m_di_0',
                'ca_s_ss_a_p15m_1', 'ca_td_a_p15m_1', 'ca_s_ss_a_p15m_di_1', 'ca_td_a_p15m_di_1',]]
y = data.c_s_ss_a_p15m

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [5]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [6]:
lr = LinearRegression()
cross_val_score(lr, X_train_ss, y_train)

array([0.18271794, 0.15796221, 0.21848821, 0.22119189, 0.19118267])

Most of these scores are between .13 and .16 with a small amount of variance, and still a high level of bias.

## Evaluation on Test Set

In [7]:
lr.fit(X_train_ss, y_train)

LinearRegression()

In [8]:
lr.score(X_test_ss, y_test)

0.14931252032779718

This model achieves a .180, an improvement over the .157 from the first simple model. We do include 6 more features which in and of itself could be responsible for this increase. 