# Purpose
Create the first simple model.

### Target
The target variable for all current model iterations is the Combined Average Distance Significant Strike Attempts per Minute of a match. This calculates the sum of the distance significant strike attempts per minute of both fighters. Distance strikes exclude clinch and ground strikes.

#### First Simple Model
- Model: Linear Regressor
- Features: 
    - Career Average Standing Significant Strike Attempts per 15 Minutes
    - Career Average Takedown Attempts per 15 minutes
- Preprocessing: Standard Scaler

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

#### Import and Split

In [2]:
data = pd.read_csv('../../data/modelling_data/model_2_data.csv', index_col=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3971 entries, 0 to 3970
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date_0            3971 non-null   object 
 1   bout_id           3971 non-null   object 
 2   fighter_id_0      3971 non-null   object 
 3   ca_s_ss_a_p15m_0  3971 non-null   float64
 4   ca_td_a_p15m_0    3971 non-null   float64
 5   fighter_id_1      3971 non-null   object 
 6   ca_s_ss_a_p15m_1  3971 non-null   float64
 7   ca_td_a_p15m_1    3971 non-null   float64
 8   c_s_ss_a_p15m     3971 non-null   float64
dtypes: float64(5), object(4)
memory usage: 310.2+ KB


In [3]:
X = data.loc[:,['ca_s_ss_a_p15m_0', 'ca_td_a_p15m_0', 
                'ca_s_ss_a_p15m_1', 'ca_td_a_p15m_1']]
y = data.c_s_ss_a_p15m

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [5]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [6]:
lr = LinearRegression()
cross_val_score(lr, X_train_ss, y_train)

array([0.11658963, 0.14081928, 0.1341381 , 0.09672126, 0.12571413])

The scores range from .09 and .14 in a close to uniform manner. This model appears to have a moderate amount of variance, with a high level of bias as well.

## Evaluation on Test Set

In [7]:
lr.fit(X_train_ss, y_train)

LinearRegression()

In [8]:
lr.score(X_test_ss, y_test)

0.1503168263877489

## Results

This model has an r-squared of .007 less than the fsm, which could just be due to random chance.
The addition of takedown stats did not significantly affect the predictive abilities of the model.

#### Next steps
We will add differentials for both takedowns and standing strikes. Differentials measure how much more or less of a technique a fighter used in relation to their opponent.