# Purpose
Create the first simple model.

### Target
The target variable for all current model iterations is the Combined Average Distance Significant Strike Attempts per Minute of a match. This calculates the sum of the distance significant strike attempts per minute of both fighters. Distance strikes exclude clinch and ground strikes.

#### First Simple Model
- Model: Linear Regressor
- Features: Career Average Significant Strike Attempts per Minute
- Preprocessing: Standard Scaler

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

#### Import and Split

In [8]:
data = pd.read_csv('../../data/modelling_data/model_1_data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3971 entries, 0 to 3970
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date_0            3971 non-null   object 
 1   bout_id           3971 non-null   object 
 2   fighter_id_0      3971 non-null   object 
 3   ca_s_ss_a_p15m_0  3971 non-null   float64
 4   fighter_id_1      3971 non-null   object 
 5   ca_s_ss_a_p15m_1  3971 non-null   float64
 6   c_s_ss_a_p15m     3971 non-null   float64
dtypes: float64(3), object(4)
memory usage: 217.3+ KB


In [9]:
X = data.loc[:,['ca_s_ss_a_p15m_0', 'ca_s_ss_a_p15m_1']]
y = data.c_s_ss_a_p15m

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=1)

#### Preprocessing

In [11]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#### Cross-Validation
performance was evaluated using the standard metric for Poisson regression, mean Poisson deviance, as well as r-squared, so that it can be compared to other non-Poisson models easily.

##### R-squared cross-val scores

In [12]:
lr = LinearRegression()
cross_val_score(lr, X_train_ss, y_train)

array([0.10260482, 0.14386408, 0.13496557, 0.08444902, 0.12156636])

All but one score is around .12. This model likely has relatively low variance but extremely high bias.

## Evaluation on Test Set

In [13]:
lr.fit(X_train_ss, y_train)

LinearRegression()

In [14]:
lr.score(X_test_ss, y_test)

0.15760826543630468

## Results

This model currently has an extremely high level of bias, indicated by the low r-squared score of .158.

#### Next steps
We will add takedown attempts per 15 minutes because takedowns are often used by fighters who are trying to prevent a striking match.