# Win Prediction Using Four-Factors
## Introduction
In this notebook, I will try to predict the number of wins for each basketball team from college basketball dataset using Dean Oliver's "Four Factors of Basketball Success". What is four factors? In simple words, for factors are four most important strategies to win a basketball game, as analyzed by Dean Oliver. The strategies are:
1. Score Every Possession
2. Pick Up All Rebounds
3. Get to the Foul Line
4. Protect the Basketball

For the explanation of each strategies, I recommend you to read from various sports blog (such as this [blog](https://squared2020.com/2017/09/05/introduction-to-olivers-four-factors/) where I got those four strategies). In other words, those strategies can be represented in four stats: 
1. Effective field goals 
2. Turnovers percentage 
3. Rebounding percentage 
4. Free Throws rate 

It should be noticed that, we must consider both offense and defense, as for example scoring many points does not enough to win the game, as we need to minimize the opponent scoring as well. Thus, we must consider 8 factors in total for both offense and defense:
* Offensive Factors
    * Effective Field Goal Percentage
    * Percentage
    * Offensive Rebound Percentage
    * Free Throw Rate

* Defensive Factors
    * Opponent’s Effective Field Goal Percentage
    * Opponent’s Turnover Percentage
    * Defensive Rebound Percentage
    * Opponent’s Free Throw Rate

(again thanks to the author of this [blog](https://squared2020.com/2017/09/05/introduction-to-olivers-four-factors/) for providing clear explanation)
Luckily for us, the college basketball dataset has provided us all of the stats above to be played with. 

## Brief Exploratory Data Analysis
First we import some libraries that will be useful for EDA

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Then we import the dataset into our dataframe. We only use the combined dataset (cbb.csv) that contain all data from 2015-2019

In [None]:
df = pd.read_csv("../input/college-basketball-dataset/cbb.csv")

Let's see the info of this dataset

In [None]:
df.info()

It can be said that the four (eight) factors that analyzed by Oliver are the most correlated features to the target, which is the number of wins ('W' in our dataset). We want to see whether this hypothesis is true by see the correlation of each feature to the number of win.

In [None]:
df.corr()['W'].sort_values()[:-1]

Our 8 factors are 'EFG_O','EFG_D','TOR','TORD','ORB','DRB','FTR', and 'FTRD'. We can see that the top 3 correlated features actually are 'WAB' which is the predicted number of win against average NCAA team, 'BARTHAG' which is the power ranking, and 'ADJOE' which is the adjusted offensive efficiency. From the 8 factors, 'EFG_O' has the highest correlation with the number of wins, as well as 'EFG_D' which has the lowest negative correlation. 
By the way, Oliver also identified the approximate weight for each factor:
* Shooting (40%)
* Turnovers (25%)
* Rebounding (20%)
* Free Throws (15%)

which means that shooting is the most important factor, followed by turnovers, rebounding, and free throws. From the correlation value we can see that this order somewhat true, but probably the weight will be different in our model.

Next let's make a subdataframe that will contain the four factors and the target

In [None]:
df_ff = df[['EFG_O','EFG_D','TOR','TORD','ORB','DRB','FTR','FTRD','W']]

Then let's see the distribution of each features

In [None]:
fig, axes = plt.subplots(ncols=4, nrows=3)
for col, ax in zip(df_ff.columns, axes.flat):
    sns.distplot(df_ff[col], hist=False, ax=ax)
plt.tight_layout()
plt.show()

It seems that all of the features have normal distribution, but I suspect that 'FTRD' feature has somewhat skewed tendency, but let it be for now. 

Let's see the descriptive stats for all of those features

In [None]:
df_ff.describe()

Before we train the model, we also want to see the relationship between the factors and the number of wins. Actually we have see it from the correlation value, but I will visualize some of the factors to see the relationship in easier way.

First, we know that EFG_O gives the highest correlation. Thus let's see the scatterplot with the number of wins

In [None]:
sns.regplot(x='W',y = 'EFG_O',data = df_ff,scatter= True, fit_reg=True)

We'll see a linear relationship here, that make the use of good ol' linear regression is a moderate choice. Let's see the next factor, which is the turnover percentage

In [None]:
sns.regplot(x='W',y = 'TOR',data = df_ff,scatter= True, fit_reg=True)

A somewhat linear relationship with negative correlation, which is true, as if we minimize the turnover, we'll have the ball more often, which resulted in more wins. Then let's see for offensive rebound percentage 

In [None]:
sns.regplot(x='W',y = 'ORB',data = df_ff,scatter= True, fit_reg=True)

Now we can see that the linearity is somewhat questionable in offensive rebound. What about the free throw rate?

In [None]:
sns.regplot(x='W',y = 'FTR',data = df_ff,scatter= True, fit_reg=True)

We can see that same with the offensive rebound, the linearity is almost missing. That correspond to low correlation on both features to the number of wins. Lastly, let see the plot for the defensive factors

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(14,10))
sns.regplot(x='W',y = 'EFG_D',data = df_ff,scatter= True, fit_reg=True,ax=ax1)
sns.regplot(x='W',y = 'TORD',data = df_ff,scatter= True, fit_reg=True,ax=ax2)
sns.regplot(x='W',y = 'DRB',data = df_ff,scatter= True, fit_reg=True,ax=ax3)
sns.regplot(x='W',y = 'FTRD',data = df_ff,scatter= True, fit_reg=True,ax=ax4)

## Train the model

I will use linear regression for our learning model. This is somewhat questionable choice because as we see, there are some factors that not linearly related to the number of wins. But as this is my first Kaggle notebook, I say why not :) maybe in the future as my knowledge about statistics and machine learning grows (I study statistics and programming by myself) I will provide more accurate model to predict win using this four-factors. Without a further ado, let's make our training model!

In [None]:
# prepare the dataset
df_ff = df[['EFG_O','EFG_D','TOR','TORD','ORB','DRB','FTR','FTRD']]
df_ff_y = df['W']

We'll import some libraries for training and measure our prediction

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

We'll split the train and test data and create the model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_ff, df_ff_y, test_size=0.25, random_state=21)
reg = LinearRegression()
reg = reg.fit(X_train,y_train)

Let's analyze the model first by evaluating the parameters

In [None]:
print('Intercept: ', reg.intercept_)
print('R^2 score: ',reg.score(X_train,y_train))

In [None]:
coeff_df = pd.DataFrame(reg.coef_, df_ff.columns, columns=['Coefficient'])
coeff_df

All right, so our R^2 score is around 86%, which is pretty good, but not** that** good. From the coefficient, we'll see that the order of "most important" factor is true, but what about the weight? Does it obey the 40/25/20/15 rule? We can find it by first find the average of absolute coefficient value of both offensive and defensive counterparts (for example average of absolute coefficient of EFG_O and EFG_D is (0.951573+0.890253)/2=0.92091), then divide it by the total number of average absolute coefficient.

In [None]:
cof = []
tcof = 0
for i in range(0,8,2):
    avgcof = (abs(coeff_df['Coefficient'][i])+abs(coeff_df['Coefficient'][i+1]))/2
    cof.append(avgcof)
    tcof += avgcof
print(cof/tcof)

We found out that shooting gives around 40.2%, turnover 38.3%, rebound 15.7%, and free throw 5.7%. It means that shooting still gives same amount of weight as analyzed by Oliver, but turnover give more weight while decrease both value of rebound and free throw. I expect the weight will be increase due to recent trend in three-pointers, but perhaps it is not the case in college basketball. Perhaps also the difference of playstyle in college basketball affect this different in weight. Now, let's make some prediction with our model

## The Prediction

In [None]:
# predict from the test set
y_pred = reg.predict(X_test)

In [None]:
# analyze the prediction
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Our MAE scores 1.94, which means that our model made errors around 2 win/lose (for example actual win 18, we predict 20). From the RMSE, we got 2.41 which also states that our model misses around 2.4 win due to some outliers (that we neglect along this notebook). Is this model good? it depends, but I'll say it is not pretty good as we neglect some linear regression assumptions. Now, let's see some good and bad prediction made by this model

In [None]:
dfd = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'AbsDiff': abs(y_test-y_pred)})
dfd.sort_values(by=['AbsDiff'], inplace=True, ascending=True)
dfd[:10]

This is top 10 most accurate result from our model. Let's see some team that we predict correctly

In [None]:
df[df.index == 21]

In [None]:
df[df.index == 937]

So here we correctly predict the fate of 2018 Texas Tech who get 27 wins and St. Francis who only get 4 wins. Let's see some bad predictions we made

In [None]:
dfd.sort_values(by=['AbsDiff'], inplace=True, ascending=False)
dfd[:10]

In [None]:
df[df.index == 1174]

In [None]:
df[df.index == 1338]

We predict that 2017 Western Carolina will get around 3 wins, but in reality they got 9 wins (huraay?) I can't say that they somewhat overachieve, or due to our bad model. The hilarious one is 2015 Grambling St., who failed to win any games and we predict them to win...**-6 games** I even can't imagine what -6 win is, but safe to say, that their four factors are **really** bad to make our model thinks that they can't win positive games.

## Conclusion
We have made some predictions using linear regression, which in my opinion gives fairly accurate-but not great-model due to some neglection on linear regression assumptions. In the future, maybe I can improve my model by pre-processing the data to meet the linear regression assumption, and also using PCA to solve some dimensionality problem, or even using better machine learning method. All of the code minus some visualization can be found in my github: https://github.com/thomasoca. I still learn how to become a good data scientist, and I hope my first notebook does not dissapoint you all. Comments, advices, or anything else is acceptable.

In [None]:
nan