# Linear Regression and Decision Tree Regressor using Soccer Dataset:

We will be using the open European Soccer dataset from <a href="https://www.kaggle.com">Kaggle</a>  

This <a href="https://www.kaggle.com/hugomathien/soccer">European Soccer Database</a> has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. 

### Import Libraries

In [1]:
import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

### Read Data from the Database into pandas

In [2]:
# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

In [3]:
df.head()

Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21 00:00:00,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,4,218353,505942,2015-03-20 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,5,218353,505942,2007-02-22 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [4]:
df.shape

(183978, 42)

In [5]:
df.columns

Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

### Declare the Columns You Want to Use as Features

In [6]:
features = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

### Specify the Prediction Target

In [7]:
target = ['overall_rating']

### Clean the Data

In [8]:
df = df.dropna()

In [9]:
df.shape

(180354, 42)

### Extract Features and Target ('overall_rating') Values into Separate Dataframes

In [10]:
X = df[features]

In [11]:
y = df[target]

Let us look at a typical row from our features: 

In [12]:
X.iloc[2]

potential             66.0
crossing              49.0
finishing             44.0
heading_accuracy      71.0
short_passing         61.0
volleys               44.0
dribbling             51.0
curve                 45.0
free_kick_accuracy    39.0
long_passing          64.0
ball_control          49.0
acceleration          60.0
sprint_speed          64.0
agility               59.0
reactions             47.0
balance               65.0
shot_power            55.0
jumping               58.0
stamina               54.0
strength              76.0
long_shots            35.0
aggression            63.0
interceptions         41.0
positioning           45.0
vision                54.0
penalties             48.0
marking               65.0
standing_tackle       66.0
sliding_tackle        69.0
gk_diving              6.0
gk_handling           11.0
gk_kicking            10.0
gk_positioning         8.0
gk_reflexes            8.0
Name: 2, dtype: float64

Let us also display our target values: 

In [13]:
y.head()

Unnamed: 0,overall_rating
0,67.0
1,67.0
2,62.0
3,61.0
4,61.0


## Split the Dataset into Training and Test Datasets

### sklearn.model_selection.train_test_split:
Split arrays or matrices into random train and test subsets.

**Parameters**:
* *test_size*(float, int, or None (default is None)): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
* *train_size*(float, int, or None (default is None)): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
* *random_state*(int or RandomState): Pseudo-random number generator state used for random sampling.

**Returns**: list containing train-test split of inputs.

Link: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

# 1. Linear Regression:

### sklearn.linear_model.LinearRegression:
Ordinary least squares Linear Regression. *LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)*

**Parameters**:
* *fit_intercept*(boolean, optional): Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
* *normalize*(boolean, optional, default False): If True, the regressors X will be normalized before regression. This parameter is ignored when fit_intercept is set to False. When the regressors are normalized, note that this makes the hyperparameters learnt more robust and almost independent of the number of samples. The same property is not valid for standardized data. However, if you wish to standardize, please use preprocessing.StandardScaler before calling fit on an estimator with normalize=False.
* *copy_X*(boolean, optional, default True): If True, X will be copied; else, it may be overwritten.

**Methods**:
* ***fit(X, y[, sample_weight])***: Fit linear model. X(training data) is a numpy array or sparse matrix of shape[n_samples,n_features]. y(target values) is a numpy array of shape[n_samples, n_targets]. Returns an instance of self.
* *get_params([deep])*: Get parameters for this estimator. Returns parameter names mapped to their values.
* ***predict(X)***:	Predict using the linear model. Returns predicted values.

In [15]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Perform Prediction using Linear Regression Model

In [16]:
y_prediction = regressor.predict(X_test)
y_prediction

array([[ 66.51284879],
       [ 79.77234615],
       [ 66.57371825],
       ..., 
       [ 69.23780133],
       [ 64.58351696],
       [ 73.6881185 ]])

### Finding the mean of the expected target value in test set

In [17]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


## Evaluate Linear Regression Accuracy using Root Mean Square Error

### sklearn.metrics.mean_squared_error:
Mean squared error regression loss. *mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')*

**Parameters**:
* ***y_true***(array-like of shape = (n_samples) or (n_samples, n_outputs)): Ground truth (correct) target values.
* ***y_pred***(array-like of shape = (n_samples) or (n_samples, n_outputs)): Estimated target values.

**Returns**: 
* ***loss***(float or ndarray of floats): A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.


In [18]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

In [19]:
print(RMSE)

2.805303046855208


# 2. Decision Tree Regressor:

### sklearn.tree.DecisionTreeRegressor:
A decision tree regressor. *DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, presort=False)*

**Parameters**:
* *criterion*(string, optional (default=”mse”)): The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
* ***max_depth***(int or None, optional (default=None)): The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* *min_samples_split*(int, float, optional (default=2)): The minimum number of samples required to split an internal node: If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
* *min_impurity_split*(float, optional (default=1e-7)): Threshold for early stopping in tree growth. If the impurity of a node is below the threshold, the node is a leaf.

**Attributes**:
* *feature_importances_*(array of shape = [n_features]): The feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [R249].
* *n_features_*(int): The number of features when fit is performed.
* *n_outputs_*(int): The number of outputs when fit is performed.

**Methods**:
* ***fit(X, y[, sample_weight, check_input, ...])***: Build a decision tree regressor from the training set (X, y). X is an array-like or sparse matrix with shape = [n_samples, n_features]. y is an array-like with shape = [n_samples] or [n_samples, n_outputs]. Returns self.
* *fit_transform(X, y=None, **fit_params)[source]*: Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. X is training set which is a numpy array of shape [n_samples, n_features]. y is target values which is a numpy array of shape [n_samples]. Returns transformed array.
* ***predict(X[, check_input])***: Predict class or regression value for X. Returns y(The predicted classes, or the predict values) which is an array of shape = [n_samples] or [n_samples, n_outputs].

In [20]:
regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=20, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

### Perform Prediction using Decision Tree Regressor model:

In [21]:
y_prediction = regressor.predict(X_test)
y_prediction

array([ 62.        ,  84.        ,  62.38666667, ...,  69.        ,
        62.        ,  72.        ])

### For comparision: What is the mean of the expected target value in test set?

In [22]:
y_test.describe()

Unnamed: 0,overall_rating
count,59517.0
mean,68.635818
std,7.041297
min,33.0
25%,64.0
50%,69.0
75%,73.0
max,94.0


### Evaluate Decision Tree Regression Accuracy using Root Mean Square Error

In [23]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

In [24]:
print(RMSE)

1.4593703951502015
