<a href="https://colab.research.google.com/github/potdarjs/Python-Codes/blob/master/Predicting_players_rating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROJECT 2 - PREDICTING PLAYER RATING

## Objective of the Project is to predict overall rating of soccer player

### About Dataset

The dataset is from European Soccer Database. contains statistics of about 25,000 football matches, from the top football league of 11 European Countries. It covers seasons from 2008 to 2016 and contains match statistics (i.e: scores, corners, fouls etc...) as well as the team formations, with player names and a pair of coordinates to indicate their position on the pitch.
+25,000 matches
+10,000 players
11 European Countries with their lead championship
Seasons 2008 to 2016
Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the
weekly updates
Team line up with squad formation (X, Y coordinates)
Betting odds from up to 10 providers
Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000
matches
The dataset also has a set of about 35 statistics for each player, derived from EA Sports' FIFA video games. It is not just the stats that come with a new version of the game but also the weekly updates.

### Import  Libraries

In [0]:
# Core Libraries - Data manipulation and analysis
import pandas as pd
import numpy as np
import math
from math import sqrt
import matplotlib.pyplot as plt
import seaborn as sns

# Core Libraries - loading data from sqlite database
import sqlite3
 
# Core Libraries - Machine Learning
import sklearn

## Importing train_test_split,cross_val_score,GridSearchCV,KFold, RandomizedSearchCV - Validation and OptimizationC
from sklearn.model_selection import ShuffleSplit, train_test_split,cross_val_score,GridSearchCV,KFold, RandomizedSearchCV

# Importing Regressors - Modelling
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor

# Importing Regression Metrics - Performance Evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

import pickle

# Warnings Library - Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [0]:
%matplotlib inline

In [0]:
# Create your connection
cnx = sqlite3.connect('database.sqlite')

In [4]:
# Loading the dataframe with the data from the Player_Attributes Table
player_attrib = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

DatabaseError: ignored

# Understand Dataset and Data

__Basic Data about the dataframe are the columns, shape, top 5 and bottom 5 rows, its column types and null(and non-null) values__ 

In [0]:
print(player_attrib.info())

In [0]:
player_attrib.head()

In [0]:
player_attrib.tail()

___There are null values in the dataset which need to be removed or imputed___

In [0]:
player_attrib.get_dtype_counts()

# Data Cleaning

__Find rows containing null values or zeros and then either impute or remove them__

___Checking for columns containing null values___

In [0]:
player_attrib.isna().any() # To look for null element in atleast one row in the dataframe

***All columns in the dataframe have null values except the id, player_fifa_api_id, player_api_id, date columns***

In [0]:
# Checking number of null values in each column  
null_info_df = pd.DataFrame(player_attrib.isna().sum())  # Identifying the number of nulls in each column
#player_attrib.isnull().sum()
null_info_df.columns = ["total_null_values"]
null_info_df

In [0]:
# Checking percentage of null values in each column  
null_info_df["null_percentage"] = (player_attrib.isna().sum()/player_attrib.shape[0])*100
null_info_df

___Since the number of rows with null values in every column is less than 1.5% of the data, dropping those rows won't have a bearing on the regression model. It also, is better to not impute because we have insufficient information about the data.___

In [0]:
# Dropping rows containing null values in the dataframe
player_attrib.dropna(axis = 0, inplace = True)

In [0]:
player_attrib.shape

_3624 rows containing one or more null values removed_

In [0]:
# Cross check if the rows containing null values are removed
player_attrib.isna().sum() 

___Checking if there are any row values = zero___

In [0]:
player_attrib.loc[(player_attrib==0).all(axis=1)].shape 

___No zeroes in the dataframe to consider___

In [0]:
player_attrib.head()

In [0]:
# Moving overall_rating column to the end of the dataframe
cols = list(player_attrib.columns.values) 
cols.pop(cols.index('overall_rating'))  
player_attrib = player_attrib[cols+['overall_rating']]  


In [0]:
player_attrib.columns.values

## Cleaning Categorical Variables

In [0]:
# Get list of the categorical Variables
categorical_cols = player_attrib.select_dtypes(include='object').columns.values
categorical_cols
# OR
# categorical_cols = player_attrib.dtypes[player_attrib.dtypes == 'object'].index
# categorical_cols

In [0]:
# Getting a list of all the 
player_attrib[categorical_cols].get_dtype_counts()

In [0]:
# Checking the number of unique values in the categorical columns
player_attrib[categorical_cols].nunique()

In [0]:
# Checking the distribution of the values in the preferred_foot column
player_attrib["preferred_foot"].value_counts() 

*The preferred_foot column doesn't need cleaning*

In [0]:
# Checking the distribution of date column
player_attrib["date"].value_counts()

The date column item values don't need cleaning


In [0]:
# Checking the distribution of the values in the attacking_work_rate column
player_attrib["attacking_work_rate"].value_counts()

The attacking_work_rate column item values need to be set to medium, low or high as those are the only possible values for attacking_work_rate. 

Reference: http://www.fifplay.com/encyclopedia/work-rate/


In [0]:
# Plotting the distribution of the values in the attacking_work_rate column
player_attrib["attacking_work_rate"].value_counts().plot.bar()

__We can choose to drop the columns where the categorical values do not make sense or we can replace those values into the three categories, medium, high, low__

In [0]:
# To delete the rows which have the gibberish values
# Delete the rows which the values in the list
cleaned = player_attrib[~(player_attrib.attacking_work_rate.isin(['None','norm','y','stoc','le']))] 
               
(1- cleaned.shape[0]/player_attrib.shape[0])*100

2.15% Data Loss

### To replace gibberish values with medium, low, high

In [0]:
# Choosing to replace only with low because it can improve the variance of the column
player_attrib.replace( ['None','norm','y','stoc','le'],'low', inplace = True)
print(player_attrib["attacking_work_rate"].value_counts())
player_attrib["attacking_work_rate"].value_counts().plot.bar()

In [0]:
# Checking the distribution of the values in the defensive_work_rate column
player_attrib["defensive_work_rate"].value_counts()

The defensive_work_rate column items need to be set into medium, low or high as those are the only possible values for defensive_work_rate. 

Reference: http://www.fifplay.com/encyclopedia/work-rate/

In [0]:
# Plotting the distribution of the values in the defensive_work_rate column
player_attrib["defensive_work_rate"].value_counts().plot.bar()

__We can choose to drop the columns where the categorical values do not make sense or we can re-organize those values into the three categories, medium, high, low__

### Dropping rows with gibberish values in defensive_work_rate

In [0]:
# To delete the rows which have the gibberish values
cleaned1 = player_attrib[~(player_attrib.defensive_work_rate.isin(['o', '1', '2', 'ormal', '3', '5', '7', '0', 
                                                         '6', '9', '4', 'es', 'tocky', 'ean', '8']))] 

In [0]:
(1- cleaned1.shape[0]/player_attrib.shape[0])*100

2.2% Data Loss

 __OR__

### To replace gibberish values with medium, low, high

In [0]:
player_attrib.replace(['o', '1', '2', 'ormal', '3', '0', 'es', 'tocky', 'ean'],'low',inplace = True) 
player_attrib.replace(['5',  '6', '4'],'medium', inplace = True) 
player_attrib.replace([ '7', '9', '8'],'high', inplace = True) 
print(player_attrib["defensive_work_rate"].value_counts())
player_attrib["defensive_work_rate"].value_counts().plot.bar()

# Basic Statistical Information

In [0]:
# Getting basic statistical information about the numerical columns
player_attrib.describe() # Only numerical columns

In [0]:
# Getting correlation between various numerical columns
player_attrib.corr()

In [0]:
# Checking for correlations using HEATMAP
plt.figure(figsize=(20,20))
sns.heatmap(player_attrib.corr(), cmap="PRGn")

In [0]:
player_attrib.corr().loc['overall_rating']

***overall_rating is highly correlated with the reactions and potential columns(Correlation>0.7). It is moderately correlated with short_passing, long_passing,ball_control, shot_power,vision (correlation >0.4)***
 


# Exploratory Data Analysis

## Univariate - Visual Analysis - Distribution and countplots etc.

### Univariate Analysis of Categorical Data

In [0]:
categorical_cols

In [0]:
player_attrib[categorical_cols].head()

In [0]:
print(player_attrib["preferred_foot"].value_counts())
print(player_attrib["attacking_work_rate"].value_counts())
print(player_attrib["defensive_work_rate"].value_counts())

___Majority of the players' preferred foot is the right leg___

___Majority of the players' attacking work rate is medium___

___Majority of the players' defensive work rate is medium___

In [0]:
player_attrib['overall_rating'].hist(bins=60)

___Players' overall rating is normally distributed___

### Univariate Analysis of Numerical Data

In [0]:
# Plotting the histograms of numerical columns to understand their distribution
player_attrib.hist(bins=100,figsize=(20,40),layout=(10,4))
plt.show() 

***The interception, marking, standing_tackle and diving_tackle column values follow bimodal distribution***

***The gk_diving, gk_relexes, gk_positioning, gk_kicking, gk_handling column values follow also bimodal distribution but are imbalanced***

***All other player attributes column values roughly follow normal distribution. This is to be expected as majority of the players have reasonably attributes but only some have exceptional attributes***

## Bi-variate -  Statistical and Visual Analysis

__Plotting: overall_rating vs  reactions and potential columns(Correlation>0.7) and short_passing, long_passing, ball_control, shot_power,vision (correlation >0.4)__

In [0]:
sns.jointplot(x=player_attrib["reactions"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["potential"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["short_passing"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["long_passing"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["ball_control"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["shot_power"], y=player_attrib["overall_rating"], kind='hex',size = 7)

In [0]:
sns.jointplot(x=player_attrib["vision"], y=player_attrib["overall_rating"], kind='hex',size = 7)

# Feature Engineering - Preparing Data for Modeling

## Preparing the input vector X


In [0]:
X = player_attrib.drop("overall_rating",axis = 1)
X.shape, X.columns

## Dropping the various ids in the dataset as they do not contribute to the regression model


In [0]:
X.drop("id",axis = 1, inplace = True)
X.drop("player_fifa_api_id",axis = 1, inplace = True)
X.drop("player_api_id",axis = 1, inplace = True)

## Modifying the date column in the input vector


In [0]:
X['year'] = pd.DatetimeIndex(X.date).year
X['month'] = pd.DatetimeIndex(X.date).month
X['day'] = pd.DatetimeIndex(X.date).day
X.drop('date',axis=1, inplace=True)

## Selecting columns for label encoding and encoding them

In [0]:
X_cat_cols = X.select_dtypes(include='object').columns.tolist()
X_cat_cols

In [0]:
# LabelEncoding the preferred_foot, attacking_work_rate, defensive_work_rate
from sklearn.preprocessing import LabelEncoder
for i in X_cat_cols:
    lbl_enc = LabelEncoder()
    X[i] = lbl_enc.fit_transform(X[i])

In [0]:
# Checking the columns and the shape of the input vector after encoding
X.columns, X.shape

In [0]:
X.head()

## Preparing the Output Y

In [0]:
Y = player_attrib["overall_rating"]
Y.shape

In [0]:
Y.head()

# Splitting the data into Train and Test

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.75, random_state = 100)

# Fitting the models and collecting the metrics

## Linear Regression

In [0]:
lm  = LinearRegression()
model  = lm.fit(x_train,y_train)
y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)
    
print('Linear Regression -', 'RMSE Train:', math.sqrt(mean_squared_error(y_train_pred, y_train)))
print('Linear Regression -', 'RMSE Test:' ,math.sqrt(mean_squared_error(y_test_pred, y_test)))  
print('Linear Regression -', 'R2_score Train:', r2_score(y_train_pred, y_train))
print('Linear Regression -', 'R2_score Test:' ,r2_score(y_test_pred, y_test))  

## Other Regressors

In [0]:
regressors = [
            ("Linear - ", LinearRegression(normalize=True)),
            ("Ridge - ",  Ridge(alpha=0.5, normalize=True)),
            ("Lasso - ",  Lasso(alpha=0.5, normalize=True)),
            ("ElasticNet - ",  ElasticNet(alpha=0.5, l1_ratio=0.5, normalize=True)),
            ("Decision Tree - ",  DecisionTreeRegressor(max_depth=5)),
            ("Random Forest - ",  RandomForestRegressor(n_estimators=100)),
            ("AdaBoost - ",  AdaBoostRegressor(n_estimators=100)),
            ("GBM - ", GradientBoostingRegressor(n_estimators=100))]

In [0]:
for reg in regressors:
    reg[1].fit(x_train, y_train)
    y_test_pred= reg[1].predict(x_test)
    print(reg[0],"\n\t R2-Score:", reg[1].score(x_test, y_test),
                 "\n\t RMSE:", math.sqrt(mean_squared_error(y_test_pred, y_test)),"\n")

# Feature Selection

___Feature Selection using feature importances from RandomForestRegressor model___


In [0]:
rndf = RandomForestRegressor(n_estimators=150)
rndf.fit(x_train, y_train)
importance = pd.DataFrame.from_dict({'cols':x_train.columns, 'importance': rndf.feature_importances_})
importance = importance.sort_values(by='importance', ascending=False)
plt.figure(figsize=(20,15))
sns.barplot(importance.cols, importance.importance)
plt.xticks(rotation=90)

In [0]:
imp_cols = importance[importance.importance >= 0.005].cols.values
imp_cols

In [0]:
# Fitting models with columns where feature importance>=0.005
x_train, x_test, y_train, y_test = train_test_split(X[imp_cols],Y,test_size=0.75, random_state = 100)
for reg in regressors:
    reg[1].fit(x_train, y_train)
    y_test_pred= reg[1].predict(x_test)
    print(reg[0],"\n\t R2-Score:", reg[1].score(x_test, y_test),
                 "\n\t RMSE:", math.sqrt(mean_squared_error(y_test_pred, y_test)),"\n")

In [0]:
imp_cols = importance[importance.importance >= 0.001].cols.values
imp_cols

In [0]:
# Fitting models with columns where feature importance>=0.001
x_train, x_test, y_train, y_test = train_test_split(X[imp_cols],Y,test_size=0.75, random_state = 100)
for reg in regressors:
    reg[1].fit(x_train, y_train)
    y_test_pred= reg[1].predict(x_test)
    print(reg[0],"\n\t R2-Score:", reg[1].score(x_test, y_test),
                 "\n\t RMSE:", math.sqrt(mean_squared_error(y_test_pred, y_test)),"\n")

___RandomForest and GBM provide us with the best RMSE and R2-Score when selecting columns with feature importance >= 0.001___

# Validation of the Models

___Validating our models using K-Fold Cross Validation for Robustness___

In [0]:
scoring = 'neg_mean_squared_error'
results=[]
names=[]
for modelname, model in regressors:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, x_train,y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(modelname)
    print(modelname,"\n\t CV-Mean:", cv_results.mean(),
                    "\n\t CV-Std. Dev:",  cv_results.std(),"\n")

___RandomForest and GBM provide us with the best validation score, both w.r.t. CV-Mean and CV-Std. Dev___

__Therefore we choose these two models to optimize. We do this by finding best hyper-parameter values which give us even better R2-Score and RMSE values__

# Tuning Model for better Performance -- Hyper-Parameter Optimization

***Tuning the RandomForestRegressor, GradientBoostingRegressor Hyper-Parameters using GridSearchCV***

In [0]:
regressors

***Warning: Run the following optimization algorithms only if you have a powerful processor or GPU. Even then it may take more than 3 - 4 hours to run completely.***

## Random Forest Regressor

In [0]:
RF_Regressor =  RandomForestRegressor(n_estimators=100, n_jobs = -1, random_state = 100)

CV = ShuffleSplit(test_size=0.25, random_state=100)

param_grid = {"max_depth": [5, None],
              "n_estimators": [50, 100, 150, 200],
              "min_samples_split": [2, 4, 5],
              "min_samples_leaf": [2, 4, 6]
             }

In [0]:
rscv_grid = GridSearchCV(RF_Regressor, param_grid=param_grid, verbose=1)

In [0]:
rscv_grid.fit(x_train, y_train)

In [0]:
rscv_grid.best_params_

In [0]:
model = rscv_grid.best_estimator_
model.fit(x_train, y_train)

In [0]:
model.score(x_test, y_test)

In [0]:
RF_reg = pickle.dumps(rscv_grid)

## Gradient Boosting Regressor

In [0]:
GB_Regressor =  GradientBoostingRegressor(n_estimators=100)

CV = ShuffleSplit(test_size=0.25, random_state=100)

param_grid = {'max_depth': [5, 7, 9],
              'learning_rate': [0.1, 0.3, 0.5]
             }

In [0]:
rscv_grid = GridSearchCV(GB_Regressor, param_grid=param_grid, verbose=1)

In [0]:
rscv_grid.fit(x_train, y_train)

In [0]:
rscv_grid.best_params_

In [0]:
model = rscv_grid.best_estimator_
model.fit(x_train, y_train)

In [0]:
model.score(x_test, y_test)

In [0]:
GB_reg = pickle.dumps(rscv_grid)

# Comparing performance metric of the different models

In [0]:
RF_regressor = pickle.loads(RF_reg)
GB_regressor = pickle.loads(GB_reg)

In [0]:
print("RandomForestRegressor - \n\t R2-Score:", RF_regressor.score(x_test, y_test),
                 "\n\t RMSE:", math.sqrt(mean_squared_error(RF_regressor.predict(x_test), y_test)),"\n")
      
print("GradientBoostingRegressor - \n\t R2-Score:", GB_regressor.score(x_test, y_test),
                 "\n\t RMSE:", math.sqrt(mean_squared_error(GB_regressor.predict(x_test), y_test)),"\n")

# Choosing the model

***We can see that Gradient Boosting Regressor gives better result with an R2-Score of more than 97% and while keeping RMSE value low(=1.1370474). So, XGBoost Regressor should be used as the regression model for this dataset.***