# Predicting footballer player wage based on respective FIFA / EAFC plater stats

The problem we are addressing is correctly evaluating a playe's wage based on their stats. With ever more growing talent in the field, it is vital as a football club to accurately price your players to stay competitive. Our goal is to study & build machine learning models to accurately predict a footballer's wage based on a dataset of players wage and stats.

# Importing data

We are going to be utilising the [EA Sports FC 24 complete player dataset](https://www.kaggle.com/datasets/stefanoleone992/ea-sports-fc-24-complete-player-dataset/data) from kaggle for player stats and valautions.

Required Python modules:
* Pandas - importing and manipulating data

In [None]:
import pandas as pd
from copy import deepcopy
# instructing pandas to not truncate column widths when displaying data in interactive mode
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

random_state = 42 # Set random state to ensure all random variables can be reproduced
datasets_fp = 'datasets/' # Datasets folder path
df = pd.read_csv(datasets_fp + 'male_players.csv', encoding='unicode_escape')
df

Let's have a quick look at the dataframe

In [None]:
df.head(5) # first 5 rows

Let's do a quick analysis of the data and it's columns using the pandas describe() method

In [None]:
df.describe(include='all') # using include='all' to show us all columns

Let's find out exactly how many columns we have

In [None]:
len(df.columns) # length of columns list

# Initial Data Preperation

Now we have sucessfully imported the dataset, we need to prepare it for the task at hand.

We will focusing on the Premier League data only so let's filter the dataset to satisfy this requirement

In [None]:
df = df[df['league_name'] == 'Premier League'] # filter league_name to 'Premier League'
df

The current dataset has too many features with a large portion being irrelevant to the task at hand. We will drop the columns that aren't relevant using the .drop() function.

In [None]:
df.drop([
    'player_id',
    'wage_eur',
    'player_url',
    'update_as_of',
    'short_name',
    'long_name',
    'dob',
    'club_name',
    'league_name',
    'club_position',
    'club_loaned_from',
    'club_joined_date',
    'nationality_id',
    'nationality_name',
    'nation_position',
    'body_type',
    'real_face',
    'player_tags',
    'player_traits',
    'fifa_update',
    'league_level',
    'club_team_id',
    'league_id',
    'club_jersey_number',
    'club_contract_valid_until_year',
    'nation_team_id',
    'nation_jersey_number',
    'release_clause_eur',
    'height_cm',
    'weight_kg',
    'ls',
    'st',
    'rs',
    'lw',
    'lf',
    'cf',
    'rf',
    'rw',
    'lam',
    'cam',
    'ram',
    'lm',
    'lcm',
    'cm',
    'rcm',
    'rm',
    'lwb',
    'ldm',
    'cdm',
    'rdm',
    'rwb',
    'lb',
    'lcb',
    'cb',
    'rcb',
    'rb',
    'gk',
    'work_rate',
    'attacking_crossing',
    'attacking_finishing',
    'attacking_heading_accuracy',
    'attacking_short_passing',
    'attacking_volleys',
    'skill_dribbling',
    'skill_curve',
    'skill_fk_accuracy',
    'skill_long_passing',
    'skill_ball_control',
    'movement_acceleration',
    'movement_sprint_speed',
    'movement_agility',
    'movement_reactions',
    'movement_balance',
    'power_shot_power',
    'power_jumping',
    'power_stamina',
    'power_strength',
    'power_long_shots',
    'mentality_aggression',
    'mentality_interceptions',
    'mentality_positioning',
    'mentality_vision',
    'mentality_penalties',
    'mentality_composure',
    'defending_marking_awareness',
    'defending_standing_tackle',
    'defending_sliding_tackle',
    'goalkeeping_diving',
    'goalkeeping_handling',
    'goalkeeping_kicking',
    'goalkeeping_positioning',
    'goalkeeping_reflexes',
    'goalkeeping_speed'
], axis=1, inplace=True)
df

Need to extract main player position from player_positions column

In [None]:
#----------------------------------
# player_positions -> main_position
#----------------------------------

# list of all outfield positons can be found at https://www.fifplay.com/encyclopedia/position/
forward_positions = [
    'ST', 
    'CF',
    'RF',
    'LF',
    'RW',
    'LW',
]

midfielder_positions = [
    'CM',
    'CDM',
    'CAM',
    'RM',
    'LM',
]

defender_positions = [
    'CB',
    'RB',
    'LB',
    'RWB',
    'LWB',
]

def calc_main_position(positions: str):

    primary_position = positions.split(',')[0] # using first position mentioned as primary position

    if primary_position in forward_positions:
        return 'FW'
    elif primary_position in midfielder_positions:
        return 'MD'
    elif primary_position in defender_positions:
        return 'DF'
    elif primary_position == 'GK':
        return 'GK'
    else:
        return None # no valid position found

# Create new column with function applied to each value
df['main_position'] = df['player_positions'].apply(lambda pos: calc_main_position(pos))
# Show new column
print(df[['player_positions','main_position']])
# Check for nulls
print('\nNulls: ', df['main_position'].isnull().sum())
# Drop old column
df.drop(['player_positions'] ,axis=1,inplace=True)



Filter data for non-goalkeeper players only

In [None]:
df = df[df['main_position'] != 'GK'] # filter out goalkeepers

In [None]:
df

Check for nulls

In [None]:
for col in df.columns:
    n_nulls = df[col].isnull().sum() # get count of column null values
    print(f"{col} - {n_nulls} nulls")

Drop columns with null value_eur

In [None]:
df.dropna(subset=['value_eur'], inplace=True)
df['value_eur'].isnull().sum()

# Splitting the dataset

In [None]:
# import train_test_split function from sklearn
# Doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn.model_selection import train_test_split

# Set target dataset to value_eur column
y = df['value_eur']
# Set features dataset to dataset minus value_eur column using .drop() function
X = df.drop(['value_eur'], axis=1)

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.2, random_state=random_state
)

# Test if split was as expected by calculating percentage of each set
# This should give us a total split of 80%-10%-10% (train, valid, test)
train_percent = X_train.shape[0] / X.shape[0] * 100
test_percent = X_test.shape[0] / X.shape[0]  * 100
total_percent = train_percent+test_percent

print(f'Train: {train_percent}%\
      \nTest: {test_percent}%\
      \n------------------------\
      \nTotal: {total_percent}%')


# EDA


In [None]:
# Install library for plots and visualisations
import seaborn as sns
import matplotlib as mpl

#### Univariate

In [None]:
sns.displot(df['value_eur'],  kind='kde')

In [None]:
sns.displot(df[df['value_eur'] > 100000000]['value_eur'], binwidth=10000000)

In [None]:
sns.displot(df['age'], binwidth=1) # binwidth = width of each bar

In [None]:
sns.boxplot(df, x='age', orient='h') # horizontal orientation

In [None]:
#sns.displot(df['overall'], binwidth=5)
#sns.boxplot(df['overall'], orient='h')

In [None]:
#sns.displot(df['potential'], binwidth=5)
#sns.boxplot(df['potential'], orient='h')

In [None]:
sns.catplot(df, x='fifa_version', kind='count')

In [None]:
sns.displot(df['preferred_foot'])

In [None]:
sns.displot(df['preferred_foot'])

In [None]:
sns.catplot(df, x='skill_moves', kind='count')

In [None]:
sns.catplot(df, x='weak_foot', kind='count')

In [None]:
sns.catplot(df, x='international_reputation', kind='count')

In [None]:
n_records = df.shape[0] # number of records in our dataframe

#loop 5 times (1 for each rating from 1-5)
for i in range(5):
    rating = i+1 # index starts from 0 so add 1 to get rating
    
    # get number of records with this rating
    count = df[df['international_reputation']==rating].shape[0]
    percent = count / n_records * 100
    
    print(f"Intl rep rating {rating}: {percent}%") 

In [None]:
sns.displot(df['pace'], binwidth=5, kde=True)
#sns.boxplot(df, x='pace')

In [None]:
sns.displot(df, x='shooting', binwidth=5, kde=True)

In [None]:
sns.displot(df['dribbling'], binwidth=5, kde=True)

In [None]:
sns.displot(df['passing'], binwidth=5, kde=True)

In [None]:
sns.displot(df['physic'], binwidth=5, kde=True)

In [None]:
sns.catplot(df, x='main_position', kind='count')

#### Bivariate

In [None]:
## Pair Plot

sns.pairplot(
    df, 
    vars=[    # 2d grid of these columns
        'value_eur',
        'age',
        'overall',
        'international_reputation',
        'skill_moves',
        'shooting',
        'pace',
        'main_position'
    ], 
)

In [None]:
sns.barplot(df, x='fifa_version', y='value_eur')

In [None]:
sns.lineplot(df, x='age', y='value_eur')

In [None]:
sns.lineplot(df, x='overall', y='value_eur')

In [None]:
sns.lineplot(df, x='potential', y='value_eur')

In [None]:
sns.barplot(df, x='main_position', y='value_eur')

In [None]:
sns.barplot(df, x='international_reputation', y='value_eur')

In [None]:
sns.barplot(df, x='preferred_foot', y='value_eur')

In [None]:
sns.lineplot(df, x='weak_foot', y='value_eur')

In [None]:
sns.lineplot(df, x='shooting', y='value_eur')

In [None]:
sns.lineplot(df, x='dribbling', y='value_eur')

In [None]:
sns.lineplot(df, x='passing', y='value_eur')

In [None]:
sns.lineplot(df, x='physic', y='value_eur')

In [None]:
sns.barplot(df, x='fifa_version', y='overall')

# Data cleansing / pre-processing

### Missing data

Check each dataset for null values

In [None]:
datasets = {    # using dictionary so we can reference the name as string
    "Train": X_train, 
    "Test": X_test, 
}

for dataset in datasets.items():
    name, _df = dataset

    print("\n", name,"\n============")
    for col in _df.columns:
        n_nulls = _df[col].isnull().sum() # get count of column null values
        print(f"{col} - {n_nulls} nulls")

### Data encoding / scaling

Dummy encoding function to transform category data to numerical data

In [None]:
def apply_dummy_encoding(df: pd.DataFrame, columns: list) -> None:
    """Create dummies for column(s) and drop old"""
    
    df = deepcopy(df)
    # create dummies
    df = pd.concat([df, pd.get_dummies(df, columns=columns, drop_first=True)], axis=1)
    
    # drop old columns
    df = df.drop(columns, axis=1)
    return df


Scale data function using RobustScalar class from sklearn

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

def apply_robust_scalar(df: pd.DataFrame) -> np.ndarray:
    """Scales data using robust scalar"""
    scaler = MinMaxScaler() # create RobustScaler object
    return scaler.fit_transform(df) # use object to fit and transform dataframe

Function to combine both preprocesing

In [None]:

def apply_preprocessing(df: pd.DataFrame, scale: bool=True) -> pd.DataFrame:
    """Apply preprocessing (encoding & scaling)"""
    df = apply_dummy_encoding(df, columns=['main_position', 'preferred_foot'])
    if scale:
        df = apply_robust_scalar(df)

    return df

Export final processed dataframes as csv

In [None]:
# Apply processing to each feature dataset
#X_train.drop(['overall', 'potential'], axis=1, inplace=True)
#X_test.drop(['overall', 'potential'], axis=1, inplace=True)

X_train = apply_preprocessing(X_train, scale=True)
X_test = apply_preprocessing(X_test, scale=True)

# Model development

In [None]:
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from math import sqrt

Creating model with best hyperparameters using automated hyperparameter tuning whilst cross-validating with K-Folds 3

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, r2_score
from math import sqrt


# Set machine learning algorithm
algo = LinearRegression()

# Define the hyperparameter grid for tuning
param_grid = {
    'fit_intercept': [True, False],
    'positive': [True, False]
}

# Create a dictionary of the scorers to use 
scorers = {
    'r2_score': make_scorer(r2_score),
    'mae': make_scorer(mean_absolute_error),
    'mse': make_scorer(mean_squared_error, squared=True),
    'rmse': make_scorer(lambda y_true, y_pred: sqrt(mean_squared_error(y_true, y_pred)))
}

# Create the RandomizedSearchCV object
random_search = GridSearchCV(
    estimator=algo, 
    param_grid=param_grid, 
    scoring=scorers, 
    refit='rmse', 
    cv=3, # number of K-Folds
    verbose=1, 
    n_jobs=-1, 
)

# Fit the RandomSearch object to the data
random_search.fit(X_train, y_train)

# Get the best model from the grid search
best_rfr = random_search.best_estimator_
# Show hyperparameters of best model
random_search.best_params_

Evaluation on test dataset

In [None]:
# Use the best model to make predictions
y_prediction = best_rfr.predict(X_test)

# Evaluate the performance of the best model on test dataset
r2 = r2_score(y_test, y_prediction)
mae = mean_absolute_error(y_test, y_prediction)
mse = mean_squared_error(y_test, y_prediction)
rmse = sqrt(mse)

print(f"R²: {r2}")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")

Visualise model prediction performance

In [None]:
sns.scatterplot(x=y_test, y=y_prediction)

Plot feature importance

In [None]:
# Get feature importances from the trained model
feature_importances = best_rfr.feature_importances_

# Get feature names from your dataset (assuming you have X_train available)
feature_names = X_train.columns

# Create a DataFrame for easier plotting
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot the feature importances using Seaborn and Matplotlib
sns.barplot(x='Importance', y='Feature', data=importance_df)
