# Classification

Classification is a supervised learning approach that assigns a class label to an input data sample.

Essentially we are trying to predict the class of a given data point. The classes are discrete and finite.

Common classification algorithms include:
- Decision Trees
- Random Forest
- Support Vector Machines
- K-Nearest Neighbors
- Naive Bayes
- Logistic Regression

We can either have binary classification (two classes) or multi-class classification (more than two classes).

Let's use FBREF data to predict the position of a player based on their stats.

In [26]:
import pandas as pd

from sklearn.model_selection import train_test_split

In [18]:
df = pd.read_csv('~/Documents/GitHub/complete-football-analytics/Module 6/classification_data.csv')
df.shape

(2865, 27)

In [19]:
df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,24-100,2000,15,12,...,0,1,0,0.0,0.0,0.8,0.8,19,40,22
1,2,Brenden Aaronson,us USA,"MF,FW",Union Berlin,de Bundesliga,23-174,2000,25,10,...,0,3,1,1.7,1.7,1.2,2.9,26,31,56
2,3,Paxten Aaronson,us USA,MF,Eint Frankfurt,de Bundesliga,20-231,2003,7,1,...,0,0,0,0.1,0.1,0.1,0.2,2,5,7
3,4,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,36-198,1987,25,25,...,0,4,0,2.4,2.4,0.3,2.7,33,117,7
4,5,Salis Abdul Samed,gh GHA,MF,Lens,fr Ligue 1,24-018,2000,25,17,...,0,2,0,0.8,0.8,0.5,1.3,8,77,20


In [20]:
# Let's check the unique values in the 'Pos' column which is the target variable
df['Pos'].unique()

array(['DF', 'MF,FW', 'MF', 'FW', 'FW,MF', 'Pos', 'DF,FW', 'GK', 'DF,MF',
       'MF,DF', 'FW,DF'], dtype=object)

In [21]:
# We'll clean this up a bit. We are going to make the decision where if there are multiple positions, we'll just take the first one.
df['Pos'] = df['Pos'].apply(lambda x: x.split(',')[0])

# We also need to remove the rows where the position is 'Pos' because that is a mistake in the data / an additional header
df = df[df['Pos'] != 'Pos']

df['Pos'].unique()

array(['DF', 'MF', 'FW', 'GK'], dtype=object)

In [22]:
# Let's check for missing values
df.isnull().sum()

Rk           0
Player       0
Nation       3
Pos          0
Squad        0
Comp         0
Age          3
Born         3
MP           0
Starts       0
Min          0
90s          0
Gls          0
Ast          0
G+A          0
G-PK         0
PK           0
PKatt        0
CrdY         0
CrdR         0
xG          14
npxG        14
xAG         14
npxG+xAG    14
PrgC        14
PrgP        14
PrgR        14
dtype: int64

In [23]:
# Let's drop rows with missing values since there are only a few
df = df.dropna()

print(df.shape)

(2737, 27)


In [31]:
# Since the 'Pos' column is the target variable, let's check the distribution of the classes
df['Pos'].value_counts()

Pos
DF    977
MF    870
FW    711
GK    179
Name: count, dtype: int64

In [32]:
# To determine the baseline accuracy, we can take the most frequent class and divide it by the total number of samples
df['Pos'].value_counts().max() / df['Pos'].value_counts().sum()

# This means that if we were to predict the most frequent class for every sample, we would be correct 35.6% of the time which is a good benchmark to compare our model to

0.35696017537449765

In [33]:
# Since our Pos column is categorical, we need to convert it to a numerical value which makes it easier for the model to understand
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Pos'] = le.fit_transform(df['Pos'])

print(df['Pos'].unique())

# We can use the inverse_transform method to get the original values
print(le.inverse_transform(df['Pos'].unique()))


[0 3 1 2]
['DF' 'MF' 'FW' 'GK']


In [34]:
df.columns

Index(['Rk', 'Player', 'Nation', 'Pos', 'Squad', 'Comp', 'Age', 'Born', 'MP',
       'Starts', 'Min', '90s', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt',
       'CrdY', 'CrdR', 'xG', 'npxG', 'xAG', 'npxG+xAG', 'PrgC', 'PrgP',
       'PrgR'],
      dtype='object')

In [35]:
# Let's split the data into features and target
X = df[[
    'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt', 'CrdY', 'CrdR', 'xG', 'npxG', 'xAG', 'npxG+xAG', 'PrgC', 'PrgP', 'PrgR'
]]

y = df['Pos']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
# Let's use a Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

In [37]:
# Let's evaluate the model
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)


0.614963503649635

In [41]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

# The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives
# The rows represent the actual classes and the columns represent the predicted classes

array([[121,  11,  12,  57],
       [ 14,  87,   5,  28],
       [  0,   0,  29,   2],
       [ 42,  31,   9, 100]])

In [42]:
# Classification Report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.60      0.64       201
           1       0.67      0.65      0.66       134
           2       0.53      0.94      0.67        31
           3       0.53      0.55      0.54       182

    accuracy                           0.61       548
   macro avg       0.61      0.68      0.63       548
weighted avg       0.62      0.61      0.61       548
