# Visualizing Player Embeddings

With our model trained and successfully predicting the subsequent subevent with decent accuracy, the player entity embeddings integrated into the model theoretically encapsulate meaningful representations of each player and their distinctive playing style. Extracting these embeddings from the model, our goal is to validate their significance in some capacity. Our main objective is to employ a simple Logistic regression to classify positions on the field (Goalkeeper, Defender, Midfielder, Forward) based on these player embeddings. Then, we will to visualize the player embeddings within a lower-dimensional space and assess similarity by measuring pairwise distances for select key players.

In [1]:
# This mounts your Google Drive to the Colab VM.
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# TODO: Enter the foldername in your Drive where you have saved the Wyscout data.
FOLDERNAME = 'real_madrid/'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# Now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files from within it.
import sys
sys.path.append('/content/drive/MyDrive/{}'.format(FOLDERNAME))

%cd drive/MyDrive/real_madrid_new

Mounted at /content/drive
/content/drive/MyDrive/real_madrid_new


In [None]:
import csv
import numpy as np
import pandas as pd
import pickle
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import multilabel_confusion_matrix, classification_report
import tensorflow as tf
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import matplotlib.cm as cm

In [None]:
#read in only relevant players
with open('players_clean_quantile_02_23_24.csv', newline='') as f:
    reader = csv.reader(f)
    relevant_players = list(reader)[0]
    relevant_players = [int(i) for i in relevant_players]

# 0 indexing of players for categorical encoding
playerID0index = {0:0}
for i in range(len(relevant_players)):
  playerID0index[int(relevant_players[i])] = i+1

In [None]:
players = pd.read_csv('players_clean.csv')
players = players[players['wyId'].isin(relevant_players)]
players['0index'] = players['wyId'].map(playerID0index)
players = players.sort_values(by ='0index')

In [None]:
model = tf.keras.models.load_model('my_model_02_23_24/', compile=False)

### Get the Player Embeddings

In [None]:
def get_embeddings(model, category_size, category_name):
    """ Return category embeddings such that embeddings[i] returns the category
    embedding for the ith object.

    Input:
        model (tf.keras.Model): a trained LSTM model.
        category_size (int): number of classes of this category
        category_name (str): name of the category stored in the model
    Output:
        embeddings (np.array[float,float]): with shape (category_size,
                                            embedding_dim)
    """
    weights = model.get_layer('embedding_{}'.format(category_name)).get_weights()[0]
    dim1, embedding_dim = weights.shape
    embeddings = np.zeros((category_size, embedding_dim))
    for i in range(category_size):
        embeddings[i] = weights[i]
    return embeddings

In [None]:
player_embeddings = get_embeddings(model, len(players)+1, 'player')
# Do not include the embedding for the "0" player since it is meaningless
player_embeddings = player_embeddings[1:]

In [None]:
# should be (num_players, embedding_dim)
player_embeddings.shape

(2277, 20)

### Classify players by position

Now, train a simple classifier to try to classify players as goalkeepers, defenders, midfielders, or forwards using their player embeddings.

In [None]:
positions_classes = {"Goalkeeper":0, "Defender":1, "Midfielder":2, "Forward":3}
# these are the ground truth values
y = list(players['role.name'].map(positions_classes))

In [None]:
clf = LogisticRegression(random_state=0).fit(player_embeddings, y)
y_pred = clf.predict(player_embeddings)

Evaluate the classifier using a confusion matrix and precision, recall, and F1 score.

In [None]:
def confusion_matrix_scorer(clf, X, y, positions_classes):
    """Create a confusion matrix for all categorical classes"""
    y_pred = clf.predict(X)
    cm = multilabel_confusion_matrix(y, y_pred)
    for position, i in positions_classes.items():
        matrix = cm[i]
        print("Confusion Matrix for {}:".format(position))
        print("TN: {} \t FP: {}".format(matrix[0][0], matrix[0][1]))
        print("FN: {} \t TP: {}\n".format(matrix[1][0], matrix[1][1]))

In [None]:
confusion_matrix_scorer(clf, player_embeddings, y, positions_classes)

Confusion Matrix for Goalkeeper:
TN: 2128 	 FP: 1
FN: 27 	 TP: 121

Confusion Matrix for Defender:
TN: 1256 	 FP: 193
FN: 139 	 TP: 689

Confusion Matrix for Midfielder:
TN: 1146 	 FP: 290
FN: 228 	 TP: 613

Confusion Matrix for Forward:
TN: 1726 	 FP: 91
FN: 181 	 TP: 279



In [None]:
print(classification_report(y, y_pred, target_names=positions_classes.keys()))

              precision    recall  f1-score   support

  Goalkeeper       0.99      0.82      0.90       148
    Defender       0.78      0.83      0.81       828
  Midfielder       0.68      0.73      0.70       841
     Forward       0.75      0.61      0.67       460

    accuracy                           0.75      2277
   macro avg       0.80      0.75      0.77      2277
weighted avg       0.75      0.75      0.75      2277



Goalkeepers are classified the best, which aligns with the inherent uniqueness of their position compared to other field players. In total, we get about 75% accuracy for a super simple logistic regression classification.

## Visualize the player embeddings

By utilizing t-SNE to visualize soccer player embeddings in 2D and 3D spaces, we gain a nuanced understanding of player relationships and similarities. This approach allows us to discern clusters or groupings of players based on their playing styles, positions, or other inherent characteristics. For instance, it might unveil distinct clusters representing goalkeepers, defenders, midfielders, and forwards, providing a visual representation of how players within each position category are related or differ in their embedded representations. This visual insight could aid in identifying players with similar playing styles or those who exhibit versatility across multiple positions on the field.

### 2 Dimensional Reduction

In [None]:
# get the 2 dimensional representation of the embeddings
X_embedded = TSNE(n_components=2, learning_rate='auto', init='random',
                  perplexity=50).fit_transform(player_embeddings)
players['x_2d'] = X_embedded[:,0]
players['y_2d'] = X_embedded[:,1]

In [None]:
# Get the row for each player of interest
ronaldo = players.loc[players['FullName'] == "Cristiano Ronaldo dos Santos Aveiro"]
benzema = players.loc[players['FullName'].str.contains("Benzema")]
modric = players.loc[players['FullName'].str.contains("Luka Modri")]
messi = players.loc[players['FullName'].str.contains("Messi")]
suarez = players.loc[players['FullName'].str.contains("Luis Alberto Su")]
courtois = players.loc[players['FullName'].str.contains("Courtois")]
debruyne = players.loc[players['FullName'].str.contains("De Bruyne")]
ramos = players.loc[players['FullName'].str.contains("Sergio Ramos")]
neymar = players.loc[players['FullName'].str.contains("Neymar")]
vandijk = players.loc[players['FullName'].str.contains("van Dijk")]
griezmann = players.loc[players['FullName'].str.contains("Griezmann")]
bale = players.loc[players['FullName'].str.contains("Gareth Frank Bale")]
lewandowski = players.loc[players['FullName'].str.contains("Lewandowski")]
mbappe = players.loc[players['FullName'].str.contains("Kylian")]
aguero = players.loc[players['FullName'].str.contains("Sergio Leonel")]
kroos = players.loc[players['FullName'].str.contains("Kroos")]
marcelo = players.loc[players['FullName'].str.contains("Marcelo Vieira")]

In [None]:
def add_annotation(fig, rows, labels):
    """Adds an arrow and label annotation to a plotly graph.

    Input:
        rows (list): rows of the data frame with necessary player information
        labels (list): strings representing player names for labeling
    """
    for i in range(len(labels)):
        fig.add_annotation(x=float(rows.iloc[i,:].x_2d),
                           y=float(rows.iloc[i,:].y_2d),
                           text=labels[i],
                           showarrow=True,
                           arrowhead=1,
                           arrowsize=1,
                           arrowwidth=2,
                           ax=-20,
                           ay=-30,
                           font=dict(color="black", size=6),
                           align="right",
                           bordercolor="black",
                           borderwidth=1,
                           borderpad=.5,
                           bgcolor="lemonchiffon")
    fig.show()

In [None]:
def twod_plot(players, rows, names):
    """Plots a 2D scatter plot.

    Input:
        players (df): containing all relevant player information
        rows (list): rows of the data frame with necessary player information
        names (list): strings representing player names for labeling
    """
    fig = px.scatter(players,
                     x='x_2d',
                     y='y_2d',
                     color="role.name",
                     title="2D TSNE Player Embeddings",
                     hover_data=["FullName"])

    fig.update_layout(autosize=False,
                      width=600,
                      height=600)

    add_annotation(fig, rows, names)

In [None]:
rows = pd.concat([ronaldo, benzema, modric, messi, suarez, courtois, debruyne,
                  ramos, neymar, vandijk, griezmann, bale, lewandowski, mbappe],
                 axis=0)
names = ["Ronaldo", "Benzema", "Modric", "Messi", "Suarez", "Courtois",
         "De Bruyne", "Ramos", "Neymar", "van Dijk", "Griezmann", "Bale",
         "Lewandowski", "Mbappé"]
twod_plot(players, rows, names)

### 3d visualization

Similarly, we can plot an interactive 3D visualization of our player embeddings:

In [None]:
X_embedded = TSNE(n_components=3,
                  learning_rate='auto',
                  init='random',
                  perplexity=50).fit_transform(player_embeddings)
players['x_3d'] = X_embedded[:,0]
players['y_3d'] = X_embedded[:,1]
players['z_3d'] = X_embedded[:,2]

In [None]:
# reinitialize rows with the 3d TSNE values
ronaldo = players.loc[players['FullName'] == "Cristiano Ronaldo dos Santos Aveiro"]
benzema = players.loc[players['FullName'].str.contains("Benzema")]
modric = players.loc[players['FullName'].str.contains("Luka Modri")]
messi = players.loc[players['FullName'].str.contains("Messi")]
suarez = players.loc[players['FullName'].str.contains("Luis Alberto Su")]
courtois = players.loc[players['FullName'].str.contains("Courtois")]
debruyne = players.loc[players['FullName'].str.contains("De Bruyne")]
ramos = players.loc[players['FullName'].str.contains("Sergio Ramos")]
neymar = players.loc[players['FullName'].str.contains("Neymar")]
vandijk = players.loc[players['FullName'].str.contains("van Dijk")]
griezmann = players.loc[players['FullName'].str.contains("Griezmann")]
bale = players.loc[players['FullName'].str.contains("Gareth Frank Bale")]
lewandowski = players.loc[players['FullName'].str.contains("Lewandowski")]
mbappe = players.loc[players['FullName'].str.contains("Kylian")]
aguero = players.loc[players['FullName'].str.contains("Sergio Leonel")]
kroos = players.loc[players['FullName'].str.contains("Kroos")]
marcelo = players.loc[players['FullName'].str.contains("Marcelo Vieira")]

rows = pd.concat([ronaldo, benzema, modric, messi, suarez, courtois, debruyne,
                  ramos, neymar, vandijk, griezmann, bale, lewandowski, mbappe],
                  axis=0)

In [None]:
def add_annotation_3d(rows, labels):
    """Adds an arrow and label annotation to a 3D plotly graph.

    Input:
        rows (list): rows of the data frame with necessary player information
        labels (list): strings representing player names for labeling
    """
    dict_list = []
    for i in range(len(labels)):
        dict_list.append(dict(showarrow=True,
                              x=float(rows.iloc[i,:].x_3d),
                              y=float(rows.iloc[i,:].y_3d),
                              z=float(rows.iloc[i,:].z_3d),
                              text=labels[i],
                              arrowhead=1,
                              arrowsize=1,
                              arrowwidth=2,
                              ax=-20,
                              ay=-30,
                              font=dict(color="black", size=6),
                              align="right",
                              bordercolor="black",
                              borderwidth=1,
                              borderpad=.5,
                              bgcolor="lemonchiffon"))
    return dict_list

In [None]:
def threed_plot(players, rows, names):
    """Plots a 3D scatter plot.

    Input:
        players (df): containing all relevant player information
        rows (list): rows of the data frame with necessary player information
        names (list): strings representing player names for labeling
    """
    fig = px.scatter_3d(players, x='x_3d',
                        y='y_3d',
                        z='z_3d',
                        color="role.name",
                        opacity=.5,
                        hover_data=["FullName"])

    dict_list = add_annotation_3d(rows, names)

    fig.update_layout(scene=dict(annotations=dict_list),
                      title="3D TSNE Player Embeddings")

    fig.update_traces(marker_size = 3)

    add_annotation(fig, rows, names)

In [None]:
threed_plot(players, rows, names)

Visually, we can see players of similar tendencies and positions clumping close to one another on the TSNE graph. This portrays not only model success in generalizing position, but success in categorizing players with similar playing style (i.e. left wingers near to other left wingers, full backs near to other full backs, etc.)

# Calculate Most Similar Players

Now, let's find the numerically closest players to some players of interest.

In [None]:
from scipy.spatial import distance

In [None]:
def find_closest(player, players, embeddings, num_closest, euclidean=True):
    """ Find the num_closest players to the given player.

    Input:
        player (df row): row within players df that represents the player of
                         interest
        players (df): df of all players
        embeddings (arr): all player embeddings
        num_closest (int): number of players closest to the player
        euclidean (bool): if true, use euclidean distance. if false, use L1
                          distance
    Output:
        closest_players (df): the player and num_closest players
    """
    players_with_distance = players.copy()
    players_with_distance['distance'] = np.zeros(len(players_with_distance))
    player_0idx = int(player['0index'])

    #reduce dimensionality of embeddings
    pca = PCA(n_components=3)
    embeddings_lower_dim = pca.fit_transform(embeddings)

    #calculate and store distance between all players
    for idx in range(len(players)):
        other_player = players.iloc[idx,:]
        other_player_0idx = int(other_player['0index'])
        if euclidean:
            dist = distance.euclidean(embeddings_lower_dim[player_0idx-1],
                                      embeddings_lower_dim[other_player_0idx-1])
        else: #L1 distance
            dist = sum(abs(value1 - value2) for value1,
                       value2 in zip(embeddings_lower_dim[player_0idx-1],
                       embeddings_lower_dim[other_player_0idx-1]))
        players_with_distance.loc[players_with_distance['0index'] == other_player_0idx, ['distance']] = dist
    # sort by distance
    players_with_distance = players_with_distance.sort_values(by ='distance')
    closest_players = players_with_distance.iloc[1:num_closest+1,:]
    closest_players = closest_players.drop('Unnamed: 0', axis = 1)
    return closest_players[['wyId', 'role.name', 'birthArea.name', 'FullName', '0index', 'distance']]

In [None]:
find_closest(modric, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
485,4969,Midfielder,Spain,Pedro Tanaus\u00fa Dom\u00ednguez Placeres,90,0.052044
1732,284315,Midfielder,Croatia,Marko Rog,1101,0.062889
1874,24964,Forward,Italy,Manuel Pucciarelli,1043,0.064307
678,7936,Midfielder,France,Paul Pogba,1419,0.067848
3325,20531,Midfielder,Netherlands,Wesley Sneijder,2103,0.074365


In [None]:
find_closest(benzema, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
1499,20475,Midfielder,Chile,Arturo Erasmo Vidal Pardo,122,0.019208
210,3384,Midfielder,Mexico,Jos\u00e9 Andr\u00e9s Guardado Hern\u00e1ndez,661,0.024056
3361,28122,Forward,France,Yassine Benzia,2071,0.025945
728,8326,Forward,Italy,Mario Barwuah Balotelli,1996,0.02968
2618,61988,Midfielder,Northern Ireland,Steven Davis,1498,0.033329


In [None]:
find_closest(ronaldo, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
160,3286,Midfielder,Spain,Daniel Parejo Mu\u00f1oz,80,0.009604
2018,25942,Midfielder,Belgium,Gilbert Gianelli Imbula Wanga,2201,0.015059
21,118,Forward,Netherlands,Memphis Depay,1902,0.027532
271,3548,Forward,United States,Giuseppe Rossi,1260,0.030425
616,269189,Midfielder,Poland,Bartosz Kapustka,526,0.039701


In [None]:
find_closest(messi, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
1843,253784,Forward,Belgium,Baptiste Guillaume,1985,0.066611
1512,20523,Midfielder,Argentina,Ricardo Gabriel \u00c1lvarez,1196,0.073746
182,3319,Midfielder,Germany,Mesut \u00d6zil,1320,0.074596
422,4498,Forward,Spain,Lucas V\u00e1zquez Iglesias,644,0.075671
335,397046,Midfielder,France,Houssem Aouar,2179,0.079747


In [None]:
find_closest(courtois, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
451,135747,Goalkeeper,Croatia,Danijel Suba\u0161i\u0107,1768,0.01278
1229,114617,Defender,Peru,Christian Guillermo Mart\u00edn Ramos Garagay,1777,0.052911
1891,25382,Goalkeeper,France,Anthony Lopes,1906,0.123711
656,7849,Goalkeeper,Poland,Wojciech Szcz\u0119sny,976,0.124038
1957,25632,Goalkeeper,France,Beno\u00eet Costil,1925,0.124124


In [None]:
find_closest(ramos, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
1013,14778,Defender,Germany,\u00d6mer Toprak,57,0.034007
438,70120,Midfielder,Spain,Francisco Javier Garc\u00eda Fern\u00e1ndez,671,0.035838
1448,3341,Defender,Spain,Gerard Piqu\u00e9 Bernab\u00e9u,532,0.042421
154,3276,Defender,France,Adil Rami,1415,0.049046
1895,25397,Defender,Cameroon,Samuel Yves Umtiti,547,0.052317


In [None]:
find_closest(mbappe, players, player_embeddings, 5, True)

Unnamed: 0,wyId,role.name,birthArea.name,FullName,0index,distance
1809,56424,Forward,United States,Aron J\u00f3hannsson,298,0.040496
2972,8959,Midfielder,England,Junior Stanislas,1603,0.044876
1719,120353,Forward,Egypt,Mohamed Salah Ghaly,1536,0.046069
1633,21177,Forward,Macedonia FYR,Goran Pandev,878,0.051634
2152,354997,Forward,Venezuela,Sergio Duvan C\u00f3rdova Lezama,266,0.052498


In PCA analysis, we measure the Euclidean distance between the embeddings projected in a lower-dimensional space. The reason we do not calculate the Euclidean distance between the embeddings themselves is due to the curse of dimensionality and the unstable results in calculating distances in high dimensions. We can see that players with similar positions and tendencies are grouped together. For example, Sergio Ramos is grouped together with the following players: Ömer Toprak (center back), Javi García (defensive midfielder and central defender), Gerard Piqué (center back), Adil Rami (center back), and Samuel Yves Umtiti (center back).

Not only was Ramos aligned with defensive players, he was closest in distance to those who played key, central positions. This example further shows the efficacy of our model in capturing not only player position, but capturing similarities between player tendencies.