## Topological Analysis of Premier League Players

One of the more frequent use cases for topological analysis is to identify related clusters and subgroups within a complex, high dimensional dataset. Taking inspiration from [this paper](https://www.nature.com/articles/srep01236) I'll make use of the mapper algorithm to try and identify different archetypes of premier league players.

The dataset has been pulled from the Fantasy Premierleague website and contains averages per 90 minutes for each player across a number of measures for last season. Some pre-processing and cleaning has already been performed on this data, in particular matches were only included if the players was involved for more than 25 minutes.

I'm going to use the [kepler mapper](https://kepler-mapper.scikit-tda.org/) library for performing the analysis, along with [plotly](https://plot.ly/python/) to keep the network plots interactive. So, let's get started and import all the packages needed and load in the data set

### Data and code setup

In [1]:
import pandas as pd
import numpy as np
import kmapper as km
import sklearn
from plotly.offline import init_notebook_mode, iplot
import igraph as ig

np.random.seed(1234)
init_notebook_mode(connected=True)

In [2]:
df=pd.read_csv("PlayerAverages.csv", encoding='ANSI')
df.columns

Index(['Identifier', 'Points', 'Goals', 'Assists', 'Bonus', 'YellowCards',
       'RedCards', 'Crosses', 'BigChancesCreated',
       'ClearancesBlocksIntercepts', 'Recoveries', 'KeyPasses', 'Tackles',
       'AttemptedPasses', 'PassesCompleted', 'BigChancesMissed',
       'ErrorsToGoal', 'ErrorsToGoalAttempt', 'Tackled', 'Offside', 'Fouls',
       'Dribbles', 'PassAccuracy'],
      dtype='object')

Looking at the above list we have a wide range of stats that we can use for grouping each player. Since the data was taken from the fantasy football game we also have the average points that each player received per game. This will serve as a proxy for how effective that player is but, for grouping the players together I'd rather focus just on their in-game performance, so we'll drop it out for now. 

Admittedly this isn't perfect - defenders should be judged on, well, defence, however goalscoring defenders will have higher points scores than others. Still, it will give a useful, high-level way to compare the different clusters that mapper will generate.

In [3]:
%%capture
# Drop fantasy related columns - we only want to cluster players based on their in-game performance
X = df[[col for col in df.columns if col not in ['Points', 'Bonus']]]

# Extract the list of player names and then drop from the data
names = X['Identifier'].values
X.drop('Identifier', axis=1, inplace=True)

# Get the overall averages for each stat for comparison later
means = np.mean(X.values, axis=0)
std_dev = np.std(X.values, axis=0)

Now that the data is ready I'm going to apply the mapper algorithm and create a network graph. The first step will be to use t-SNE to project the dataset into 2 dimensions, then look at a covering of the pullback to generate the graph. 

For a nice overview of what the mapper algorithm is doing, one of the kepler mapper contributors has a clear explanation [here](https://sauln.github.io/blog/mapper-intro/). For anyone looking to really get into the maths behind this, Gunnar Carlssons [paper](http://www.ayasdi.com/wp-content/uploads/2015/02/Topology_and_Data.pdf) is the ideal place to start.

In [4]:
# Initialise mapper and create lens using TSNE
mapper = km.KeplerMapper(verbose=0)
lens = mapper.fit_transform(X.values, projection=sklearn.manifold.TSNE(), scaler=None)

# Create the graph of the nerve of the corresponding pullback
graph = mapper.map(lens, X.values, clusterer=sklearn.cluster.KMeans(n_clusters=2, random_state=1234),
                   nr_cubes=20, overlap_perc=0.5)

Kepler Mapper allows for plotting in html using the D3 JavaScript library and generates a nice interactive webpage. In order to allow for interactive plots within this notebook I'll use plotly instead (There is an update in progress to allow this natively within kepler mapper but as of writing it's not finished yet). Let's first define some functions that will be used for extracting information about each node and then replicating the network plot using igraph and plotly.

This is pretty lengthy for a notebook code block but will make it easy later on to regenerate the plots with different parameters.

In [5]:
def get_cluster_summary(player_list, average_mean, average_std, dataset, columns):
    # Compare players against the average and list the attributes that are above and below the average

    cluster_mean = np.mean(dataset.iloc[player_list].values, axis=0)
    diff = cluster_mean - average_mean
    std_m = np.sqrt((cluster_mean - average_mean) ** 2) / average_std

    stats = sorted(zip(columns, cluster_mean, average_mean, diff, std_m), key=lambda x: x[4], reverse=True)
    above_stats = [a[0] + ': ' + f'{a[1]:.2f}' for a in stats if a[3] > 0]
    below_stats = [a[0] + ': ' + f'{a[1]:.2f}' for a in stats if a[3] < 0]
    below_stats.reverse()

    # Create a string summary for the tooltips
    cluster_summary = 'Above Mean:<br>' + '<br>'.join(above_stats[:5]) + \
                      '<br><br>Below Mean:<br>' + '<br>'.join(below_stats[-5:])

    return cluster_summary

def make_igraph_plot(graph, data, X, player_names, layout, mean_list, std_dev_list, title, line_color='rgb(200,200,200)'):
    # Extract node information for the plot
    div = '<br>-------<br>'
    node_list = []
    cluster_sizes = []
    avg_points = []
    tooltip = []
    for node in graph['nodes']:
        node_list.append(node)
        players = graph['nodes'][node]
        cluster_sizes.append(2 * int(np.log(len(players) + 1) + 1))
        avg_points.append(np.average([data.iloc[i]['Points'] for i in players]))
        node_info = node + div + '<br>'.join([player_names[i] for i in players]) + div + \
                    get_cluster_summary(players, mean_list, std_dev_list, X, X.columns)
        tooltip += tuple([node_info])

    # Add the edges to a list for passing into iGraph:
    edge_list = []
    for node in graph['links']:
        for nbr in graph['links'][node]:
            # Need to base everything on indices for igraph
            edge_list.append((node_list.index(node), node_list.index(nbr)))

    # Make the igraph plot
    g = ig.Graph(len(node_list))
    g.add_edges(edge_list)

    links = g.get_edgelist()
    plot_layout = g.layout(layout)

    n = len(plot_layout)
    x_nodes = [plot_layout[k][0] for k in range(n)]  # x-coordinates of nodes
    y_nodes = [plot_layout[k][1] for k in range(n)]  # y-coordinates of nodes

    x_edges = []
    y_edges = []
    for e in links:
        x_edges.extend([plot_layout[e[0]][0], plot_layout[e[1]][0], None])
        y_edges.extend([plot_layout[e[0]][1], plot_layout[e[1]][1], None])

    edges_trace = dict(type='scatter', x=x_edges, y=y_edges, mode='lines', line=dict(color=line_color, width=0.3),
                       hoverinfo='none')

    nodes_trace = dict(type='scatter', x=x_nodes, y=y_nodes, mode='markers', opacity=0.8,
                       marker=dict(symbol='dot', colorscale='Viridis', showscale=True, reversescale=False,
                                   color=avg_points, size=cluster_sizes,
                                   line=dict(color=line_color, width=0.3),
                                   colorbar=dict(thickness=20, ticklen=4)),
                       text=tooltip, hoverinfo='text')

    axis = dict(showline=False, zeroline=False, showgrid=False, showticklabels=False, title='')

    layout = dict(title=title, font=dict(size=12), showlegend=False, autosize=False, width=700, height=700,
                  xaxis=dict(axis), yaxis=dict(axis), hovermode='closest', plot_bgcolor='rgba(20,20,20, 0.8)')

    iplot(dict(data=[edges_trace, nodes_trace], layout=layout))

### Player clustering using mapper

Now that the functions we need are in place we're finally in a position to do some analysis. Let's use the above functions now to produce the graph output from mapper

In [6]:
make_igraph_plot(graph, df, X, names, 'kk', means, std_dev, 'Player data - resolution=20')

Each node represents a group of players with some characteristics in common. It's possible for a single player to appear in multiple nodes and this is the criteria for linking two nodes with an edge.

Note that it's possible to zoom and pan on the above plot. Hovering over any node will show a tooltip with the list of players in the node along with which of their stats are significantly above and below the average. In other words the characteristics that these players share and the reason they're grouped together.

So, what exactly is being conveyed in this plot? Let's start with the top left where there is a block of 3 linked diamond shapes. The clusters in here represent the "pass masters" - looking at each node we can see that the averages for completed passes and pass accuracy are above the average. We can also see a nice transition though across the spectrum of high-passing players. At the bottom we have clusters representing 'ball-winning passers' - we have a mixture of defenders who can pass well (Like Robertson and Mertesacker), high pressing creators like Erikson or destroyer midfielders like Wanyama and Kanté. They recycle possession regularly so the combination of high passing and high recovery fits their role well.

As we move more to the tip of that cluster (The pentagram shape at the top) we have the pure creators. These are the players whose high passing is for the purpose of making goal-scoring chances. Right at the top we have the players who embody this the most - De Bruyne, Ozil and Fabregas. Looking at the colours of the node we can also see the transition from darker at the bottom to yellow at the top - these creators are the ones who score the most points from a fantasy football perspective.

Speaking of nodes with a high points score, two linked nodes at the bottom really stand out as being the brightest. Hazard and Sanchez stand out alone as being high chance creators but through individual skill (i.e. dribbling) rather than incisive passing. Interestingly the cluster to the right of these two also contains chance creating dribblers (Zaha, Lookman, Bolasie). They have a similar playing style to Sanchez and Hazard but are far less effective at it.

To give a quick summary on the rest of the graph, the bottom contains the cluster of goalkeepers. The large block on the lower right is primarily defenders / defensive midfielders - from high blocking passers to pure tacklers to dribbling crossers.
Top right is attackers while the largest cluster in the middle has a little bit of everything. There are still some interesting sub-groups within this though - by hovering across different parts of the graph you can use the tooltips to understand the criteria that caused each group of players to be grouped together.

### Increasing the resolution

One of the really useful things about mapper is the ability to customise the scale at which you're looking at the data. Intuitively this is like zooming in on the data - by looking at it with a higher resolution some of those larger clusters could be teased apart into distinct groups.

Technical this will be done by increasing the number of cubes that are used to partition the projected points. I'll increase this to 30 and see what impact it has on the graph

In [7]:
graph = mapper.map(lens, X.values, clusterer=sklearn.cluster.KMeans(n_clusters=2, random_state=1234),
                   nr_cubes=30, overlap_perc=0.7)

make_igraph_plot(graph, df, X, names, 'kk', means, std_dev, 'Player data - resolution=30')

There's still one large cluster and in addition the large cluster of forwards has remained largely intact. There are far more small clusters now though, dominating the right hand side of the graph.

Let's take goalkeepers for example. At the top right of the graph there is a cluster containing a lot of goalkeepers. Looking at each node, one thing they have in common is that their pass accuracy is below the mean in each case - it's at about 50% for most of them. Considering keepers tend to boot the ball down the pitch when they get it, this seems pretty fair. Those goalkeepers who are hitting average with their pass accuracy (Which is probably high for a goalkeeper) are now in their own cluster, just below centre on the right. Naturally this contains the keepers from teams who like to pass out rather than play long - Lloris, Vorm, Ederson, Mignolet and Karius. The fact that both Vorm and Lloris and Karius and Mignolet are grouped closely show that this is more an artefact of the teams play-style rather than something purely individual.

It's also worth pointing out again the beacon of yellow that catches the eye in the bottom right. The highest scoring node contains Aguero and Salah whose playstyle seems all about taking risks. They both dribble a lot yet get tackled a lot, they miss a lot of big chances and yet still create and score plenty. What's also interesting is the other players who are grouped with them - Deulofeu, Ibe and Bojan - players whose talent and potential have been touted before but who would not normally be associated with a Salah or Aguero. It really highlights how fine the margins are at the top - start missing a few more chances and player of the year Salah could have been Jordan Ibe... 