### Graph Neural Networks 

In [3]:
import pandas as pd

# download data
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_player.csv
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_player_skill.csv
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_team.csv

# loading data
player_df = pd.read_csv("raw.githubusercontent.com_batuhan-demirci_fifa21_dataset_master_data_tbl_player.csv")
skill_df = pd.read_csv("raw.githubusercontent.com_batuhan-demirci_fifa21_dataset_master_data_tbl_player_skill.csv")
team_df = pd.read_csv("raw.githubusercontent.com_batuhan-demirci_fifa21_dataset_master_data_tbl_team.csv")

# extract subsets
player_df = player_df[["int_player_id", "str_player_name", "str_positions", "int_overall_rating", "int_team_id"]]
skill_df = skill_df[["int_player_id", "int_long_passing", "int_ball_control", "int_dribbling"]]
team_df = team_df[["int_team_id", "str_team_name", "int_overall"]]

# merging data
player_df = player_df.merge(skill_df, on='int_player_id')
fifa_df = player_df.merge(team_df, on='int_team_id')

# sorting the dataframe
fifa_df = fifa_df.sort_values(by="int_overall_rating", ascending=False)
print("Players: ", fifa_df.shape[0])
fifa_df.head()

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


Players:  18767


Unnamed: 0,int_player_id,str_player_name,str_positions,int_overall_rating,int_team_id,int_long_passing,int_ball_control,int_dribbling,str_team_name,int_overall
0,1,Lionel Andrés Messi Cuccittini,"RW, ST, CF",93,5.0,91,96,96,FC Barcelona,84
33,2,Cristiano Ronaldo dos Santos Aveiro,"ST, LW",92,6.0,77,92,88,Juventus,83
57,3,Jan Oblak,GK,91,8.0,40,30,12,Atlético Madrid,83
121,5,Neymar da Silva Santos Júnior,"LW, CAM",91,7.0,81,95,95,Paris Saint-Germain,83
89,4,Kevin De Bruyne,"CAM, CM",91,2.0,93,92,88,Manchester City,85


Let's first identify the graph-specific things we need:

1. Nodes - Football players (by ID)
2. Edges - If they play for the same team (see explanation below)
3. Node Features - The football player's position, specialities, ball control, etc.
4. Labels - The football player's overall rating (node-level regression task)

Nodes are usually very straight-forward to identify - here we even have IDs. If you don't have a unique identifier, you need one, because you need to know between which nodes a connection exists!

The most challenging task is typically to link these nodes somehow through edges. Here we define the edges based on the team assignment. With this dataset, we could predict the expected rating when a player switches to a new team or a new player is observed. Therefore we expect relational effects through the team assignment. Of course there are many other possibilities to define the edges such as:

1. How many times two players played together (edge weight) --> Synergies
2. How many times a player has won/los 1:1 duels (edge weight)
3. Started their career in the same football club
4. Temporal edges: "Played together in the last 2 weeks"


In [5]:
# checking for duplicate nodes
max(fifa_df["int_player_id"].value_counts())

1

Each football player ID occurs only once in our dataset.

In [6]:
# sorting to define the order of nodes
sorted_df = fifa_df.sort_values(by="int_player_id")
# selecting node features
node_features = sorted_df[["str_positions", "int_long_passing", "int_ball_control", "int_dribbling"]]
# converting non-numeric columns
pd.set_option('mode.chained_assignment', None)
positions = node_features["str_positions"].str.split(",", expand=True)
node_features["first_position"] = positions[0]
# one-hot encoding
node_features = pd.concat([node_features, pd.get_dummies(node_features["first_position"])], axis=1, join='inner')
node_features.drop(["str_positions", "first_position"], axis=1, inplace=True)
node_features.head() 

Unnamed: 0,int_long_passing,int_ball_control,int_dribbling,CAM,CB,CDM,CF,CM,GK,LB,LM,LW,LWB,RB,RM,RW,RWB,ST
0,91,96,96,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
33,77,92,88,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
57,40,30,12,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
89,93,92,88,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
121,81,95,95,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


The number of nodes and the ordering is implicitly defined by it's shape. Each row corresponds to one node in our final graph.

In [7]:
# converting to numpy
x = node_features.to_numpy()
x.shape # [num_nodes x num_features]

(18767, 18)

In [8]:
# sorting to define the order of nodes
sorted_df = fifa_df.sort_values(by="int_player_id")
# selecting node features
labels = sorted_df[["int_overall"]]
labels.head()

Unnamed: 0,int_overall
0,84
33,83
57,83
89,85
121,83


In [9]:
# convert to numpy
y = labels.to_numpy()
y.shape # [num_nodes, 1] --> node regression

(18767, 1)

In [10]:
# remapping player IDs
fifa_df["int_player_id"] = fifa_df.reset_index().index

In [11]:
# how many players we need to connect
fifa_df["str_team_name"].value_counts()

Everton                   36
Valencia CF               34
FC Nantes                 34
Villarreal CF             34
Real Valladolid CF        34
                          ..
Wellington Phoenix        19
Central Coast Mariners    19
Melbourne Victory         19
Brisbane Roar             19
Adelaide United           19
Name: str_team_name, Length: 681, dtype: int64

We now need to build all permutations of these players within one team, which corresponds to a fully-connected graph within each team-subgroup. We use the column int_player_id as indices for the edges. If there is for example a [0, 1] in the edge index, it means that the first and second node (regarding the previously defined node feature matrix) are connected.

In [12]:
import itertools
import numpy as np

teams = fifa_df["str_team_name"].unique()
all_edges = np.array([], dtype=np.int32).reshape((0, 2))
for team in teams:
    team_df = fifa_df[fifa_df["str_team_name"] == team]
    players = team_df["int_player_id"].values
    # building all combinations, as all players are connected
    permutations = list(itertools.combinations(players, 2))
    edges_source = [e[0] for e in permutations]
    edges_target = [e[1] for e in permutations]
    team_edges = np.column_stack([edges_source, edges_target])
    all_edges = np.vstack([all_edges, team_edges])
# converting to Pytorch Geometric format
edge_index = all_edges.transpose()
edge_index # [2, num_edges]

array([[    0,     0,     0, ..., 18704, 18704, 18719],
       [    7,    32,    45, ..., 18719, 18751, 18751]], dtype=int64)

The result are these source/target edge pairs. Here you can also model dircted or undirected edges by inluding both or just one direction (I included both). This COO format is usually chosen as it is more efficient than a NxN adjacency matrix.

In [13]:
from torch_geometric.data import Data
data = Data(x=x, edge_index=edge_index, y=y)

This data object represents one single graph.

In [14]:
from torch_geometric.loader import DataLoader
data_list = [Data(...), ..., Data(...)]
loader = DataLoader(data_list, batch_size=32)

In [15]:
#printing 
Data

torch_geometric.data.data.Data