# [CDAF] Atividade 5

## Nome e matrícula
Nome: Luís Felipe Ramos Ferreira
Matrícula: 2019022553

## Objetivos
- Nessa atividade, estou entregando a pipeline inteira do VAEP implementada para os dados do Wyscout das Top 5 ligas.
- Para cada subtítulo abaixo, vocês devem explicar o que foi feito e à qual seção/subseção/equação do paper "Actions Speak Louder than Goals: Valuing Actions by Estimating Probabilities" ela corresponde. Justifique suas respostas.
- Além disso, após algumas partes do código haverão perguntas que vocês devem responder, possivelmente explorando minimamente o que já está pronto.
- Por fim, vocês devem montar um diagrama do fluxo de funções/tarefas de toda a pipeline do VAEP abaixo. Esse diagrama deve ser enviado como arquivo na submissão do Moodle, para além deste notebook.

## Referências
- [1] https://tomdecroos.github.io/reports/kdd19_tomd.pdf
- [2] https://socceraction.readthedocs.io/en/latest/api/vaep.html

### Carregando os dados

In [45]:
import numpy as np
import pandas as pd

In [46]:
def load_matches(path):
    matches = pd.read_json(path_or_buf=path)
    # as informações dos times de cada partida estão em um dicionário dentro da coluna 'teamsData', então vamos separar essas informações
    team_matches = []
    for i in range(len(matches)):
        match = pd.DataFrame(matches.loc[i, "teamsData"]).T
        match["matchId"] = matches.loc[i, "wyId"]
        team_matches.append(match)
    team_matches = pd.concat(team_matches).reset_index(drop=True)

    return team_matches

In [47]:
def load_players(path):
    players = pd.read_json(path_or_buf=path)
    players["player_name"] = players["firstName"] + ' ' + players["lastName"]
    players = players[["wyId", "player_name"]].rename(columns={"wyId": "player_id"})

    return players

In [48]:
def load_events(path):
    events = pd.read_json(path_or_buf=path)
    # pré processamento em colunas da tabela de eventos para facilitar a conversão p/ SPADL
    events = events.rename(columns={
        "id": "event_id",
        "eventId": "type_id",
        "subEventId": "subtype_id",
        "teamId": "team_id",
        "playerId": "player_id",
        "matchId": "game_id"
    })
    events["milliseconds"] = events["eventSec"] * 1000
    events["period_id"] = events["matchPeriod"].replace({"1H": 1, "2H": 2})

    return events

In [49]:
def load_minutes_played_per_game(path):
    minutes = pd.read_json(path_or_buf=path)
    minutes = minutes.rename(columns={
        "playerId": "player_id",
        "matchId": "game_id",
        "teamId": "team_id",
        "minutesPlayed": "minutes_played"
    })
    minutes = minutes.drop(["shortName", "teamName", "red_card"], axis=1)

    return minutes

In [50]:
leagues = ["England", "Spain"]
events = {}
matches = {}
minutes = {}
for league in leagues:
    path = f"../data/atv03/matches/matches_{league}.json"
    matches[league] = load_matches(path)
    path = f"../data/atv03/events/events_{league}.json"
    events[league] = load_events(path)
    path = f"../data/atv03/minutes_played/minutes_played_per_game_{league}.json"
    minutes[league] = load_minutes_played_per_game(path)

In [51]:
path = "../data/atv03/players/players.json"
players = load_players(path)
players["player_name"] = players["player_name"].str.decode("unicode-escape")

### SPADL

In [52]:
from tqdm import tqdm
import socceraction.spadl as spd

In [53]:
def spadl_transform(events, matches):
    spadl = []
    game_ids = events.game_id.unique().tolist()
    for g in tqdm(game_ids):
        match_events = events.loc[events.game_id == g]
        match_home_id = matches.loc[(matches.matchId == g) & (matches.side == "home"), "teamId"].values[0]
        match_actions = spd.wyscout.convert_to_actions(events=match_events, home_team_id=match_home_id)
        match_actions = spd.play_left_to_right(actions=match_actions, home_team_id=match_home_id)
        match_actions = spd.add_names(match_actions)
        spadl.append(match_actions)
    spadl = pd.concat(spadl).reset_index(drop=True)

    return spadl

In [54]:
spadl = {}
""" for league in leagues:
    spadl[league] = spadl_transform(events=events[league], matches=matches[league]) """

' for league in leagues:\n    spadl[league] = spadl_transform(events=events[league], matches=matches[league]) '

In [55]:
spadl["England"] = pd.read_json("../data/atv05/spadl_England.json")
spadl["Spain"] = pd.read_json("../data/atv05/spadl_Spain.json")

### Features

In [56]:
from socceraction.vaep import features as ft

In [57]:
def features_transform(spadl):
    spadl.loc[spadl.result_id.isin([2, 3]), ["result_id"]] = 0
    spadl.loc[spadl.result_name.isin(["offside", "owngoal"]), ["result_name"]] = "fail"

    xfns = [
        ft.actiontype_onehot,
        ft.bodypart_onehot,
        ft.result_onehot,
        ft.goalscore,
        ft.startlocation,
        ft.endlocation,
        ft.team,
        ft.time,
        ft.time_delta
    ]

    features = []
    for game in tqdm(np.unique(spadl.game_id).tolist()):
        match_actions = spadl.loc[spadl.game_id == game].reset_index(drop=True)
        match_states = ft.gamestates(actions=match_actions)
        match_feats = pd.concat([fn(match_states) for fn in xfns], axis=1)
        features.append(match_feats)
    features = pd.concat(features).reset_index(drop=True)

    return features

1- O que a primeira e a segunda linhas da função acima fazem? Qual sua hipótese sobre intuito dessas transformações? Como você acha que isso pode impactar o modelo final?

In [58]:
features = {}
for league in ["England", "Spain"]:
    features[league] = features_transform(spadl[league])

100%|██████████| 380/380 [00:06<00:00, 54.82it/s]
100%|██████████| 380/380 [00:06<00:00, 55.32it/s]


### Labels

In [59]:
import socceraction.vaep.labels as lab

In [60]:
def labels_transform(spadl):
    yfns = [lab.scores, lab.concedes]

    labels = []
    for game in tqdm(np.unique(spadl.game_id).tolist()):
        match_actions = spadl.loc[spadl.game_id == game].reset_index(drop=True)
        labels.append(pd.concat([fn(actions=match_actions) for fn in yfns], axis=1))

    labels = pd.concat(labels).reset_index(drop=True)

    return labels

In [61]:
labels = {}
for league in ["England", "Spain"]:
    labels[league] = labels_transform(spadl[league])

100%|██████████| 380/380 [00:09<00:00, 41.42it/s]
100%|██████████| 380/380 [00:09<00:00, 41.93it/s]


In [62]:
labels["England"]["scores"].sum()

7553

In [63]:
labels["England"]["concedes"].sum()

2313

2- Explique o por que da quantidade de labels positivos do tipo scores ser muito maior que do concedes. Como você acha que isso pode impactar o modelo final?

### Training Model

In [64]:
import xgboost as xgb
import sklearn.metrics as mt

In [65]:
def train_vaep(X_train, y_train, X_test, y_test):
    models = {}
    for m in ["scores", "concedes"]:
        models[m] = xgb.XGBClassifier(random_state=0, n_estimators=50, max_depth=3)

        print("training " + m + " model")
        models[m].fit(X_train, y_train[m])

        p = sum(y_train[m]) / len(y_train[m])
        base = [p] * len(y_train[m])
        y_train_pred = models[m].predict_proba(X_train)[:, 1]
        train_brier = mt.brier_score_loss(y_train[m], y_train_pred) / mt.brier_score_loss(y_train[m], base)
        print(m + " Train NBS: " + str(train_brier))
        print()

        p = sum(y_test[m]) / len(y_test[m])
        base = [p] * len(y_test[m])
        y_test_pred = models[m].predict_proba(X_test)[:, 1]
        test_brier = mt.brier_score_loss(y_test[m], y_test_pred) / mt.brier_score_loss(y_test[m], base)
        print(m + " Test NBS: " + str(test_brier))
        print()

        print("----------------------------------------")

    return models

In [66]:
models = train_vaep(X_train=features["England"], y_train=labels["England"], X_test=features["Spain"], y_test=labels["Spain"])

training scores model
scores Train NBS: 0.8452154331687597

scores Test NBS: 0.850366923253325

----------------------------------------
training concedes model
concedes Train NBS: 0.964463215550682

concedes Test NBS: 0.9745272575372074

----------------------------------------


3- Por que treinamos dois modelos diferentes? Por que a performance dos dois é diferente?

### Predictions

In [67]:
def generate_predictions(features, models):
    preds = {}
    for m in ["scores", "concedes"]:
        preds[m] = models[m].predict_proba(features)[:, 1]
    preds = pd.DataFrame(preds)

    return preds

In [68]:
preds = {}
preds["England"] = generate_predictions(features=features["England"], models=models)
preds

{'England':           scores  concedes
 0       0.002992  0.000412
 1       0.003928  0.000329
 2       0.002779  0.000345
 3       0.002234  0.000298
 4       0.005827  0.000308
 ...          ...       ...
 482896  0.076417  0.001592
 482897  0.023226  0.003552
 482898  0.005620  0.068251
 482899  0.082877  0.003011
 482900  0.034658  0.003071
 
 [482901 rows x 2 columns]}

### Action Values

In [69]:
import socceraction.vaep.formula as fm

In [70]:
def calculate_action_values(spadl, predictions):
    action_values = fm.value(actions=spadl, Pscores=predictions["scores"], Pconcedes=predictions["concedes"])
    action_values = pd.concat([
        spadl[["original_event_id", "player_id", "action_id", "game_id", "start_x", "start_y", "end_x", "end_y", "type_name", "result_name"]],
        predictions.rename(columns={"scores": "Pscores", "concedes": "Pconcedes"}),
        action_values
    ], axis=1)

    return action_values

In [71]:
action_values = {}
action_values["England"] = calculate_action_values(spadl=spadl["England"], predictions=preds["England"])
action_values["England"]

Unnamed: 0,original_event_id,player_id,action_id,game_id,start_x,start_y,end_x,end_y,type_name,result_name,Pscores,Pconcedes,offensive_value,defensive_value,vaep_value
0,177959171.0,25413,0,2499719,51.45,34.68,32.55,14.96,pass,success,0.002992,0.000412,0.000000,-0.000000,0.000000
1,177959172.0,370224,1,2499719,32.55,14.96,53.55,17.00,pass,success,0.003928,0.000329,0.000935,0.000083,0.001018
2,177959173.0,3319,2,2499719,53.55,17.00,36.75,19.72,pass,success,0.002779,0.000345,-0.001149,-0.000016,-0.001164
3,177959174.0,120339,3,2499719,36.75,19.72,43.05,3.40,pass,success,0.002234,0.000298,-0.000545,0.000047,-0.000498
4,177959175.0,167145,4,2499719,43.05,3.40,75.60,8.16,pass,success,0.005827,0.000308,0.003593,-0.000010,0.003583
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
482896,251596226.0,20620,1139,2500098,55.65,7.48,103.95,19.04,pass,success,0.076417,0.001592,0.066197,0.000381,0.066578
482897,251596229.0,14703,1140,2500098,103.95,19.04,103.95,19.04,cross,fail,0.023226,0.003552,-0.053191,-0.001960,-0.055151
482898,251596408.0,8239,1141,2500098,2.10,46.92,0.00,46.24,interception,success,0.005620,0.068251,0.002068,-0.045026,-0.042958
482899,251596232.0,70965,1142,2500098,105.00,0.00,92.40,36.04,corner_crossed,success,0.082877,0.003011,0.036377,-0.003011,0.033366


4- Explore as ações com Pscores >= 0.95. Por que elas tem um valor tão alto? As compare com ações do mesmo tipo e resultado opostado. Será que o modelo aprende que essa combinação de tipo de ação e resultado está diretamente relacionado à variável y que estamos tentando prever?

5- Qual formula do paper corresponde à coluna 'offensive_value' do dataframe action_values? E a coluna 'defensive_value'?

### Player Ratings

In [72]:
def calculate_minutes_per_season(minutes_per_game):
    minutes_per_season = minutes_per_game.groupby("player_id", as_index=False)["minutes_played"].sum()

    return minutes_per_season

In [73]:
minutes_per_season = {}
minutes_per_season["England"] = calculate_minutes_per_season(minutes["England"])
minutes_per_season["England"]

Unnamed: 0,player_id,minutes_played
0,36,1238
1,38,382
2,48,3343
3,54,3348
4,56,266
...,...,...
510,448708,21
511,450826,35
512,486252,649
513,531655,28


In [74]:
def calculate_player_ratings(action_values, minutes_per_season, players):
    player_ratings = action_values.groupby(by="player_id", as_index=False).agg({"vaep_value": "sum"}).rename(columns={"vaep_value": "vaep_total"})
    player_ratings = player_ratings.merge(minutes_per_season, on=["player_id"], how="left")
    player_ratings["vaep_p90"] = player_ratings["vaep_total"] / player_ratings["minutes_played"] * 90
    player_ratings = player_ratings[player_ratings["minutes_played"] >= 600].sort_values(by="vaep_p90", ascending=False).reset_index(drop=True)
    player_ratings = player_ratings.merge(players, on=["player_id"], how="left")
    player_ratings = player_ratings[["player_id", "player_name", "minutes_played", "vaep_total", "vaep_p90"]]

    return player_ratings

In [75]:
player_ratings = {}
player_ratings["England"] = calculate_player_ratings(action_values=action_values["England"], minutes_per_season=minutes_per_season["England"], players=players)
player_ratings["England"]

Unnamed: 0,player_id,player_name,minutes_played,vaep_total,vaep_p90
0,120353,Mohamed Salah Ghaly,2995.0,28.516333,0.856918
1,3802,Philippe Coutinho Correia,1134.0,8.896437,0.706066
2,8325,Sergio Leonel Agüero del Castillo,2038.0,14.206033,0.627352
3,8717,Harry Kane,3201.0,20.985924,0.590045
4,25867,Pierre-Emerick Aubameyang,1098.0,7.095187,0.581573
...,...,...,...,...,...
366,8391,Julián Speroni,1041.0,0.371623,0.032129
367,8242,Shane Duffy,3411.0,1.149650,0.030334
368,8903,Troy Deeney,1940.0,0.609467,0.028274
369,38031,Christian Benteke Liolo,2344.0,-0.403160,-0.015480


6- Acha que o Top 5 da lista é bem representativo? Compare esse ranqueamento do VAEP com o do xT da Atividade 4. Qual você acha que é mais representativo?