# [CDAF] Atividade 5

## Nome e matrícula
Nome: Júlio Guerra Domingues
Matrícula: 2022431280

Nome: Leandro Luiz Duarte Teixeira
Matrícula: 2024006099

## Introdução
- Neste notebook, vamos implementar o carregamento dos dados no formato SPADL
- Um modelo de Expected Threat
- Um modelo VAEP com pipeline completa

## Dados
https://figshare.com/collections/Soccer_match_event_dataset/4415000

### Carregando os dados

In [1]:
import numpy as np
import pandas as pd

In [3]:
# prompt: importar zip de um link e descompactar na pasta raiz

import requests
import zipfile
import io
import os
import IPython

def download_and_extract_zip(url, extract_to='.'):
    """Downloads a zip file from a URL and extracts it to a specified directory.

    Args:
        url: The URL of the zip file.
        extract_to: The directory to extract the zip file to. Defaults to the current directory.
    """
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for non-200 status codes

        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
            zip_ref.extractall(extract_to)
        print(f"Successfully downloaded and extracted zip file from {url} to {extract_to}")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading zip file: {e}")
    except zipfile.BadZipFile as e:
        print(f"Error extracting zip file: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

zip_events = 'https://figshare.com/ndownloader/files/14464685'
zip_other =
download_and_extract_zip(zip_events)
download_and_extract_zip(zip_other)

Successfully downloaded and extracted zip file from https://github.com/juliogdomingues/cdaf_ufmg/raw/refs/heads/main/At3/all_files_World_Cup.zip to .


In [4]:
def load_matches(path):
    matches = pd.read_json(path_or_buf=path)
    # as informações dos times de cada partida estão em um dicionário dentro da coluna 'teamsData', então vamos separar essas informações
    team_matches = []
    for i in range(len(matches)):
        match = pd.DataFrame(matches.loc[i, 'teamsData']).T
        match['matchId'] = matches.loc[i, 'wyId']
        team_matches.append(match)
    team_matches = pd.concat(team_matches).reset_index(drop=True)

    return team_matches

In [5]:
def load_players(path):
    players = pd.read_json(path_or_buf=path)
    players['player_name'] = players['firstName'] + ' ' + players['lastName']
    players = players[['wyId', 'player_name']].rename(columns={'wyId': 'player_id'})

    return players

In [6]:
def load_events(path):
    events = pd.read_json(path_or_buf=path)
    # pré processamento em colunas da tabela de eventos para facilitar a conversão p/ SPADL
    events = events.rename(columns={
        'id': 'event_id',
        'eventId': 'type_id',
        'subEventId': 'subtype_id',
        'teamId': 'team_id',
        'playerId': 'player_id',
        'matchId': 'game_id'
    })
    events['milliseconds'] = events['eventSec'] * 1000
    events['period_id'] = events['matchPeriod'].replace({'1H': 1, '2H': 2})

    return events

In [7]:
def load_minutes_played_per_game(path):
    minutes = pd.read_json(path_or_buf=path)
    minutes = minutes.rename(columns={
        'playerId': 'player_id',
        'matchId': 'game_id',
        'teamId': 'team_id',
        'minutesPlayed': 'minutes_played'
    })
    minutes = minutes.drop(['shortName', 'teamName', 'red_card'], axis=1)

    return minutes

In [10]:
leagues = ['England', 'Spain']
events = {}
matches = {}
minutes = {}
for league in leagues:
    path = r'matches_{}.json'.format(league)
    matches[league] = load_matches(path)
    path = r'events_{}.json'.format(league)
    events[league] = load_events(path)
    path = r'minutes_played_per_game_{}.json'.format(league)
    minutes[league] = load_minutes_played_per_game(path)

FileNotFoundError: File minutes_played_per_game_World_Cup.json does not exist

In [None]:
path = r'players.json'
players = load_players(path)
players['player_name'] = players['player_name'].str.decode('unicode-escape')

### SPADL

In [None]:
from tqdm import tqdm
import socceraction.spadl as spd

In [None]:
def spadl_transform(events, matches):
    spadl = []
    game_ids = events.game_id.unique().tolist()
    for g in tqdm(game_ids):
        match_events = events.loc[events.game_id == g]
        match_home_id = matches.loc[(matches.matchId == g) & (matches.side == 'home'), 'teamId'].values[0]
        match_actions = spd.wyscout.convert_to_actions(events=match_events, home_team_id=match_home_id)
        match_actions = spd.play_left_to_right(actions=match_actions, home_team_id=match_home_id)
        match_actions = spd.add_names(match_actions)
        spadl.append(match_actions)
    spadl = pd.concat(spadl).reset_index(drop=True)

    return spadl

In [None]:
spadl = {}
for league in leagues:
    spadl[league] = spadl_transform(events=events[league], matches=matches[league])

## Parte I
- Vamos implementar um modelo de xT usando a biblioteca Socceraction, referenciada abaixo

## Referências
- [1] https://socceraction.readthedocs.io/en/latest/api/generated/socceraction.xthreat.ExpectedThreat.html#socceraction.xthreat.ExpectedThreat
- [2] https://socceraction.readthedocs.io/en/latest/api/generated/socceraction.xthreat.get_successful_move_actions.html#socceraction.xthreat.get_successful_move_actions
- [3] https://socceraction.readthedocs.io/en/latest/documentation/valuing_actions/xT.html

### Questão 1

- Instancie um objeto ExpectedThreat [2] com parâmetros l=25 e w=16.
- Faça o fit do modelo ExpectedThreat com o dataframe "spadl".

In [None]:
from socceraction import xthreat as xt

In [None]:
xt_object = xt.ExpectedThreat(l=25, w=16)
model = xt_object.fit(spadl)

### Questão 2
- Crie um dataframe "prog_actions" à partir do dataframe "spadl", contendo apenas as ações de progressão e que são bem-sucedidas [3].
- Use o método rate do objeto ExpectedThreat p/ calcular o valor de cada ação de progressão do dataframe "prog_actions", em uma coluna chamada "action_value".
- Agrupe o dataframe "prog_actions" por "player_name" e reporte a soma dos "action_value".
- Reporte os 10 jogadores com maior "action_value".

## Parte II
- Nessa atividade, temos implementada a pipeline inteira do VAEP [1] para os dados do Wyscout das Top 5 ligas.
- [2] é a documentação das funções do VAEP na API do socceraction.
- [3] apresenta uma explicação do framework com uma mistura de intuição, matemática e código.
- [4] são notebooks públicos que implementam o VAEP para outro conjunto de dados.

## Referências
- [1] https://tomdecroos.github.io/reports/kdd19_tomd.pdf
- [2] https://socceraction.readthedocs.io/en/latest/api/vaep.html
- [3] https://socceraction.readthedocs.io/en/latest/documentation/valuing_actions/vaep.html
- [4] https://github.com/ML-KULeuven/socceraction/tree/master/public-notebooks

## Instruções
- Para cada header do notebook abaixo, vocês devem explicar o que foi feito e à qual seção/subseção/equação do paper "Actions Speak Louder than Goals: Valuing Actions by Estimating Probabilities" ela corresponde. Justifique suas respostas.
- Além disso, após algumas partes do código haverão perguntas que vocês devem responder, possivelmente explorando minimamente o que já está pronto.
- Por fim, vocês devem montar um diagrama do fluxo de funções/tarefas de toda a pipeline do VAEP abaixo. Esse diagrama deve ser enviado como arquivo na submissão do Moodle, para além deste notebook.

### Features

In [None]:
from socceraction.vaep import features as ft

In [None]:
def features_transform(spadl):
    spadl.loc[spadl.result_id.isin([2, 3]), ['result_id']] = 0
    spadl.loc[spadl.result_name.isin(['offside', 'owngoal']), ['result_name']] = 'fail'

    xfns = [
        ft.actiontype_onehot,
        ft.bodypart_onehot,
        ft.result_onehot,
        ft.goalscore,
        ft.startlocation,
        ft.endlocation,
        ft.team,
        ft.time,
        ft.time_delta
    ]

    features = []
    for game in tqdm(np.unique(spadl.game_id).tolist()):
        match_actions = spadl.loc[spadl.game_id == game].reset_index(drop=True)
        match_states = ft.gamestates(actions=match_actions)
        match_feats = pd.concat([fn(match_states) for fn in xfns], axis=1)
        features.append(match_feats)
    features = pd.concat(features).reset_index(drop=True)

    return features

1- O que a primeira e a segunda linhas da função acima fazem? Qual sua hipótese sobre intuito dessas transformações? Como você acha que isso pode impactar o modelo final?

In [None]:
features = {}
for league in ['England', 'Spain']:
    features[league] = features_transform(spadl[league])

### Labels

In [None]:
import socceraction.vaep.labels as lab

In [None]:
def labels_transform(spadl):
    yfns = [lab.scores, lab.concedes]

    labels = []
    for game in tqdm(np.unique(spadl.game_id).tolist()):
        match_actions = spadl.loc[spadl.game_id == game].reset_index(drop=True)
        labels.append(pd.concat([fn(actions=match_actions) for fn in yfns], axis=1))

    labels = pd.concat(labels).reset_index(drop=True)

    return labels

In [None]:
labels = {}
for league in ['England', 'Spain']:
    labels[league] = labels_transform(spadl[league])

In [None]:
labels['England']['scores'].sum()

In [None]:
labels['England']['concedes'].sum()

2- Explique o por que da quantidade de labels positivos do tipo scores ser muito maior que do concedes. Como você acha que isso pode impactar o modelo final?

### Training Model

In [None]:
import xgboost as xgb
import sklearn.metrics as mt

In [None]:
def train_vaep(X_train, y_train, X_test, y_test):
    models = {}
    for m in ['scores', 'concedes']:
        models[m] = xgb.XGBClassifier(random_state=0, n_estimators=50, max_depth=3)

        print('training ' + m + ' model')
        models[m].fit(X_train, y_train[m])

        p = sum(y_train[m]) / len(y_train[m])
        base = [p] * len(y_train[m])
        y_train_pred = models[m].predict_proba(X_train)[:, 1]
        train_brier = mt.brier_score_loss(y_train[m], y_train_pred) / mt.brier_score_loss(y_train[m], base)
        print(m + ' Train NBS: ' + str(train_brier))
        print()

        p = sum(y_test[m]) / len(y_test[m])
        base = [p] * len(y_test[m])
        y_test_pred = models[m].predict_proba(X_test)[:, 1]
        test_brier = mt.brier_score_loss(y_test[m], y_test_pred) / mt.brier_score_loss(y_test[m], base)
        print(m + ' Test NBS: ' + str(test_brier))
        print()

        print('----------------------------------------')

    return models

In [None]:
models = train_vaep(X_train=features['England'], y_train=labels['England'], X_test=features['Spain'], y_test=labels['Spain'])

3- Por que treinamos dois modelos diferentes? Por que a performance dos dois é diferente?

### Predictions

In [None]:
def generate_predictions(features, models):
    preds = {}
    for m in ['scores', 'concedes']:
        preds[m] = models[m].predict_proba(features)[:, 1]
    preds = pd.DataFrame(preds)

    return preds

In [None]:
preds = {}
preds['Spain'] = generate_predictions(features=features['Spain'], models=models)

### Action Values

In [None]:
import socceraction.vaep.formula as fm

In [None]:
def calculate_action_values(spadl, predictions):
    action_values = fm.value(actions=spadl, Pscores=predictions['scores'], Pconcedes=predictions['concedes'])
    action_values = pd.concat([
        spadl[['original_event_id', 'action_id', 'game_id', 'start_x', 'start_y', 'end_x', 'end_y', 'type_name', 'result_name']],
        predictions.rename(columns={'scores': 'Pscores', 'concedes': 'Pconcedes'}),
        action_values
    ], axis=1)

    return action_values

In [None]:
action_values = {}
action_values['Spain'] = calculate_action_values(spadl=spadl['Spain'], predictions=preds['Spain'])

4- Explore as ações com Pscores >= 0.95. Por que elas tem um valor tão alto? As compare com ações do mesmo tipo e resultado opostado. Será que o modelo aprende que essa combinação de tipo de ação e resultado está diretamente relacionado à variável y que estamos tentando prever?

5- Qual formula do paper corresponde à coluna 'offensive_value' do dataframe action_values? E a coluna 'defensive_value'?

### Player Ratings

In [None]:
def calculate_minutes_per_season(minutes_per_game):
    minutes_per_season = minutes_per_game.groupby('player_id', as_index=False)['minutes_played'].sum()

    return minutes_per_season

In [None]:
minutes_per_season = {}
minutes_per_season['Spain'] = calculate_minutes_per_season(minutes['Spain'])

In [None]:
def calculate_player_ratings(action_values, minutes_per_season, players):
    player_ratings = action_values.groupby(by='player_id', as_index=False).agg({'vaep_value': 'sum'}).rename(columns={'vaep_value': 'vaep_total'})
    player_ratings = player_ratings.merge(minutes_per_season, on=['player_id'], how='left')
    player_ratings['vaep_p90'] = player_ratings['vaep_total'] / player_ratings['minutes_played'] * 90
    player_ratings = player_ratings[player_ratings['minutes_played'] >= 600].sort_values(by='vaep_p90', ascending=False).reset_index(drop=True)
    player_ratings = player_ratings.merge(players, on=['player_id'], how='left')
    player_ratings = player_ratings[['player_id', 'player_name', 'minutes_played', 'vaep_total', 'vaep_p90']]

    return player_ratings

In [None]:
player_ratings = {}
player_ratings['Spain'] = calculate_player_ratings(action_values=action_values['Spain'], minutes_per_season=minutes_per_season['Spain'], players=players)

6- Acha que o Top 5 da lista é bem representativo? Compare esse ranqueamento do VAEP com o do xT da Parte I. Qual você acha que é mais representativo?