# [CDAF] Atividade 4

## Nome e matrícula
Nome: João Antonio Oliveira Pedrosa
Matrícula: 2019006752

## Referências
- [1] https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- [2] https://socceraction.readthedocs.io/en/latest/api/generated/socceraction.xthreat.ExpectedThreat.html#socceraction.xthreat.ExpectedThreat
- [3] https://socceraction.readthedocs.io/en/latest/api/generated/socceraction.xthreat.get_successful_move_actions.html#socceraction.xthreat.get_successful_move_actions
- [4] https://socceraction.readthedocs.io/en/latest/documentation/valuing_actions/xT.html

In [1]:
# Importando bibliotecas
from tqdm import tqdm
import numpy as np
import pandas as pd
import socceraction.spadl as spd
from socceraction import xthreat as xt

### LaLiga  p/ SPADL com pré-processamentos

In [2]:
# carregando os eventos
path = '../atv3/events/events_Spain.json'
events = pd.read_json(path_or_buf=path)

In [3]:
# pré processamento em colunas da tabela de eventos para facilitar a conversão p/ SPADL
events = events.rename(columns={'id': 'event_id', 'eventId': 'type_id', 'subEventId': 'subtype_id',
                                'teamId': 'team_id', 'playerId': 'player_id', 'matchId': 'game_id'})
events['milliseconds'] = events['eventSec'] * 1000
events['period_id'] = events['matchPeriod'].replace({'1H': 1, '2H': 2})

In [4]:
# carregando as partidas, pois vamos saber quais times jogam em casa e fora p/ usar como parametro do SPADL
path = r'../atv3/matches/matches_Spain.json'
matches = pd.read_json(path_or_buf=path)

In [5]:
# as informações dos times de cada partida estão em um dicionário dentro da coluna 'teamsData', então vamos separar essas informações
team_matches = []
for i in tqdm(range(len(matches))):
    match = pd.DataFrame(matches.loc[i, 'teamsData']).T
    match['matchId'] = matches.loc[i, 'wyId']
    team_matches.append(match)
team_matches = pd.concat(team_matches).reset_index(drop=True)

100%|███████████████████████████████████████| 380/380 [00:00<00:00, 2291.68it/s]


In [6]:
# fazendo a conversão p/ SPADL, padronizando a direção de jogo da esquerda p/ a direita e adicionando os nomes dos tipos de ações
spadl = []
game_ids = events.game_id.unique().tolist()
for g in tqdm(game_ids):
    match_events = events.loc[events.game_id == g]
    match_home_id = team_matches.loc[(team_matches.matchId == g) & (team_matches.side == 'home'), 'teamId'].values[0]
    match_actions = spd.wyscout.convert_to_actions(events=match_events, home_team_id=match_home_id)
    match_actions = spd.play_left_to_right(actions=match_actions, home_team_id=match_home_id)
    match_actions = spd.add_names(match_actions)
    spadl.append(match_actions)
spadl = pd.concat(spadl).reset_index(drop=True)

100%|█████████████████████████████████████████| 380/380 [01:50<00:00,  3.44it/s]


In [7]:
# adicionando o nome dos jogadores
path = '../atv3/players.json'
players = pd.read_json(path_or_buf=path)
players['player_name'] = players['firstName'] + ' ' + players['lastName']
players = players[['wyId', 'player_name']].rename(columns={'wyId': 'player_id'})
spadl = spadl.merge(players, on='player_id', how='left')

## Questão 1
- Crei um dataframe "shots" à partir do dataframe "spadl", contendo apenas os chutes.
- Crie 4 colunas no dataframe "shots" a serem usadas como features de um modelo de xG.
- Justifique a escolha das features.

As features que eu irei utilizar são:

- Ângulo do Chute
- Distância do Chute
- Parte do Corpo
- Tipo da Ação

Irei utilizar ângulo e distância ao invés de utilizar as coordenadas pois acredito que será mais fácil para o modelo fazer previsões baseado nessas duas informações. Elas são equivalentes, então nenhuma informação será perdida e, dessa forma, expressamos o posicionamento da ação em relação ao gol.

A parte do corpo é obviamente uma feature interessante. Finalizações de longa distância feitas com a cabeça por exemplo, tem baixas chances de sucesso. Não irei adicionar essa coluna ao dataset pois a coluna bodypart_id já possui essa informação exatamente como irei utilizá-la no modelo. A mesma lógica se aplica para a feature de tipo da ação.

In [68]:
shots = spadl[(spadl['type_name'] == 'shot') | (spadl['type_name'] == 'shot_freekick') | (spadl['type_name'] == 'shot_penalty') ]
# Consertando os Player Names
shots.loc[:,'player_name'] = shots.loc[:,'player_name'].str.decode('unicode-escape')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shots.loc[:,'player_name'] = shots.loc[:,'player_name'].str.decode('unicode-escape')


In [69]:
import math

def calculate_angle_distance(point1, point2):
    x1, y1 = point1
    x2, y2 = point2

    # Calculate the angle
    dx = x2 - x1
    dy = y2 - y1
    angle = math.atan2(dy, dx)  # Angle in radians

    # Calculate the distance
    distance = math.sqrt(dx**2 + dy**2)

    # Convert the angle to degrees
    angle_degrees = math.degrees(angle)

    return angle_degrees, distance

In [71]:
# Já vou calcular a coluna binária de sucesso aqui

actions = list(spadl['type_name'].unique())
goal_point = [105, 34] # Coordenadas do Gol

angle = [0 for i in range(len(shots))]
dist  = [0 for i in range(len(shots))]
goal  = [0 for i in range(len(shots))]

shots.reset_index(inplace=True)
for i in shots.index:
    x = shots['start_x'][i]
    y = shots['start_y'][i]
    if(shots['result_name'][i] == 'success'):
        goal[i] = 1
    angle[i], dist[i] = calculate_angle_distance(goal_point, [x,y])
    
shots.loc[:,'angle'] = angle
shots.loc[:,'dist']  = dist
shots.loc[:,'goal']  = goal

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shots.loc[:,'angle'] = angle
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shots.loc[:,'dist']  = dist
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shots.loc[:,'goal']  = goal


Quanto à esses warnings, eu tentei fazer exatamente como eles recomendaram e eles continuam, então só colapsei eles.

## Questão 2
- Crie uma coluna numérica binária "goal" no dataframe "shots" indicando se o chute resultou em gol ou não.
- Use regressão logística [1] p/ treinar (.fit(X_train, y_train)) um modelo de xG usando as features criadas na questão 1.
- Use 70% dos dados para treino e 30% para teste.
- Reporte a acurácia do modelo para os conjuntos de treino (.score(X_train, y_train)) e teste (.score(X_test, y_test)).

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = shots[['bodypart_id', 'type_id', 'angle', 'dist']]  # Features
y = shots['goal']  # Objetivo

# Dividindo entre treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Treinando
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print("Score de Treino:")
print(logreg.score(X_train, y_train))

print("Score de Teste:")
print(logreg.score(X_test, y_test))

Score de Treino:
0.8899849523491055
Score de Teste:
0.8841653666146646


## Questão 3
- Use o modelo treinado na questão 2 p/ prever a probabilidade de gol de todos os chutes do dataframe "shots". Reporte essas probabilidades no dataframe "shots" em uma coluna "xG".
- Agrupe o dataframe "shots" por "player_name" e reporte a soma dos "goal" e "xG".
- Reporte os 10 jogadores com maior xG.
- Reporte os 10 jogadores com maior diferença de Gols e xG.

In [72]:
preds = logreg.predict(X)
shots.loc[:, 'xG'] = preds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shots.loc[:, 'xG'] = preds


In [100]:
shots_player = shots.groupby(['player_name']).sum()
shots_player = shots_player[['goal', 'xG']]
shots_player['diff'] = [shots_player['goal'][i] - shots_player['xG'][i] for i in shots_player.index]

10 jogadores com maior xG:

In [101]:
display(shots_player.sort_values(by="xG", ascending=False).head(10))

Unnamed: 0_level_0,goal,xG,diff
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cristiano Ronaldo dos Santos Aveiro,26,5,21
Emmanuel Okyere Boateng,6,4,2
Luis Alberto Suárez Díaz,25,4,21
José Paulo Bezzera Maciel Júnior,9,3,6
Iago Aspas Juncal,22,3,19
Santiago Mina Lorenzo,12,3,9
Wissam Ben Yedder,9,3,6
Karim Benzema,5,3,2
Gabriel Appelt Pires,5,2,3
Carlos Arturo Bacca Ahumada,15,2,13


10 jogadores com maior diferença de Gol e xG:

In [102]:
display(shots_player.sort_values(by="diff", ascending=False).head(10))

Unnamed: 0_level_0,goal,xG,diff
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lionel Andrés Messi Cuccittini,34,1,33
Luis Alberto Suárez Díaz,25,4,21
Cristiano Ronaldo dos Santos Aveiro,26,5,21
Cristhian Ricardo Stuani Curbelo,21,0,21
Iago Aspas Juncal,22,3,19
Antoine Griezmann,19,1,18
Maximiliano Gómez González,18,1,17
Gareth Frank Bale,16,1,15
Willian José da Silva,15,0,15
Rodrigo Moreno Machado,16,2,14


Esses 10 primeiros jogadores dão uma falsa noção de que o modelo na verdade é muito ruim mas eu chequei a acurácia manualmente e a acurácia reportada está correta.

## Questão 4 [4]
- Instancie um objeto ExpectedThreat [2] com parâmetros l=25 e w=16.
- Faça o fit do modelo ExpectedThreat com o dataframe "spadl".

In [106]:
def correct_error(a, b):
    return np.nan_to_num(a/b)

In [111]:
import socceraction.xthreat as xthreat
xthreat._safe_divide = correct_error
xTModel = xthreat.ExpectedThreat(l=25, w=16)
xTModel.fit(spadl)

  return np.nan_to_num(a/b)


# iterations:  30


<socceraction.xthreat.ExpectedThreat at 0x7f504d8e1160>

## Questão 5
- Crie um dataframe "prog_actions" à partir do dataframe "spadl", contendo apenas as ações de progressão e que são bem-sucedidas [3].
- Use o método rate do objeto ExpectedThreat p/ calcular o valor de cada ação de progressão do dataframe "prog_actions", em uma coluna chamada "action_value".
- Agrupe o dataframe "prog_actions" por "player_name" e reporte a soma dos "action_value".
- Reporte os 10 jogadores com maior "action_value".

In [112]:
prog_actions = xthreat.get_successful_move_actions(spadl)
prog_actions["action_value"] = xTModel.rate(prog_actions)
prog_actions['player_name'] = prog_actions['player_name'].str.decode('unicode-escape')

In [119]:
prog_actions_players = prog_actions.groupby('player_name').sum()
prog_actions_players

Unnamed: 0_level_0,game_id,period_id,time_seconds,team_id,player_id,start_x,start_y,end_x,end_y,bodypart_id,type_id,result_id,action_id,action_value
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Aarón Martín Caricol,2468213376,1402,1.293765e+06,664742,271705356,51378.60,56806.0984,53069.10,49846.72,44,2731,962,581574,5.578629
Achraf Hakimi Mouh,1069923190,624,5.679448e+05,281475,165330075,23482.20,3345.6000,23990.40,6433.48,7,1205,417,289922,1.369938
Adalberto Peñaranda Maestre,679901692,372,3.208078e+05,180995,68885690,17405.85,8347.0000,17175.90,8311.64,2,1660,265,142549,0.251660
Adnan Januzaj,1444521348,838,7.299608e+05,386781,147014501,38528.70,16244.5200,39756.15,17880.60,0,3651,563,344545,4.142670
Adrián González Morales,2370742044,1383,1.271137e+06,631092,3423420,48683.25,36403.8000,51000.60,36999.48,64,2020,924,569638,1.053695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Óscar David Romero Villamayor,130844162,90,8.737077e+04,35496,5816805,2909.55,1049.9200,2949.45,1146.48,7,253,51,40836,0.088822
Óscar Esau Duarte Gaitán,992956972,548,4.917586e+05,267417,48471750,13215.30,13556.4800,16352.70,14122.92,24,337,387,219093,1.011419
Óscar Melendo Jiménez,659415050,417,3.702676e+05,177587,115851488,15377.25,7828.1600,15527.40,8402.76,5,800,257,184289,0.317399
Óscar de Marcos Arana,1657517693,977,8.631285e+05,437988,2209966,38771.25,6366.1600,39805.50,10733.80,34,1497,646,391853,2.742232


10 jogadores com maior "action_value"

In [120]:
prog_actions_players.sort_values(by="action_value", ascending=False).head(10)

Unnamed: 0_level_0,game_id,period_id,time_seconds,team_id,player_id,start_x,start_y,end_x,end_y,bodypart_id,type_id,result_id,action_id,action_value
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Lionel Andrés Messi Cuccittini,4705538769,2778,2577345.0,1239784,6160406,124407.15,52640.16,132844.95,59429.28,14,7463,1834,1317903,10.650189
Marcelo Vieira da Silva Júnior,3979456219,2338,2161144.0,1046925,5133810,93085.65,90968.36,101014.2,79523.28,27,3528,1551,1087920,10.264535
Álvaro Odriozola Arzallus,3753649494,2169,1934327.0,1005081,407136807,86065.35,11475.68,85872.15,21035.8,38,4993,1463,904555,8.708854
José Luis Morales Nogales,2242465244,1344,1238712.0,607430,196727786,56079.45,27190.48,59961.3,28331.52,11,5481,874,550756,7.81904
Hugo Mallo Novegil,3915316318,2268,2024664.0,1055992,5840002,77569.8,16677.0,82930.05,28213.88,97,3421,1526,944856,7.431915
Juan Francisco Moreno Fuertes,3474013972,2053,1867470.0,916658,4502050,71543.85,12923.4,79160.55,21790.6,68,4254,1354,875393,7.281309
Éver Maximiliano David Banega,5123809747,3023,2685928.0,1357960,6550160,105353.85,72857.24,118285.65,70945.08,23,4606,1997,1324517,7.01516
Lucas Vázquez Iglesias,2339977112,1469,1397931.0,615600,4102176,59982.3,13984.88,59226.3,20553.68,15,2642,912,723666,6.908507
Jordi Alba Ramos,4895416848,2785,2491619.0,1289808,6237252,107337.3,111709.04,108414.6,95475.4,41,2209,1908,1246334,6.824937
José Luis Gayá Peña,3289278923,1855,1684123.0,864068,6271544,66810.45,74368.2,70629.3,64261.36,41,2652,1282,765194,6.81135


Estatisticamente comprovado Messi GOAT