Hello everyone, I would like to show you the process of data preparation of football statistics that I downloaded from fbref.com. I downloaded data for every player in every game this season. I also took downloaded data to get more information. You can download the data from https://drive.google.com/file/d/1T5MpGyDHwBH_0Zr8IPNFtZprgPN6_FXO/view?usp=sharing.

We will prepare the data by creating 6 new tables which will be an excellent basis for further analysis of players, teams, matches and referees.

Let us get our hands dirty!

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np
import glob
import seaborn as sns

In [2]:
player_match_data = pd.read_csv('players_Premier-League_2023-2024_01_tillnow.csv')

In [3]:
player_match_data.head()

Unnamed: 0,dayofweek,date,start_time,home_team,home_xg,score,away_xg,away_team,attendance,venue,...,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,aerials_won_pct
0,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,...,0,2,0,0.0,0.0,0,4.0,1.0,2.0,33.3
1,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,...,1,1,0,0.0,0.0,0,1.0,0.0,0.0,
2,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,...,1,0,0,0.0,0.0,0,5.0,2.0,3.0,40.0
3,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,...,0,0,0,0.0,0.0,0,0.0,0.0,0.0,
4,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,...,0,2,0,0.0,0.0,0,5.0,0.0,0.0,


In [4]:
#Let's check dtypes of each column

pd.set_option('display.max_rows', 119)
player_match_data.dtypes

dayofweek                        object
date                             object
start_time                       object
home_team                        object
home_xg                         float64
score                            object
away_xg                         float64
away_team                        object
attendance                       object
venue                            object
referee                          object
league                           object
gameweek                          int64
manager                          object
team                             object
formation                        object
name                             object
x                               float64
y                               float64
shirtnumber                       int64
nationality                      object
position                         object
age                             float64
minutes                           int64
goals                             int64


In [5]:
#Let's convert attendence column into integer

player_match_data['attendance'] = player_match_data['attendance'].str.replace(',', '', regex=True)
player_match_data['attendance'] = player_match_data['attendance'].astype(float).astype(pd.Int64Dtype())

player_match_data['attendance']

0       21572
1       21572
2       21572
3       21572
4       21572
        ...  
2705    61286
2706    61286
2707    61286
2708    61286
2709    61286
Name: attendance, Length: 2710, dtype: Int64

In [6]:
#Let's delete all of percentage columns (those with "_pct" in the name)
#It is easy to calculate in later stages of analysis if it is needed.

pct_cols = [x for x in player_match_data.columns if '_pct' in x]
player_match_data.drop(pct_cols, axis = 1, inplace = True)
player_match_data.columns.values

array(['dayofweek', 'date', 'start_time', 'home_team', 'home_xg', 'score',
       'away_xg', 'away_team', 'attendance', 'venue', 'referee', 'league',
       'gameweek', 'manager', 'team', 'formation', 'name', 'x', 'y',
       'shirtnumber', 'nationality', 'position', 'age', 'minutes',
       'goals', 'assists', 'pens_made', 'pens_att', 'shots',
       'shots_on_target', 'cards_yellow', 'cards_red', 'touches',
       'tackles', 'interceptions', 'blocks', 'xg', 'npxg', 'xg_assist',
       'sca', 'gca', 'passes_completed', 'passes', 'progressive_passes',
       'carries', 'progressive_carries', 'take_ons', 'take_ons_won',
       'passes_total_distance', 'passes_progressive_distance',
       'passes_completed_short', 'passes_short',
       'passes_completed_medium', 'passes_medium',
       'passes_completed_long', 'passes_long', 'pass_xa',
       'assisted_shots', 'passes_into_final_third',
       'passes_into_penalty_area', 'crosses_into_penalty_area',
       'passes_live', 'passes_dead',

In [7]:
#Let's create colums which will be usable in analysis 
#and are not presented in current dataframe.
#First, we will create "starter" column 
#which will have 0 if player started on the bench and 
#1 if player started match in the field.
#In raw data, player name in "name" column starts with "\xa0\xa0\xa0"

player_match_data['starter'] = player_match_data.apply(lambda row: 0 if row['name'].startswith('\xa0\xa0\xa0') else 1, axis = 1)
player_match_data.name = player_match_data.name.str.strip()

player_match_data[['name', 'starter']]

Unnamed: 0,name,starter
0,Zeki Amdouni,1
1,Anass Zaroury,0
2,Lyle Foster,1
3,Nathan Redmond,0
4,Josh Cullen,1
...,...,...
2705,Antonee Robinson,1
2706,Tim Ream,1
2707,Calvin Bassey,1
2708,Timothy Castagne,1


In [8]:
#Next columns are home and away score 
#which will be extracted as integers from "score" column.

player_match_data[['home_score', 'away_score']] = player_match_data.apply(lambda row: pd.Series([int(x) for x in row['score'].split('–')]), axis = 1)

player_match_data[['home_team', 'away_team', 'score', 'home_score', 'away_score']]

Unnamed: 0,home_team,away_team,score,home_score,away_score
0,Burnley,Manchester City,0–3,0,3
1,Burnley,Manchester City,0–3,0,3
2,Burnley,Manchester City,0–3,0,3
3,Burnley,Manchester City,0–3,0,3
4,Burnley,Manchester City,0–3,0,3
...,...,...,...,...,...
2705,Tottenham,Fulham,2–0,2,0
2706,Tottenham,Fulham,2–0,2,0
2707,Tottenham,Fulham,2–0,2,0
2708,Tottenham,Fulham,2–0,2,0


In [9]:
#Now, we will create columns home_win, draw, away_win 
#from columns "home_score", "away_score".
#Created columns will have values 1 if match is finished by win 
#of home or away team or with a draw else, values will be 0

player_match_data['hda_winner'] = player_match_data.apply(lambda row: 'h' if row.home_score>row.away_score else 'a' if row.home_score<row.away_score else 'd', axis = 1)

player_match_data['home_win'] = (player_match_data['hda_winner'] == 'h').astype(int)
player_match_data['draw'] = (player_match_data['hda_winner'] == 'd').astype(int)
player_match_data['away_win'] = (player_match_data['hda_winner'] == 'a').astype(int)

player_match_data.drop('hda_winner', inplace = True, axis = 1)

player_match_data[['home_team', 'away_team', 'score', 'home_score', 'away_score', 'home_win', 'draw', 'away_win']]

Unnamed: 0,home_team,away_team,score,home_score,away_score,home_win,draw,away_win
0,Burnley,Manchester City,0–3,0,3,0,0,1
1,Burnley,Manchester City,0–3,0,3,0,0,1
2,Burnley,Manchester City,0–3,0,3,0,0,1
3,Burnley,Manchester City,0–3,0,3,0,0,1
4,Burnley,Manchester City,0–3,0,3,0,0,1
...,...,...,...,...,...,...,...,...
2705,Tottenham,Fulham,2–0,2,0,1,0,0
2706,Tottenham,Fulham,2–0,2,0,1,0,0
2707,Tottenham,Fulham,2–0,2,0,1,0,0
2708,Tottenham,Fulham,2–0,2,0,1,0,0


In [10]:
#We do not know which team won match. Let's create column "team_winner" 
#which will indicate name of winning team, or value 'draw' if it was draw.

player_match_data['team_winner'] = player_match_data.apply(lambda row: row.home_team if row.home_score>row.away_score else row.away_team if row.home_score<row.away_score else 'draw', axis = 1)
player_match_data[['home_team', 'away_team', 'home_score', 'away_score', 'home_win', 'draw', 'away_win', 'team_winner']]

Unnamed: 0,home_team,away_team,home_score,away_score,home_win,draw,away_win,team_winner
0,Burnley,Manchester City,0,3,0,0,1,Manchester City
1,Burnley,Manchester City,0,3,0,0,1,Manchester City
2,Burnley,Manchester City,0,3,0,0,1,Manchester City
3,Burnley,Manchester City,0,3,0,0,1,Manchester City
4,Burnley,Manchester City,0,3,0,0,1,Manchester City
...,...,...,...,...,...,...,...,...
2705,Tottenham,Fulham,2,0,1,0,0,Tottenham
2706,Tottenham,Fulham,2,0,1,0,0,Tottenham
2707,Tottenham,Fulham,2,0,1,0,0,Tottenham
2708,Tottenham,Fulham,2,0,1,0,0,Tottenham


In [11]:
#We have to create columns "player_win" and "player_lose" 
#which will indicate if player was part of winning or losing team.

player_match_data['winner'] = player_match_data.apply(lambda row: 'w' if row.team==row.team_winner else 'd' if row.team_winner=='draw' else 'l', axis = 1)

player_match_data['player_win'] = (player_match_data['winner'] == 'w').astype(int)
player_match_data['player_lose'] = (player_match_data['winner'] == 'l').astype(int)

player_match_data.drop('winner', inplace = True, axis = 1)

player_match_data[['home_team', 'away_team', 'home_score', 'away_score', 'team_winner', 'player_win', 'player_lose']]

Unnamed: 0,home_team,away_team,home_score,away_score,team_winner,player_win,player_lose
0,Burnley,Manchester City,0,3,Manchester City,0,1
1,Burnley,Manchester City,0,3,Manchester City,0,1
2,Burnley,Manchester City,0,3,Manchester City,0,1
3,Burnley,Manchester City,0,3,Manchester City,0,1
4,Burnley,Manchester City,0,3,Manchester City,0,1
...,...,...,...,...,...,...,...
2705,Tottenham,Fulham,2,0,Tottenham,0,1
2706,Tottenham,Fulham,2,0,Tottenham,0,1
2707,Tottenham,Fulham,2,0,Tottenham,0,1
2708,Tottenham,Fulham,2,0,Tottenham,0,1


In [12]:
#Let's add column points which will indicate 
#if team won 3, 1 or 0 points.

player_match_data['points'] = player_match_data.apply(lambda row: 3 if row.player_win==1 else 0 if row.player_lose==1 else 1, axis = 1)

player_match_data[['home_team', 'away_team', 'home_score', 'away_score', 'team_winner', 'player_win', 'player_lose', 'points']]

Unnamed: 0,home_team,away_team,home_score,away_score,team_winner,player_win,player_lose,points
0,Burnley,Manchester City,0,3,Manchester City,0,1,0
1,Burnley,Manchester City,0,3,Manchester City,0,1,0
2,Burnley,Manchester City,0,3,Manchester City,0,1,0
3,Burnley,Manchester City,0,3,Manchester City,0,1,0
4,Burnley,Manchester City,0,3,Manchester City,0,1,0
...,...,...,...,...,...,...,...,...
2705,Tottenham,Fulham,2,0,Tottenham,0,1,0
2706,Tottenham,Fulham,2,0,Tottenham,0,1,0
2707,Tottenham,Fulham,2,0,Tottenham,0,1,0
2708,Tottenham,Fulham,2,0,Tottenham,0,1,0


In [13]:
#Let's assign home and awax xg values 
#to player's team and opponent team.

player_match_data['team_xg'] = player_match_data.apply(lambda row: row.home_xg if row.team==row.home_team else row.away_xg, axis = 1)
player_match_data['against_xg'] = player_match_data.apply(lambda row: row.away_xg if row.team==row.home_team else row.home_xg, axis = 1)

player_match_data[['home_team', 'away_team', 'home_score', 'away_score', 'home_xg', 'away_xg', 'team_xg', 'against_xg']]

Unnamed: 0,home_team,away_team,home_score,away_score,home_xg,away_xg,team_xg,against_xg
0,Burnley,Manchester City,0,3,0.3,1.9,0.3,1.9
1,Burnley,Manchester City,0,3,0.3,1.9,0.3,1.9
2,Burnley,Manchester City,0,3,0.3,1.9,0.3,1.9
3,Burnley,Manchester City,0,3,0.3,1.9,0.3,1.9
4,Burnley,Manchester City,0,3,0.3,1.9,0.3,1.9
...,...,...,...,...,...,...,...,...
2705,Tottenham,Fulham,2,0,1.5,1.0,1.0,1.5
2706,Tottenham,Fulham,2,0,1.5,1.0,1.0,1.5
2707,Tottenham,Fulham,2,0,1.5,1.0,1.0,1.5
2708,Tottenham,Fulham,2,0,1.5,1.0,1.0,1.5


In [14]:
#Let's assign home and awax golas values 
#to player's team scored and conceded goals.

player_match_data['team_scored'] = player_match_data.apply(lambda row: row.home_score if row.team==row.home_team else row.away_score, axis = 1)
player_match_data['team_conceded'] = player_match_data.apply(lambda row: row.home_score if row.team==row.away_team else row.away_score, axis = 1)

player_match_data[['home_team', 'away_team', 'home_score', 'away_score', 'team_scored', 'team_conceded']]

Unnamed: 0,home_team,away_team,home_score,away_score,team_scored,team_conceded
0,Burnley,Manchester City,0,3,0,3
1,Burnley,Manchester City,0,3,0,3
2,Burnley,Manchester City,0,3,0,3
3,Burnley,Manchester City,0,3,0,3
4,Burnley,Manchester City,0,3,0,3
...,...,...,...,...,...,...
2705,Tottenham,Fulham,2,0,0,2
2706,Tottenham,Fulham,2,0,0,2
2707,Tottenham,Fulham,2,0,0,2
2708,Tottenham,Fulham,2,0,0,2


In [15]:
#Let's check how dataframe looks like now

pd.set_option('display.max_columns', 150)
player_match_data.head(10)

Unnamed: 0,dayofweek,date,start_time,home_team,home_xg,score,away_xg,away_team,attendance,venue,referee,league,gameweek,manager,team,formation,name,x,y,shirtnumber,nationality,position,age,minutes,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,starter,home_score,away_score,home_win,draw,away_win,team_winner,player_win,player_lose,points,team_xg,against_xg,team_scored,team_conceded
0,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Zeki Amdouni,33.33,15.0,25,SUI,FW,22.685,60,0,0,0,0,1,1,0,0,23.0,1.0,0,0.0,0.0,0.0,0.0,1.0,0.0,10.0,11.0,0.0,11.0,1.0,4.0,2.0,142.0,48.0,6.0,7.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,3.0,9.0,11.0,3.0,23.0,2.0,63.0,17.0,0.0,1.0,3.0,1.0,15.0,3.0,0,0,2,0,0.0,0.0,0,4.0,1.0,2.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
1,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Anass Zaroury,9999.0,9999.0,19,MAR,FW,22.759,29,0,0,0,0,1,0,0,1,13.0,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,10.0,12.0,0.0,6.0,0.0,0.0,0.0,203.0,52.0,4.0,4.0,3.0,3.0,3.0,5.0,0.1,1.0,0.0,0.0,0.0,7.0,5.0,0.0,0.0,0.0,4,0.0,4.0,2.0,2.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,7.0,0.0,13.0,0.0,13.0,4.0,0.0,0.0,0.0,0.0,7.0,1.0,0,1,1,0,0.0,0.0,0,1.0,0.0,0.0,0,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
2,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Lyle Foster,33.33,25.0,17,RSA,LM,22.937,89,0,0,0,0,2,0,0,0,37.0,2.0,0,2.0,0.1,0.1,0.0,3.0,0.0,14.0,20.0,2.0,14.0,2.0,3.0,1.0,247.0,43.0,5.0,7.0,7.0,8.0,1.0,2.0,0.0,1.0,1.0,1.0,0.0,16.0,4.0,1.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,0.0,3.0,5.0,23.0,10.0,4.0,37.0,2.0,103.0,51.0,1.0,1.0,3.0,2.0,18.0,3.0,0,1,0,0,0.0,0.0,0,5.0,2.0,3.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
3,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Nathan Redmond,9999.0,9999.0,15,ENG,LM,29.433,1,0,0,0,0,0,0,0,0,2.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,31.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,10.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0,0,0,0,0.0,0.0,0,0.0,0.0,0.0,0,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
4,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Josh Cullen,33.33,30.0,24,IRL,CM,27.345,90,0,0,0,0,0,0,0,0,46.0,1.0,0,1.0,0.0,0.0,0.0,0.0,0.0,37.0,43.0,2.0,16.0,0.0,0.0,0.0,659.0,112.0,17.0,18.0,18.0,19.0,1.0,3.0,0.0,0.0,1.0,0.0,0.0,41.0,2.0,1.0,0.0,0.0,2,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,21.0,19.0,6.0,0.0,46.0,0.0,48.0,18.0,0.0,0.0,0.0,0.0,30.0,1.0,0,0,2,0,0.0,0.0,0,5.0,0.0,0.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
5,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Sander Berge,66.67,25.0,16,NOR,CM,25.488,89,0,0,0,0,0,0,0,0,33.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,23.0,28.0,1.0,12.0,0.0,0.0,0.0,413.0,57.0,12.0,13.0,8.0,10.0,3.0,4.0,0.0,0.0,1.0,0.0,0.0,27.0,1.0,1.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,4.0,12.0,18.0,3.0,1.0,33.0,0.0,35.0,17.0,0.0,0.0,1.0,0.0,24.0,1.0,0,2,0,0,0.0,0.0,0,3.0,2.0,0.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
6,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Josh Brownhill,9999.0,9999.0,8,ENG,CM,27.644,1,0,0,0,0,0,0,0,0,2.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,43.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0,0,0,0,0.0,0.0,0,0.0,0.0,0.0,0,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
7,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Luca Koleosho,83.33,30.0,30,ITA,RM,18.904,60,0,0,0,0,1,0,0,0,23.0,0.0,0,0.0,0.1,0.1,0.1,2.0,0.0,8.0,14.0,0.0,18.0,1.0,2.0,0.0,71.0,8.0,6.0,6.0,1.0,6.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,13.0,1.0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,11.0,11.0,2.0,23.0,2.0,90.0,26.0,2.0,0.0,3.0,1.0,16.0,1.0,0,1,0,0,0.0,0.0,0,5.0,0.0,1.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
8,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Jacob Bruun Larsen,9999.0,9999.0,34,DEN,RM,24.893,30,0,0,0,0,1,0,0,0,9.0,0.0,0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,6.0,2.0,4.0,3.0,0.0,0.0,55.0,21.0,2.0,3.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,5.0,1.0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,7.0,2.0,9.0,0.0,53.0,37.0,1.0,1.0,1.0,0.0,6.0,3.0,0,2,0,0,0.0,0.0,0,0.0,2.0,0.0,0,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3
9,Fri,2023-08-11,20:00,Burnley,0.3,0–3,1.9,Manchester City,21572,Turf Moor,Craig Pawson,Premier-League,1,Vincent Kompany,Burnley,5-4-1,Vitinho,16.67,15.0,22,BRA,LB,24.052,90,0,0,0,0,0,0,0,0,31.0,0.0,1,2.0,0.0,0.0,0.1,2.0,0.0,20.0,25.0,0.0,14.0,1.0,1.0,0.0,245.0,40.0,16.0,16.0,4.0,5.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,21.0,4.0,1.0,0.0,0.0,0,3.0,0.0,0.0,0.0,0.0,0.0,2.0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,11.0,15.0,5.0,1.0,31.0,1.0,78.0,29.0,0.0,0.0,0.0,0.0,17.0,2.0,0,1,1,0,0.0,0.0,0,2.0,0.0,0.0,1,0,3,0,0,1,Manchester City,0,1,0,0.3,1.9,0,3


In [16]:
#Considering dataframe contains data for each player in each match, 
#we will export dataframe with name "player_match_data.csv".

player_match_data.to_csv('player_match_data.csv', index = False)

In [17]:
player_match_data.columns.values

array(['dayofweek', 'date', 'start_time', 'home_team', 'home_xg', 'score',
       'away_xg', 'away_team', 'attendance', 'venue', 'referee', 'league',
       'gameweek', 'manager', 'team', 'formation', 'name', 'x', 'y',
       'shirtnumber', 'nationality', 'position', 'age', 'minutes',
       'goals', 'assists', 'pens_made', 'pens_att', 'shots',
       'shots_on_target', 'cards_yellow', 'cards_red', 'touches',
       'tackles', 'interceptions', 'blocks', 'xg', 'npxg', 'xg_assist',
       'sca', 'gca', 'passes_completed', 'passes', 'progressive_passes',
       'carries', 'progressive_carries', 'take_ons', 'take_ons_won',
       'passes_total_distance', 'passes_progressive_distance',
       'passes_completed_short', 'passes_short',
       'passes_completed_medium', 'passes_medium',
       'passes_completed_long', 'passes_long', 'pass_xa',
       'assisted_shots', 'passes_into_final_third',
       'passes_into_penalty_area', 'crosses_into_penalty_area',
       'passes_live', 'passes_dead',

In [27]:
#Let's create data for each team in each match. 
#We will calculate sum of values for each team in each match, 
#except for "age" column where calculation of mean will be applied.

team_match_columns = ['dayofweek', 'date', 'start_time', 'home_team', 'home_xg',
       'away_xg', 'away_team', 'attendance', 'venue', 'referee', 'league',
       'gameweek', 'manager', 'team', 'formation','home_score', 'away_score', 'home_win',
       'draw', 'away_win', 'team_winner', 'player_win', 'player_lose',
       'points', 'team_xg', 'against_xg', 'team_scored', 'team_conceded']

numeric_columns = player_match_data.select_dtypes(include=[np.number]).columns.tolist()
redundant_numeric_columns = ['starter', 'minutes', 'x', 'y','shirtnumber']
numeric_columns = [col for col in numeric_columns if col not in redundant_numeric_columns]

agg_dict = {col: 'sum' for col in numeric_columns if col != 'age' and col not in team_match_columns}
agg_dict['age'] = 'mean'

team_match_data = player_match_data.groupby(team_match_columns).agg(agg_dict).reset_index()

#Let's check if we created proper dataframe with proper columns and values. 
#We will explore which team competed with oldest squad in this season.
team_match_data.sort_values('age', ascending = False)

Unnamed: 0,dayofweek,date,start_time,home_team,home_xg,away_xg,away_team,attendance,venue,referee,league,gameweek,manager,team,formation,home_score,away_score,home_win,draw,away_win,team_winner,player_win,player_lose,points,team_xg,against_xg,team_scored,team_conceded,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,age
24,Sat,2023-08-12,15:00,Everton,2.7,1.5,Fulham,39940,Goodison Park,Stuart Attwell,Premier-League,1,Marco Silva,Fulham,4-3-3,0,1,0,0,1,Fulham,1,0,3,1.5,2.7,1,0,1,0,0,0,9,2,2,0,691.0,10.0,4,10.0,1.5,1.5,0.7,16.0,2.0,498.0,600.0,38.0,507.0,14.0,11.0,6.0,8460.0,2970.0,221.0,248.0,221.0,252.0,42.0,67.0,0.4,7.0,33.0,9.0,3.0,545.0,52.0,18.0,0.0,2.0,11,22.0,4.0,1.0,1.0,0.0,3.0,12.0,4,5.0,4.0,1.0,7.0,14.0,7.0,4.0,6.0,14.0,31.0,0.0,93.0,284.0,295.0,119.0,18.0,691.0,4.0,2441.0,1270.0,14.0,2.0,8.0,9.0,495.0,38.0,0,6,12,3,0.0,0.0,0,43.0,14.0,9.0,29.752786
76,Sat,2023-09-23,15:00,Crystal Palace,0.3,0.6,Fulham,25072,Selhurst Park,Paul Tierney,Premier-League,6,Marco Silva,Fulham,4-3-3,0,0,0,1,0,draw,0,0,1,0.6,0.3,0,0,0,0,0,0,10,5,2,0,620.0,27.0,9,9.0,0.5,0.5,0.3,19.0,0.0,404.0,502.0,49.0,327.0,24.0,13.0,7.0,7060.0,2285.0,185.0,211.0,165.0,193.0,42.0,72.0,0.4,8.0,35.0,11.0,2.0,469.0,32.0,12.0,1.0,6.0,18,15.0,2.0,0.0,0.0,0.0,1.0,8.0,11,16.0,8.0,3.0,12.0,27.0,15.0,2.0,7.0,36.0,20.0,0.0,51.0,204.0,270.0,149.0,14.0,620.0,5.0,2416.0,1425.0,15.0,3.0,16.0,9.0,396.0,47.0,0,15,10,1,0.0,0.0,0,66.0,16.0,13.0,29.520500
14,Mon,2023-10-02,20:00,Fulham,1.1,1.7,Chelsea,24445,Craven Cottage,Tim Robinson,Premier-League,7,Marco Silva,Fulham,4-3-3,0,2,0,0,1,Chelsea,0,1,0,1.1,1.7,0,2,0,0,0,0,10,3,1,0,710.0,16.0,13,10.0,1.1,1.1,1.0,19.0,0.0,530.0,617.0,48.0,432.0,21.0,10.0,3.0,9648.0,3340.0,248.0,270.0,219.0,235.0,59.0,90.0,0.9,9.0,42.0,10.0,3.0,562.0,53.0,15.0,1.0,6.0,22,21.0,8.0,2.0,4.0,0.0,2.0,12.0,7,8.0,5.0,3.0,7.0,15.0,8.0,1.0,9.0,29.0,5.0,1.0,62.0,262.0,276.0,178.0,26.0,710.0,6.0,2429.0,1420.0,8.0,7.0,15.0,18.0,525.0,48.0,0,15,11,2,0.0,0.0,0,46.0,9.0,8.0,29.520467
17,Mon,2023-10-23,20:00,Tottenham,1.5,1.0,Fulham,61286,Tottenham Hotspur Stadium,Anthony Taylor,Premier-League,9,Marco Silva,Fulham,4-2-3-1,2,0,1,0,0,Tottenham,0,1,0,1.0,1.5,0,2,0,0,0,0,10,3,0,0,618.0,19.0,12,24.0,1.1,1.1,1.2,18.0,0.0,419.0,513.0,27.0,385.0,10.0,24.0,12.0,8011.0,2609.0,169.0,183.0,170.0,194.0,66.0,97.0,0.8,9.0,29.0,6.0,2.0,453.0,54.0,14.0,3.0,6.0,19,22.0,5.0,2.0,3.0,0.0,6.0,13.0,11,9.0,7.0,3.0,7.0,16.0,9.0,6.0,18.0,31.0,13.0,2.0,72.0,241.0,258.0,124.0,12.0,618.0,10.0,1575.0,682.0,8.0,3.0,14.0,9.0,417.0,27.0,0,14,10,6,0.0,0.0,0,57.0,4.0,5.0,29.463250
66,Sat,2023-09-16,15:00,Fulham,1.0,1.1,Luton Town,24467,Craven Cottage,Michael Salisbury,Premier-League,5,Marco Silva,Fulham,4-3-3,1,0,1,0,0,Fulham,1,0,3,1.0,1.1,1,0,1,0,0,0,13,2,2,0,947.0,17.0,7,6.0,1.0,1.0,0.5,25.0,2.0,765.0,858.0,68.0,634.0,19.0,14.0,4.0,13420.0,3603.0,325.0,349.0,372.0,403.0,53.0,78.0,0.9,10.0,68.0,10.0,4.0,809.0,49.0,13.0,3.0,6.0,24,23.0,6.0,2.0,2.0,0.0,0.0,6.0,9,8.0,5.0,4.0,7.0,11.0,4.0,1.0,5.0,24.0,10.0,0.0,31.0,140.0,573.0,240.0,21.0,947.0,8.0,3051.0,1457.0,19.0,2.0,19.0,12.0,760.0,68.0,0,6,13,0,0.0,0.0,0,51.0,11.0,18.0,29.450937
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,Sat,2023-10-07,15:00,Burnley,0.7,1.9,Chelsea,21654,Turf Moor,Stuart Attwell,Premier-League,8,Vincent Kompany,Burnley,4-3-3,1,4,0,0,1,Chelsea,0,1,0,0.7,1.9,1,4,1,1,0,0,10,3,2,0,532.0,24.0,4,6.0,0.7,0.7,0.6,19.0,2.0,353.0,432.0,13.0,333.0,11.0,24.0,8.0,6275.0,2149.0,162.0,182.0,153.0,172.0,35.0,61.0,0.6,8.0,9.0,3.0,1.0,382.0,46.0,9.0,1.0,1.0,9,20.0,7.0,4.0,0.0,0.0,4.0,7.0,15,14.0,4.0,6.0,14.0,22.0,8.0,2.0,4.0,28.0,10.0,0.0,78.0,250.0,190.0,96.0,8.0,532.0,12.0,1710.0,887.0,7.0,2.0,13.0,10.0,350.0,12.0,0,11,8,4,0.0,1.0,1,41.0,10.0,9.0,24.134750
4,Fri,2023-08-25,20:00,Chelsea,2.2,0.4,Luton Town,39893,Stamford Bridge,Robert Jones,Premier-League,3,Mauricio Pochettino,Chelsea,3-4-3,3,0,1,0,0,Chelsea,1,0,3,2.2,0.4,3,0,3,2,0,0,19,8,2,0,789.0,10.0,9,12.0,2.4,2.4,1.6,35.0,6.0,621.0,684.0,42.0,488.0,24.0,14.0,7.0,10405.0,3382.0,290.0,307.0,265.0,283.0,47.0,67.0,1.6,14.0,33.0,12.0,0.0,642.0,42.0,11.0,4.0,3.0,16,13.0,6.0,0.0,4.0,0.0,0.0,3.0,6,7.0,1.0,2.0,4.0,11.0,7.0,5.0,7.0,19.0,11.0,0.0,81.0,262.0,361.0,170.0,37.0,789.0,7.0,2140.0,1140.0,12.0,8.0,19.0,11.0,619.0,42.0,0,15,11,0,0.0,0.0,0,47.0,14.0,6.0,24.098214
133,Sun,2023-08-13,16:30,Chelsea,1.4,1.3,Liverpool,40096,Stamford Bridge,Anthony Taylor,Premier-League,1,Mauricio Pochettino,Chelsea,3-4-3,1,1,0,1,0,draw,0,0,1,1.4,1.3,1,1,1,1,0,0,10,4,3,0,862.0,13.0,7,14.0,1.4,1.4,1.4,19.0,2.0,645.0,755.0,57.0,631.0,22.0,19.0,10.0,10545.0,3694.0,331.0,364.0,236.0,265.0,53.0,75.0,1.1,9.0,51.0,13.0,4.0,700.0,52.0,17.0,4.0,3.0,26,23.0,4.0,1.0,1.0,0.0,3.0,20.0,9,6.0,5.0,2.0,6.0,16.0,10.0,5.0,9.0,20.0,18.0,0.0,57.0,234.0,401.0,232.0,32.0,862.0,5.0,2843.0,1360.0,18.0,6.0,25.0,14.0,640.0,57.0,0,5,13,3,0.0,0.0,0,63.0,11.0,6.0,24.080533
137,Sun,2023-08-20,16:30,West Ham,1.8,2.5,Chelsea,62451,London Stadium,John Brooks,Premier-League,2,Mauricio Pochettino,Chelsea,3-4-3,3,1,1,0,0,West Ham,0,1,0,2.5,1.8,1,3,1,0,0,1,16,3,3,0,872.0,16.0,3,8.0,2.4,1.6,1.3,30.0,2.0,661.0,773.0,57.0,587.0,37.0,31.0,15.0,9656.0,2808.0,391.0,424.0,219.0,249.0,31.0,51.0,1.5,11.0,44.0,13.0,5.0,723.0,48.0,14.0,5.0,3.0,34,16.0,9.0,6.0,1.0,0.0,2.0,13.0,11,5.0,8.0,3.0,3.0,8.0,5.0,3.0,5.0,19.0,13.0,1.0,53.0,181.0,390.0,307.0,49.0,871.0,6.0,3247.0,1842.0,33.0,14.0,12.0,8.0,657.0,57.0,0,9,12,2,1.0,1.0,0,53.0,19.0,11.0,24.015733


In [19]:
#Considering dataframe contains data for each team in each match, 
#we will export dataframe with name "team_match_data.csv".

team_match_data.to_csv('team_match_data.csv', index = False)

In [20]:
#Let's create data for each match. 
#We will calculate sum of values for each match, 
#except for "age" column where calculation of mean will be applied.

match_columns = ['dayofweek', 'date', 'start_time', 'home_team', 'home_xg',
       'away_xg', 'away_team', 'attendance', 'venue', 'referee', 'league',
       'gameweek','home_score', 'away_score', 'team_winner']

numeric_columns = team_match_data.select_dtypes(include=[np.number]).columns.tolist()
redundant_numeric_columns = ['player_win', 'player_lose','points', 'against_xg', 'team_conceded']
numeric_columns = [col for col in numeric_columns if col not in redundant_numeric_columns]

agg_dict = {col: 'sum' for col in numeric_columns if col != 'age' and col != 'draw' and col not in match_columns}
agg_dict['age'] = 'mean'
agg_dict['draw'] = 'first'

match_data = team_match_data.groupby(match_columns).agg(agg_dict).reset_index()

match_data.columns
match_data.sort_values('team_scored', ascending = False)


Unnamed: 0,dayofweek,date,start_time,home_team,home_xg,away_xg,away_team,attendance,venue,referee,league,gameweek,home_score,away_score,team_winner,home_win,away_win,team_xg,team_scored,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,age,draw
81,Sun,2023-09-24,16:30,Sheffield Utd,0.9,3.9,Newcastle Utd,31127,Bramall Lane,Stuart Attwell,Premier-League,6,0,8,Newcastle Utd,0,2,4.8,8,8,6,0,0,31,16,4,0,1353.0,38.0,19,24.0,5.0,5.0,3.3,53.0,14.0,918.0,1116.0,86.0,762.0,35.0,35.0,20.0,14818.0,5231.0,471.0,528.0,358.0,402.0,63.0,128.0,1.9,24.0,47.0,19.0,3.0,1026.0,86.0,24.0,6.0,7.0,23,33.0,7.0,3.0,2.0,1.0,4.0,16.0,20,19.0,15.0,4.0,12.0,32.0,20.0,7.0,17.0,57.0,32.0,2.0,126.0,437.0,655.0,271.0,54.0,1353.0,12.0,3677.0,1647.0,25.0,13.0,30.0,26.0,913.0,86.0,0,20,19,4,0.0,0.0,0,119.0,20.0,20.0,26.265692,0
27,Sat,2023-09-02,15:00,Burnley,1.3,2.2,Tottenham,21750,Turf Moor,Darren England,Premier-League,4,2,5,Tottenham,0,2,3.5,7,7,6,0,0,37,15,7,0,1347.0,46.0,21,24.0,3.5,3.5,3.0,64.0,12.0,918.0,1101.0,75.0,880.0,51.0,54.0,23.0,15092.0,5114.0,450.0,495.0,359.0,401.0,84.0,148.0,2.4,26.0,47.0,24.0,5.0,1003.0,95.0,26.0,5.0,6.0,19,33.0,11.0,5.0,1.0,1.0,3.0,15.0,29,19.0,18.0,9.0,26.0,49.0,23.0,11.0,13.0,67.0,30.0,3.0,220.0,526.0,502.0,329.0,80.0,1347.0,26.0,3992.0,1959.0,37.0,26.0,24.0,20.0,913.0,75.0,0,24,24,3,0.0,0.0,0,118.0,17.0,17.0,25.586688,0
43,Sat,2023-09-30,12:30,Aston Villa,1.6,1.7,Brighton,40636,Villa Park,Andy Madley,Premier-League,7,6,1,Aston Villa,2,0,3.3,7,6,5,0,0,30,12,8,0,1101.0,40.0,12,25.0,3.1,3.1,2.7,52.0,11.0,748.0,898.0,68.0,691.0,37.0,43.0,19.0,12314.0,4090.0,347.0,379.0,331.0,375.0,47.0,94.0,2.8,22.0,54.0,18.0,1.0,787.0,105.0,45.0,6.0,4.0,16,34.0,4.0,2.0,2.0,0.0,6.0,15.0,26,9.0,29.0,2.0,18.0,37.0,19.0,11.0,14.0,52.0,25.0,1.0,109.0,340.0,519.0,250.0,53.0,1101.0,18.0,3711.0,1841.0,15.0,15.0,19.0,22.0,740.0,67.0,0,40,39,6,0.0,0.0,1,73.0,13.0,13.0,25.673174,0
29,Sat,2023-09-02,15:00,Manchester City,2.2,1.4,Fulham,52899,Etihad Stadium,Michael Oliver,Premier-League,4,5,1,Manchester City,2,0,3.6,6,6,3,1,1,12,8,6,0,1285.0,28.0,15,12.0,4.2,3.4,1.8,23.0,10.0,971.0,1133.0,55.0,731.0,25.0,27.0,16.0,17226.0,5008.0,420.0,468.0,419.0,458.0,102.0,151.0,0.6,9.0,54.0,11.0,1.0,1048.0,83.0,25.0,4.0,6.0,20,36.0,9.0,1.0,7.0,0.0,2.0,14.0,19,13.0,13.0,2.0,10.0,26.0,16.0,1.0,11.0,43.0,23.0,1.0,104.0,415.0,667.0,213.0,31.0,1284.0,10.0,3780.0,2028.0,15.0,6.0,19.0,18.0,956.0,55.0,0,24,22,2,1.0,1.0,0,93.0,14.0,14.0,27.193406,0
14,Sat,2023-08-12,17:30,Newcastle Utd,3.3,1.8,Aston Villa,52207,St James' Park,Andy Madley,Premier-League,1,5,1,Newcastle Utd,2,0,5.1,6,6,5,0,0,33,18,8,0,1234.0,29.0,13,20.0,5.4,5.4,4.2,61.0,12.0,864.0,1027.0,70.0,794.0,37.0,36.0,18.0,14092.0,4964.0,427.0,465.0,361.0,413.0,55.0,99.0,3.0,27.0,66.0,13.0,2.0,925.0,100.0,30.0,12.0,4.0,27,38.0,11.0,4.0,4.0,0.0,2.0,17.0,19,15.0,11.0,3.0,8.0,26.0,18.0,6.0,14.0,42.0,29.0,1.0,126.0,434.0,531.0,278.0,52.0,1234.0,8.0,3831.0,1870.0,28.0,11.0,27.0,21.0,858.0,70.0,0,29,28,2,0.0,0.0,0,93.0,11.0,11.0,27.167531,0
72,Sun,2023-09-03,14:00,Crystal Palace,2.1,1.2,Wolves,24741,Selhurst Park,Robert Jones,Premier-League,4,3,2,Crystal Palace,2,0,3.3,5,5,5,0,0,28,14,4,0,1221.0,50.0,18,15.0,3.4,3.4,3.0,49.0,10.0,802.0,1005.0,82.0,624.0,27.0,36.0,14.0,13035.0,4647.0,396.0,443.0,320.0,367.0,56.0,141.0,1.5,22.0,59.0,21.0,7.0,905.0,97.0,26.0,2.0,6.0,39,38.0,6.0,3.0,2.0,0.0,3.0,9.0,31,21.0,19.0,10.0,20.0,34.0,14.0,4.0,11.0,68.0,34.0,1.0,125.0,380.0,545.0,308.0,48.0,1221.0,20.0,3518.0,2014.0,25.0,9.0,26.0,30.0,796.0,82.0,0,24,20,3,0.0,0.0,0,116.0,25.0,25.0,26.877582,0
11,Sat,2023-08-12,15:00,Brighton,4.0,1.5,Luton Town,31872,The American Express Community Stadium,David Coote,Premier-League,1,4,1,Brighton,2,0,5.5,5,5,2,2,2,34,13,4,0,1164.0,30.0,13,18.0,5.4,3.9,3.5,64.0,7.0,768.0,927.0,91.0,746.0,42.0,30.0,7.0,13572.0,5480.0,368.0,406.0,307.0,338.0,75.0,137.0,3.9,30.0,57.0,23.0,8.0,831.0,91.0,25.0,1.0,6.0,45,27.0,13.0,2.0,8.0,0.0,5.0,13.0,22,14.0,12.0,4.0,16.0,23.0,7.0,7.0,11.0,43.0,42.0,2.0,160.0,363.0,499.0,308.0,68.0,1162.0,16.0,3831.0,2003.0,29.0,16.0,32.0,14.0,761.0,91.0,0,23,21,5,1.0,2.0,0,91.0,33.0,33.0,27.069,0
17,Sat,2023-08-19,15:00,Wolves,2.1,2.2,Brighton,31317,Molineux Stadium,Andy Madley,Premier-League,2,1,4,Brighton,0,2,4.3,5,5,5,0,0,32,13,11,1,1314.0,45.0,16,21.0,4.6,4.6,4.0,60.0,10.0,920.0,1094.0,76.0,818.0,47.0,58.0,37.0,14120.0,4842.0,499.0,541.0,325.0,361.0,59.0,118.0,3.1,27.0,58.0,13.0,1.0,989.0,94.0,35.0,8.0,2.0,25,30.0,8.0,7.0,0.0,0.0,11.0,16.0,32,26.0,17.0,2.0,18.0,55.0,37.0,7.0,14.0,61.0,28.0,0.0,160.0,458.0,604.0,272.0,63.0,1314.0,18.0,4428.0,2228.0,26.0,23.0,28.0,27.0,910.0,76.0,1,25,24,11,0.0,0.0,0,113.0,10.0,10.0,27.268773,0
23,Sat,2023-08-26,15:00,Manchester Utd,2.8,1.2,Nott'ham Forest,73595,Old Trafford,Stuart Attwell,Premier-League,3,3,2,Manchester Utd,2,0,4.0,5,5,3,1,1,26,12,8,1,1255.0,34.0,21,17.0,4.0,3.2,2.9,52.0,10.0,813.0,1034.0,90.0,774.0,45.0,38.0,17.0,13274.0,4866.0,422.0,479.0,299.0,359.0,67.0,139.0,2.1,22.0,73.0,15.0,1.0,939.0,90.0,28.0,4.0,2.0,35,29.0,14.0,2.0,9.0,0.0,5.0,13.0,22,23.0,8.0,3.0,15.0,32.0,17.0,5.0,12.0,55.0,56.0,0.0,136.0,380.0,514.0,379.0,48.0,1254.0,15.0,3779.0,2183.0,34.0,12.0,27.0,19.0,801.0,86.0,0,24,24,5,1.0,1.0,0,120.0,32.0,32.0,27.198186,0
52,Sat,2023-10-07,15:00,Burnley,0.7,1.9,Chelsea,21654,Turf Moor,Stuart Attwell,Premier-League,8,1,4,Chelsea,0,2,2.6,5,4,3,1,1,18,7,6,0,1331.0,46.0,13,16.0,2.6,1.8,1.7,37.0,8.0,985.0,1130.0,50.0,818.0,23.0,48.0,16.0,17563.0,5211.0,418.0,455.0,459.0,493.0,91.0,141.0,1.3,16.0,51.0,12.0,2.0,1038.0,86.0,23.0,7.0,9.0,22,38.0,10.0,5.0,1.0,0.0,6.0,12.0,28,20.0,10.0,16.0,26.0,42.0,16.0,6.0,10.0,59.0,20.0,0.0,157.0,535.0,565.0,242.0,35.0,1330.0,26.0,3502.0,1689.0,12.0,9.0,31.0,20.0,978.0,48.0,0,19,18,6,1.0,1.0,1,92.0,19.0,19.0,24.338625,0


In [21]:
#Considering dataframe contains data for each match, 
#we will export dataframe with name "match_data.csv".

match_data.to_csv('match_data.csv', index = False)

In [22]:
#Let's create data for each team. 
#We will calculate sum of values for each team, 
#except for "age" column where calculation of mean will be applied.

team_columns = ['team']

numeric_columns = team_match_data.select_dtypes(include=[np.number]).columns.tolist()
redundant_numeric_columns = ['home_xg', 'away_xg', 'gameweek', 'home_score', 'away_score', 'home_win', 'away_win', 'player_win', 'player_lose', '']
numeric_columns = [col for col in numeric_columns if col not in redundant_numeric_columns]

agg_dict = {col: 'sum' for col in numeric_columns if col != 'age' and col not in team_columns}
agg_dict['age'] = 'mean'


grouped_data = team_match_data.groupby(team_columns)
team_data = grouped_data.agg(agg_dict).reset_index()
team_data['matches_played'] =grouped_data.size().values

#Let's check if we created proper dataframe with proper columns and values. 
#We will shown league table with total values for each column, except age, which is shown as mean value.
team_data.sort_values('points', ascending = False)

Unnamed: 0,team,attendance,draw,points,team_xg,against_xg,team_scored,team_conceded,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,age,matches_played
17,Tottenham,366732,2,23,16.4,11.9,20,8,18,16,0,0,168,61,31,1,6657.0,179.0,78,109.0,16.5,16.5,13.9,295.0,34.0,4775.0,5577.0,513.0,4355.0,250.0,230.0,105.0,74555.0,25439.0,2525.0,2710.0,1827.0,2059.0,262.0,469.0,13.3,129.0,378.0,116.0,12.0,5107.0,451.0,150.0,28.0,18.0,151,170.0,56.0,22.0,16.0,0.0,19.0,118.0,113,85.0,62.0,32.0,88.0,166.0,78.0,33.0,76.0,257.0,173.0,2.0,680.0,2045.0,2759.0,1939.0,354.0,6657.0,99.0,23294.0,11931.0,162.0,94.0,121.0,90.0,4737.0,510.0,1,102,120,19,0.0,2.0,1,488.0,82.0,102.0,25.440241,9
0,Arsenal,414848,3,21,15.9,7.8,18,8,17,10,5,5,126,42,17,1,6403.0,158.0,73,115.0,16.1,12.2,8.9,236.0,29.0,4717.0,5509.0,448.0,4141.0,186.0,173.0,78.0,79839.0,25010.0,2244.0,2461.0,2040.0,2256.0,344.0,570.0,8.8,100.0,366.0,108.0,15.0,5107.0,383.0,97.0,21.0,23.0,162,145.0,76.0,57.0,1.0,0.0,19.0,72.0,89,61.0,67.0,30.0,79.0,136.0,57.0,32.0,83.0,231.0,98.0,5.0,510.0,1776.0,2729.0,1943.0,289.0,6398.0,75.0,20562.0,10540.0,134.0,57.0,110.0,91.0,4662.0,441.0,1,90,92,19,4.0,1.0,0,429.0,99.0,108.0,25.239386,9
10,Liverpool,420911,2,20,19.0,11.7,20,9,18,11,3,4,151,41,18,4,6424.0,151.0,79,110.0,18.7,15.4,12.8,278.0,29.0,4568.0,5408.0,417.0,3904.0,207.0,181.0,78.0,76788.0,26706.0,2181.0,2373.0,1919.0,2146.0,358.0,630.0,10.1,116.0,300.0,92.0,12.0,4976.0,415.0,103.0,25.0,36.0,141,167.0,59.0,23.0,31.0,0.0,17.0,85.0,96,62.0,60.0,29.0,77.0,187.0,110.0,28.0,82.0,230.0,181.0,5.0,604.0,1963.0,2948.0,1568.0,275.0,6420.0,88.0,18778.0,10229.0,141.0,63.0,124.0,80.0,4538.0,413.0,1,108,93,17,3.0,0.0,1,503.0,117.0,86.0,26.433474,9
1,Aston Villa,359905,1,19,17.0,14.1,23,13,22,15,3,3,134,51,22,0,5335.0,145.0,60,97.0,17.3,14.9,11.9,235.0,38.0,3673.0,4425.0,335.0,3292.0,186.0,172.0,75.0,64575.0,22126.0,1587.0,1769.0,1694.0,1915.0,316.0,550.0,8.2,102.0,300.0,92.0,21.0,3979.0,438.0,151.0,18.0,32.0,158,151.0,55.0,33.0,4.0,1.0,8.0,76.0,83,71.0,61.0,13.0,69.0,151.0,82.0,28.0,69.0,205.0,135.0,1.0,636.0,1859.0,2264.0,1258.0,248.0,5332.0,65.0,18161.0,9691.0,108.0,56.0,122.0,72.0,3607.0,333.0,0,103,118,8,3.0,0.0,1,424.0,62.0,80.0,26.546437,9
12,Manchester City,366809,0,18,14.7,5.9,18,7,18,13,1,2,125,51,18,2,6183.0,116.0,48,66.0,15.1,13.5,10.8,234.0,32.0,4830.0,5475.0,381.0,4173.0,216.0,134.0,69.0,81321.0,22458.0,2282.0,2426.0,2050.0,2226.0,369.0,567.0,10.6,101.0,391.0,89.0,11.0,5146.0,323.0,95.0,26.0,20.0,136,124.0,50.0,12.0,23.0,0.0,6.0,80.0,68,51.0,45.0,20.0,55.0,99.0,44.0,13.0,53.0,164.0,88.0,3.0,447.0,1480.0,3023.0,1711.0,232.0,6181.0,50.0,21918.0,12624.0,169.0,56.0,91.0,58.0,4761.0,376.0,1,66,92,6,1.0,0.0,1,357.0,64.0,59.0,26.185154,8
14,Newcastle Utd,385187,1,16,19.4,8.4,24,8,24,16,2,2,118,54,18,0,5417.0,121.0,42,101.0,19.5,17.9,13.5,213.0,41.0,3836.0,4603.0,338.0,3236.0,159.0,144.0,78.0,65279.0,21747.0,1814.0,2017.0,1630.0,1859.0,315.0,531.0,8.6,89.0,242.0,79.0,15.0,4237.0,361.0,116.0,17.0,35.0,126,136.0,40.0,13.0,12.0,1.0,5.0,64.0,78,56.0,51.0,14.0,54.0,128.0,74.0,27.0,74.0,163.0,117.0,3.0,443.0,1705.0,2529.0,1228.0,223.0,5415.0,43.0,16357.0,7918.0,114.0,53.0,109.0,83.0,3814.0,336.0,0,93,101,5,2.0,0.0,0,409.0,100.0,88.0,27.349229,8
4,Brighton,357380,1,16,17.4,14.6,22,18,21,16,1,1,135,56,25,0,6379.0,147.0,84,102.0,17.2,16.5,13.4,251.0,38.0,4703.0,5404.0,389.0,4180.0,203.0,190.0,79.0,77094.0,25237.0,2331.0,2522.0,1891.0,2088.0,345.0,534.0,11.3,114.0,280.0,93.0,18.0,4983.0,396.0,137.0,17.0,23.0,130,121.0,49.0,18.0,13.0,0.0,25.0,69.0,88,69.0,61.0,17.0,60.0,171.0,111.0,33.0,69.0,231.0,136.0,4.0,734.0,2102.0,2873.0,1455.0,259.0,6378.0,87.0,20834.0,11194.0,124.0,62.0,132.0,79.0,4630.0,386.0,0,111,125,25,1.0,2.0,1,426.0,77.0,63.0,26.680794,9
13,Manchester Utd,542664,0,15,13.8,13.4,11,13,11,9,1,1,143,42,20,0,6101.0,151.0,62,109.0,13.5,12.7,10.9,258.0,21.0,4209.0,5129.0,454.0,3813.0,198.0,175.0,86.0,70047.0,22529.0,2078.0,2329.0,1596.0,1840.0,378.0,651.0,10.0,116.0,367.0,108.0,13.0,4697.0,406.0,95.0,18.0,40.0,178,172.0,62.0,20.0,26.0,2.0,26.0,87.0,88,69.0,55.0,27.0,81.0,177.0,96.0,31.0,78.0,213.0,161.0,1.0,574.0,1750.0,2621.0,1791.0,254.0,6100.0,65.0,19138.0,10573.0,133.0,68.0,120.0,93.0,4159.0,450.0,0,93,89,26,1.0,1.0,1,517.0,92.0,110.0,26.838641,9
18,West Ham,395283,2,14,13.3,18.3,16,16,16,14,1,1,104,39,22,1,4574.0,151.0,93,129.0,13.1,12.3,9.4,188.0,31.0,2699.0,3527.0,256.0,2441.0,131.0,145.0,79.0,47165.0,19270.0,1293.0,1455.0,996.0,1189.0,303.0,612.0,8.7,78.0,190.0,56.0,14.0,3110.0,395.0,96.0,26.0,38.0,132,139.0,43.0,26.0,17.0,0.0,22.0,94.0,99,81.0,53.0,17.0,71.0,146.0,75.0,51.0,78.0,244.0,246.0,1.0,676.0,1768.0,1890.0,958.0,198.0,4573.0,49.0,13538.0,6358.0,91.0,47.0,125.0,78.0,2676.0,254.0,1,100,91,22,1.0,3.0,0,421.0,137.0,140.0,28.578322,9
6,Chelsea,318203,3,12,16.4,9.1,13,9,12,7,2,3,119,39,29,1,6763.0,150.0,60,97.0,16.7,14.3,9.9,213.0,21.0,4976.0,5790.0,397.0,4543.0,227.0,191.0,84.0,81248.0,26930.0,2477.0,2692.0,1979.0,2195.0,364.0,570.0,9.4,85.0,354.0,99.0,15.0,5353.0,410.0,121.0,26.0,30.0,167,167.0,44.0,14.0,17.0,0.0,27.0,95.0,87,70.0,51.0,29.0,63.0,127.0,64.0,36.0,61.0,210.0,154.0,3.0,650.0,2145.0,2998.0,1678.0,293.0,6760.0,75.0,21488.0,11216.0,135.0,74.0,163.0,93.0,4934.0,394.0,0,95,107,27,2.0,1.0,0,483.0,119.0,83.0,24.183063,9


In [23]:
#Considering dataframe contains data for each team, 
#we will export dataframe with name "team_data.csv".

team_data.to_csv('team_data.csv', index = False)

In [26]:
#Let's create data for each player. 
#We will calculate sum of values for each team in each match, 
#except for "age" column where last value will be applied.

player_columns = ['name']

numeric_columns = player_match_data.select_dtypes(include=[np.number]).columns.tolist()
redundant_numeric_columns = ['x', 'y','shirtnumber', 'home_xg', 'away_xg', 'gameweek', 'home_score', 'away_score', 'home_win', 'away_win', 'player_win', 'player_lose']
numeric_columns = [col for col in numeric_columns if col not in redundant_numeric_columns]

agg_dict = {col: 'sum' for col in numeric_columns if col != 'age' and col not in player_columns}
agg_dict['age'] = 'last'

player_data = player_match_data.groupby(player_columns).agg(agg_dict).reset_index()

#Let's check if we created proper dataframe with proper columns and values. 
#We will explore which player is currently oldest in the league.
player_data.sort_values('age', ascending = False)

Unnamed: 0,name,attendance,minutes,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,starter,draw,points,team_xg,against_xg,team_scored,team_conceded,age
406,Thiago Silva,318203,810,0,0,0,0,2,1,2,0,837.0,12.0,7,10.0,0.1,0.1,0.3,8.0,1.0,734.0,768.0,31.0,601.0,1.0,1.0,1.0,13342.0,5124.0,301.0,312.0,363.0,374.0,64.0,73.0,0.1,4.0,42.0,5.0,0.0,740.0,28.0,18.0,0.0,2.0,1,1.0,0.0,0.0,0.0,0.0,0.0,2.0,5,10.0,1.0,1.0,9.0,13.0,4.0,8.0,2.0,19.0,40.0,0.0,119.0,405.0,409.0,26.0,7.0,837.0,0.0,2408.0,1318.0,6.0,0.0,2.0,0.0,628.0,1.0,0,5,3,0,0.0,0.0,0,53.0,20.0,7.0,9,3,12,16.4,9.1,13,9,39.079
43,Ashley Young,336112,739,0,0,0,0,4,1,5,1,507.0,15.0,6,8.0,0.2,0.2,0.4,19.0,0.0,308.0,441.0,34.0,230.0,15.0,10.0,3.0,5483.0,2467.0,134.0,151.0,133.0,160.0,35.0,102.0,0.5,8.0,25.0,8.0,1.0,322.0,116.0,20.0,2.0,1.0,43,78.0,18.0,10.0,7.0,1.0,3.0,12.0,10,9.0,4.0,2.0,10.0,19.0,9.0,3.0,5.0,21.0,13.0,2.0,24.0,141.0,233.0,136.0,1.0,507.0,6.0,1165.0,680.0,9.0,1.0,2.0,3.0,247.0,10.0,1,9,8,0,0.0,0.0,0,44.0,6.0,6.0,9,1,7,14.7,12.5,9,14,38.285
188,James Milner,253375,274,0,0,0,0,0,0,2,0,204.0,7.0,4,3.0,0.0,0.0,0.1,4.0,0.0,142.0,170.0,18.0,122.0,2.0,1.0,0.0,2204.0,758.0,80.0,88.0,53.0,57.0,6.0,14.0,0.2,1.0,12.0,3.0,0.0,147.0,21.0,3.0,2.0,0.0,7,18.0,0.0,0.0,0.0,0.0,2.0,5.0,6,4.0,3.0,0.0,2.0,3.0,1.0,0.0,3.0,11.0,9.0,0.0,13.0,49.0,94.0,62.0,1.0,204.0,1.0,345.0,134.0,2.0,0.0,3.0,1.0,126.0,15.0,0,7,4,0,0.0,0.0,0,13.0,2.0,6.0,4,0,12,11.5,9.6,16,9,37.795
412,Tim Ream,276706,693,1,0,0,0,1,1,3,1,639.0,14.0,7,8.0,1.0,1.0,0.0,3.0,0.0,517.0,574.0,41.0,414.0,10.0,3.0,2.0,10423.0,3695.0,162.0,172.0,289.0,314.0,62.0,77.0,0.1,0.0,50.0,1.0,0.0,551.0,22.0,11.0,0.0,6.0,1,0.0,0.0,0.0,0.0,0.0,1.0,4.0,8,10.0,3.0,1.0,4.0,9.0,5.0,3.0,5.0,21.0,25.0,2.0,66.0,276.0,343.0,23.0,6.0,639.0,1.0,2470.0,1679.0,4.0,1.0,4.0,3.0,452.0,2.0,1,5,2,0,0.0,1.0,0,51.0,13.0,10.0,8,1,10,9.3,13.8,6,13,36.049
218,Jonny Evans,186781,268,0,1,0,0,0,0,1,0,209.0,3.0,3,1.0,0.0,0.0,0.1,4.0,1.0,156.0,179.0,9.0,115.0,1.0,0.0,0.0,2433.0,596.0,71.0,80.0,74.0,81.0,5.0,10.0,0.1,3.0,11.0,2.0,0.0,164.0,13.0,3.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2,3.0,0.0,0.0,3.0,6.0,3.0,1.0,0.0,6.0,15.0,0.0,32.0,104.0,94.0,11.0,1.0,209.0,0.0,468.0,259.0,1.0,0.0,1.0,1.0,125.0,0.0,0,2,5,0,0.0,0.0,0,22.0,9.0,4.0,3,0,9,4.4,5.7,6,5,35.797
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,Facundo Buonanotte,31617,45,0,0,0,0,1,1,0,0,21.0,2.0,2,0.0,0.0,0.0,0.0,0.0,0.0,10.0,14.0,0.0,13.0,0.0,1.0,0.0,155.0,18.0,5.0,7.0,4.0,6.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,13.0,1.0,0.0,0.0,0.0,1,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0,7.0,11.0,4.0,0.0,21.0,1.0,55.0,16.0,0.0,0.0,2.0,3.0,11.0,2.0,0,2,4,0,0.0,0.0,0,4.0,0.0,0.0,1,0,3,1.9,1.1,3,1,18.753
173,Jack Hinshelwood,40636,85,0,0,0,0,0,0,0,0,55.0,0.0,3,1.0,0.0,0.0,0.1,2.0,0.0,45.0,51.0,3.0,34.0,1.0,0.0,0.0,500.0,74.0,30.0,31.0,11.0,15.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,50.0,1.0,1.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,1.0,3.0,0.0,0.0,1.0,14.0,30.0,11.0,1.0,55.0,0.0,120.0,47.0,1.0,0.0,0.0,0.0,42.0,1.0,0,2,1,0,0.0,0.0,0,5.0,0.0,0.0,1,0,0,1.7,1.6,1,6,18.471
270,Luke Harris,52899,75,0,0,0,0,1,1,0,0,19.0,1.0,0,0.0,0.3,0.3,0.0,0.0,0.0,11.0,14.0,1.0,9.0,0.0,2.0,0.0,172.0,14.0,5.0,8.0,4.0,4.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,11.0,3.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,8.0,6.0,5.0,1.0,19.0,2.0,39.0,6.0,1.0,0.0,0.0,0.0,11.0,0.0,0,2,2,0,0.0,0.0,0,3.0,0.0,0.0,0,0,0,1.4,2.2,1,5,18.416
105,David Ozoh,52189,4,0,0,0,0,0,0,0,0,7.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,6.0,0.0,5.0,0.0,0.0,0.0,84.0,14.0,1.0,2.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,6.0,0.0,0.0,7.0,0.0,19.0,3.0,0.0,0.0,0.0,0.0,5.0,0.0,0,0,0,0,0.0,0.0,0,0.0,1.0,0.0,0,0,0,1.2,2.4,0,4,18.392


In [30]:
#Considering dataframe contains data for each player, 
#we will export dataframe with name "player_data.csv".

player_data.to_csv('player_data.csv', index = False)

In [33]:
#Let's create data for each player. 
#We will calculate sum of values for each team in each match, 
#except for "age" column where first value will be applied.

referee_columns = ['referee']

numeric_columns = match_data.select_dtypes(include=[np.number]).columns.tolist()
redundant_numeric_columns = ['x', 'y','shirtnumber', 'home_xg', 'away_xg', 'gameweek']
numeric_columns = [col for col in numeric_columns if col not in redundant_numeric_columns]

agg_dict = {col: 'sum' for col in numeric_columns if col != 'age' and col not in referee_columns}
agg_dict['age'] = 'first'

grouped_data = match_data.groupby(referee_columns)
referee_data = grouped_data.agg(agg_dict).reset_index()
referee_data['number_of_matches'] =grouped_data.size().values

#Let's check if we created proper dataframe with proper columns and values. 
#We will explore which referee called the most fouls.
referee_data.sort_values('number_of_matches', ascending = False)

Unnamed: 0,referee,attendance,home_score,away_score,home_win,away_win,team_xg,team_scored,goals,assists,pens_made,pens_att,shots,shots_on_target,cards_yellow,cards_red,touches,tackles,interceptions,blocks,xg,npxg,xg_assist,sca,gca,passes_completed,passes,progressive_passes,carries,progressive_carries,take_ons,take_ons_won,passes_total_distance,passes_progressive_distance,passes_completed_short,passes_short,passes_completed_medium,passes_medium,passes_completed_long,passes_long,pass_xa,assisted_shots,passes_into_final_third,passes_into_penalty_area,crosses_into_penalty_area,passes_live,passes_dead,passes_free_kicks,through_balls,passes_switches,crosses,throw_ins,corner_kicks,corner_kicks_in,corner_kicks_out,corner_kicks_straight,passes_offsides,passes_blocked,tackles_won,tackles_def_3rd,tackles_mid_3rd,tackles_att_3rd,challenge_tackles,challenges,challenges_lost,blocked_shots,blocked_passes,tackles_interceptions,clearances,errors,touches_def_pen_area,touches_def_3rd,touches_mid_3rd,touches_att_3rd,touches_att_pen_area,touches_live_ball,take_ons_tackled,carries_distance,carries_progressive_distance,carries_into_final_third,carries_into_penalty_area,miscontrols,dispossessed,passes_received,progressive_passes_received,cards_yellow_red,fouls,fouled,offsides,pens_won,pens_conceded,own_goals,ball_recoveries,aerials_won,aerials_lost,draw,age,number_of_matches
1,Anthony Taylor,359230,16,9,8,4,27.1,25,25,17,2,2,209,64,38,1,10194.0,258.0,138,233.0,27.9,26.3,20.4,382.0,43.0,6901.0,8416.0,647.0,6268.0,290.0,310.0,142.0,118153.0,40739.0,3284.0,3640.0,2831.0,3230.0,604.0,1076.0,16.5,165.0,548.0,161.0,26.0,7653.0,728.0,191.0,28.0,48.0,272,283.0,95.0,40.0,25.0,0.0,35.0,179.0,156,132.0,92.0,34.0,132.0,274.0,142.0,72.0,161.0,396.0,318.0,8.0,1056.0,3192.0,4389.0,2696.0,420.0,10192.0,132.0,30768.0,16230.0,231.0,89.0,235.0,126.0,6824.0,639.0,0,168,167,35,2.0,2.0,0,850.0,175.0,175.0,2,27.089906,8
11,Michael Oliver,314807,13,10,8,6,18.8,23,21,16,2,2,170,56,31,0,9012.0,255.0,113,156.0,19.1,17.5,12.7,300.0,39.0,6178.0,7513.0,522.0,5492.0,252.0,270.0,118.0,105667.0,35127.0,2891.0,3217.0,2583.0,2954.0,526.0,951.0,12.3,130.0,437.0,110.0,14.0,6889.0,600.0,154.0,29.0,44.0,191,257.0,62.0,34.0,23.0,0.0,24.0,126.0,161,122.0,96.0,37.0,126.0,244.0,118.0,41.0,115.0,368.0,235.0,4.0,918.0,2957.0,4040.0,2090.0,325.0,9010.0,126.0,27361.0,14888.0,187.0,75.0,185.0,129.0,6094.0,519.0,0,137,131,24,1.0,2.0,2,726.0,168.0,168.0,0,25.708737,7
15,Robert Jones,255705,14,9,6,0,18.7,23,22,17,2,2,176,65,38,3,8427.0,230.0,118,143.0,19.1,17.5,13.7,305.0,41.0,5784.0,7017.0,477.0,5004.0,239.0,224.0,105.0,97438.0,32627.0,2774.0,3054.0,2416.0,2738.0,446.0,875.0,10.3,122.0,417.0,100.0,20.0,6377.0,620.0,193.0,19.0,39.0,233,223.0,65.0,35.0,20.0,2.0,20.0,105.0,132,103.0,85.0,42.0,101.0,206.0,105.0,45.0,98.0,348.0,263.0,5.0,991.0,2841.0,3591.0,2076.0,324.0,8425.0,101.0,25507.0,13420.0,162.0,83.0,177.0,129.0,5724.0,470.0,2,181,171,20,1.0,2.0,1,678.0,151.0,151.0,4,25.462013,7
14,Peter Bankes,192771,9,8,4,2,14.5,17,17,14,0,0,162,47,35,1,7110.0,205.0,98,154.0,14.4,14.4,10.3,285.0,33.0,4365.0,5708.0,442.0,3824.0,208.0,236.0,102.0,78737.0,30200.0,1995.0,2298.0,1789.0,2135.0,478.0,950.0,9.4,126.0,358.0,100.0,24.0,5082.0,601.0,154.0,20.0,44.0,226,245.0,77.0,34.0,25.0,3.0,25.0,134.0,115,100.0,74.0,31.0,106.0,208.0,102.0,44.0,110.0,303.0,248.0,4.0,717.0,2348.0,2955.0,1876.0,296.0,7110.0,106.0,20214.0,10132.0,141.0,60.0,188.0,99.0,4317.0,438.0,1,137,130,25,0.0,0.0,0,616.0,256.0,256.0,3,26.478208,6
0,Andy Madley,291212,17,12,6,4,23.2,29,27,23,0,0,191,83,40,1,7357.0,200.0,106,145.0,24.2,24.2,19.8,344.0,49.0,4985.0,6055.0,453.0,4602.0,253.0,227.0,115.0,82251.0,28799.0,2458.0,2712.0,1941.0,2225.0,412.0,750.0,15.8,151.0,367.0,105.0,19.0,5440.0,588.0,182.0,32.0,38.0,187,227.0,60.0,26.0,21.0,1.0,27.0,115.0,133,102.0,85.0,13.0,82.0,197.0,115.0,49.0,96.0,306.0,236.0,3.0,838.0,2433.0,3149.0,1837.0,338.0,7357.0,82.0,23819.0,12338.0,160.0,89.0,135.0,118.0,4941.0,450.0,1,163,159,27,0.0,0.0,2,559.0,113.0,113.0,1,27.167531,6
3,Craig Pawson,218726,5,5,6,4,11.9,10,9,5,2,2,149,43,29,2,7407.0,210.0,91,124.0,12.1,10.5,7.7,257.0,15.0,4972.0,6124.0,419.0,4258.0,213.0,218.0,100.0,85272.0,29063.0,2334.0,2593.0,2038.0,2347.0,444.0,860.0,8.1,108.0,388.0,97.0,19.0,5576.0,537.0,135.0,15.0,31.0,195,230.0,61.0,25.0,25.0,1.0,11.0,81.0,130,112.0,67.0,31.0,89.0,189.0,100.0,50.0,74.0,301.0,219.0,3.0,831.0,2440.0,3153.0,1879.0,262.0,7405.0,89.0,23162.0,12423.0,188.0,58.0,164.0,121.0,4919.0,418.0,1,130,128,11,1.0,2.0,1,620.0,169.0,169.0,1,25.279187,6
9,John Brooks,187766,7,6,4,6,15.3,13,12,8,1,2,143,42,24,3,6018.0,197.0,90,133.0,15.0,13.4,10.5,248.0,19.0,3822.0,4807.0,332.0,3444.0,176.0,203.0,95.0,64584.0,23371.0,1876.0,2102.0,1486.0,1719.0,354.0,711.0,8.0,106.0,258.0,75.0,20.0,4296.0,491.0,140.0,17.0,32.0,164,192.0,55.0,29.0,19.0,0.0,20.0,101.0,110,86.0,72.0,39.0,83.0,178.0,95.0,41.0,92.0,287.0,208.0,5.0,646.0,2031.0,2564.0,1491.0,261.0,6016.0,83.0,18390.0,9447.0,135.0,60.0,166.0,114.0,3778.0,328.0,2,128,125,20,2.0,2.0,1,516.0,127.0,127.0,0,27.227604,5
19,Stuart Attwell,197936,7,16,4,6,18.1,23,22,15,2,2,127,54,28,1,6426.0,167.0,84,104.0,18.2,16.6,12.9,238.0,41.0,4449.0,5370.0,359.0,3984.0,168.0,190.0,89.0,76168.0,25652.0,2080.0,2306.0,1857.0,2083.0,401.0,710.0,7.7,104.0,287.0,77.0,12.0,4880.0,461.0,138.0,20.0,28.0,137,173.0,53.0,20.0,19.0,1.0,29.0,80.0,92,86.0,49.0,32.0,76.0,165.0,89.0,30.0,74.0,251.0,181.0,3.0,725.0,2230.0,2828.0,1437.0,228.0,6424.0,76.0,19252.0,9523.0,124.0,57.0,133.0,91.0,4410.0,353.0,0,112,110,29,2.0,2.0,1,506.0,111.0,111.0,0,28.75947,5
13,Paul Tierney,136014,5,7,0,4,12.2,12,12,10,1,1,120,39,25,3,6368.0,181.0,78,124.0,12.0,11.2,8.8,214.0,20.0,4150.0,5219.0,413.0,3503.0,169.0,194.0,86.0,71240.0,24714.0,1995.0,2249.0,1703.0,1995.0,364.0,717.0,10.4,91.0,335.0,99.0,22.0,4748.0,452.0,112.0,20.0,34.0,205,188.0,58.0,36.0,8.0,2.0,19.0,104.0,109,94.0,62.0,25.0,84.0,170.0,86.0,34.0,90.0,259.0,237.0,4.0,617.0,1972.0,2721.0,1735.0,245.0,6367.0,84.0,19433.0,9942.0,132.0,45.0,158.0,97.0,4114.0,408.0,2,98,93,19,1.0,1.0,0,565.0,160.0,160.0,3,27.618808,5
6,David Coote,147031,11,3,6,2,17.4,14,14,7,4,4,152,48,19,1,5898.0,152.0,68,109.0,17.1,14.0,10.2,279.0,22.0,3904.0,4779.0,403.0,3629.0,237.0,187.0,90.0,69695.0,25899.0,1782.0,1978.0,1632.0,1862.0,394.0,700.0,11.4,111.0,315.0,100.0,22.0,4294.0,468.0,141.0,21.0,28.0,201,161.0,60.0,31.0,13.0,2.0,17.0,77.0,85,80.0,52.0,20.0,64.0,154.0,90.0,44.0,65.0,220.0,208.0,5.0,738.0,1957.0,2360.0,1635.0,305.0,5894.0,64.0,19699.0,10769.0,128.0,77.0,146.0,88.0,3859.0,401.0,1,134,129,17,3.0,4.0,0,456.0,130.0,130.0,1,25.926374,5


In [31]:
#Considering dataframe contains data for each referee, 
#we will export dataframe with name "referee_data.csv".

referee_data.to_csv('referee_data.csv', index = False)

We have prepared the data for the analysis process. We have also extracted 6 tables which will be a great basis for further analysis. In the future, we will look at players, teams, matches and referees to gain insights.

I hope you enjoyed this notebook and learned something new. Please write to me if you have any concerns, suggestions or ideas.