# Project: European Soccer Database

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling/Cleaning</a></li>
</ul>

<a id='intro'></a>
## Introduction

European Soccer Database contains data for soccer matches, players, and teams from 2008 to 2016 from the top 11 leagues. The dataset provides rich stats such as:
<ul>
    <li>+25,000 matches</li>
    <li>+10,000 players</li>
    <li>11 European Countries with their lead championship</li>
    <li>Seasons 2008 to 2016</li>
    <li>Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the weekly updates</li>
    <li>Team line up with squad formation (X, Y coordinates)</li>
    <li>Betting odds from up to 10 providers</li>
    <li>Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +10,000 matches</li>
    </ul>

for more details or download, you can refer to this [link](https://www.kaggle.com/hugomathien/soccer)

In [1]:
pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
#Data comes in a sqlite format, and can be exported in csv format using DB browser (https://sqlitebrowser.org/)

In [3]:
cd ../input/european-soccer-csv-files/

/kaggle/input/european-soccer-csv-files


In [4]:
#Essential Imports
import pandas as pd
import numpy as np
import xmltodict
import collections

In [5]:
#Data comes in a sqlite format, and can be exported in csv format using DB browser (https://sqlitebrowser.org/)

#reading data csv files
country_df = pd.read_csv('Country.csv')
league_df = pd.read_csv('League.csv')
match_df = pd.read_csv('Match.csv')
player_df = pd.read_csv('Player.csv')
player_attr_df = pd.read_csv('Player_Attributes.csv')
team_df = pd.read_csv('Team.csv')
team_attr_df = pd.read_csv('Team_Attributes.csv')

<a id='wrangling'></a>
## Data Wrangling/Cleaning

#### Country Table

In [6]:
country_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      11 non-null     int64 
 1   name    11 non-null     object
dtypes: int64(1), object(1)
memory usage: 304.0+ bytes


In [7]:
type(country_df.name.iloc[0])

str

In [8]:
country_df

Unnamed: 0,id,name
0,1,Belgium
1,1729,England
2,4769,France
3,7809,Germany
4,10257,Italy
5,13274,Netherlands
6,15722,Poland
7,17642,Portugal
8,19694,Scotland
9,21518,Spain


>This is a simple table contains the 11 countries of interest along with their ids that link them with other tables, it will be of a great use when we try to focus on a specific league.
>
> It has two attributes {id: int64, name: string}
> no need for wrangling

#### League Table

In [9]:
league_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          11 non-null     int64 
 1   country_id  11 non-null     int64 
 2   name        11 non-null     object
dtypes: int64(2), object(1)
memory usage: 392.0+ bytes


In [10]:
type(league_df.name.iloc[0])

str

In [11]:
league_df.head(5)

Unnamed: 0,id,country_id,name
0,1,1,Belgium Jupiler League
1,1729,1729,England Premier League
2,4769,4769,France Ligue 1
3,7809,7809,Germany 1. Bundesliga
4,10257,10257,Italy Serie A


The league table is straitforward, it contains the league name along with its id and the id of the corresponding country. The two ids are identical. so, what I will do is that I will drop the country id column and delete the country dataframe. As the league name is totally indicative of the country name.

In [12]:
#country_id column drop
league_df.drop('country_id', axis= 1, inplace= True)

In [13]:
#country df deletion
del country_df

#### Match Table

In [14]:
match_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Columns: 115 entries, id to BSA
dtypes: float64(96), int64(9), object(10)
memory usage: 22.8+ MB


In [15]:
#columns names
for i, col in enumerate(match_df.columns):
    print(i, col)

0 id
1 country_id
2 league_id
3 season
4 stage
5 date
6 match_api_id
7 home_team_api_id
8 away_team_api_id
9 home_team_goal
10 away_team_goal
11 home_player_X1
12 home_player_X2
13 home_player_X3
14 home_player_X4
15 home_player_X5
16 home_player_X6
17 home_player_X7
18 home_player_X8
19 home_player_X9
20 home_player_X10
21 home_player_X11
22 away_player_X1
23 away_player_X2
24 away_player_X3
25 away_player_X4
26 away_player_X5
27 away_player_X6
28 away_player_X7
29 away_player_X8
30 away_player_X9
31 away_player_X10
32 away_player_X11
33 home_player_Y1
34 home_player_Y2
35 home_player_Y3
36 home_player_Y4
37 home_player_Y5
38 home_player_Y6
39 home_player_Y7
40 home_player_Y8
41 home_player_Y9
42 home_player_Y10
43 home_player_Y11
44 away_player_Y1
45 away_player_Y2
46 away_player_Y3
47 away_player_Y4
48 away_player_Y5
49 away_player_Y6
50 away_player_Y7
51 away_player_Y8
52 away_player_Y9
53 away_player_Y10
54 away_player_Y11
55 home_player_1
56 home_player_2
57 home_player_3
58 home

The match table has 115 columns. the names are indicative enough tho.
>country_id can be dropped as it is the same as league_id
>
>Match api id can be dropped as it refers to the data source (match_api_id)
>
>Columns from 11:54 are related to the players coordinates, I can't find a use for them rightnow, so will be dropped unless needed in future work
>
>Columns 85 till the end indicate the bets organization initials and ends with (H,D,A) for (Home wins, Draw, Away wins)

In [16]:
(match_df.country_id == match_df.league_id).all()

True

In [17]:
#confirmation step
cols = np.r_[1, 6, 11:55]
match_df.drop(match_df.columns[cols],axis=1, inplace=True)
match_df.columns

Index(['id', 'league_id', 'season', 'stage', 'date', 'home_team_api_id',
       'away_team_api_id', 'home_team_goal', 'away_team_goal', 'home_player_1',
       'home_player_2', 'home_player_3', 'home_player_4', 'home_player_5',
       'home_player_6', 'home_player_7', 'home_player_8', 'home_player_9',
       'home_player_10', 'home_player_11', 'away_player_1', 'away_player_2',
       'away_player_3', 'away_player_4', 'away_player_5', 'away_player_6',
       'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10',
       'away_player_11', 'goal', 'shoton', 'shotoff', 'foulcommit', 'card',
       'cross', 'corner', 'possession', 'B365H', 'B365D', 'B365A', 'BWH',
       'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD',
       'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA',
       'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA'],
      dtype='object')

I can see it more convenient for the analysis to split this table into three tables:
>One with the match statistics
>
>One with the lineup players ids
>
>One fot the bets values

In [18]:
#columns names
for i, col in enumerate(match_df.columns):
    print(i, col)

0 id
1 league_id
2 season
3 stage
4 date
5 home_team_api_id
6 away_team_api_id
7 home_team_goal
8 away_team_goal
9 home_player_1
10 home_player_2
11 home_player_3
12 home_player_4
13 home_player_5
14 home_player_6
15 home_player_7
16 home_player_8
17 home_player_9
18 home_player_10
19 home_player_11
20 away_player_1
21 away_player_2
22 away_player_3
23 away_player_4
24 away_player_5
25 away_player_6
26 away_player_7
27 away_player_8
28 away_player_9
29 away_player_10
30 away_player_11
31 goal
32 shoton
33 shotoff
34 foulcommit
35 card
36 cross
37 corner
38 possession
39 B365H
40 B365D
41 B365A
42 BWH
43 BWD
44 BWA
45 IWH
46 IWD
47 IWA
48 LBH
49 LBD
50 LBA
51 PSH
52 PSD
53 PSA
54 WHH
55 WHD
56 WHA
57 SJH
58 SJD
59 SJA
60 VCH
61 VCD
62 VCA
63 GBH
64 GBD
65 GBA
66 BSH
67 BSD
68 BSA


In [19]:
cols1 = np.r_[0:9, 31:39]
match_stats_df = match_df.iloc[:, cols1].reset_index().drop('index', axis= 1)
cols2 = np.r_[0, 39:69]
match_bets_df = match_df.iloc[:, cols2].reset_index().drop('index', axis= 1)
cols3 = np.r_[0, 9:31]
match_lineup_df = match_df.iloc[:, cols3].reset_index().drop('index', axis= 1)
del match_df

I will focus on fixing the datatypes and parsing the necessary info and will leave handling null values when it comes to proposing questions later to handle each separatly

In [20]:
#change date from string to datetime
match_stats_df.date = pd.to_datetime(match_stats_df.date)

In [21]:
#inspecting goal values
match_stats_df.goal.unique()[1]

'<goal><value><comment>n</comment><stats><goals>1</goals><shoton>1</shoton></stats><event_incident_typefk>406</event_incident_typefk><elapsed>22</elapsed><player2>38807</player2><subtype>header</subtype><player1>37799</player1><sortorder>5</sortorder><team>10261</team><id>378998</id><n>295</n><type>goal</type><goal_type>n</goal_type></value><value><comment>n</comment><stats><goals>1</goals><shoton>1</shoton></stats><event_incident_typefk>393</event_incident_typefk><elapsed>24</elapsed><player2>24154</player2><subtype>shot</subtype><player1>24148</player1><sortorder>4</sortorder><team>10260</team><id>379019</id><n>298</n><type>goal</type><goal_type>n</goal_type></value></goal>'

In [22]:
#inspecting possession values
match_stats_df.possession.unique()[1]

'<possession><value><comment>56</comment><event_incident_typefk>352</event_incident_typefk><elapsed>25</elapsed><subtype>possession</subtype><sortorder>1</sortorder><awaypos>44</awaypos><homepos>56</homepos><n>68</n><type>special</type><id>379029</id></value><value><comment>54</comment><elapsed_plus>1</elapsed_plus><event_incident_typefk>352</event_incident_typefk><elapsed>45</elapsed><subtype>possession</subtype><sortorder>4</sortorder><awaypos>46</awaypos><homepos>54</homepos><n>117</n><type>special</type><id>379251</id></value><value><comment>54</comment><event_incident_typefk>352</event_incident_typefk><elapsed>70</elapsed><subtype>possession</subtype><sortorder>0</sortorder><awaypos>46</awaypos><homepos>54</homepos><n>190</n><type>special</type><id>379443</id></value><value><comment>55</comment><elapsed_plus>5</elapsed_plus><event_incident_typefk>352</event_incident_typefk><elapsed>90</elapsed><subtype>possession</subtype><sortorder>1</sortorder><awaypos>45</awaypos><homepos>55</h

In [23]:
# o = xmltodict.parse(match_stats_df.goal.unique()[1])
# o

In [24]:
def parse_goal(goal, home_id, away_id):
    '''
    The function parses the goal values which is xml text into more convenient tuble
    Args:
        goal -> xml text with multiple tags and goal info
        home_id -> the id of the home team of the match that goal was scored in
        away_id -> the id of the away team
    Returns:
        a tuble of two lists: the first one is the home goals list and the second is the away goals list.
        each list consists of a number of tubles correspond to each goal.
        tuble format: (time of the goal in mins-int, scorer id-int, assisstant id-int, goal type-string)
    '''
    if pd.notna(goal):
        if xmltodict.parse(goal)['goal'] != None:
            goal_dict = xmltodict.parse(goal)['goal']['value']
            home_goals = list()
            away_goals = list()
            if type(goal_dict) == collections.OrderedDict:
                goal_dict = [goal_dict]
            for g in goal_dict:
                try:
                    p1 = int(g['player1'])
                except:
                    p1 = 0
                try:
                    p2 = int(g['player2'])
                except:
                    p2 = 0                
                g_info = (int(g['elapsed']),p1, p2, g['comment'])
                if 'del' not in g.keys():
                    if int(g['team']) == home_id:
                        home_goals.append(g_info)
                    else:
                        away_goals.append(g_info)
            return home_goals, away_goals

multiple conditions were added to the function logic to handle various cases of the goal values such as:
not all the attributes exist all the time
normally the parser returns a list of ordereddicts of all the goals but if it is one goal it returns an ordereddict not a list of length one

In [25]:
match_stats_df["goals_info"] = match_stats_df.apply(lambda row: parse_goal(row.goal, row.home_team_api_id, row.away_team_api_id), axis= 1)

In [26]:
odd_goals = match_stats_df[match_stats_df.goals_info.isnull() == False]

In [27]:
#confirmation step
odd_goals[odd_goals.home_team_goal+odd_goals.away_team_goal != odd_goals.goals_info.apply(lambda x: len(x[0])+len(x[1]))][['goals_info', 'home_team_goal', 'away_team_goal']].head(10)

Unnamed: 0,goals_info,home_team_goal,away_team_goal
1734,"([(32, 24166, 0, dg), (71, 24166, 35608, n), (...",2,1
1748,"([(3, 30893, 27430, n), (29, 34944, 27430, n),...",4,3
1758,"([(22, 26181, 0, n), (48, 26181, 30613, n)], [...",2,1
1769,"([(70, 31291, 0, o)], [(20, 23354, 0, npm), (8...",0,2
1775,"([(45, 23783, 0, dg)], [(28, 37139, 34222, n),...",0,2
1787,"([(55, 30630, 0, dg)], [])",0,0
1789,"([(11, 33418, 38802, n), (30, 33418, 0, dg)], ...",1,4
1792,"([(83, 35608, 0, dg)], [])",0,0
1810,"([], [(54, 30893, 0, dg)])",0,0
1821,"([(84, 36012, 0, dg)], [])",0,0


it seems that there are some ambiguous goal types that are mentioned but they don't count in the score 
<br/> n -> normal goal that can be a shot or a header, etc (sometimes this type has a subtype)
<br/> p -> penality kick
<br/> o -> own goal (and it is counted in the list of goals info of the team that received it)
<br/> dg, npm -> types of goals that are not counted in the score, maybe canceled by VAR or something

In [28]:
def parse_poss(poss):
    '''
    The function parses the possession values which is xml text into more convenient tuble
    Args:
        poss -> xml text with multiple tags and possession info
    Returns:
        a list of tubles: each one represents the essential possession info
        tuble format: (time at which the possession recorded in mins-int, home possession-int, away possession-int)
    '''
    if pd.notna(poss):
        if xmltodict.parse(poss)['possession'] != None:
            poss_dict = xmltodict.parse(poss)['possession']['value']
            if type(poss_dict) == collections.OrderedDict:
                poss_dict = [poss_dict]
            poss_info = list()
            for p in poss_dict:
                try:
                    poss_info.append((int(p['elapsed']), int(p['homepos']), int(p['awaypos'])))
                except:
                    pass
            return poss_info

In [29]:
match_stats_df['poss_info'] = match_stats_df.possession.apply(lambda val: parse_poss(val))

In [30]:
match_stats_df[match_stats_df.goals_info.isnull() == False]

Unnamed: 0,id,league_id,season,stage,date,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,goal,shoton,shotoff,foulcommit,card,cross,corner,possession,goals_info,poss_info
1728,1729,1729,2008/2009,1,2008-08-17,10260,10261,1,1,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><blocked>1</blocked></st...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card><value><comment>y</comment><stats><ycard...,<cross><value><stats><crosses>1</crosses></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>56</comment><event...,"([(24, 24148, 24154, n)], [(22, 37799, 38807, ...","[(25, 56, 44), (45, 54, 46), (70, 54, 46), (90..."
1729,1730,1729,2008/2009,1,2008-08-16,9825,8659,1,0,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><blocked>1</blocked></st...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card />,<cross><value><stats><crosses>1</crosses></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>65</comment><event...,"([(4, 26181, 39297, n)], [])","[(27, 65, 35), (45, 61, 39), (74, 65, 35), (90..."
1730,1731,1729,2008/2009,1,2008-08-16,8472,8650,0,1,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><blocked>1</blocked></st...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card><value><comment>y</comment><stats><ycard...,<cross><value><stats><crosses>1</crosses></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>45</comment><event...,"([], [(83, 30853, 30889, n)])","[(25, 45, 55), (45, 43, 57), (70, 48, 52), (90..."
1731,1732,1729,2008/2009,1,2008-08-16,8654,8528,2,1,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><shoton>1</shoton></stat...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card><value><comment>y</comment><stats><ycard...,<cross><value><stats><crosses>1</crosses></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>50</comment><event...,"([(4, 23139, 36394, n), (10, 23139, 37277, n)]...","[(25, 50, 50), (45, 56, 44), (69, 41, 59), (90..."
1732,1733,1729,2008/2009,1,2008-08-17,10252,8456,4,2,<goal><value><comment>n</comment><stats><goals...,<shoton><value><stats><blocked>1</blocked></st...,<shotoff><value><stats><shotoff>1</shotoff></s...,<foulcommit><value><stats><foulscommitted>1</f...,<card><value><comment>y</comment><stats><ycard...,<cross><value><stats><corners>1</corners></sta...,<corner><value><stats><corners>1</corners></st...,<possession><value><comment>51</comment><event...,"([(47, 26165, 23354, n), (69, 23264, 24658, n)...","[(25, 51, 49), (45, 54, 46), (70, 49, 51), (90..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25944,25945,24558,2015/2016,36,2016-05-25,9931,9956,0,1,<goal><value><comment>o</comment><stats><owngo...,<shoton />,<shotoff />,<foulcommit />,<card><value><comment>y</comment><stats><ycard...,<cross />,<corner />,<possession />,"([(41, 181211, 0, o)], [])",
25945,25946,24558,2015/2016,36,2016-05-25,7896,10190,3,0,<goal><value><comment>n</comment><stats><goals...,<shoton />,<shotoff />,<foulcommit />,<card><value><comment>y</comment><stats><ycard...,<cross />,<corner />,<possession />,"([(4, 340790, 0, n), (11, 178142, 0, p), (75, ...",
25946,25947,24558,2015/2016,36,2016-05-25,10199,10179,2,2,<goal><value><comment>n</comment><stats><goals...,<shoton />,<shotoff />,<foulcommit />,<card><value><comment>y</comment><stats><ycard...,<cross />,<corner />,<possession />,"([(90, 34082, 0, n)], [(35, 38601, 0, n), (60,...",
25947,25948,24558,2015/2016,36,2016-05-25,10191,10192,0,3,<goal><value><comment>n</comment><stats><goals...,<shoton />,<shotoff />,<foulcommit />,<card><value><comment>y</comment><stats><ycard...,<cross />,<corner />,<possession />,"([], [(19, 25843, 0, n), (58, 245161, 0, n), (...",


Typically, the possession is recorded 4 times during the match (25', 45', 70' and 90') sometimes it varies either in the time of the record or in the number of times

now, we can drop the original and possession columns.
<br/> additionally, the other stats columns have a great variation in their tags and have multiple empty tags (like <corner />) so will be dropped for now unless needed for future analysis

In [31]:
match_stats_df.drop(['goal', 'shoton', 'shotoff', 'foulcommit', 'card', 'cross', 'corner', 'possession'], axis=1, inplace= True)

#### match lineup table

we can drop the lineups with more than 20 missing values (out of 22)

In [32]:
match_lineup_df.dropna(thresh=20, inplace= True)

In [33]:
match_lineup_df.reset_index(inplace=True)

In [34]:
match_lineup_df.drop('index', axis=1, inplace=True)

In [35]:
#max number of missing values per lineup
match_lineup_df.isnull().sum(axis= 1).max()

3

#### match bets table

In [36]:
match_bets_df.dropna(subset= list(match_bets_df.columns).remove('id'), inplace= True)

In [37]:
match_bets_df.reset_index(inplace= True)

In [38]:
match_bets_df.drop('index', inplace= True, axis= 1)

The bets values represent decimal odds, I would go for converting them to probabilities to be more indicative by taking the reciprocal. <br/> important thing to nore is that the probabilities don't sum up to 100% but close. This can be due to [Overround](http://betting.football-data.co.uk/overround.php) 

In [39]:
cols = list(match_bets_df.columns)
cols.remove('id')

In [40]:
for c in cols:
    match_bets_df[c] = match_bets_df[c].apply(lambda x: 1/x)

In [41]:
match_bets_df

Unnamed: 0,id,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,998,0.420168,0.307692,0.333333,0.408163,0.303030,0.370370,0.434783,0.322581,0.384615,...,0.347222,0.416667,0.294118,0.344828,0.408163,0.303030,0.370370,0.416667,0.307692,0.370370
1,999,0.555556,0.277778,0.230947,0.571429,0.270270,0.238095,0.555556,0.312500,0.270270,...,0.222222,0.555556,0.277778,0.222222,0.571429,0.270270,0.238095,0.555556,0.285714,0.250000
2,1000,0.606061,0.263158,0.200000,0.598802,0.277778,0.200000,0.606061,0.303030,0.227273,...,0.181818,0.588235,0.270270,0.200000,0.598802,0.277778,0.200000,0.588235,0.285714,0.222222
3,1001,0.666667,0.250000,0.142857,0.653595,0.256410,0.166667,0.666667,0.270270,0.200000,...,0.166667,0.636943,0.256410,0.166667,0.653595,0.256410,0.166667,0.666667,0.263158,0.166667
4,1002,0.775194,0.190476,0.100000,0.800000,0.181818,0.100000,0.800000,0.222222,0.111111,...,0.090909,0.800000,0.181818,0.083333,0.800000,0.181818,0.100000,0.800000,0.181818,0.111111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2751,23413,0.500000,0.294118,0.266667,0.500000,0.312500,0.256410,0.540541,0.285714,0.253165,...,0.263158,0.487805,0.285714,0.256410,0.500000,0.312500,0.256410,0.500000,0.294118,0.285714
2752,23414,0.333333,0.307692,0.420168,0.303030,0.312500,0.425532,0.344828,0.303030,0.434783,...,0.420168,0.312500,0.294118,0.420168,0.303030,0.312500,0.425532,0.347222,0.303030,0.420168
2753,23415,0.125000,0.222222,0.714286,0.114286,0.210526,0.714286,0.131579,0.208333,0.740741,...,0.751880,0.111111,0.190476,0.735294,0.114286,0.210526,0.714286,0.117647,0.222222,0.735294
2754,23416,0.523560,0.285714,0.250000,0.526316,0.294118,0.250000,0.500000,0.303030,0.277778,...,0.263158,0.512821,0.277778,0.238095,0.526316,0.294118,0.250000,0.523560,0.285714,0.266667


#### player table

In [42]:
player_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11060 entries, 0 to 11059
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  11060 non-null  int64  
 1   player_api_id       11060 non-null  int64  
 2   player_name         11060 non-null  object 
 3   player_fifa_api_id  11060 non-null  int64  
 4   birthday            11060 non-null  object 
 5   height              11060 non-null  float64
 6   weight              11060 non-null  int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 605.0+ KB


In [43]:
print(type(player_df.player_name[0]))
print(type(player_df.birthday[0]))

<class 'str'>
<class 'str'>


In [44]:
player_df.birthday = pd.to_datetime(player_df.birthday)

#### player attributes table

In [45]:
player_attr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183978 entries, 0 to 183977
Data columns (total 42 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   183978 non-null  int64  
 1   player_fifa_api_id   183978 non-null  int64  
 2   player_api_id        183978 non-null  int64  
 3   date                 183978 non-null  object 
 4   overall_rating       183142 non-null  float64
 5   potential            183142 non-null  float64
 6   preferred_foot       183142 non-null  object 
 7   attacking_work_rate  180748 non-null  object 
 8   defensive_work_rate  183142 non-null  object 
 9   crossing             183142 non-null  float64
 10  finishing            183142 non-null  float64
 11  heading_accuracy     183142 non-null  float64
 12  short_passing        183142 non-null  float64
 13  volleys              181265 non-null  float64
 14  dribbling            183142 non-null  float64
 15  curve            

In [46]:
player_attr_df.date = pd.to_datetime(player_attr_df.date)

I will fill the null values in the numeric columns with the mean values (normally average players in Fifa have average stats)

In [47]:
player_attr_df.describe()

Unnamed: 0,id,player_fifa_api_id,player_api_id,overall_rating,potential,crossing,finishing,heading_accuracy,short_passing,volleys,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
count,183978.0,183978.0,183978.0,183142.0,183142.0,183142.0,183142.0,183142.0,183142.0,181265.0,...,181265.0,183142.0,183142.0,183142.0,181265.0,183142.0,183142.0,183142.0,183142.0,183142.0
mean,91989.5,165671.524291,135900.617324,68.600015,73.460353,55.086883,49.921078,57.266023,62.429672,49.468436,...,57.87355,55.003986,46.772242,50.351257,48.001462,14.704393,16.063612,20.998362,16.132154,16.441439
std,53110.01825,53851.094769,136927.84051,7.041139,6.592271,17.242135,19.038705,16.488905,14.194068,18.256618,...,15.144086,15.546519,21.227667,21.483706,21.598778,16.865467,15.867382,21.45298,16.099175,17.198155
min,1.0,2.0,2625.0,33.0,39.0,1.0,1.0,1.0,3.0,1.0,...,1.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
25%,45995.25,155798.0,34763.0,64.0,69.0,45.0,34.0,49.0,57.0,35.0,...,49.0,45.0,25.0,29.0,25.0,7.0,8.0,8.0,8.0,8.0
50%,91989.5,183488.0,77741.0,69.0,74.0,59.0,53.0,60.0,65.0,52.0,...,60.0,57.0,50.0,56.0,53.0,10.0,11.0,12.0,11.0,11.0
75%,137983.75,199848.0,191080.0,73.0,78.0,68.0,65.0,68.0,72.0,64.0,...,69.0,67.0,66.0,69.0,67.0,13.0,15.0,15.0,15.0,15.0
max,183978.0,234141.0,750584.0,94.0,97.0,95.0,97.0,98.0,97.0,93.0,...,97.0,96.0,96.0,95.0,95.0,94.0,93.0,97.0,96.0,96.0


In [48]:
old_mean = player_attr_df.describe().loc['mean']
old_std = player_attr_df.describe().loc['std']

In [49]:
for i in zip(player_attr_df.columns, player_attr_df.dtypes):
    if i[1] == 'float64':
        player_attr_df[i[0]] = player_attr_df[i[0]].fillna(player_attr_df[i[0]].mean())

In [50]:
player_attr_df.isnull().sum()

id                        0
player_fifa_api_id        0
player_api_id             0
date                      0
overall_rating            0
potential                 0
preferred_foot          836
attacking_work_rate    3230
defensive_work_rate     836
crossing                  0
finishing                 0
heading_accuracy          0
short_passing             0
volleys                   0
dribbling                 0
curve                     0
free_kick_accuracy        0
long_passing              0
ball_control              0
acceleration              0
sprint_speed              0
agility                   0
reactions                 0
balance                   0
shot_power                0
jumping                   0
stamina                   0
strength                  0
long_shots                0
aggression                0
interceptions             0
positioning               0
vision                    0
penalties                 0
marking                   0
standing_tackle     

I will compare the mean and std of the numerical columns before and after the fill-null-with-mean step to observe the effect

In [51]:
player_attr_df.describe().loc['mean'] - old_mean

id                    0.000000e+00
player_fifa_api_id    0.000000e+00
player_api_id         0.000000e+00
overall_rating       -1.449507e-12
potential            -1.435296e-12
crossing             -1.264766e-12
finishing             9.734435e-13
heading_accuracy     -5.471179e-13
short_passing         1.165290e-12
volleys               2.444267e-12
dribbling             9.023893e-13
curve                 3.836931e-13
free_kick_accuracy    1.705303e-13
long_passing          5.968559e-13
ball_control          2.842171e-14
acceleration         -5.258016e-13
sprint_speed         -9.947598e-13
agility               4.121148e-12
reactions             7.531753e-13
balance               5.684342e-13
shot_power            5.186962e-13
jumping               1.577405e-12
stamina              -9.947598e-13
strength             -1.236344e-12
long_shots            4.476419e-13
aggression            9.237056e-13
interceptions        -3.552714e-14
positioning           4.192202e-13
vision              

In [52]:
player_attr_df.describe().loc['std'] - old_std

id                    0.000000
player_fifa_api_id    0.000000
player_api_id         0.000000
overall_rating       -0.016016
potential            -0.014995
crossing             -0.039219
finishing            -0.043306
heading_accuracy     -0.037506
short_passing        -0.032286
volleys              -0.135110
dribbling            -0.040362
curve                -0.135104
free_kick_accuracy   -0.040560
long_passing         -0.032742
ball_control         -0.034566
acceleration         -0.029532
sprint_speed         -0.028591
agility              -0.095872
reactions            -0.020825
balance              -0.096675
shot_power           -0.036701
jumping              -0.081456
stamina              -0.029946
strength             -0.027460
long_shots           -0.041778
aggression           -0.036597
interceptions        -0.044241
positioning          -0.041963
vision               -0.112075
penalties            -0.035362
marking              -0.048285
standing_tackle      -0.048867
sliding_

we can see that the effect is negligible with difference values of order 10 to the power -15 in the mean and small values in the std

**inspecting the remaining attributes: (preferred_foot, attacking_work_rate, defensive_work_rate)**

In [53]:
player_attr_df.preferred_foot.unique()

array(['right', 'left', nan], dtype=object)

we can replace (right, left, nan) with (1, -1, 0) as 0 means not available info about this player

In [54]:
player_attr_df.preferred_foot.fillna(value= 0, inplace=True)

In [55]:
player_attr_df.replace({'right': 1, 'left': -1}, inplace=True)

In [56]:
player_attr_df.preferred_foot.unique()

array([ 1, -1,  0])

In [57]:
player_attr_df.attacking_work_rate.value_counts()

medium    125070
high       42823
low         8569
None        3639
norm         348
y            106
le           104
stoc          89
Name: attacking_work_rate, dtype: int64

In [58]:
player_attr_df.defensive_work_rate.value_counts()

medium    130846
high       27041
low        18432
_0          2394
o           1550
1            441
ormal        348
2            342
3            258
5            234
7            217
6            197
0            197
9            152
4            116
es           106
ean          104
tocky         89
8             78
Name: defensive_work_rate, dtype: int64

Those two columns are odd! I don't think those values are meaningfull. They can be dropped for now, but for future investigation, their values can be interpolated from other values such as stamina, strength, long_shots, aggression.

#### team table

In [59]:
team_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                299 non-null    int64  
 1   team_api_id       299 non-null    int64  
 2   team_fifa_api_id  288 non-null    float64
 3   team_long_name    299 non-null    object 
 4   team_short_name   299 non-null    object 
dtypes: float64(1), int64(2), object(2)
memory usage: 11.8+ KB


some fifa_api_ids are missing, I will make sure that 0 is not in the column then fill the missing values with it

In [60]:
0 in team_df.team_fifa_api_id.unique()

False

In [61]:
team_df.team_fifa_api_id.fillna(value= 0, inplace=True)

#### team attributes table

In [62]:
team_attr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              1458 non-null   int64  
 1   team_fifa_api_id                1458 non-null   int64  
 2   team_api_id                     1458 non-null   int64  
 3   date                            1458 non-null   object 
 4   buildUpPlaySpeed                1458 non-null   int64  
 5   buildUpPlaySpeedClass           1458 non-null   object 
 6   buildUpPlayDribbling            489 non-null    float64
 7   buildUpPlayDribblingClass       1458 non-null   object 
 8   buildUpPlayPassing              1458 non-null   int64  
 9   buildUpPlayPassingClass         1458 non-null   object 
 10  buildUpPlayPositioningClass     1458 non-null   object 
 11  chanceCreationPassing           1458 non-null   int64  
 12  chanceCreationPassingClass      14

In [63]:
team_attr_df.date = pd.to_datetime(team_attr_df.date)

In [64]:
for c in zip(team_attr_df.columns, team_attr_df.dtypes):
    if c[1] == 'object':
        print(team_attr_df[c[0]].unique())

['Balanced' 'Fast' 'Slow']
['Little' 'Normal' 'Lots']
['Mixed' 'Long' 'Short']
['Organised' 'Free Form']
['Normal' 'Risky' 'Safe']
['Normal' 'Lots' 'Little']
['Normal' 'Lots' 'Little']
['Organised' 'Free Form']
['Medium' 'Deep' 'High']
['Press' 'Double' 'Contain']
['Normal' 'Wide' 'Narrow']
['Cover' 'Offside Trap']


buildUpPlayDribbling is the only column with null values

In [65]:
team_attr_df.buildUpPlayDribbling.isnull().sum()

969

In [66]:
team_attr_df[['buildUpPlayDribbling', 'buildUpPlayDribblingClass']][team_attr_df[['buildUpPlayDribbling', 'buildUpPlayDribblingClass']].buildUpPlayDribbling.isnull()].buildUpPlayDribblingClass.value_counts()

Little    969
Name: buildUpPlayDribblingClass, dtype: int64

all the missing values are in the 'little' class

In [67]:
little_vals = dict(team_attr_df[['buildUpPlayDribbling', 'buildUpPlayDribblingClass']][team_attr_df[['buildUpPlayDribbling', 'buildUpPlayDribblingClass']].buildUpPlayDribblingClass == 'Little'].buildUpPlayDribbling.value_counts())
print(little_vals)

{32.0: 12, 33.0: 6, 31.0: 4, 29.0: 4, 28.0: 3, 24.0: 2, 30.0: 2, 27.0: 1, 26.0: 1}


In [68]:
#calculate average for buildUpPlayDribbling values in 'little' buildUpPlayDribblingClass
avg = sum(list(little_vals.keys()))/len(little_vals)

#calculate weighted average
summation = 0
total = 0
for k, v in little_vals.items():
    summation = summation + (k*v)
    total = total + v
weighted_avg = summation/total

print(f'average= {avg}')
print(f'weighted_avg= {weighted_avg}')

average= 28.88888888888889
weighted_avg= 30.485714285714284


no big difference between the two values, so I will choose the weighted average

In [69]:
team_attr_df.buildUpPlayDribbling.fillna(value= weighted_avg, inplace=True)
team_attr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              1458 non-null   int64         
 1   team_fifa_api_id                1458 non-null   int64         
 2   team_api_id                     1458 non-null   int64         
 3   date                            1458 non-null   datetime64[ns]
 4   buildUpPlaySpeed                1458 non-null   int64         
 5   buildUpPlaySpeedClass           1458 non-null   object        
 6   buildUpPlayDribbling            1458 non-null   float64       
 7   buildUpPlayDribblingClass       1458 non-null   object        
 8   buildUpPlayPassing              1458 non-null   int64         
 9   buildUpPlayPassingClass         1458 non-null   object        
 10  buildUpPlayPositioningClass     1458 non-null   object        
 11  chan

All cleaned tables are saved for further analysis