# __Web Scraping Football Leagues__
---
## Introduction

![EPL_Understat](https://raw.githubusercontent.com/tuanspjain/Soccer_Leagues_Statistics/main/EPL_Understat.png)

Data Science is becoming more and more crucial in the world of football. As a data guy who loves football, I will try to scrape football data by myself and analyse this dataset to better understand the key role of this field to football. My work is also the humble tribute to the useful [blog post](https://towardsdatascience.com/web-scraping-advanced-football-statistics-11cace1d863a) of Sergi Lehkyi.

In this notebook I will describe the process of scraping data from web portal understat.com which has  statistical information about all games in top 5 European football leagues.

According to understat.com home page:
* Expected goals (xG) is the new revolutionary football metric, which allows you to evaluate team and player performance. It calculates how many goals a team should have scored based on the quality of the chances created.
* In a low-scoring game such as football, final match score does not provide a clear picture of performance.
* This is why more and more sports analytics turn to the advanced models like xG, which is a statistical measure of the quality of chances created and conceded.
* Our goal was to create the most precise method for shot quality evaluation.
* For this case, we trained neural network prediction algorithms with the large dataset (>100,000 shots, over 10 parameters for each).
* On this site, you will find our detailed xG statistics for the top European leagues.

They have not only xG metric, but much more, which makes this site perfect to scrape statistical data about football games.

We will start by importing libraries that will be used in this project.


In [1]:
#import modules and packages
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

## __Website Research__

On the home page I can find the site contains data for top 5 European Leagues:
* EPL
* La Liga
* BundesLiga
* Serie A
* Ligue 1

We can also see that the statistics starting from season 2014/2015. Another important notion we make is the structure of URL. It is 'https://understat.com/league' + '/name_of_the_league' + '/year_start_of_the_season'

Therefore, we create global variables with this data to be able to select any of those.


In [2]:
#Create urls for all seasons in Europe's Top 5 Leagues
base_url = 'http://understat.com/league'
leagues = ['EPL', 'La_liga', 'Bundesliga' , 'Serie_A', 'Ligue_1']
seasons = ['2014', '2015', '2016', '2017', '2018', '2019', '2020']


After going through the web-page we can find that the data is stored under “script” tag and it is JSON encoded. That's why we need to look for this tag, get JSON from it and convert it into Python readable data structure.

However, the JSON only contains the statistics for every games. We will have to manipulate data to make it nearly the same as the tables in the original source.

In [3]:
#Starting with getting the data of La Liga in season 2014-2015
url = base_url + '/' + leagues[1] + '/' + seasons[0]
res = requests.get(url) #getting HTML codes from the url

soup = BeautifulSoup(res.content, "lxml")

#Based on the web page's structure, data can be found in JSON variable, under the 'script' tags
scripts = soup.find_all('script')

## __Working with JSON__

We found that the data interests us is stored in teamsData variable, after creating a soup of html tags it becomes a string, so we find that text and extract JSON from it.

In [1]:
#Find data for teams
for el in scripts:
    if 'teamsData' in str(el):
        teamData = str(el).strip()

#Strip unnecessary symbols and get only JSON data
ind_start = teamData.index("('") + 2
ind_end = teamData.index("')")
json_data = teamData[ind_start:ind_end]

json_data = json_data.encode('utf8').decode('unicode_escape')


NameError: name 'scripts' is not defined

Once we have gotten our JSON and cleaned it up we can convert it into Python dictionaries.

## __Understanding the data__

In [6]:
#Convert JSON data into dictionaries to better understand how data looks
data = json.loads(json_data)
print('IDs are: ',data.keys())
print()
print('Attributes are: ', data['137'].keys())
print()
print('Name of the club with ID137: ', data['137']['title'])
print()
print('Data regarding the first match of this above club: ', data['137']['history'][0])

IDs are:  dict_keys(['137', '138', '139', '140', '141', '142', '143', '145', '146', '147', '148', '150', '151', '152', '154', '155', '156', '206', '207', '208'])

Attributes are:  dict_keys(['id', 'title', 'history'])

Name of the club with ID137:  Malaga

Data regarding the first match of this above club:  {'h_a': 'h', 'xG': 1.32107, 'xGA': 1.14151, 'npxG': 0.438073, 'npxGA': 1.14151, 'ppda': {'att': 338, 'def': 28}, 'ppda_allowed': {'att': 189, 'def': 30}, 'deep': 4, 'deep_allowed': 5, 'scored': 1, 'missed': 0, 'xpts': 1.5303, 'result': 'w', 'date': '2014-08-23 18:00:00', 'wins': 1, 'draws': 0, 'loses': 0, 'pts': 3, 'npxGD': -0.7034370000000001}


When we research the data more we understand that this is a dictionary of dictionaries of 3 keys: *id*, *title* and *history*. The first layer of dictionary uses ids as keys too.

Also from this we understand that *history* has data regarding every single match the team played in its own league.

We can gather teams names by their respective ids when going over the first layer dictionary.

In [7]:
#Get team names by their respective IDs and put them into separate dictionary
teams = {}
for id in data.keys():
    teams[id] = data[id]['title']

In [8]:
teams

{'137': 'Malaga',
 '138': 'Sevilla',
 '139': 'Deportivo La Coruna',
 '140': 'Real Sociedad',
 '141': 'Espanyol',
 '142': 'Getafe',
 '143': 'Atletico Madrid',
 '145': 'Rayo Vallecano',
 '146': 'Valencia',
 '147': 'Athletic Club',
 '148': 'Barcelona',
 '150': 'Real Madrid',
 '151': 'Levante',
 '152': 'Celta Vigo',
 '154': 'Villarreal',
 '155': 'Granada',
 '156': 'Eibar',
 '206': 'Cordoba',
 '207': 'Elche',
 '208': 'Almeria'}

The *history* is the array of dictionaries whose keys are names of metrics and values are values.

Column names (metrics) repeat over and over again so we add them to separate list. We will also check how the sample values look like.

In [9]:
#Checking how sample values look like in columns
columns = []
values = []
for id in data.keys():
    columns = list(data['137']['history'][0].keys())
    values = list(data['137']['history'][0].values())
    break
    
print(columns)
print(values)

['h_a', 'xG', 'xGA', 'npxG', 'npxGA', 'ppda', 'ppda_allowed', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'result', 'date', 'wins', 'draws', 'loses', 'pts', 'npxGD']
['h', 1.32107, 1.14151, 0.438073, 1.14151, {'att': 338, 'def': 28}, {'att': 189, 'def': 30}, 4, 5, 1, 0, 1.5303, 'w', '2014-08-23 18:00:00', 1, 0, 0, 3, -0.7034370000000001]


Having found that FC Barcelona has the *id*=148, I will get all the data for this team and then reproduce the same steps for all teams in the league.

In [10]:
#Getting data from FC Barcelona
barca_data = []
for row in data['148']['history']:
    barca_data.append(list(row.values()))
df = pd.DataFrame(barca_data, columns=columns)
df.head()


Unnamed: 0,h_a,xG,xGA,npxG,npxGA,ppda,ppda_allowed,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD
0,h,1.54124,0.10804,1.54124,0.10804,"{'att': 216, 'def': 33}","{'att': 515, 'def': 28}",12,0,3,0,2.605,w,2014-08-24 20:00:00,1,0,0,3,1.4332
1,a,3.12545,1.10836,3.12545,1.10836,"{'att': 120, 'def': 32}","{'att': 321, 'def': 15}",11,5,1,0,2.6874,w,2014-08-31 18:00:00,1,0,0,3,2.01709
2,h,2.1772,0.097971,2.1772,0.097971,"{'att': 262, 'def': 31}","{'att': 386, 'def': 34}",14,3,2,0,2.8197,w,2014-09-13 15:00:00,1,0,0,3,2.079229
3,a,3.8229,0.44198,3.07962,0.44198,"{'att': 154, 'def': 22}","{'att': 429, 'def': 18}",14,0,5,0,2.9336,w,2014-09-21 20:00:00,1,0,0,3,2.63764
4,a,0.646364,0.278657,0.646364,0.278657,"{'att': 96, 'def': 21}","{'att': 293, 'def': 31}",7,4,0,0,1.6659,d,2014-09-24 21:00:00,0,1,0,1,0.367707


In [11]:
#Getting data for all teams in La Liga
dataframes = {}
for id, team in teams.items():
    teams_data = []
    for row in data[id]['history']:
        teams_data.append(list(row.values()))
    
    df = pd.DataFrame(teams_data, columns=columns)
    dataframes[team] = df
    print('Added data for {}.'.format(team))


Added data for Malaga.
Added data for Sevilla.
Added data for Deportivo La Coruna.
Added data for Real Sociedad.
Added data for Espanyol.
Added data for Getafe.
Added data for Atletico Madrid.
Added data for Rayo Vallecano.
Added data for Valencia.
Added data for Athletic Club.
Added data for Barcelona.
Added data for Real Madrid.
Added data for Levante.
Added data for Celta Vigo.
Added data for Villarreal.
Added data for Granada.
Added data for Eibar.
Added data for Cordoba.
Added data for Elche.
Added data for Almeria.


In [12]:
# Sample check of our new DataFrame
dataframes['Real Madrid'].head(2)

Unnamed: 0,h_a,xG,xGA,npxG,npxGA,ppda,ppda_allowed,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD
0,h,0.612645,0.37841,0.612645,0.37841,"{'att': 212, 'def': 25}","{'att': 345, 'def': 16}",4,4,2,0,1.5211,w,2014-08-25 19:00:00,1,0,0,3,0.234235
1,a,2.31574,2.74099,2.31574,2.74099,"{'att': 197, 'def': 29}","{'att': 223, 'def': 24}",8,7,2,4,1.0946,l,2014-08-31 20:00:00,0,0,1,0,-0.42525


In [2]:
dataframes.head(5)

NameError: name 'dataframes' is not defined

Now we have a dictionary of DataFrames where keys are the teams names and values are the DataFrames with all games of that team.


## __Manipulating data as tables in the original source__

We can notice that such metrics as PPDA and OPPDA (ppda and ppda_allowed) are represented as total amounts of attacking and defensive actions, but in the original table it is shown as coefficients (passes allowed per defensive action in the opposition half and opponent passes allowed per defensive action in the opposition half). Let's fix that!

In [14]:
for team,df in dataframes.items():
    dataframes[team]['ppda_coef'] = df['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
    dataframes[team]['oppda_coef'] = df['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)

#Check out how our new dataframes look like
dataframes['Sevilla'].head(2)
    

Unnamed: 0,h_a,xG,xGA,npxG,npxGA,ppda,ppda_allowed,deep,deep_allowed,scored,...,xpts,result,date,wins,draws,loses,pts,npxGD,ppda_coef,oppda_coef
0,h,1.17197,1.74903,1.17197,1.74903,"{'att': 226, 'def': 24}","{'att': 213, 'def': 20}",8,4,1,...,0.8858,d,2014-08-23 20:00:00,0,1,0,1,-0.57706,9.416667,10.65
1,a,1.36011,0.945745,1.36011,0.945745,"{'att': 167, 'def': 25}","{'att': 253, 'def': 36}",1,4,2,...,1.6688,w,2014-08-30 22:00:00,1,0,0,3,0.414365,6.68,7.027778


Now we have successfully got all numbers, but for every single game. What we desire is the totals for the team. Let's find out the columns we have to sum up. After considering the original table, only PPDA and OPPDA are means in the end.

In [15]:
cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts', 'wins', 'draws', 'loses',
              'pts', 'npxGD']
cols_to_mean = ['ppda_coef','oppda_coef']


Let's calculate our sums and means. Looping through dictionary of dataframes and calling .sum() and .mean() DataFrame methods that return Series, that's why we add .transpose() to those calls. We will put these new DataFrames into a list and then concat them into a new DataFrame full_stat.

In [16]:
frames = []

#Calculate mean and sum in specific columns
for team,df in dataframes.items():
    sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
    mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()
    
#Join sum_data and mean_data. Add columns team and match and then append all the frames to a list    
    final_df = sum_data.join(mean_data)
    final_df['team'] = team
    final_df['matches'] = len(df)
    frames.append(final_df)
    
#Concat the frames list into the final DataFrames
full_stat = pd.concat(frames)
full_stat


Unnamed: 0,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,missed,xpts,wins,draws,loses,pts,npxGD,ppda_coef,oppda_coef,team,matches
0,46.221008,54.130818,40.878338,49.515437,184.0,184.0,42.0,48.0,48.5128,14.0,8.0,16.0,50.0,-8.637099,7.792069,7.019068,Malaga,38
0,69.526624,47.862742,62.094599,41.916529,305.0,168.0,71.0,45.0,67.3867,23.0,7.0,8.0,76.0,20.17807,8.276148,9.477805,Sevilla,38
0,37.871238,50.979304,32.505887,46.519761,133.0,188.0,35.0,60.0,43.5249,7.0,14.0,17.0,35.0,-14.013874,9.87252,7.921498,Deportivo La Coruna,38
0,33.485197,51.158118,31.255368,46.536026,146.0,168.0,44.0,51.0,38.7818,11.0,13.0,14.0,46.0,-15.280658,8.614235,8.829567,Real Sociedad,38
0,43.979193,48.303916,41.006215,47.560642,173.0,205.0,47.0,51.0,50.385,13.0,10.0,15.0,49.0,-6.554427,9.381754,7.158217,Espanyol,38
0,33.968581,53.673382,32.482021,47.728197,114.0,221.0,33.0,64.0,40.1031,10.0,7.0,21.0,37.0,-15.246177,10.145486,7.23566,Getafe,38
0,57.04767,29.069107,52.588008,26.839271,197.0,123.0,67.0,29.0,73.1353,23.0,9.0,6.0,78.0,25.748737,8.982028,9.237091,Atletico Madrid,38
0,47.790696,70.43346,45.560868,65.9738,147.0,219.0,46.0,68.0,43.5455,15.0,4.0,19.0,49.0,-20.412932,6.157978,9.735109,Rayo Vallecano,38
0,55.0625,39.392572,49.703978,33.446477,203.0,172.0,70.0,32.0,63.7068,22.0,11.0,5.0,77.0,16.257501,8.709827,7.870225,Valencia,38
0,45.542151,44.106707,41.826151,41.737161,183.0,171.0,42.0,41.0,53.3585,15.0,10.0,13.0,55.0,0.08899,7.462406,9.403965,Athletic Club,38


Next we reorder columns for better readability, sort rows based on points, reset index and add new column 'position'.

In [17]:

#Reorder the columns
full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG',
                       'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
#Sort rows based on points
full_stat.sort_values(by='pts', ascending=False, inplace=True)

#Reset Index
full_stat.reset_index(inplace=True, drop=True)

#Add column position
full_stat['position'] = range(1,len(full_stat)+1)




In [18]:
full_stat


Unnamed: 0,team,matches,wins,draws,loses,scored,missed,pts,xG,npxG,xGA,npxGA,npxGD,ppda_coef,oppda_coef,deep,deep_allowed,xpts,position
0,Barcelona,38,30.0,4.0,4.0,110.0,21.0,94.0,102.980152,97.777212,28.444293,24.727907,73.049305,5.683535,16.367593,489.0,114.0,94.0813,1
1,Real Madrid,38,30.0,2.0,6.0,118.0,38.0,92.0,95.766243,86.103895,42.607198,38.890805,47.21309,10.209085,12.92951,351.0,153.0,81.7489,2
2,Atletico Madrid,38,23.0,9.0,6.0,67.0,29.0,78.0,57.04767,52.588008,29.069107,26.839271,25.748737,8.982028,9.237091,197.0,123.0,73.1353,3
3,Valencia,38,22.0,11.0,5.0,70.0,32.0,77.0,55.0625,49.703978,39.392572,33.446477,16.257501,8.709827,7.870225,203.0,172.0,63.7068,4
4,Sevilla,38,23.0,7.0,8.0,71.0,45.0,76.0,69.526624,62.094599,47.862742,41.916529,20.17807,8.276148,9.477805,305.0,168.0,67.3867,5
5,Villarreal,38,16.0,12.0,10.0,48.0,37.0,60.0,56.767999,55.281438,40.701813,38.471977,16.809461,10.072085,8.67966,242.0,171.0,62.7363,6
6,Athletic Club,38,15.0,10.0,13.0,42.0,41.0,55.0,45.542151,41.826151,44.106707,41.737161,0.08899,7.462406,9.403965,183.0,171.0,53.3585,7
7,Celta Vigo,38,13.0,12.0,13.0,47.0,44.0,51.0,58.887332,54.427664,51.777138,46.574205,7.853459,6.056173,10.882769,287.0,207.0,55.0488,8
8,Malaga,38,14.0,8.0,16.0,42.0,48.0,50.0,46.221008,40.878338,54.130818,49.515437,-8.637099,7.792069,7.019068,184.0,184.0,48.5128,9
9,Rayo Vallecano,38,15.0,4.0,19.0,46.0,68.0,49.0,47.790696,45.560868,70.43346,65.9738,-20.412932,6.157978,9.735109,147.0,219.0,43.5455,10


In the original table they also have values of differences between expected and real metrics. Let's add those too.

In [19]:
full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']




Converting floats to integers in appropriate columns.

In [20]:
cols_to_int = ['wins','draws','loses','scored', 'missed', 'pts', 'deep', 'deep_allowed']
full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)

Let's prettify final view of our DataFrame.

In [21]:
col_order = ['position','team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 
             'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
full_stat = full_stat[col_order]
pd.options.display.float_format = '{:,.2f}'.format
full_stat

Unnamed: 0,position,team,matches,wins,draws,loses,scored,missed,pts,xG,...,xGA,xGA_diff,npxGA,npxGD,ppda_coef,oppda_coef,deep,deep_allowed,xpts,xpts_diff
0,1,Barcelona,38,30,4,4,110,21,94,102.98,...,28.44,7.44,24.73,73.05,5.68,16.37,489,114,94.08,0.08
1,2,Real Madrid,38,30,2,6,118,38,92,95.77,...,42.61,4.61,38.89,47.21,10.21,12.93,351,153,81.75,-10.25
2,3,Atletico Madrid,38,23,9,6,67,29,78,57.05,...,29.07,0.07,26.84,25.75,8.98,9.24,197,123,73.14,-4.86
3,4,Valencia,38,22,11,5,70,32,77,55.06,...,39.39,7.39,33.45,16.26,8.71,7.87,203,172,63.71,-13.29
4,5,Sevilla,38,23,7,8,71,45,76,69.53,...,47.86,2.86,41.92,20.18,8.28,9.48,305,168,67.39,-8.61
5,6,Villarreal,38,16,12,10,48,37,60,56.77,...,40.7,3.7,38.47,16.81,10.07,8.68,242,171,62.74,2.74
6,7,Athletic Club,38,15,10,13,42,41,55,45.54,...,44.11,3.11,41.74,0.09,7.46,9.4,183,171,53.36,-1.64
7,8,Celta Vigo,38,13,12,13,47,44,51,58.89,...,51.78,7.78,46.57,7.85,6.06,10.88,287,207,55.05,4.05
8,9,Malaga,38,14,8,16,42,48,50,46.22,...,54.13,6.13,49.52,-8.64,7.79,7.02,184,184,48.51,-1.49
9,10,Rayo Vallecano,38,15,4,19,46,68,49,47.79,...,70.43,2.43,65.97,-20.41,6.16,9.74,147,219,43.55,-5.45


**_Original Table_**

![Original table](https://raw.githubusercontent.com/tuanspjain/Soccer_Leagues_Statistics/main/La_Liga%20Table.PNG)

## __Scrapping data for all teams of all leagues of all seasons__

Testing the data of La Liga 2014-2015 before getting data for all leagues in all seasons

In [22]:
season_data = dict()
season_data[seasons[0]] = full_stat

full_data = dict()
full_data[leagues[1]] = season_data
print(full_data)

{'2014':     position                 team  matches  wins  draws  loses  scored  \
0          1            Barcelona       38    30      4      4     110   
1          2          Real Madrid       38    30      2      6     118   
2          3      Atletico Madrid       38    23      9      6      67   
3          4             Valencia       38    22     11      5      70   
4          5              Sevilla       38    23      7      8      71   
5          6           Villarreal       38    16     12     10      48   
6          7        Athletic Club       38    15     10     13      42   
7          8           Celta Vigo       38    13     12     13      47   
8          9               Malaga       38    14      8     16      42   
9         10       Rayo Vallecano       38    15      4     19      46   
10        11             Espanyol       38    13     10     15      47   
11        12        Real Sociedad       38    11     13     14      44   
12        13                E

Eventually we get there ! Let's put all the previous code into loops to get all the precious data we need.

In [24]:
#Putting all the codes above in loops to get all the data

full_data = dict()
for league in leagues:
    season_data = dict()
    for season in seasons:      
        url = base_url + '/' + league + '/' + season
        res = requests.get(url) #getting HTML codes from the url

        soup = BeautifulSoup(res.content, "lxml")

        #Based on the web page's structure, data can be found in JSON variable, under the 'script' tags
        scripts = soup.find_all('script')
        
        #Find data for teams
        for el in scripts:
            if 'teamsData' in str(el):
                teamData = str(el).strip()

        #Strip unnecessary symbols and get only JSON data
        ind_start = teamData.index("('") + 2
        ind_end = teamData.index("')")
        json_data = teamData[ind_start:ind_end]

        json_data = json_data.encode('utf8').decode('unicode_escape')
        
        #Convert data into Python dictionaries
        data = json.loads(json_data)
        
        #Get team names by their respective IDs and put them into separate dictionary
        teams = {}
        for id in data.keys():
            teams[id] = data[id]['title']
            
        #Checking how sample values look like in columns
        columns = []
        values = []
        for id in data.keys():
            columns = list(data[id]['history'][0].keys())
            values = list(data[id]['history'][0].values())
            break
            
        #Getting data for all teams
        dataframes = {}
        for id, team in teams.items():
            teams_data = []
            for row in data[id]['history']:
                teams_data.append(list(row.values()))

            df = pd.DataFrame(teams_data, columns=columns)
            dataframes[team] = df
            #print('Added data for {}.'.format(team))
    
        for team,df in dataframes.items():
            dataframes[team]['ppda_coef'] = df['ppda'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
            dataframes[team]['oppda_coef'] = df['ppda_allowed'].apply(lambda x: x['att']/x['def'] if x['def'] != 0 else 0)
        
        cols_to_sum = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 'xpts',
                       'wins', 'draws', 'loses', 'pts', 'npxGD']
        cols_to_mean = ['ppda_coef','oppda_coef']

        frames = []
        #Calculate sum and mean in specific columns
        for team,df in dataframes.items():
            sum_data = pd.DataFrame(df[cols_to_sum].sum()).transpose()
            mean_data = pd.DataFrame(df[cols_to_mean].mean()).transpose()

        #Join sum_data and mean_data. Add columns team and match and then append all the frames to a list    
            final_df = sum_data.join(mean_data)
            final_df['team'] = team
            final_df['matches'] = len(df)
            frames.append(final_df)

        #Concat the frames list into the final DataFrames
        full_stat = pd.concat(frames)
        
        #Reorder the columns
        full_stat = full_stat[['team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'npxG',
                               'xGA', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts']]
        #Sort rows based on points
        full_stat.sort_values(by='pts', ascending=False, inplace=True)

        #Reset Index
        full_stat.reset_index(inplace=True, drop=True)
        
        #Add column position
        full_stat['position'] = range(1,len(full_stat)+1)
        
        #Add columns of differences between expected and real metrics
        full_stat['xG_diff'] = full_stat['xG'] - full_stat['scored']
        full_stat['xGA_diff'] = full_stat['xGA'] - full_stat['missed']
        full_stat['xpts_diff'] = full_stat['xpts'] - full_stat['pts']
        
        #Converting floats to integers in appropriate columns
        cols_to_int = ['wins','draws','loses','scored', 'missed', 'pts', 'deep', 'deep_allowed']
        full_stat[cols_to_int] = full_stat[cols_to_int].astype(int)
        
        #Prettify final view of DataFrame
        col_order = ['position','team', 'matches', 'wins', 'draws', 'loses', 'scored', 'missed', 'pts', 'xG', 'xG_diff', 'npxG', 'xGA', 'xGA_diff', 'npxGA', 'npxGD', 'ppda_coef', 'oppda_coef', 'deep', 'deep_allowed', 'xpts', 'xpts_diff']
        full_stat = full_stat[col_order]
        full_stat.set_index('position', inplace=True)
        
        season_data[season] = full_stat

    df_season = pd.concat(season_data)
    full_data[league] = df_season
    
pd.options.display.float_format = '{:,.2f}'.format
final_data = pd.concat(full_data)

final_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,team,matches,wins,draws,loses,scored,missed,pts,xG,xG_diff,...,xGA,xGA_diff,npxGA,npxGD,ppda_coef,oppda_coef,deep,deep_allowed,xpts,xpts_diff
Unnamed: 0_level_1,Unnamed: 1_level_1,position,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
EPL,2014,1,Chelsea,38,26,9,3,73,32,87,68.64,-4.36,...,31.52,-0.48,29.24,35.5,10.94,13.42,407,171,75.32,-11.68
EPL,2014,2,Manchester City,38,24,7,7,83,38,79,75.82,-7.18,...,40.5,2.5,37.45,32.15,7.98,15.08,575,144,73.1,-5.9
EPL,2014,3,Arsenal,38,22,9,7,71,36,75,69.8,-1.2,...,35.72,-0.28,33.44,31.04,8.66,13.25,398,171,75.17,0.17
EPL,2014,4,Manchester United,38,20,10,8,62,37,70,54.21,-7.79,...,39.84,2.84,36.8,13.6,7.65,15.52,267,194,63.03,-6.97
EPL,2014,5,Tottenham,38,19,7,12,58,53,64,52.39,-5.61,...,57.04,4.04,51.6,-3.17,8.0,11.3,210,232,48.94,-15.06


## __Exporting data to CSV file__

In [25]:
final_data.to_csv('FootballLeagues.csv')