# Introduction

In this notebook we are going to retrieve soccer player statistics (goals/assists mainly) from different websites:
- ligue1.com (Ligue 1 goals and assists, Coupe de la Ligue goals)

Champions Cup:
Game opposing previous League champion and winner of the Cup.
As there is only one of such game each season, data will be input manually

# Import

In [15]:
import lxml.html as lh
import lxml.etree as et
import urllib.request as ulib
import pandas as pd
from selenium import webdriver
import time

# Params

## Ligue 1 current week

In [16]:
ligue1_current_week_goals = 38
ligue1_current_week_assists = 38
season_id = 100

## URLs

In [26]:
ligue1_goals_url = "http://www.ligue1.com/ligue1/classementButeurs#sai={0}&journee1=1&journee2={1}&cat=G&poste=Tous&viewAll=true".format(season_id,ligue1_current_week_goals)
ligue1_assists_url = "http://www.ligue1.com/ligue1/classementPasseurs#sai={0}&journee1=1&journee2={1}&cat=G&poste=Tous&viewAll=true".format(season_id,ligue1_current_week_assists)
cl_goals_url = "http://www.ligue1.com/coupeLigue/classementButeurs#sai={0}&journee1=47&journee2=60&cat=G&poste=Tous&viewAll=true".format(season_id)

print(ligue1_goals_url)
print(ligue1_assists_url)
print(cl_goals_url)

http://www.ligue1.com/ligue1/classementButeurs#sai=100&journee1=1&journee2=38&cat=G&poste=Tous&viewAll=true
http://www.ligue1.com/ligue1/classementPasseurs#sai=100&journee1=1&journee2=38&cat=G&poste=Tous&viewAll=true
http://www.ligue1.com/coupeLigue/classementButeurs#sai=100&journee1=47&journee2=60&cat=G&poste=Tous&viewAll=true


## Dataframe columns

### Ligue 1 goals columns
Legend:<br>
%competition%_G = Goals<br>
%competition%_PK = On penalty<br>
%competition%_A = Assists<br>
%competition%_DB = Dead ball assists<br>
%competition%_TP = Time played in minutes<br>

%competition%:<br>
L1: Ligue 1<br>
CL: Coupe de la Ligue<br>
TC: Trophee des Champions<br>
CF: Coupe de France<br>
UCL: UEFA Champions League<br>
EL: Europa League<br>

Columns which will be drop later since not too interesting:<br>
Pld = The number of games played in which the player scored<br>
Pts = Number of points earned by the team<br>

In [27]:
columns_l1_goals = ['#', 'Player', 'Team', 'L1_G', 'L1_PK', 'L1_A', 'Pld', 'L1_TP', 'Pts']
columns_l1_assists = ['#', 'Player', 'Team', 'L1_A', 'L1_DB', 'Pld', 'L1_TP']
columns_cl_goals = ['#', 'Player', 'Team', 'CL_G', 'CL_PK', 'Pld', 'CL_TP']

# Trophee des Champions (TC) will be treated manually so we gather goals/assists in the same data table
columns_tc = ['Player', 'Team', 'TC_G', 'TC_PK', 'TC_A', 'TC_DB', 'TC_TP']

# final output
columns_final = [
    'Player', 'Team',
    'L1_G', 'L1_PK', 'L1_A', 'L1_DB', 'L1_TP',
    'CL_G', 'CL_PK', 'CL_TP',
    'TC_G', 'TC_PK', 'TC_A', 'TC_DB', 'TC_TP',
    'Tot_G', 'Tot_A', 'Tot', 'Tot_TP'
]

# quick display
columns_quick = [
    'Player', 'Team',
    'L1_G', 'L1_A',
    'CL_G', 
    'Tot_G', 'Tot_A', 'Tot'
]

### Numeric columns

In [28]:
numeric_cols = [
    'L1_G', 'L1_PK', 'L1_TP_x', 'L1_A', 'L1_DB', 'L1_TP_y',
    'CL_G', 'CL_PK', 'CL_TP',
    'TC_G', 'TC_PK', 'TC_A', 'TC_DB', 'TC_TP'
]

## Dictionaries

In [29]:
urls = {
    'l1_goals': ligue1_goals_url,
    'l1_assists': ligue1_assists_url,
    'cl_goals': cl_goals_url
}

cols = {
    'l1_goals': columns_l1_goals, 
    'l1_assists': columns_l1_assists,
    'cl_goals': columns_cl_goals,
    'tc' : columns_tc
}

## Tree dictionary

We want to retrieve our URLs HTML as elementree object to parse it later. The problem is the URLs we are looking for are using Javascript to generate part of HTML page. The basic method 'urlopen' won't work in that case.

We are going to use selenium package to fix that. This module opens a firefox window to retrieve the final version of the HTML page (after JS has done its part).

In [30]:
# get url html as elementtree using selenium.webdriver
def selenium_url_to_tree(driver, url):
    driver.get(url)
    time.sleep(5)
    htmlSource = driver.page_source
    tree = lh.fromstring(htmlSource)
    return tree

# open driver: that will open firefox window
driver = webdriver.Firefox()

# construct the trees only once
l1_goals_tree = selenium_url_to_tree(driver,ligue1_goals_url)
l1_assists_tree = selenium_url_to_tree(driver,ligue1_assists_url)
cl_goals_tree = selenium_url_to_tree(driver,cl_goals_url)

# we construct the 'trees' dictionary by loading related urls
# so the urls are loaded only once, only trees will be manipulated going forward
trees = {
    'l1_goals': l1_goals_tree,
    'l1_assists': l1_assists_tree,
    'cl_goals': cl_goals_tree
}

# close firefox window once done
driver.quit()

# Functions

## [extract_data_from_url] load url and extract table html element

In [31]:
def extract_data_from_url(url):
    
    #load url as etree and get table
    tree = lh.parse(ulib.urlopen(url))
    table = tree.findall('.//table')[0]
    
    # convert table element to list
    data = [[td.text_content().strip().lower() for td in row.findall('.//td')] for row in table.findall('.//tr')]
    
    # remove first empty line in the list
    data = data[1:]
    
    return data

## [get_ligue1_com_stats] retrieve goals/assists stats from ligue1.com website

### params:
stat_type: dictionaries lookup key<br>
team_filter: filter on a particular team<br>

In [32]:
def get_ligue1_com_stats(stat_type, team_filter=None):
    
    # get table html element from the tree
    tree = trees[stat_type]
    table = tree.findall('.//table')[0]
    
    # convert table element to list
    data = [[td.text_content().strip().lower() for td in row.findall('.//td')] for row in table.findall('.//tr')]
    
    # remove first empty line in the list
    data = data[1:]
    
    # convert list to pandas dataframe
    df = pd.DataFrame(data, columns=cols[stat_type])
    
    # filter on team name
    if team_filter:
        df = df.loc[df['Team']==team_filter]
        
    # remove special characters (ie: accent on 'e' or 'a')
    df['Player'] = df['Player'].map(handle_accent)
    df['Team'] = df['Team'].map(handle_accent)
    
    # reset index
    df = df.reset_index()
    
    # delete useless columns
    del df['index']
    del df['#']
    
    # specific columns for ligue1/Coupe de la Ligue top goals
    if 'Pts' in df:
        del df['Pts']
    
    if 'Pld' in df:
        del df['Pld']
    
    # drop assist column for Ligue 1 goals df (redundant)
    if stat_type=='l1_goals':
        del df['L1_A']
        
    return df

## [handle_accent] handle special character like accent on 'e' or 'a'

In [33]:
# we need to find a much better way to handle all that (using unicode)
def handle_accent(s):
    s = s.replace("ã¨", "e")
    s = s.replace("ã©", "e")
    s = s.replace("ã¯", "i")
    s = s.replace("ã", "a")
    s = s.replace("a«", "e")
    s = s.replace("a¢", "a")
    s = s.replace("a§", "c")
    return s

## [merge_df] merge goal and assist dataframes for all competitions

In [34]:
def merge_dfs(df_l1_goals, df_l1_assists, df_cl_goals, df_tc):
    
    # get union of all dfs Player/Team columns for the lookup
    df1 = df_l1_goals[['Player','Team']]
    df2 = df_l1_assists[['Player','Team']]
    df3 = df_cl_goals[['Player','Team']]
    df4 = df_tc[['Player','Team']]
    df = pd.concat([df1,df2,df3,df4])
    
    # remove duplicate
    df = df.drop_duplicates()
    
    # remove columns in double
    del df_l1_goals['Team']
    del df_l1_assists['Team']
    del df_cl_goals['Team']
    del df_tc['Team']
    
    # reset index and remove index col
    df = df.reset_index(drop=True)
    
    # merge
    df = df.merge(df_l1_goals, on='Player', how='left')
    df = df.merge(df_l1_assists, on='Player', how='left')
    df = df.merge(df_cl_goals, on='Player', how='left')
    df = df.merge(df_tc, on='Player', how='left')
    
    # replace NaN by zero
    df = df.fillna(0)
    
    # convert columns to numeric
    pd.options.mode.chained_assignment = None  # default='warn'
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric)
    pd.options.mode.chained_assignment = 'warn'
    
    # handle L1_TP columns as it is now in double (L1_TP_x and L1_TP_y)
    # we want to take the max of the two
    df['L1_TP'] = df[['L1_TP_x','L1_TP_y']].max(axis=1)
    del df['L1_TP_x']
    del df['L1_TP_y']      
    
    # add total columns for goals, assists, time played in minute
    df['Tot_G'] = df['L1_G'] + df['CL_G'] + df['TC_G']
    df['Tot_A'] = df['L1_A'] + df['TC_A']
    df['Tot'] = df['Tot_G'] + df['Tot_A']
    df['Tot_TP'] = df['L1_TP'] + df['CL_TP'] + df['TC_TP']
    
    # arrange columns order
    df = df[columns_final]
    
    return df

# Test the API

## get stats for a given team

In [35]:
team_filter = "paris saint-germain"

In [36]:
get_ligue1_com_stats('l1_goals', team_filter)

Unnamed: 0,Player,Team,L1_G,L1_PK,L1_TP
0,edinson cavani,paris saint-germain,35,7,2965
1,lucas moura,paris saint-germain,12,1,2421
2,angel di maria,paris saint-germain,6,0,2035
3,blaise matuidi,paris saint-germain,4,0,2415
4,julian draxler,paris saint-germain,4,0,1012
5,marco verratti,paris saint-germain,3,0,2148
6,adrien rabiot,paris saint-germain,3,0,1935
7,thiago silva,paris saint-germain,3,0,2385
8,marquinhos,paris saint-germain,3,0,2501
9,layvin kurzawa,paris saint-germain,2,0,1527


In [37]:
get_ligue1_com_stats('l1_assists', team_filter)

Unnamed: 0,Player,Team,L1_A,L1_DB,L1_TP
0,angel di maria,paris saint-germain,7,2,2035
1,maxwell,paris saint-germain,6,0,1713
2,marco verratti,paris saint-germain,5,0,2148
3,lucas moura,paris saint-germain,5,2,2421
4,blaise matuidi,paris saint-germain,4,0,2415
5,edinson cavani,paris saint-germain,4,0,2965
6,javier pastore,paris saint-germain,4,0,864
7,thomas meunier,paris saint-germain,4,1,1695
8,serge aurier,paris saint-germain,3,0,1828
9,giovani lo celso,paris saint-germain,2,0,82


In [38]:
get_ligue1_com_stats('cl_goals', team_filter)

Unnamed: 0,Player,Team,CL_G,CL_PK,CL_TP
0,edinson cavani,paris saint-germain,4,0,251
1,angel di maria,paris saint-germain,3,0,270
2,thiago silva,paris saint-germain,2,0,270
3,lucas moura,paris saint-germain,2,1,163
4,julian draxler,paris saint-germain,1,0,54
5,jesé rodriguez,paris saint-germain,1,0,90


## Trophee des Champions dataframe: manual input

In [39]:
# Trophee des Champions dataframe
data = [
    ['lucas moura', 'paris saint-germain', '1', '0', '0', '0', '90'],
    ['javier pastore', 'paris saint-germain', '1', '2', '0', '0', '90'],
    ['angel di maria', 'paris saint-germain', '0', '1', '0', '0', '65'],
    ['layvin kurzawa', 'paris saint-germain', '1', '1', '0', '0', '75'],
    ['hatem ben arfa', 'paris saint-germain', '1', '0', '0', '0', '90'],
    ['corentin tolisso', 'olympique lyonnais', '1', '0', '0', '0', '90'],
    ['christophe jallet', 'olympique lyonnais', '0', '1', '0', '0', '26']
]

df_tc = pd.DataFrame(data, columns=cols['tc'])

## construct dataframe which gather the whole data

In [40]:
df = merge_dfs(get_ligue1_com_stats('l1_goals'),
               get_ligue1_com_stats('l1_assists'),
               get_ligue1_com_stats('cl_goals'),
               df_tc)
df[:20]

Unnamed: 0,Player,Team,L1_G,L1_PK,L1_A,L1_DB,L1_TP,CL_G,CL_PK,CL_TP,TC_G,TC_PK,TC_A,TC_DB,TC_TP,Tot_G,Tot_A,Tot,Tot_TP
0,edinson cavani,paris saint-germain,35,7,4,0,2965,4,0,251,0,0,0,0,0,39,4,43,3216
1,alexandre lacazette,olympique lyonnais,28,10,3,0,2405,1,0,27,0,0,0,0,0,29,3,32,2432
2,radamel falcao,as monaco,21,4,5,0,1927,1,0,98,0,0,0,0,0,22,5,27,2025
3,bafetimbi gomis,olympique de marseille,20,3,3,0,2599,1,0,132,0,0,0,0,0,21,3,24,2731
4,kylian mbappe,as monaco,15,0,8,0,1498,3,0,267,0,0,0,0,0,18,8,26,1765
5,florian thauvin,olympique de marseille,15,1,9,3,2956,0,0,0,0,0,0,0,0,15,9,24,2956
6,ivan santini,sm caen,15,3,2,0,2874,0,0,0,0,0,0,0,0,15,2,17,2874
7,mario balotelli,ogc nice,15,3,1,0,1739,1,1,90,0,0,0,0,0,16,1,17,1829
8,steve mounie,montpellier hérault sc,14,0,2,0,2824,1,0,120,0,0,0,0,0,15,2,17,2944
9,nicolas de preville,losc,14,5,1,1,2059,0,0,0,0,0,0,0,0,14,1,15,2059


## trick to inspect data type in the whole dataframe

In [41]:
dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]
dtypeCount

[<class 'str'>    373
 Name: Player, dtype: int64, <class 'str'>    373
 Name: Team, dtype: int64, <class 'int'>    373
 Name: L1_G, dtype: int64, <class 'int'>    373
 Name: L1_PK, dtype: int64, <class 'int'>    373
 Name: L1_A, dtype: int64, <class 'int'>    373
 Name: L1_DB, dtype: int64, <class 'int'>    373
 Name: L1_TP, dtype: int64, <class 'int'>    373
 Name: CL_G, dtype: int64, <class 'int'>    373
 Name: CL_PK, dtype: int64, <class 'int'>    373
 Name: CL_TP, dtype: int64, <class 'int'>    373
 Name: TC_G, dtype: int64, <class 'int'>    373
 Name: TC_PK, dtype: int64, <class 'int'>    373
 Name: TC_A, dtype: int64, <class 'int'>    373
 Name: TC_DB, dtype: int64, <class 'int'>    373
 Name: TC_TP, dtype: int64, <class 'int'>    373
 Name: Tot_G, dtype: int64, <class 'int'>    373
 Name: Tot_A, dtype: int64, <class 'int'>    373
 Name: Tot, dtype: int64, <class 'int'>    373
 Name: Tot_TP, dtype: int64]

## order results by total number of decisive play

In [42]:
df.sort_values(by='Tot', ascending=False).reset_index(drop=True)[columns_quick][:20]

Unnamed: 0,Player,Team,L1_G,L1_A,CL_G,Tot_G,Tot_A,Tot
0,edinson cavani,paris saint-germain,35,4,4,39,4,43
1,alexandre lacazette,olympique lyonnais,28,3,1,29,3,32
2,radamel falcao,as monaco,21,5,1,22,5,27
3,kylian mbappe,as monaco,15,8,3,18,8,26
4,bafetimbi gomis,olympique de marseille,20,3,1,21,3,24
5,florian thauvin,olympique de marseille,15,9,0,15,9,24
6,thomas lemar,as monaco,9,10,1,10,10,20
7,ryad boudebouz,montpellier hérault sc,11,9,0,11,9,20
8,lucas moura,paris saint-germain,12,5,2,15,5,20
9,emiliano sala,fc nantes,12,4,3,15,4,19


## order results by total number of assists

In [43]:
df.sort_values(by='Tot_A', ascending=False).reset_index(drop=True)[columns_quick][:20]

Unnamed: 0,Player,Team,L1_G,L1_A,CL_G,Tot_G,Tot_A,Tot
0,morgan sanson,olympique de marseille,4,12,0,4,12,16
1,jean michael seri,ogc nice,7,10,0,7,10,17
2,thomas lemar,as monaco,9,10,1,10,10,20
3,ryad boudebouz,montpellier hérault sc,11,9,0,11,9,20
4,florian thauvin,olympique de marseille,15,9,0,15,9,24
5,bernardo silva,as monaco,8,9,0,8,9,17
6,kylian mbappe,as monaco,15,8,3,18,8,26
7,angel di maria,paris saint-germain,6,7,3,9,7,16
8,memphis depay,olympique lyonnais,5,7,0,5,7,12
9,thomas mangani,angers sco,3,7,0,3,7,10
