Objective: Predicting the result of game played between any two teams. 

Intro: 
    By analyzing and visualizing the performance of each team, we try to predict the result of any future games played between any two teams based on previous statistics. 
   
Libraries:
 

In [None]:
import os 
import colorlover as cl
import lightgbm as lgbm
import plotly.offline as py
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import skew
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score, classification_report, confusion_matrix, precision_recall_curve
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')



In [None]:
data = pd.read_csv('../input/FIFA 2018 Statistics.csv')

In [None]:
numerical_features   = data.select_dtypes(include = [np.number]).columns
categorical_features = data.select_dtypes(include= [np.object]).columns

In [None]:
numerical_features

In [None]:
categorical_features

Checking the data: 

In [None]:
dtype_counts = data.dtypes.value_counts()

fig, ax = plt.subplots(1, 1, figsize=[7, 4])
sns.barplot(y=dtype_counts.index.astype(str), x=dtype_counts, ax=ax, 
            palette=sns.color_palette("cubehelix", 8))

for side in ['top', 'right', 'left']:
    ax.spines[side].set_visible(False)
ax.grid(axis='x', linestyle='--')
ax.set_xlabel('Variable count')

plt.suptitle('data types', ha='left', fontsize=20, x=.125, y=1)
plt.show()

Checking for nulls: 

In [None]:
null_sums = data.isnull().sum()
null_sums = null_sums[null_sums > 0].sort_values(ascending=False)

fig, ax = plt.subplots(1, 1, figsize=[7, 4])
sns.barplot(y=null_sums.index, x=100 * null_sums / len(data), 
            ax=ax, palette=sns.color_palette("cubehelix", 8))

for side in ['top', 'right', 'left']:
    ax.spines[side].set_visible(False)
ax.grid(axis='x', linestyle='--')
ax.set_xlabel('Null %')

plt.suptitle('Null % of columns', ha='left', fontsize=16, x=.125, y=1)
plt.title('Only columns with at least one null value plotted', ha='left', x=0)
plt.show()

There are three columns with null values. It is understandable since there has been couple goalless game, or even more own-goal-less game. 

Checking the Skewness of Data: 

In [None]:
skew_values = skew(data[numerical_features], nan_policy = 'omit')

pd.concat([pd.DataFrame(list(numerical_features), columns=['Features']),pd.DataFrame(list(skew_values), columns=['Skewness degree'])], axis = 1)

Skewness indicates the normality of the data. A skewness larger than 0 shows that the distribution is right tailed. In this case, goal socred = 1.132232 which is far from 0, meaning that there are features that affect the normality of the data. 

Before we can actually model and make prediction, it is a good idea to remove the outlier events, such as games that have abnormally amount of violence. And due to the rarity of events like own goal and red card, we would remove them in order to keep the integrity of the data. 

Removing outliers:

In [None]:
var_numerical = ['Goal Scored', 'On-Target', 'Corners', 'Attempts', 'Free Kicks', 'Yellow Card', 'Red', 'Fouls Committed']
plt.show()
temp_data = data[var_numerical]
plt.figure(figsize=(40,20))
sns.boxplot(data = temp_data)
plt.show()

We can visually observe outliers that are 1.5 interquantile range away from the sample mean. 

That means: clean up time!

In [None]:
data.drop(['Own goal Time', 'Own goals', '1st Goal'], axis = 1, inplace= True)

Now we also have to observe and clean up categorical data. 

Before we are able to do that, we need to encode the categorical data.



In [None]:
categorical_features

In [None]:
data = pd.read_csv('../input/FIFA 2018 Statistics.csv')
def uniqueCategories(x):
    columns = list(x.columns).copy()
    for col in columns:
        print('Feature {} has {} unique values: {}'.format(col, len(x[col].unique()), x[col].unique()))
        print('\n')
uniqueCategories(data[categorical_features].drop('Date', axis = 1))

We are removing the date, round since they have little relevancy of the outcome. 



In [None]:
data = pd.read_csv('../input/FIFA 2018 Statistics.csv')
data.drop('Date', axis = 1, inplace=True)


Now we are good to go!

Let's look at our cleaned up data!

In [None]:
print(data.shape)
data.head()

In [None]:
cleaned_data  = pd.get_dummies(data)

To have a better standing of how each variable affect the other, one of the most common measurements that we will be using is Pearson product-moment correlation coefficient. It measures the strength and direction of the linear relationship between two variables. 



In [None]:
numerical_features

In [None]:
plt.figure(figsize=(60,10))
data = pd.read_csv('../input/FIFA 2018 Statistics.csv')
sns.heatmap(data[numerical_features].corr(), square=True, annot=True,robust=True, yticklabels=1)

So what are the aspects of the game that have the biggest impact on the outcome? In another words, what affects the number of goal scored? 

From a glance of the bivariate analysis from above and years of experience as a soccer fan, it is not hard to boil the game down into several factor such as possession, distance covered, passing accuracy and attempts. 

Since Spain won the 2010 Euro Cup and 2012 World Cup with high possesion control style, more coaches start believing that controling the pace of the game is essential to winning it. 

We will use a rather simple linear regression model to validate our intuition. 

In [None]:
df = data
match_df = df.merge(df, left_on=['Date', 'Team'], right_on=['Date', 'Opponent'], 
                    how='inner', suffixes=[' Team', ' Opponent'])

fig, (ax, ax1) = plt.subplots(1, 2, figsize=[14, 5])
match_df.loc[match_df['Goal Scored Team'] > match_df['Goal Scored Opponent'], 'Result'] = 'Team win'
match_df.loc[match_df['Goal Scored Team'] < match_df['Goal Scored Opponent'], 'Result'] = 'Opponent win'
match_df.loc[match_df['Goal Scored Team'] == match_df['Goal Scored Opponent'], 'Result'] = 'Draw'
match_df.loc[(match_df['Goal Scored Team'] == match_df['Goal Scored Opponent']) &
             (match_df['Goals in PSO Team'] < match_df['Goals in PSO Opponent']), 'Result'] = 'Opponent win (Pens)'
match_df.loc[(match_df['Goal Scored Team'] == match_df['Goal Scored Opponent']) &
             (match_df['Goals in PSO Team'] > match_df['Goals in PSO Opponent']), 'Result'] = 'Team win (Pens)'
for res in match_df['Result'].unique():
    sns.kdeplot(match_df.loc[match_df['Result'] == res, 'Ball Possession % Team'], 
                ax=ax, label=res, shade=True)
ax.set_title('Effect of possession of ball', 
             ha='left', fontsize=16, x=0, y=1.02)
ax.yaxis.set_visible(False)
ax.set_ylim([ax.get_ylim()[0], .04])
    
sns.barplot(y='Result', x='Ball Possession % Team', data=match_df, ax=ax1)
ax1.grid(axis='x', linestyle='--')
ax1.set_ylabel('')

for a in [ax, ax1]:
    for spine in ['top', 'left', 'right']:
        a.spines[spine].set_visible(False)
    a.set_xlabel('Team possession (%)')
    a.legend(frameon=False)

plt.autoscale()
plt.show()

Does runnning more gives the team the edge? Legends has it that if you work harder than every one else, you would have a better chance of winning the game. 

But do you?

In [None]:

keep = []
for i, row in match_df.iterrows():
    if i > 0:
        if (row['Team Team'] == match_df.loc[i - 1, 'Opponent Team']) & \
            (row['Date'] == match_df.loc[i - 1, 'Date']):
            continue
        else:
            keep.append(i)
            
match_df = match_df.loc[keep, :]
match_df['Distance vs. Opponent'] = match_df['Distance Covered (Kms) Team'] - match_df['Distance Covered (Kms) Opponent']
match_df['Goal Difference'] = match_df['Goal Scored Team'] - match_df['Goal Scored Opponent']

lm1 = LinearRegression().fit(match_df['Distance vs. Opponent'].values.reshape(-1, 1), match_df['Ball Possession % Team'])
extremes = np.array([match_df['Distance vs. Opponent'].min(), match_df['Distance vs. Opponent'].max()]).reshape(-1, 1)
poss_pred = lm1.predict(match_df['Ball Possession % Team'].values.reshape(-1, 1))
poss_pred_plot = lm1.predict(extremes)

lm2 = LinearRegression().fit(match_df['Distance vs. Opponent'].values.reshape(-1, 1), match_df['Goal Difference'])
gd_pred = lm2.predict(match_df['Ball Possession % Team'].values.reshape(-1, 1))
gd_pred_plot = lm2.predict(extremes)

fig, (ax, ax1) = plt.subplots(1, 2, figsize=[16, 4])
ax.scatter(match_df['Distance vs. Opponent'], match_df['Ball Possession % Team'], edgecolors='blue', alpha=.3,
           s=100, c='red')
ax.plot(extremes, poss_pred_plot, color='g', linestyle='--', 
        label='Linear fit (R2 = {:.2f})'.format(r2_score(match_df['Ball Possession % Team'], poss_pred)))
ax.set_ylabel('Possession (%)')
ax.set_title('On possession', ha='left', fontsize=16, x=0, y=1.05)

ax1.scatter(match_df['Distance vs. Opponent'], match_df['Goal Difference'], edgecolors='green', alpha=.3,
           s=100, c='yellow')
ax1.plot(extremes, gd_pred_plot, color='g', linestyle='--', 
        label='Linear fit (R2 = {:.2f})'.format(r2_score(match_df['Ball Possession % Team'], gd_pred)))
ax1.set_ylabel('Match goal difference')
ax1.set_title('On goal difference', ha='left', fontsize=16, x=0, y=1.05)

for a in (ax, ax1):
    a.legend(frameon=False)
    a.spines['right'].set_visible(False)
    a.spines['top'].set_visible(False)
    a.set_xlabel('Distance run further than opponent (km)')
    
plt.suptitle('Effects of Distance Covered', ha='left', x=.125, fontsize=16, y=1.05)
plt.show()

Upon examining the graph above, more distance covered has a negative relationship with possesion. Which makes sense, because when the team has no ball under the feet, they need to run harder to retrieve them. 

But distance covered has a less clear relationship with goal difference. Despite the positive relationship, low R2 score suggests that there is no strong support for the claim. 

In [None]:

country_dict = {
    'Afghanistan': 'AFG',
     'Albania': 'ALB',
     'Algeria': 'DZA',
     'American Samoa': 'ASM',
     'Andorra': 'AND',
     'Angola': 'AGO',
     'Anguilla': 'AIA',
     'Antigua and Barbuda': 'ATG',
     'Argentina': 'ARG',
     'Armenia': 'ARM',
     'Aruba': 'ABW',
     'Australia': 'AUS',
     'Austria': 'AUT',
     'Azerbaijan': 'AZE',
     'Bahamas, The': 'BHM',
     'Bahrain': 'BHR',
     'Bangladesh': 'BGD',
     'Barbados': 'BRB',
     'Belarus': 'BLR',
     'Belgium': 'BEL',
     'Belize': 'BLZ',
     'Benin': 'BEN',
     'Bermuda': 'BMU',
     'Bhutan': 'BTN',
     'Bolivia': 'BOL',
     'Bosnia and Herzegovina': 'BIH',
     'Botswana': 'BWA',
     'Brazil': 'BRA',
     'British Virgin Islands': 'VGB',
     'Brunei': 'BRN',
     'Bulgaria': 'BGR',
     'Burkina Faso': 'BFA',
     'Burma': 'MMR',
     'Burundi': 'BDI',
     'Cabo Verde': 'CPV',
     'Cambodia': 'KHM',
     'Cameroon': 'CMR',
     'Canada': 'CAN',
     'Cayman Islands': 'CYM',
     'Central African Republic': 'CAF',
     'Chad': 'TCD',
     'Chile': 'CHL',
     'China': 'CHN',
     'Colombia': 'COL',
     'Comoros': 'COM',
     'Congo, Democratic Republic of the': 'COD',
     'Congo, Republic of the': 'COG',
     'Cook Islands': 'COK',
     'Costa Rica': 'CRI',
     "Cote d'Ivoire": 'CIV',
     'Croatia': 'HRV',
     'Cuba': 'CUB',
     'Curacao': 'CUW',
     'Cyprus': 'CYP',
     'Czech Republic': 'CZE',
     'Denmark': 'DNK',
     'Djibouti': 'DJI',
     'Dominica': 'DMA',
     'Dominican Republic': 'DOM',
     'Ecuador': 'ECU',
     'Egypt': 'EGY',
     'El Salvador': 'SLV',
     'Equatorial Guinea': 'GNQ',
     'Eritrea': 'ERI',
     'Estonia': 'EST',
     'Ethiopia': 'ETH',
     'Falkland Islands (Islas Malvinas)': 'FLK',
     'Faroe Islands': 'FRO',
     'Fiji': 'FJI',
     'Finland': 'FIN',
     'France': 'FRA',
     'French Polynesia': 'PYF',
     'Gabon': 'GAB',
     'Gambia, The': 'GMB',
     'Georgia': 'GEO',
     'Germany': 'DEU',
     'Ghana': 'GHA',
     'Gibraltar': 'GIB',
     'Greece': 'GRC',
     'Greenland': 'GRL',
     'Grenada': 'GRD',
     'Guam': 'GUM',
     'Guatemala': 'GTM',
     'Guernsey': 'GGY',
     'Guinea': 'GIN',
     'Guinea-Bissau': 'GNB',
     'Guyana': 'GUY',
     'Haiti': 'HTI',
     'Honduras': 'HND',
     'Hong Kong': 'HKG',
     'Hungary': 'HUN',
     'Iceland': 'ISL',
     'India': 'IND',
     'Indonesia': 'IDN',
     'Iran': 'IRN',
     'Iraq': 'IRQ',
     'Ireland': 'IRL',
     'Isle of Man': 'IMN',
     'Israel': 'ISR',
     'Italy': 'ITA',
     'Jamaica': 'JAM',
     'Japan': 'JPN',
     'Jersey': 'JEY',
     'Jordan': 'JOR',
     'Kazakhstan': 'KAZ',
     'Kenya': 'KEN',
     'Kiribati': 'KIR',
     'Korea, North': 'PRK',
     'Korea, South': 'KOR',
     'Kosovo': 'KSV',
     'Kuwait': 'KWT',
     'Kyrgyzstan': 'KGZ',
     'Laos': 'LAO',
     'Latvia': 'LVA',
     'Lebanon': 'LBN',
     'Lesotho': 'LSO',
     'Liberia': 'LBR',
     'Libya': 'LBY',
     'Liechtenstein': 'LIE',
     'Lithuania': 'LTU',
     'Luxembourg': 'LUX',
     'Macau': 'MAC',
     'Macedonia': 'MKD',
     'Madagascar': 'MDG',
     'Malawi': 'MWI',
     'Malaysia': 'MYS',
     'Maldives': 'MDV',
     'Mali': 'MLI',
     'Malta': 'MLT',
     'Marshall Islands': 'MHL',
     'Mauritania': 'MRT',
     'Mauritius': 'MUS',
     'Mexico': 'MEX',
     'Micronesia, Federated States of': 'FSM',
     'Moldova': 'MDA',
     'Monaco': 'MCO',
     'Mongolia': 'MNG',
     'Montenegro': 'MNE',
     'Morocco': 'MAR',
     'Mozambique': 'MOZ',
     'Namibia': 'NAM',
     'Nepal': 'NPL',
     'Netherlands': 'NLD',
     'New Caledonia': 'NCL',
     'New Zealand': 'NZL',
     'Nicaragua': 'NIC',
     'Niger': 'NER',
     'Nigeria': 'NGA',
     'Niue': 'NIU',
     'Northern Mariana Islands': 'MNP',
     'Norway': 'NOR',
     'Oman': 'OMN',
     'Pakistan': 'PAK',
     'Palau': 'PLW',
     'Panama': 'PAN',
     'Papua New Guinea': 'PNG',
     'Paraguay': 'PRY',
     'Peru': 'PER',
     'Philippines': 'PHL',
     'Poland': 'POL',
     'Portugal': 'PRT',
     'Puerto Rico': 'PRI',
     'Qatar': 'QAT',
     'Romania': 'ROU',
     'Russia': 'RUS',
     'Rwanda': 'RWA',
     'Saint Kitts and Nevis': 'KNA',
     'Saint Lucia': 'LCA',
     'Saint Martin': 'MAF',
     'Saint Pierre and Miquelon': 'SPM',
     'Saint Vincent and the Grenadines': 'VCT',
     'Samoa': 'WSM',
     'San Marino': 'SMR',
     'Sao Tome and Principe': 'STP',
     'Saudi Arabia': 'SAU',
     'Senegal': 'SEN',
     'Serbia': 'SRB',
     'Seychelles': 'SYC',
     'Sierra Leone': 'SLE',
     'Singapore': 'SGP',
     'Sint Maarten': 'SXM',
     'Slovakia': 'SVK',
     'Slovenia': 'SVN',
     'Solomon Islands': 'SLB',
     'Somalia': 'SOM',
     'South Africa': 'ZAF',
     'South Sudan': 'SSD',
     'Spain': 'ESP',
     'Sri Lanka': 'LKA',
     'Sudan': 'SDN',
     'Suriname': 'SUR',
     'Swaziland': 'SWZ',
     'Sweden': 'SWE',
     'Switzerland': 'CHE',
     'Syria': 'SYR',
     'Taiwan': 'TWN',
     'Tajikistan': 'TJK',
     'Tanzania': 'TZA',
     'Thailand': 'THA',
     'Timor-Leste': 'TLS',
     'Togo': 'TGO',
     'Tonga': 'TON',
     'Trinidad and Tobago': 'TTO',
     'Tunisia': 'TUN',
     'Turkey': 'TUR',
     'Turkmenistan': 'TKM',
     'Tuvalu': 'TUV',
     'Uganda': 'UGA',
     'Ukraine': 'UKR',
     'United Arab Emirates': 'ARE',
     'United Kingdom': 'GBR',
     'United States': 'USA',
     'Uruguay': 'URY',
     'Uzbekistan': 'UZB',
     'Vanuatu': 'VUT',
     'Venezuela': 'VEN',
     'Vietnam': 'VNM',
     'Virgin Islands': 'VGB',
     'West Bank': 'WBG',
     'Yemen': 'YEM',
     'Zambia': 'ZMB',
     'Zimbabwe': 'ZWE'
}
results_to_points_home = {
    'Team win': 3,
    'Opponent win': 0,
    'Draw': 1,
    'Opponent win (Pens)': 0,
    'Tean win (Pens)': 3
}
results_to_points_away = {
    'Team win': 0,
    'Opponent win': 3,
    'Draw': 1,
    'Opponent win (Pens)': 3,
    'Tean win (Pens)': 0
}
continent_dict={
    'Russia': 'Europe',
    'Saudi Arabia': 'Asia',
    'Egypt': 'Africa',
    'Uruguay': 'South America',
    'Morocco': 'Africa',
    'Iran': 'Asia',
    'Portugal': 'Europe',
    'Spain': 'Europe',
    'France': 'Europe',
    'Australia': 'Asia',
    'Argentina': 'South America',
    'Iceland': 'Europe',
    'Peru': 'South America',
    'Denmark': 'Europe',
    'Croatia': 'Europe',
    'Nigeria': 'Africa',
    'Costa Rica': 'North & Central America',
    'Serbia': 'Europe',
    'Germany': 'Europe',
    'Mexico': 'North & Central America',
    'Brazil': 'South America',
    'Switzerland': 'Europe',
    'Sweden': 'Europe',
    'Korea Republic': 'Asia',
    'Belgium': 'Europe',
    'Panama': 'North & Central America',
    'Tunisia': 'Africa',
    'England': 'Europe',
    'Colombia': 'South America',
    'Japan': 'Asia',
    'Poland': 'Europe',
    'Senegal': 'Africa'
}

country_dict['England'] = 'GBR'  
country_dict['Korea Republic'] = 'KOR'

match_df['Home Team Points'] = match_df['Result'].map(results_to_points_home)
match_df['Away Team Points'] = match_df['Result'].map(results_to_points_away)

country_performance_home = match_df.groupby('Team Team')['Home Team Points'].sum().reset_index()
country_performance_away = match_df.groupby('Opponent Team')['Away Team Points'].sum().reset_index()

country_performance = country_performance_home.merge(country_performance_away, 
                                                     left_on='Team Team', right_on='Opponent Team')
country_performance['Total Points'] = country_performance['Home Team Points'] + \
    country_performance['Away Team Points']

country_performance['Team Plotly Code'] = country_performance['Team Team'].map(country_dict)



fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )
df['Shot Accuracy %'] = 100 * df['On-Target'] / (df['On-Target'] + df['Off-Target'])
team_precision = df.groupby('Team')['Pass Accuracy %', 'Shot Accuracy %'].mean().reset_index()
team_precision = \
    team_precision.merge(country_performance[['Team Team', 'Total Points']], left_on='Team', right_on='Team Team')

In [None]:
fig, ax = plt.subplots(1, 1, figsize = [12, 8])
ax.scatter(team_precision['Pass Accuracy %'], team_precision['Shot Accuracy %'],
           s=100 * team_precision['Total Points'], alpha=.7)
ax.set_xlabel('Pass Accuracy (%)')
ax.set_ylabel('Shot Accuracy (%)')

for spine in ['top', 'left', 'right', 'bottom']:
    ax.spines[spine].set_visible(False)

ax.grid(linestyle='', alpha=.7)
for i, row in team_precision.iterrows():
    ax.annotate(row['Team'], xy=(row['Pass Accuracy %']+.3, row['Shot Accuracy %']+.5))

plt.show()

It turns out that passing better and shooting better does not guarentee a W, but there is a positive relationship between passing accuracy and the size of the circle. With more data we have reasons to believe that a simple linear regression model would be enough to justify the claim. 