# Football Scores Predictor

# Introduction

## Data Scientist

- Ryan Mburu

## Problem Definition

-> This project aims at predicting results of a football game between two teams. 

## Context

The predictions will be based on the home team's rank, the away team's rank and the tournament they are participating in.

## Metric of Success

- Understanding the problem and datasets provided
- Performing data cleaning
- Performing EDA so as to get statistical visualizations and correlation between the features.
- Performing regression
- Model evaluation through cross validation, RMSE Scores, residual plots and Heteroscedascity

## Data relevance

the rankings are accurate as of the official FIFA wikipedia
 - https://en.wikipedia.org/wiki/FIFA_World_Rankings

# Data Understanding

In [84]:
#  Load in the Libraries to be used

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

## Reading the data

### rankings Dataset

In [85]:
# First Dataset
rank = pd.read_csv('/Users/RyanMburu/Documents/DS-Projects/Supervised-Learning/FIFA-rankings/Datasets/fifa_ranking.csv')
rank.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
1,2,Italy,ITA,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
2,3,Switzerland,SUI,0.0,50,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
3,4,Sweden,SWE,0.0,55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08
4,5,Argentina,ARG,0.0,51,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08


In [86]:
# View last 5 records
rank.tail()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date
57788,206,Anguilla,AIA,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONCACAF,2018-06-07
57789,206,Bahamas,BAH,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONCACAF,2018-06-07
57790,206,Eritrea,ERI,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,2018-06-07
57791,206,Somalia,SOM,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CAF,2018-06-07
57792,206,Tonga,TGA,0.0,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,OFC,2018-06-07


In [87]:
# View no of records and columns
rank.shape

(57793, 16)

### Results Dataset

In [88]:
# Load the dataset & Preview first 5 records
results = pd.read_csv('/Users/RyanMburu/Documents/DS-Projects/Supervised-Learning/FIFA-rankings/Datasets/results.csv')
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [89]:
# Last 5 records
results.tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
40834,2019-07-18,American Samoa,Tahiti,8,1,Pacific Games,Apia,Samoa,True
40835,2019-07-18,Fiji,Solomon Islands,4,4,Pacific Games,Apia,Samoa,True
40836,2019-07-19,Senegal,Algeria,0,1,African Cup of Nations,Cairo,Egypt,True
40837,2019-07-19,Tajikistan,North Korea,0,1,Intercontinental Cup,Ahmedabad,India,True
40838,2019-07-20,Papua New Guinea,Fiji,1,1,Pacific Games,Apia,Samoa,True


In [90]:
# Records and features
results.shape

(40839, 9)

The rankings dataset has 57K records and 16 columns

The results dataset has 40K records and 9 columns

# Data Cleaning

## Null Values

In [91]:
# Null values
rank.isna().sum()

rank                       0
country_full               0
country_abrv               0
total_points               0
previous_points            0
rank_change                0
cur_year_avg               0
cur_year_avg_weighted      0
last_year_avg              0
last_year_avg_weighted     0
two_year_ago_avg           0
two_year_ago_weighted      0
three_year_ago_avg         0
three_year_ago_weighted    0
confederation              0
rank_date                  0
dtype: int64

In [92]:
results.isna().sum()

date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64

## Duplicate Values

In [93]:
# Duplicates
rank.duplicated().sum()

# Duplicates will not be dropped as a country can maintain the same rank for multiple years e.g belgium been no 1 rank from 2018 upto now


37

In [94]:
results.duplicated().sum()

0

## Datatypes

-> Change to datetime and extract year as its the important one

In [95]:
# datatypes

rank.dtypes

rank                         int64
country_full                object
country_abrv                object
total_points               float64
previous_points              int64
rank_change                  int64
cur_year_avg               float64
cur_year_avg_weighted      float64
last_year_avg              float64
last_year_avg_weighted     float64
two_year_ago_avg           float64
two_year_ago_weighted      float64
three_year_ago_avg         float64
three_year_ago_weighted    float64
confederation               object
rank_date                   object
dtype: object

In [96]:
# Change the rank_date column to datetype

rank['date'] = pd.to_datetime(rank['rank_date'])
rank.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date,date
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08
1,2,Italy,ITA,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08
2,3,Switzerland,SUI,0.0,50,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08
3,4,Sweden,SWE,0.0,55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08
4,5,Argentina,ARG,0.0,51,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08,1993-08-08


In [97]:
# Extract year from the new column
rank['year'] = rank['date'].dt.year
rank.head()

Unnamed: 0,rank,country_full,country_abrv,total_points,previous_points,rank_change,cur_year_avg,cur_year_avg_weighted,last_year_avg,last_year_avg_weighted,two_year_ago_avg,two_year_ago_weighted,three_year_ago_avg,three_year_ago_weighted,confederation,rank_date,date,year
0,1,Germany,GER,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08,1993
1,2,Italy,ITA,0.0,57,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08,1993
2,3,Switzerland,SUI,0.0,50,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08,1993
3,4,Sweden,SWE,0.0,55,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,UEFA,1993-08-08,1993-08-08,1993
4,5,Argentina,ARG,0.0,51,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CONMEBOL,1993-08-08,1993-08-08,1993


In [98]:
# New Rank Dataframe with only rank, country name and total points for merging

new_rank = rank[['rank', 'country_full', 'year']]
new_rank

Unnamed: 0,rank,country_full,year
0,1,Germany,1993
1,2,Italy,1993
2,3,Switzerland,1993
3,4,Sweden,1993
4,5,Argentina,1993
...,...,...,...
57788,206,Anguilla,2018
57789,206,Bahamas,2018
57790,206,Eritrea,2018
57791,206,Somalia,2018


In [80]:
# Do the same for results table

results['new_date'] = pd.to_datetime(results['date'])
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,new_date
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872-11-30
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873-03-08
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874-03-07
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875-03-06
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876-03-04


In [81]:
# Year column
results['year'] = results['new_date'].dt.year
results.tail()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,new_date,year
40834,2019-07-18,American Samoa,Tahiti,8,1,Pacific Games,Apia,Samoa,True,2019-07-18,2019
40835,2019-07-18,Fiji,Solomon Islands,4,4,Pacific Games,Apia,Samoa,True,2019-07-18,2019
40836,2019-07-19,Senegal,Algeria,0,1,African Cup of Nations,Cairo,Egypt,True,2019-07-19,2019
40837,2019-07-19,Tajikistan,North Korea,0,1,Intercontinental Cup,Ahmedabad,India,True,2019-07-19,2019
40838,2019-07-20,Papua New Guinea,Fiji,1,1,Pacific Games,Apia,Samoa,True,2019-07-20,2019


### Filter results table to only have matches that occurred after 1993

In [82]:
results = results[results['year'] > 1992]
results

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,new_date,year
17361,1993-01-01,Ghana,Mali,1,1,Friendly,Libreville,Gabon,True,1993-01-01,1993
17362,1993-01-02,Gabon,Burkina Faso,1,1,Friendly,Libreville,Gabon,False,1993-01-02,1993
17363,1993-01-02,Kuwait,Lebanon,2,0,Friendly,Kuwait City,Kuwait,False,1993-01-02,1993
17364,1993-01-03,Burkina Faso,Mali,1,0,Friendly,Libreville,Gabon,True,1993-01-03,1993
17365,1993-01-03,Gabon,Ghana,2,3,Friendly,Libreville,Gabon,False,1993-01-03,1993
...,...,...,...,...,...,...,...,...,...,...,...
40834,2019-07-18,American Samoa,Tahiti,8,1,Pacific Games,Apia,Samoa,True,2019-07-18,2019
40835,2019-07-18,Fiji,Solomon Islands,4,4,Pacific Games,Apia,Samoa,True,2019-07-18,2019
40836,2019-07-19,Senegal,Algeria,0,1,African Cup of Nations,Cairo,Egypt,True,2019-07-19,2019
40837,2019-07-19,Tajikistan,North Korea,0,1,Intercontinental Cup,Ahmedabad,India,True,2019-07-19,2019


In [83]:
# new dataframe with important columns - year, home team, away,team, scores and tournament

new_results = results.drop(['date', 'city', 'country', 'neutral', 'new_date'], axis=1)
new_results

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,year
17361,Ghana,Mali,1,1,Friendly,1993
17362,Gabon,Burkina Faso,1,1,Friendly,1993
17363,Kuwait,Lebanon,2,0,Friendly,1993
17364,Burkina Faso,Mali,1,0,Friendly,1993
17365,Gabon,Ghana,2,3,Friendly,1993
...,...,...,...,...,...,...
40834,American Samoa,Tahiti,8,1,Pacific Games,2019
40835,Fiji,Solomon Islands,4,4,Pacific Games,2019
40836,Senegal,Algeria,0,1,African Cup of Nations,2019
40837,Tajikistan,North Korea,0,1,Intercontinental Cup,2019


## Preview New Clean Datasets

In [100]:
# Rank table
new_rank

Unnamed: 0,rank,country_full,year
0,1,Germany,1993
1,2,Italy,1993
2,3,Switzerland,1993
3,4,Sweden,1993
4,5,Argentina,1993
...,...,...,...
57788,206,Anguilla,2018
57789,206,Bahamas,2018
57790,206,Eritrea,2018
57791,206,Somalia,2018


In [101]:
# results table
new_results

Unnamed: 0,home_team,away_team,home_score,away_score,tournament,year
17361,Ghana,Mali,1,1,Friendly,1993
17362,Gabon,Burkina Faso,1,1,Friendly,1993
17363,Kuwait,Lebanon,2,0,Friendly,1993
17364,Burkina Faso,Mali,1,0,Friendly,1993
17365,Gabon,Ghana,2,3,Friendly,1993
...,...,...,...,...,...,...
40834,American Samoa,Tahiti,8,1,Pacific Games,2019
40835,Fiji,Solomon Islands,4,4,Pacific Games,2019
40836,Senegal,Algeria,0,1,African Cup of Nations,2019
40837,Tajikistan,North Korea,0,1,Intercontinental Cup,2019


## Merging the two datasets