# World Cup Champion Prediction
the workflow of this notebook consists of the following:
   * Define the objective
   * Exploratory data analysis
   * Data visualization, and statistics summary
   * Feature engineering
   * Build a regression model
   * Predict all possible games with 32 teams
   * Simulate tournament for the 2018 world cup

In [14]:
import pandas as pd
matches = pd.read_csv('./data/matches.csv')
teams = pd.read_csv('./data/teams.csv')

## Objective

* the goal is to predict, given two teams, what's the goal difference(GD) of a match.
* if the GD is positive, the first team wins; otherwise, it loses.
* we regard the goal difference as a continuous variable, so we can use regression to predict.

## Data 
* our dataset comes from https://github.com/neaorin/PredictTheWorldCup/tree/master/input
* it includes about 30,000 matches from International Class A Tournament, from 1956 to 2017.

In [15]:
matches.head()

Unnamed: 0,date,team1,team1Text,team2,team2Text,venue,IdCupSeason,CupName,team1Score,team2Score,statText,resText,team1PenScore,team2PenScore
0,19500308,WAL,Wales,NIR,Northern Ireland,"Cardiff, Wales",6,FIFA competition team qualification,0.0,0.0,,0-0,,
1,19500402,ESP,Spain,POR,Portugal,"Madrid, Spain",6,FIFA competition team qualification,5.0,1.0,,5-1,,
2,19500409,POR,Portugal,ESP,Spain,"Lisbon, Portugal",6,FIFA competition team qualification,2.0,2.0,,2-2,,
3,19500415,SCO,Scotland,ENG,England,"Glasgow, Scotland",6,FIFA competition team qualification,0.0,1.0,,0-1,,
4,19500624,BRA,Brazil,MEX,Mexico,"Rio De Janeiro, Brazil",7,FIFA competition team final,4.0,0.0,,4-0,,


In [16]:
teams.head()

Unnamed: 0,confederation,name,fifa_code,ioc_code
0,CAF,Algeria,ALG,ALG
1,CAF,Angola,ANG,ANG
2,CAF,Benin,BEN,BEN
3,CAF,Botswana,BOT,BOT
4,CAF,Burkina Faso,BFA,BUR


In [17]:
set(matches['CupName'].values)

{'Confederation competition team final',
 'FIFA competition team final',
 'FIFA competition team qualification',
 'Friendly'}

In [18]:
# remove match duplicates 
matches = matches.drop_duplicates()
# modified date to datetime object with yyyy-mm-dd format
from datetime import datetime
matches['date'] = matches['date'].apply(lambda x: datetime.strptime(str(x), '%Y%m%d').strftime('%Y-%m-%d'))
matches.shape

(31825, 14)

So here we need some domain knowledge in soccer. If you frequently watch soccer games, you'll know that whether or not a team was playing at home would probably make a significant difference in its performance. Therefore, we want to take which team is playing at home into account. 

- We add two new variables to indicate whether or not the teams playing at home.
    - ```team1Home``` 
    - ```team2Home```
- If the first team is playing at home, then ```team1Home``` is ```True```, ```team2Home``` is ```False``` and vice versa.
- If both teams are playing at home, both ```team1Home``` and ```team2Home``` are ```True```; if both teams are away, then ```team1Home``` and ```team2Home``` are both ```False```. 

In [21]:
matches['team1Home'] = matches.apply(lambda x: str(x['team1Text'].lower() in str(x['venue']).lower()), axis=1)
matches['team2Home'] = matches.apply(lambda x: str(x['team2Text'].lower() in str(x['venue']).lower()), axis=1)
matches.head()

Unnamed: 0,date,team1,team1Text,team2,team2Text,venue,IdCupSeason,CupName,team1Score,team2Score,statText,resText,team1PenScore,team2PenScore,team1Home,team2Home
0,1950-03-08,WAL,Wales,NIR,Northern Ireland,"Cardiff, Wales",6,FIFA competition team qualification,0.0,0.0,,0-0,,,True,False
1,1950-04-02,ESP,Spain,POR,Portugal,"Madrid, Spain",6,FIFA competition team qualification,5.0,1.0,,5-1,,,True,False
2,1950-04-09,POR,Portugal,ESP,Spain,"Lisbon, Portugal",6,FIFA competition team qualification,2.0,2.0,,2-2,,,True,False
3,1950-04-15,SCO,Scotland,ENG,England,"Glasgow, Scotland",6,FIFA competition team qualification,0.0,1.0,,0-1,,,True,False
4,1950-06-24,BRA,Brazil,MEX,Mexico,"Rio De Janeiro, Brazil",7,FIFA competition team final,4.0,0.0,,4-0,,,True,False


Also, another factor that might influence a team's performance is the type of matches it's playing. Normally, teams will treat Friendly match as an opportunity for exploring young talented players. Therefore, they would be less serious towards the scores. For the actual prediction, we might want to exclude the friendly matches if they turn out to make a huge difference. 

In [23]:
import copy
# make a deep copy for the matches including friendly matches 
matches_with_friendly = copy.deepcopy(matches)
# exclude the friendly matches for our analysis
matches = matches[matches['CupName'] != 'Friendly']
# also, we want to exclude the influence of outliers, like some teams would 
# score more than 10 goals in a single match
matches = matches[(matches['team1Score']<=10) & (matches['team2Score']<=10)]