## World Cup 2022: Prediction knockout stage

In this notebook, the main steps will be taken to develop a model that predicts the results of the knockout stage of the World Cup in Qatar 2022

with this solution I achieved an accuracy of 62.5% in the 2022 world cup.

The motivation of this project has the challenge World Cup's Data Science Contest, offered by SigmaGeek.
If you want to participate in other challenges enter the link: 
https://sigmageek.com/?ref=04QS0KSQ7V8J9VH

This notebook's solution was based on the: 
https://www.kaggle.com/code/brunosoaresdossantos/fifa-world-cup-prediction

## Index
1. Importing the libraries
2. Reading the files
3. Looking briefly at the data
4. Feature Engineering
5. Working only with World Cup data - "Training is training, and a game is a game"
6. Split into training, validation and testing.
7. Feature Engineering with Cup data
8. Training the model - RandomForestClassifier
9. Training the model - XGBClassifier
10. Training the model - LogisticRegression
11. Training the test file for submission - XGBClassifier
12. Predictions of the round of 8
13. Predicting the semifinal
14. Predicting the final and third place
15. Creating the file for submission

## 1. Importing the libraries

In this step, all libraries used in the notebook will be imported



In [1]:
# Importing the Standard Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Library for working with dates
from datetime import datetime

# Importing the library to load files from the drive
from google.colab import drive

# Importing the libraries to train the model
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.linear_model import LogisticRegression

# Importing the metrics libraries
from sklearn.metrics import  accuracy_score
from sklearn.metrics import classification_report

In [2]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Reading the files
In this step the database will be read, the file "international_matches.csv" was found in Kaggle.
https://www.kaggle.com/datasets/brenda89/fifa-world-cup-2022

The "oitavas de final.csv" file I created myself by inserting the fifa ranking data, and putting the score of each team based on the "international_matches.csv" file.

To indicate that the injuries were affecting the teams, I created a kind of "punishment", the selections of France mainly due to the injury of Benzema and Kanté (-2 points in the attack score, -2 in the midfield score and -1 in the score of defence).  
Senegal - The great player of the team, Mané, was injured (-2 point in attack).


In [3]:
# Path to get files from drive
caminho = '/content/drive/MyDrive/Dataset - projetos dados/Copa do mundo 2022/international_matches.csv'
oitavas = '/content/drive/MyDrive/Dataset - projetos dados/Copa do mundo 2022/oitavas de final.csv'

In [4]:
df_copa= pd.read_csv(caminho, delimiter=';')
df_copa_2022 = pd.read_csv(oitavas, delimiter=';')

## 3. Looking briefly at the data




In [5]:
# Looking at the first lines
df_copa.head()

Unnamed: 0,date,home_team,away_team,home_team_continent,away_team_continent,home_team_fifa_rank,away_team_fifa_rank,home_team_total_fifa_points,away_team_total_fifa_points,home_team_score,...,shoot_out,home_team_result,home_team_goalkeeper_score,away_team_goalkeeper_score,home_team_mean_defense_score,home_team_mean_offense_score,home_team_mean_midfield_score,away_team_mean_defense_score,away_team_mean_offense_score,away_team_mean_midfield_score
0,08/08/1993,Bolivia,Uruguay,South America,South America,59.0,22.0,0.0,0.0,3.0,...,No,Win,,,,,,,,
1,08/08/1993,Brazil,Mexico,South America,North America,8.0,14.0,0.0,0.0,1.0,...,No,Draw,,,,,,,,
2,08/08/1993,Ecuador,Venezuela,South America,South America,35.0,94.0,0.0,0.0,5.0,...,No,Win,,,,,,,,
3,08/08/1993,Guinea,Sierra Leone,Africa,Africa,65.0,86.0,0.0,0.0,1.0,...,No,Win,,,,,,,,
4,08/08/1993,Paraguay,Argentina,South America,South America,67.0,5.0,0.0,0.0,1.0,...,No,Lose,,,,,,,,


In [6]:
# Looking at column type
df_copa.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 23969 entries, 0 to 23968

Data columns (total 25 columns):

 #   Column                         Non-Null Count  Dtype  

---  ------                         --------------  -----  

 0   date                           23969 non-null  object 

 1   home_team                      23969 non-null  object 

 2   away_team                      23969 non-null  object 

 3   home_team_continent            23921 non-null  object 

 4   away_team_continent            23921 non-null  object 

 5   home_team_fifa_rank            23923 non-null  float64

 6   away_team_fifa_rank            23922 non-null  float64

 7   home_team_total_fifa_points    23921 non-null  float64

 8   away_team_total_fifa_points    23921 non-null  float64

 9   home_team_score                23921 non-null  float64

 10  away_team_score                23921 non-null  float64

 11  tournament                     23921 non-null  object 

 12  city                         

In [7]:
# Transforming object "date" column to datetime
df_copa['date'] = pd.to_datetime(df_copa['date'])

In [8]:
# Looking at null values
df_copa.isnull().sum().sort_values(ascending=False)

away_team_mean_defense_score     16405
home_team_mean_defense_score     16182
away_team_mean_midfield_score    15990
away_team_goalkeeper_score       15873
home_team_mean_midfield_score    15807
away_team_mean_offense_score     15657
home_team_goalkeeper_score       15588
home_team_mean_offense_score     15459
country                             48
home_team_result                    48
shoot_out                           48
city                                48
tournament                          48
away_team_score                     48
home_team_score                     48
away_team_total_fifa_points         48
home_team_total_fifa_points         48
away_team_continent                 48
home_team_continent                 48
away_team_fifa_rank                 47
home_team_fifa_rank                 46
home_team                            0
neutral_location                     0
away_team                            0
date                                 0
dtype: int64

In [9]:
# Looking at the number of each result, we see that most games do not end in a draw.
df_copa['home_team_result'].value_counts().sort_values(ascending=False)

Win     11761
Lose     6771
Draw     5389
Name: home_team_result, dtype: int64

## 4. Feature Engineering
In this step we will create new features and eliminate others.



In [10]:
# Creating a function to change object values to numeric in the 'home_team_result' column.
def transform_resultado(result):
  if result == "Win":
    return 1
  elif result == "Lose":
    return -1
  else:
    return 0
df_copa['home_team_result'] = df_copa['home_team_result'].map(transform_resultado)

In [11]:
# Creating a variable that adds the attack score (midfield + offense) and then adds the defense score (goalkeeper + defense)
df_copa['home_attack'] = df_copa['home_team_mean_midfield_score'] + df_copa['home_team_mean_offense_score']
df_copa['home_defense'] = df_copa['home_team_goalkeeper_score'] + df_copa['home_team_mean_defense_score']

df_copa['away_attack'] = df_copa['away_team_mean_midfield_score'] + df_copa['away_team_mean_offense_score']
df_copa['away_defense'] = df_copa['away_team_goalkeeper_score'] + df_copa['away_team_mean_defense_score']

# Applying in df_copa_2022 as well
df_copa_2022['home_attack'] = df_copa_2022['home_team_mean_midfield_score'] + df_copa_2022['home_team_mean_offense_score']
df_copa_2022['home_defense'] = df_copa_2022['home_team_goalkeeper_score'] + df_copa_2022['home_team_mean_defense_score']

df_copa_2022['away_attack'] = df_copa_2022['away_team_mean_midfield_score'] + df_copa_2022['away_team_mean_offense_score']
df_copa_2022['away_defense'] = df_copa_2022['away_team_goalkeeper_score'] + df_copa_2022['away_team_mean_defense_score']

In [12]:
# Creating a variable with the ranking of the home team minus the away team
df_copa['rank_difference'] = df_copa['home_team_fifa_rank'] - df_copa['away_team_fifa_rank']

df_copa_2022['rank_difference'] = df_copa_2022['home_team_fifa_rank'] - df_copa_2022['away_team_fifa_rank']

In [13]:
# Let's delete the columns that are not very important

# No matter the city or country, the most important thing is to know if the location is neutral or not
# As we only want to know the first stage of the world cup the 'shoot_out' is not important

df_copa = df_copa.drop(['home_team_continent', 'away_team_continent', 'city','country', 
              'shoot_out', 'home_team_mean_midfield_score', 'home_team_mean_offense_score', 
              'home_team_goalkeeper_score', 'home_team_mean_defense_score', 
              'away_team_mean_midfield_score', 'away_team_mean_offense_score', 
              'away_team_goalkeeper_score',  'away_team_mean_defense_score'], axis=1)

## 5. Working only with World Cup data - "Training is training, and a game is a game"




In [14]:
df_copa = df_copa[df_copa['tournament']== 'FIFA World Cup']

In [15]:
copa_2006 = df_copa[df_copa['date'].dt.year ==2006]
copa_2010 = df_copa[df_copa['date'].dt.year ==2010]
copa_2014 = df_copa[df_copa['date'].dt.year ==2014]
copa_2018 = df_copa[df_copa['date'].dt.year ==2018]
copa_2022 = df_copa_2022

In [16]:
# # Selecting only the knockout stage of the world cup because that's what we want to predict.
copa_2006= copa_2006[48:]
copa_2010 = copa_2010[48:]
copa_2014 = copa_2014[48:]
copa_2018 = copa_2018[48:]
copa_2022 = copa_2022

## 6. Split into training, validation and testing.
In this step, we analyze that we had a lot of missing data before 2005, so let's use data starting from 2006 until now (2022).


In [17]:
treino = pd.concat([copa_2006, copa_2010, copa_2014])
valid_1 = copa_2018
teste = copa_2022

## 7. Feature Engineering with Cup data



In [18]:
# Filling in the null variables (these usually from selections with less tradition), with the lowest scores

treino['away_defense'] = treino['away_defense'].fillna(treino['away_defense'].min())
treino['home_defense'] = treino['home_defense'].fillna(treino['home_defense'].min())
treino['home_attack'] = treino['home_attack'].fillna(treino['home_attack'].min())
treino['away_attack'] = treino['away_attack'].fillna(treino['away_attack'].min())

valid_1['away_defense'] = valid_1['away_defense'].fillna(valid_1['away_defense'].min())
valid_1['home_defense'] = valid_1['home_defense'].fillna(valid_1['home_defense'].min())
valid_1['home_attack'] = valid_1['home_attack'].fillna(valid_1['home_attack'].min())
valid_1['away_attack'] = valid_1['away_attack'].fillna(valid_1['away_attack'].min())

In [19]:
# Creating attack minus defense feature
treino['home_attack_goal'] = treino['home_attack'] - treino['away_defense']
treino['away_attack_goal'] = treino['away_attack'] - treino['home_defense']

valid_1['home_attack_goal'] = valid_1['home_attack'] - valid_1['away_defense']
valid_1['away_attack_goal'] = valid_1['away_attack'] - valid_1['home_defense']

teste['home_attack_goal'] = teste['home_attack'] - teste['away_defense']
teste['away_attack_goal'] = teste['away_attack'] - teste['home_defense']

In [20]:
# Creating feature with practically a normalization of the previous feature
treino['diff_goal'] = (treino['home_attack_goal'] - treino['away_attack_goal'])/treino['home_attack_goal']

valid_1['diff_goal'] = (valid_1['home_attack_goal'] - valid_1['away_attack_goal'])/valid_1['home_attack_goal']

teste['diff_goal'] = (teste['home_attack_goal'] - teste['away_attack_goal'])/teste['home_attack_goal']

In [21]:
# The team score is the sum of the attack and defense scores.
treino['home_score'] = treino['home_attack'] + treino['home_defense']
treino['away_score'] = treino['away_attack'] + treino['away_defense']

valid_1['home_score'] = valid_1['home_attack'] + valid_1['home_defense']
valid_1['away_score'] = valid_1['away_attack'] + valid_1['away_defense']

teste['home_score'] = teste['home_attack'] + teste['home_defense']
teste['away_score'] = teste['away_attack'] + teste['away_defense']

In [22]:
# Creating feature with practically a normalization of the previous feature
treino['dif_fifa_rank'] = (treino['home_team_fifa_rank'] - treino['away_team_fifa_rank'])/treino['home_team_fifa_rank'] 

valid_1['dif_fifa_rank'] = (valid_1['home_team_fifa_rank'] - valid_1['away_team_fifa_rank'])/valid_1['home_team_fifa_rank']

teste['dif_fifa_rank'] = (teste['home_team_fifa_rank'] - teste['away_team_fifa_rank'])/teste['home_team_fifa_rank']

In [23]:
#xvariaveis = ['home_team_fifa_rank',	'away_team_fifa_rank', 'home_attack',	'home_defense',	'away_attack',	'away_defense', 'rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score','away_score' ]

In [24]:
# Selecting model features
xvariaveis = ['rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score']

In [25]:
treino_var = ['rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score', 'home_team_result']

In [26]:
treino = treino[treino_var]

In [27]:
# Creating filter
df_filter = treino.isin([np.nan, np.inf, -np.inf])
  
# Masking df with the filter
treino = treino[~df_filter]
  
# Dropping rows with nan values
treino.dropna(inplace=True)
  
# Printing df
treino

Unnamed: 0,rank_difference,home_attack_goal,away_attack_goal,diff_goal,home_score,home_team_result
9701,5.0,13.0,-18.0,2.384615,346.5,1
9702,3.0,11.8,-16.7,2.415254,350.2,1
9705,-29.0,40.4,-33.6,1.831683,354.7,1
9709,-29.0,20.0,-30.1,2.505,361.0,1
9711,-10.0,16.9,-4.6,1.272189,308.5,-1
9712,-47.0,46.9,-28.7,1.61194,363.7,1
9713,-3.0,0.4,-2.5,7.25,360.4,-1
9715,10.0,-2.5,-6.2,-1.48,350.2,1
9716,-32.0,38.0,-36.0,1.947368,361.0,1
9718,-7.0,7.2,1.0,0.861111,363.7,-1


In [28]:
valid_1 = valid_1[treino_var]

In [29]:
# Creating filter
df_filter = valid_1.isin([np.nan, np.inf, -np.inf])
  
# Masking df with the filter
valid_1 = valid_1[~df_filter]
  
# Dropping rows with nan values
valid_1.dropna(inplace=True)
  
# Printing df
valid_1

Unnamed: 0,rank_difference,home_attack_goal,away_attack_goal,diff_goal,home_score,home_team_result
20433,2.0,6.5,2.2,0.661538,344.5,1
20434,10.0,-1.7,3.7,3.176471,327.8,1
20435,43.0,-17.3,14.2,1.820809,317.0,1
20436,-3.0,2.6,-7.5,3.884615,330.9,1
20437,-6.0,14.0,-8.8,1.628571,340.6,1
20438,-36.0,25.9,-17.0,1.656371,345.6,1
20439,13.0,-2.2,1.3,1.590909,314.3,1
20440,10.0,-2.3,8.3,4.608696,320.3,-1
20442,12.0,-8.7,8.0,1.91954,327.8,-1
20443,-1.0,-1.7,3.3,2.941176,340.6,-1


In [30]:
# split x and y
y_treino = treino['home_team_result']
X_treino= treino[xvariaveis]

y_valid_1 = valid_1['home_team_result']
X_valid_1= valid_1[xvariaveis]

y_teste = teste['home_team_result']
X_teste= teste[xvariaveis]

In [31]:
# Creating training data with previous training data + previous validation
y_treino_2 = pd.concat([y_treino, y_valid_1], axis=0)
X_treino_2 = pd.concat([X_treino, X_valid_1], axis=0)

## 8. Training the model - RandomForestClassifier
In this step, we are actually going to train the model.  

This through RandomForestClassifier, XGBClassifier and LogisticRegression.
  
Starting with RandomForestClassifier


In [32]:
modelo = RandomForestClassifier(n_estimators=200, min_samples_leaf= 2, max_depth= 6, n_jobs=-1, random_state=0)
modelo.fit(X_treino, y_treino)
p1 = modelo.predict(X_valid_1)

acc1 = accuracy_score(y_true=y_valid_1, y_pred=p1)
print('Acc: {:.4f}'.format(acc1))

# matriz de confusão
print(classification_report(y_valid_1, p1))

Acc: 0.6667

              precision    recall  f1-score   support



          -1       0.50      0.80      0.62         5

           1       0.86      0.60      0.71        10



    accuracy                           0.67        15

   macro avg       0.68      0.70      0.66        15

weighted avg       0.74      0.67      0.68        15




## 9. Training the model - XGBClassifier



In [33]:
modelo = xgb.XGBClassifier(random_state=0,
                          n_estimators=100,use_label_encoder=False)
modelo.fit(X_treino, y_treino)
p1 = modelo.predict(X_valid_1)

acc1 = accuracy_score(y_true=y_valid_1, y_pred=p1)
print('Acc: {:.4f}'.format(acc1))

# matriz de confusão
print(classification_report(y_valid_1, p1))

Acc: 0.8000

              precision    recall  f1-score   support



          -1       0.67      0.80      0.73         5

           1       0.89      0.80      0.84        10



    accuracy                           0.80        15

   macro avg       0.78      0.80      0.78        15

weighted avg       0.81      0.80      0.80        15




## 10. Training the model - LogisticRegression



In [34]:
modelo = LogisticRegression(C=10)
modelo.fit(X_treino, y_treino)
p1 = modelo.predict(X_valid_1)

acc1 = accuracy_score(y_true=y_valid_1, y_pred=p1)
print('Acc: {:.4f}'.format(acc1))

# matriz de confusão
print(classification_report(y_valid_1, p1))

Acc: 0.7333

              precision    recall  f1-score   support



          -1       0.57      0.80      0.67         5

           1       0.88      0.70      0.78        10



    accuracy                           0.73        15

   macro avg       0.72      0.75      0.72        15

weighted avg       0.77      0.73      0.74        15




## 11. Training the test file for submission - XGBClassifier
XGBClassifier was selected because it had the highest accuracy

In [35]:
modelo = xgb.XGBClassifier(random_state=0,
                          n_estimators=100,use_label_encoder=False)
modelo.fit(X_treino_2, y_treino_2)
p1 = modelo.predict(X_teste)

In [36]:
p1

array([ 1,  1, -1,  1,  1,  1, -1,  1])

by the challenge rules, you had to send the solution by putting the name of a team, how many goals you predicted it scored, the name of another team, how many goals you predicted it scored, and the name of the team you predicted it won.

In [37]:
placar = []
for i in range(0,len(p1)):
  if p1[i]==1:
    home='1'
    visitor='0'
    game= df_copa_2022['home_team'][i]+','+home+','+df_copa_2022['away_team'][i]+','+visitor+','+df_copa_2022['home_team'][i]
  else:
    home='0'
    visitor='1'
    game= df_copa_2022['home_team'][i]+','+home+','+df_copa_2022['away_team'][i]+','+visitor+','+df_copa_2022['away_team'][i]
  placar.append(game)

In [38]:
# looking at the predictions of the round of 16
placar

['NED,1,USA,0,NED',
 'ARG,1,AUS,0,ARG',
 'JPN,0,CRO,1,CRO',
 'BRA,1,COR,0,BRA',
 'FRA,1,POL,0,FRA',
 'ENG,1,SEN,0,ENG',
 'MAR,0,ESP,1,ESP',
 'POR,1,SUI,0,POR']

In [39]:
sub1 = pd.Series(placar)

## 12. Predictions of the round of 8


Knowing the prediction of the round of 16, I created a csv file, with the games of the quarterfinals with their respective data.  
I did this every round of the knockout stage.  
And the development of the solution was identical to that carried out in the round of 16.

In [40]:
# Path to get files from drive
quartas = '/content/drive/MyDrive/Dataset - projetos dados/Copa do mundo 2022/quartas de final.csv'

In [41]:
# Reading the files
df_copa_2022_quartas = pd.read_csv(quartas, delimiter=';')

In [42]:
# Creating a variable that adds the attack score (midfield + offense) and then adds the defense score (goalkeeper + defense)

df_copa_2022_quartas['home_attack'] = df_copa_2022_quartas['home_team_mean_midfield_score'] + df_copa_2022_quartas['home_team_mean_offense_score']
df_copa_2022_quartas['home_defense'] = df_copa_2022_quartas['home_team_goalkeeper_score'] + df_copa_2022_quartas['home_team_mean_defense_score']

df_copa_2022_quartas['away_attack'] = df_copa_2022_quartas['away_team_mean_midfield_score'] + df_copa_2022_quartas['away_team_mean_offense_score']
df_copa_2022_quartas['away_defense'] = df_copa_2022_quartas['away_team_goalkeeper_score'] + df_copa_2022_quartas['away_team_mean_defense_score']

In [43]:
# Creating a variable with the ranking of the home team minus the away team
df_copa_2022_quartas['rank_difference'] = df_copa_2022_quartas['home_team_fifa_rank'] - df_copa_2022_quartas['away_team_fifa_rank']

In [44]:
teste = df_copa_2022_quartas

In [45]:
# Creating attack minus defense feature

teste['home_attack_goal'] = teste['home_attack'] - teste['away_defense']
teste['away_attack_goal'] = teste['away_attack'] - teste['home_defense']

In [46]:
# Creating feature with practically a normalization of the goals differences
teste['diff_goal'] = (teste['home_attack_goal'] - teste['away_attack_goal'])/teste['home_attack_goal']

In [47]:
# The team score is the sum of the attack and defense scores.
teste['home_score'] = teste['home_attack'] + teste['home_defense']
teste['away_score'] = teste['away_attack'] + teste['away_defense']

In [48]:
# Creating feature with practically a normalization of the previous feature
teste['dif_fifa_rank'] = (teste['home_team_fifa_rank'] - teste['away_team_fifa_rank'])/teste['home_team_fifa_rank']

In [49]:
# Selecting model features
xvariaveis = ['rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score']

In [50]:
# split x and y
y_teste = teste['home_team_result']
X_teste= teste[xvariaveis]

In [51]:
# Training the test file for submission - XGBClassifier
modelo = xgb.XGBClassifier(random_state=0,
                          n_estimators=100,use_label_encoder=False)
modelo.fit(X_treino_2, y_treino_2)
p2 = modelo.predict(X_teste)

by the challenge rules, you had to send the solution by putting the name of a team, how many goals you predicted it scored, the name of another team, how many goals you predicted it scored, and the name of the team you predicted it won.

In [52]:
placar2 = []
for i in range(0,len(p2)):
  if p2[i]==1:
    home='1'
    visitor='0'
    game= df_copa_2022_quartas['home_team'][i]+','+home+','+df_copa_2022_quartas['away_team'][i]+','+visitor+','+df_copa_2022_quartas['home_team'][i]
  else:
    home='0'
    visitor='1'
    game= df_copa_2022_quartas['home_team'][i]+','+home+','+df_copa_2022_quartas['away_team'][i]+','+visitor+','+df_copa_2022_quartas['away_team'][i]
  placar2.append(game)

In [53]:
# looking at the predictions
placar2

['NED,0,ARG,1,ARG', 'CRO,0,BRA,1,BRA', 'FRA,1,ENG,0,FRA', 'ESP,1,POR,0,ESP']

In [54]:
sub2 = pd.Series(placar2)

## 13. Predicting the semifinal


Knowing the prediction of the round of 8, I created a csv file, with the games semifinal with their respective data.  
I did this every round of the knockout stage.  
And the development of the solution was identical to that carried out in the round of 16.

In [55]:
# Path to get files from drive
semi = '/content/drive/MyDrive/Dataset - projetos dados/Copa do mundo 2022/semifinal.csv'

In [56]:
# Reading the files
df_copa_2022_semi = pd.read_csv(semi, delimiter=';')

In [57]:
# Creating a variable that adds the attack score (midfield + offense) and then adds the defense score (goalkeeper + defense)

df_copa_2022_semi['home_attack'] = df_copa_2022_semi['home_team_mean_midfield_score'] + df_copa_2022_semi['home_team_mean_offense_score']
df_copa_2022_semi['home_defense'] = df_copa_2022_semi['home_team_goalkeeper_score'] + df_copa_2022_semi['home_team_mean_defense_score']

df_copa_2022_semi['away_attack'] = df_copa_2022_semi['away_team_mean_midfield_score'] + df_copa_2022_semi['away_team_mean_offense_score']
df_copa_2022_semi['away_defense'] = df_copa_2022_semi['away_team_goalkeeper_score'] + df_copa_2022_semi['away_team_mean_defense_score']

In [58]:
# Creating a variable with the ranking of the home team minus the away team
df_copa_2022_semi['rank_difference'] = df_copa_2022_semi['home_team_fifa_rank'] - df_copa_2022_semi['away_team_fifa_rank']

In [59]:
teste = df_copa_2022_semi

In [60]:
# Creating attack minus defense feature

teste['home_attack_goal'] = teste['home_attack'] - teste['away_defense']
teste['away_attack_goal'] = teste['away_attack'] - teste['home_defense']

In [61]:
# Creating feature with practically a normalization of the goals differences
teste['diff_goal'] = (teste['home_attack_goal'] - teste['away_attack_goal'])/teste['home_attack_goal']

In [62]:
# The team score is the sum of the attack and defense scores.
teste['home_score'] = teste['home_attack'] + teste['home_defense']
teste['away_score'] = teste['away_attack'] + teste['away_defense']

In [63]:
# Creating feature with practically a normalization of the previous feature

teste['dif_fifa_rank'] = (teste['home_team_fifa_rank'] - teste['away_team_fifa_rank'])/teste['home_team_fifa_rank']

In [64]:
# Selecting model features
xvariaveis = ['rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score']

In [65]:
# split x and y
y_teste = teste['home_team_result']
X_teste= teste[xvariaveis]

In [66]:
# Training the test file for submission - XGBClassifier
modelo = xgb.XGBClassifier(random_state=0,
                          n_estimators=100,use_label_encoder=False)
modelo.fit(X_treino_2, y_treino_2)
p3 = modelo.predict(X_teste)

by the challenge rules, you had to send the solution by putting the name of a team, how many goals you predicted it scored, the name of another team, how many goals you predicted it scored, and the name of the team you predicted it won.

In [67]:
placar3 = []
for i in range(0,len(p3)):
  if p3[i]==1:
    home='1'
    visitor='0'
    game= df_copa_2022_semi['home_team'][i]+','+home+','+df_copa_2022_semi['away_team'][i]+','+visitor+','+df_copa_2022_semi['home_team'][i]
  else:
    home='0'
    visitor='1'
    game= df_copa_2022_semi['home_team'][i]+','+home+','+df_copa_2022_semi['away_team'][i]+','+visitor+','+df_copa_2022_semi['away_team'][i]
  placar3.append(game)

In [68]:
# looking at the predictions
placar3

['ARG,0,BRA,1,BRA', 'FRA,1,ESP,0,FRA']

In [69]:
sub3 = pd.Series(placar3)

## 14. Predicting the final and third place


Knowing the prediction of the semifinal, I created a csv file, with the games of the final with their respective data.  
I did this every round of the knockout stage.  
And the development of the solution was identical to that carried out in the round of 16.

In [70]:
# Path to get files from drive
finais = '/content/drive/MyDrive/Dataset - projetos dados/Copa do mundo 2022/finais.csv'

In [71]:
# Reading the files
df_copa_2022_finais = pd.read_csv(finais, delimiter=';')

In [72]:
# Creating a variable that adds the attack score (midfield + offense) and then adds the defense score (goalkeeper + defense)

df_copa_2022_finais['home_attack'] = df_copa_2022_finais['home_team_mean_midfield_score'] + df_copa_2022_finais['home_team_mean_offense_score']
df_copa_2022_finais['home_defense'] = df_copa_2022_finais['home_team_goalkeeper_score'] + df_copa_2022_finais['home_team_mean_defense_score']

df_copa_2022_finais['away_attack'] = df_copa_2022_finais['away_team_mean_midfield_score'] + df_copa_2022_finais['away_team_mean_offense_score']
df_copa_2022_finais['away_defense'] = df_copa_2022_finais['away_team_goalkeeper_score'] + df_copa_2022_finais['away_team_mean_defense_score']

In [73]:
# Creating a variable with the ranking of the home team minus the away team
df_copa_2022_finais['rank_difference'] = df_copa_2022_finais['home_team_fifa_rank'] - df_copa_2022_finais['away_team_fifa_rank']

In [74]:
teste = df_copa_2022_finais

In [75]:
# Creating attack minus defense feature

teste['home_attack_goal'] = teste['home_attack'] - teste['away_defense']
teste['away_attack_goal'] = teste['away_attack'] - teste['home_defense']

In [76]:
# Creating feature with practically a normalization of the goals differences
teste['diff_goal'] = (teste['home_attack_goal'] - teste['away_attack_goal'])/teste['home_attack_goal']

In [77]:
# The team score is the sum of the attack and defense scores.

teste['home_score'] = teste['home_attack'] + teste['home_defense']
teste['away_score'] = teste['away_attack'] + teste['away_defense']

In [78]:
# Creating feature with practically a normalization of the previous feature

teste['dif_fifa_rank'] = (teste['home_team_fifa_rank'] - teste['away_team_fifa_rank'])/teste['home_team_fifa_rank']

In [79]:
# Selecting model features
xvariaveis = ['rank_difference', 'home_attack_goal', 'away_attack_goal', 'diff_goal', 'home_score']

In [80]:
# split x and y
y_teste = teste['home_team_result']
X_teste= teste[xvariaveis]

In [81]:
# Training the test file for submission - XGBClassifier
modelo = xgb.XGBClassifier(random_state=0,
                          n_estimators=100,use_label_encoder=False)
modelo.fit(X_treino_2, y_treino_2)
p4 = modelo.predict(X_teste)

by the challenge rules, you had to send the solution by putting the name of a team, how many goals you predicted it scored, the name of another team, how many goals you predicted it scored, and the name of the team you predicted it won.

In [82]:
placar4 = []
for i in range(0,len(p3)):
  if p4[i]==1:
    home='1'
    visitor='0'
    game= df_copa_2022_finais['home_team'][i]+','+home+','+df_copa_2022_finais['away_team'][i]+','+visitor+','+df_copa_2022_finais['home_team'][i]
  else:
    home='0'
    visitor='1'
    game= df_copa_2022_finais['home_team'][i]+','+home+','+df_copa_2022_finais['away_team'][i]+','+visitor+','+df_copa_2022_finais['away_team'][i]
  placar4.append(game)

In [83]:
# looking at the predictions
placar4

['BRA,0,FRA,1,FRA', 'ARG,1,ESP,0,ARG']

In [84]:
sub4 = pd.Series(placar4)

## 15. Creating the file for submission
merging all knockout stage predictions into just one csv file

In [85]:
sub = pd.concat([sub1, sub2, sub3, sub4])

In [86]:
sub

0    NED,1,USA,0,NED
1    ARG,1,AUS,0,ARG
2    JPN,0,CRO,1,CRO
3    BRA,1,COR,0,BRA
4    FRA,1,POL,0,FRA
5    ENG,1,SEN,0,ENG
6    MAR,0,ESP,1,ESP
7    POR,1,SUI,0,POR
0    NED,0,ARG,1,ARG
1    CRO,0,BRA,1,BRA
2    FRA,1,ENG,0,FRA
3    ESP,1,POR,0,ESP
0    ARG,0,BRA,1,BRA
1    FRA,1,ESP,0,FRA
0    BRA,0,FRA,1,FRA
1    ARG,1,ESP,0,ARG
dtype: object

In [87]:
sub.to_csv('copasigmatamata.csv', header= False)