# Paul the Octopus

In this notebook, I will analyze data from previous FIFA World Cup matches and try using it to predict the outcome of future matches.

First, we import pandas library.

In [2]:
import pandas as pd

We then read the datasets we will analyze. These datasets were obtained from [kaggle](https://www.kaggle.com/abecklas/fifa-world-cup/) and contain information about the matches, players and cups.

In [3]:
matches = pd.read_csv('data/WorldCupMatches.csv')
players = pd.read_csv('data/WorldCupPlayers.csv')
world_cups = pd.read_csv('data/WorldCups.csv')

Let's have a look at the first rows of each dataset imported.

In [5]:
matches.head(3)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,1930.0,13 Jul 1930 - 15:00,Group 1,Pocitos,Montevideo,France,4.0,1.0,Mexico,,4444.0,3.0,0.0,LOMBARDI Domingo (URU),CRISTOPHE Henry (BEL),REGO Gilberto (BRA),201.0,1096.0,FRA,MEX
1,1930.0,13 Jul 1930 - 15:00,Group 4,Parque Central,Montevideo,USA,3.0,0.0,Belgium,,18346.0,2.0,0.0,MACIAS Jose (ARG),MATEUCCI Francisco (URU),WARNKEN Alberto (CHI),201.0,1090.0,USA,BEL
2,1930.0,14 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,2.0,1.0,Brazil,,24059.0,2.0,0.0,TEJADA Anibal (URU),VALLARINO Ricardo (URU),BALWAY Thomas (FRA),201.0,1093.0,YUG,BRA


In [6]:
players.head(3)

Unnamed: 0,RoundID,MatchID,Team Initials,Coach Name,Line-up,Shirt Number,Player Name,Position,Event
0,201,1096,FRA,CAUDRON Raoul (FRA),S,0,Alex THEPOT,GK,
1,201,1096,MEX,LUQUE Juan (MEX),S,0,Oscar BONFIGLIO,GK,
2,201,1096,FRA,CAUDRON Raoul (FRA),S,0,Marcel LANGILLER,,G40'


In [8]:
world_cups.head(3)

Unnamed: 0,Year,Country,Winner,Runners-Up,Third,Fourth,GoalsScored,QualifiedTeams,MatchesPlayed,Attendance
0,1930,Uruguay,Uruguay,Argentina,USA,Yugoslavia,70,13,18,590.549
1,1934,Italy,Italy,Czechoslovakia,Germany,Austria,70,16,17,363.0
2,1938,France,Italy,Hungary,Brazil,Sweden,84,15,18,375.7


As you can see, each row has and index a some information about a match played on FIFA World Cup, such as teams, score, location, attendance,referees, etc.

Below, the list containing the names and the datatype of each collumn is displayed.

In [8]:
print(df1.dtypes)

Year                    float64
Datetime                 object
Stage                    object
Stadium                  object
City                     object
Home Team Name           object
Home Team Goals         float64
Away Team Goals         float64
Away Team Name           object
Win conditions           object
Attendance              float64
Half-time Home Goals    float64
Half-time Away Goals    float64
Referee                  object
Assistant 1              object
Assistant 2              object
RoundID                 float64
MatchID                 float64
Home Team Initials       object
Away Team Initials       object
dtype: object


So, let's say we would like to know how many World Cup matches the brazilian national team has played since its first edition, in 1930.

We must first filter the original dataframe **df1** by teams, and then get the size of the filtered dataframe.

In [10]:
df_br = df1[df1['Home Team Name'] == 'Brazil']

print('The brazilian team has played %d matches since 1930 FIFA World Cup.' % len(df_br))

The brazilian team has played 82 matches since 1930 FIFA World Cup


82 matches seems a lot, but actually we are just considering the matches where Brazil was set as the home team.

To obtain the real number of matches, we must also consider the matches where Brazil was set as away team. Let's do it.

In [13]:
df_br = df1[(df1['Home Team Name'] == 'Brazil') | (df1['Away Team Name'] == 'Brazil')]

print('The real number of matches played by the brazilian team is %d.' % len(df_br))

The real number of matches played by the brazilian team is 108


The operand `|` stands for a logic **or**, meaning that we want all rows that have Brazil on the fields `Home Team Name` or `Away Team Name`.

Obs.: Notice that each filter was inside `()`. This is needed when applying multiple conditions.

## Highest and Lowest Attendances

We are now interested to know which were the games with most and least public attendance. To do so, we must get the maximum and minimum value present on the column `Attendance` of our dataframe.

In [23]:
max_att = df1['Attendance'].max()
min_att = df1['Attendance'].min()

print('The highest attendance on a WC match was %d people, whilst the '
      'lowest attendance was only %d people' % (max_att, min_att))

The highest attendance on a WC match was 173850 people, whilst the lowest attendance was only 2000 people


However, we would also like to know in what games those attendances occurred. First, let's find in what rows of our original dataframe these attendances are located.

In [36]:
id_max_att = df1['Attendance'].idxmax()
id_min_att = df1['Attendance'].idxmin()

print('The highest and lowest attendances appear on rows #%d and #%d, respectively' % (id_max_att, id_min_att))

The highest and lowest attendances appear on rows #74 and #9, respectively


Knowing the row's id, we can access the data contained in it using functions `loc` or `iloc`. Basically, the difference between those function is how the user address the columns, with the first requiring the full name of the column while the other receives the id of the columns of interest.

In [48]:
date_max = df1.iloc[id_max_att, 1]
venue_max = df1.iloc[id_max_att, 3]
city_max = df1.iloc[id_max_att, 4]
home_team_max = df1.iloc[id_max_att, 5]
away_team_max = df1.iloc[id_max_att, 8]
score_home_max = df1.iloc[id_max_att, 6]
score_away_max = df1.iloc[id_max_att, 7]
short_home_max = df1.iloc[id_max_att, 18]
short_away_max = df1.iloc[id_max_att, 19]

print('The highest attendance in a FIFA World Cup match was registered in %s at %s in %s.' 
      % (date_max, venue_max, city_max))
print('The match was between %s and %s and the final score was (%s) %d x %d (%s)' 
      % (home_team_max, away_team_max, short_home_max, score_home_max, score_away_max, short_away_max))

The highest attendance in a FIFA World Cup match was registered in 16 Jul 1950 - 15:00  at Maracan� - Est�dio Jornalista M�rio Filho in Rio De Janeiro .
The match was between Uruguay and Brazil and the final score was (URU) 2 x 1 (BRA)


In [52]:
date_min = df1.loc[id_min_att, 'Datetime']
venue_min = df1.loc[id_min_att, 'Stadium']
city_min = df1.loc[id_min_att, 'City']
home_team_min = df1.loc[id_min_att, 'Home Team Name']
away_team_min = df1.loc[id_min_att, 'Away Team Name']
score_home_min = df1.loc[id_min_att, 'Home Team Goals']
score_away_min = df1.loc[id_min_att, 'Away Team Goals']
short_home_min = df1.loc[id_min_att, 'Home Team Initials']
short_away_min = df1.loc[id_min_att, 'Away Team Initials']

print('The lowest attendance in a FIFA World Cup match was registered in %s at %s in %s.' 
      % (date_min, venue_min, city_min))
print('The match was between %s and %s and the final score was (%s) %d x %d (%s)' 
      % (home_team_min, away_team_min, short_home_min, score_home_min, score_away_min, short_away_min))

The lowest attendance in a FIFA World Cup match was registered in 19 Jul 1930 - 12:50  at Estadio Centenario in Montevideo .
The match was between Chile and France and the final score was (CHI) 1 x 0 (FRA)


<kbd>Alt<\