# CMSC 320 FINAL PROJECT: WHAT COUNTRY HAS HAD THE GREATEST NATIONAL FOOTBALL(SOCCER) TEAM IN THE LAST 22 YEARS?
Owners: Nathaniel Bekele and Mikias Atnafu

## Introduction:
First and foremost, we will be establishing what we believe to be the most important boundary for this project, we will be referring to the sport exclusively as football not soccer.

The data we will be using was compiled by Mart Jürisoo and is available on Kaggle: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017?resource=download.

## Factors to be considered:
### Time period
This dataset has all the results of international football fixtures from 1872 to 2022. While it would be amazing to consider the data from 1872, we also thought of the possibility that it would be rather unfair on most countries that were colonized at the time and hence did not have a national team to compete with. We also wanted to limit the data from 2000 onwards so that we can find the greatest national team in this current millenium. Thus we will be only be considering data from 2000 to 2022 

In [132]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Preprocessing
Before we start analysing who is the greatest national football team, we need to preprocess the data. So, let us first read the csv file into a panda dataframe. We have used two data sets for this project, The first one is the international football results which has general football result data from the 1870s, and the second one is the football world cup summary, which will has world cup data showing who reached the final and who won the world cup over the years starting from 1930.    

### Reading in the first dataset, *The international football results*

In [133]:
football_df = pd.read_csv('results.csv')


# Create column Year using the year in the date column
football_df['year'] = football_df['date'].str.extract('(\d\d\d\d)')


# Change column Year to type int
football_df['year'] = football_df['year'].astype(int)

# filter only the data with year >= 2000 
football_df = football_df[football_df['year'] >= 2000]

football_df.head() 


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year
22617,2000-01-04,Egypt,Togo,2,1,Friendly,Aswan,Egypt,False,2000
22618,2000-01-07,Tunisia,Togo,7,0,Friendly,Tunis,Tunisia,False,2000
22619,2000-01-08,Trinidad and Tobago,Canada,0,0,Friendly,Port of Spain,Trinidad and Tobago,False,2000
22620,2000-01-09,Burkina Faso,Gabon,1,1,Friendly,Ouagadougou,Burkina Faso,False,2000
22621,2000-01-09,Guatemala,Armenia,1,1,Friendly,Los Angeles,United States,True,2000


### Reading in the second dataset, *World Cup Summary*

In [134]:
# read the FIFA world cup summary and display the first 5 rows 
world_cup_summary = pd.read_csv('FIFA-World Cup Summary.csv')

# filters the dataset by taking the world cup after the year 2000 
world_cup_summary = world_cup_summary[world_cup_summary['YEAR'] >= 2000]
world_cup_summary.head()

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
16,2002,"South Korea, Japan",Brazil,Germany,Turkey,32,64,161,2.5
17,2006,Germany,Italy,France,Germany,32,64,147,2.3
18,2010,South Africa,Spain,Netherlands,Germany,32,64,145,2.3
19,2014,Brazil,Germany,Argentina,Netherlands,32,64,171,2.7
20,2018,Russia,France,Croatia,Belgium,32,64,169,2.6


## Tyding the data
- The tournament column contains very many types of tournaments, qualifiers and even friendlies (177 to be specefic), for this project we will only focus on the world cup, so we will only be working with the worldcup tournament.

- The reason we chose to specifically focus on the world cup is because it is the biggest sports competition on the planet with qualifiers played by teams representing their countries from every single continent, and so unless a country is specifically banned by FIFA from the world cup, every country has a chance to go to the world cup. This makes the world cup the best tournament to take as a metric for best football team. 

- Additional information on how qualification for the world cup works here: https://www.espn.com/soccer/fifa-world-cup/story/3860182/2022-world-cup-how-qualifying-works-around-the-world

- Also, to make the analysis convenient, we will add serval columns into our dataframe such as Winning percentage, Number of losses, Number of matches played any a few others. Using these columns, we can better analysis the best national team. The code below will do just that, adding the necessary columns.  

In [135]:
# Filter only the FIFA World Cup tournment
FIFA_WC = football_df[football_df['tournament'] == 'FIFA World Cup'].copy()
FIFA_WC.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year
24966,2002-05-31,France,Senegal,0,1,FIFA World Cup,Seoul,South Korea,True,2002
24967,2002-06-01,Germany,Saudi Arabia,8,0,FIFA World Cup,Sapporo,Japan,True,2002
24968,2002-06-01,Republic of Ireland,Cameroon,1,1,FIFA World Cup,Niigata,Japan,True,2002
24969,2002-06-01,Uruguay,Denmark,1,2,FIFA World Cup,Ulsan,South Korea,True,2002
24970,2002-06-02,Argentina,Nigeria,1,0,FIFA World Cup,Kashima,Japan,True,2002


- In the code below, we are going to group the dataset by the teams and create a column for the goal scored and goal conceeded for each team.

- The goals scored column will help us to see the attacking strength of a specific team, the goals conceeded meanwhile will tell us about their defensive strength.

- The challenge we face now is that grouping by team is going to be a bit difficult, this is because you cant group explicitly by home team or away because for example Kenya could play 2 matches against Ethiopia, one as the Home team and then another as the away team. So what we will do is group the data by home team and by away teams.

In [136]:
# group the data by home_team
team_group_home = FIFA_WC.groupby(by='home_team')
# group the data by away_team
team_group_away = FIFA_WC.groupby(by='away_team')


Now we will combine data from both groups to to create a data frame where we will aggreggate most of the data for each team, first lets create the necessary columns.

In [137]:
# create a list with the home_teams and away_teams unique
teams = list(set(team_group_home.groups.keys()) | set(team_group_away.groups.keys()))
WorldCupTeamsdf = pd.DataFrame(teams, columns=['Countries'])
WorldCupTeamsdf['Wins'] = 0
WorldCupTeamsdf['Matches Played'] = 0
# create column goals_scored and goals_conceded
WorldCupTeamsdf['Goals Scored'] = 0
WorldCupTeamsdf['Goals Conceded'] = 0
# # create a list with name of the home_teams
# home_teams = team_group_home.groups.keys()
# # create a list with name of the away_teams
# away_teams = team_group_away.groups.keys()
# # create a pandas dataframe with the home_teams as rows
# home_teams_df = pd.DataFrame(home_teams, columns=['Countries'])
# # empty column wins and matches played
# home_teams_df['Wins'] = 0
# home_teams_df['Matches Played'] = 0
# # create a pandas dataframe with the away_teams as rows
# away_teams_df = pd.DataFrame(away_teams, columns=['Countries'])
# # empty column wins and matches played
# away_teams_df['Wins'] = 0
# away_teams_df['Matches Played'] = 0
# # create column goals_scored and goals_conceded
# home_teams_df['Goals Scored'] = 0
# home_teams_df['Goals Conceded'] = 0
# away_teams_df['Goals Scored'] = 0
# away_teams_df['Goals Conceded'] = 0

In [141]:

# to get a win goals scored has to be greater than the goals scored against, iter through the list of teams
# then use team to get the data from the FIFA_WC dataframe and add the goals scored and goals conceded to the
# corresponding columns in the dataframe
# and add 1 to the country column Wins if the condition is met
for team in teams:
    # get the data from the FIFA_WC dataframe
    team_data = FIFA_WC[FIFA_WC['home_team'] == team]
    # get the goals scored
    goals_scored = team_data['home_score'].sum()
    # get the goals conceded
    goals_conceded = team_data['away_score'].sum()
    # add the goals scored to the corresponding column in the dataframe
    WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Scored'] = goals_scored
    # add the goals conceded to the corresponding column in the dataframe
    WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Conceded'] = goals_conceded
    # add 1 to the country column Wins if the condition is met
    # WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Wins'] = WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Wins'] + team_data['home_score'].apply(lambda x: 1 if x > team_data['away_score'].apply(lambda x: x) else 0)
    # add 1 to the country column Matches Played if the condition is met
    WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Matches Played'] = WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Matches Played'] + 1
    # get the data from the FIFA_WC dataframe
    team_data = FIFA_WC[FIFA_WC['away_team'] == team]
    # get the goals scored
    goals_scored = team_data['away_score'].sum()
    # get the goals conceded
    goals_conceded = team_data['home_score'].sum()
    # add the goals scored to the corresponding column in the dataframe
    WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Scored'] = WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Scored'] + goals_scored
    # add the goals conceded to the corresponding column in the dataframe
    WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Conceded'] = WorldCupTeamsdf.loc[WorldCupTeamsdf['Countries'] == team, 'Goals Conceded'] + goals_conceded

# Sort dataframe by country column
WorldCupTeamsdf = WorldCupTeamsdf.sort_values(by='Countries')
WorldCupTeamsdf.head()

Unnamed: 0,Countries,Wins,Matches Played,Goals Scored,Goals Conceded
41,Algeria,0,4,7,9
44,Angola,0,4,1,2
35,Argentina,0,4,37,24
20,Australia,0,4,13,26
52,Belgium,0,4,28,16
