# CMSC 320 FINAL PROJECT: WHAT COUNTRY HAS HAD THE GREATEST NATIONAL FOOTBALL(SOCCER) TEAM IN THE LAST 22 YEARS?
Owners: Nathaniel Bekele and Mikias Atnafu

## Introduction:
First and foremost, we will be establishing what we believe to be the most important boundary for this project, we will be referring to the sport exclusively as football not soccer.

The data we will be using was compiled by Mart Jürisoo and is available on Kaggle: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017?resource=download.

## Factors to be considered:
### Time period
This dataset has all the results of international football fixtures from 1872 to 2022. While it would be amazing to consider the data from 1872, we also thought of the possibility that it would be rather unfair on most countries that were colonized at the time and hence did not have a national team to compete with. We also wanted to limit the data from 2000 onwards so that we can find the greatest national team in this current millenium. Thus we will be only be considering data from 2000 to 2022 

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Preprocessing
Before we start analysing who is the greatest national football team, we need to preprocess the data. So, let us first read the csv file into a panda dataframe. We have used two data sets for this project, The first one is the international football results which has general football result data from the 1870s, and the second one is the football world cup summary, which will has world cup data showing who reached the final and who won the world cup over the years starting from 1930.    

### Reading in the first dataset, *The international football results*

In [20]:
football_df = pd.read_csv('results.csv')


# Create column Year using the year in the date column
football_df['year'] = football_df['date'].str.extract('(\d\d\d\d)')


# Change column Year to type int
football_df['year'] = football_df['year'].astype(int)

# filter only the data with year >= 2000 
football_df = football_df[football_df['year'] >= 2000]

football_df.head() 


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year
22617,2000-01-04,Egypt,Togo,2,1,Friendly,Aswan,Egypt,False,2000
22618,2000-01-07,Tunisia,Togo,7,0,Friendly,Tunis,Tunisia,False,2000
22619,2000-01-08,Trinidad and Tobago,Canada,0,0,Friendly,Port of Spain,Trinidad and Tobago,False,2000
22620,2000-01-09,Burkina Faso,Gabon,1,1,Friendly,Ouagadougou,Burkina Faso,False,2000
22621,2000-01-09,Guatemala,Armenia,1,1,Friendly,Los Angeles,United States,True,2000


### Reading in the second dataset, *World Cup Summary*

In [21]:
# read the FIFA world cup summary and display the first 5 rows 
world_cup_summary = pd.read_csv('FIFA-World Cup Summary.csv')

# filters the dataset by taking the world cup after the year 2000 
world_cup_summary = world_cup_summary[world_cup_summary['YEAR'] >= 2000]
world_cup_summary.head()

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
10,1978,Argentina,Argentina,Netherlands,Brazil,16,38,102,2.7
11,1982,Spain,Italy,West Germany,Poland,24,52,146,2.8
12,1986,Mexico,Argentina,West Germany,France,24,52,132,2.5
13,1990,Italy,West Germany,Argentina,Italy,24,52,115,2.2
14,1994,United States,Brazil,Italy,Sweden,24,52,141,2.7


## Tyding the data
- The tournament column contains very many types of tournaments, qualifiers and even friendlies (177 to be specefic), for this project we will only focus on the world cup, so we will only be working with the worldcup tournament.

- The reason we chose to specifically focus on the world cup is because it is the biggest sports competition on the planet with qualifiers played by teams representing their countries from every single continent, and so unless a country is specifically banned by FIFA from the world cup, every country has a chance to go to the world cup. This makes the world cup the best tournament to take as a metric for best football team. 

- Additional information on how qualification for the world cup works here: https://www.espn.com/soccer/fifa-world-cup/story/3860182/2022-world-cup-how-qualifying-works-around-the-world

- Also, to make the analysis convenient, we will add serval columns into our dataframe such as Winning percentage, Number of losses, Number of matches played any a few others. Using these columns, we can better analysis the best national team. The code below will do just that, adding the necessary columns.  

In [22]:
# Filter only the FIFA World Cup tournment
FIFA_WC = football_df[football_df['tournament'] == 'FIFA World Cup'].copy()
FIFA_WC.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year
24966,2002-05-31,France,Senegal,0,1,FIFA World Cup,Seoul,South Korea,True,2002
24967,2002-06-01,Germany,Saudi Arabia,8,0,FIFA World Cup,Sapporo,Japan,True,2002
24968,2002-06-01,Republic of Ireland,Cameroon,1,1,FIFA World Cup,Niigata,Japan,True,2002
24969,2002-06-01,Uruguay,Denmark,1,2,FIFA World Cup,Ulsan,South Korea,True,2002
24970,2002-06-02,Argentina,Nigeria,1,0,FIFA World Cup,Kashima,Japan,True,2002


In the code below, we are going to group the dataset by the teams and create a column for the goal scored and goal conceeded for each team
The goal scored column will help us to see which team is good at attacking, whereas the goal conceeded tell us about the teams defense strength

In [23]:
# group the data by home_team
team_group = FIFA_WC.groupby(by='home_team')

In [24]:

win = []
for index, data in FIFA_WC.iterrows():
    # get each team using their team name
    team_name = data['home_team']
    team = team_group.get_group(team_name)

    # calculate the winning percentage of each team
    win_per = len(team[team['home_score']  > team['away_score']]) / len(team)

    # append each team winning percentage to a list       
    win.append(win_per)


# add a new column for the winning percentage
FIFA_WC['win_percentage'] = win
FIFA_WC.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,win_percentage
24966,2002-05-31,France,Senegal,0,1,FIFA World Cup,Seoul,South Korea,True,2002,0.538462
24967,2002-06-01,Germany,Saudi Arabia,8,0,FIFA World Cup,Sapporo,Japan,True,2002,0.636364
24968,2002-06-01,Republic of Ireland,Cameroon,1,1,FIFA World Cup,Niigata,Japan,True,2002,0.0
24969,2002-06-01,Uruguay,Denmark,1,2,FIFA World Cup,Ulsan,South Korea,True,2002,0.363636
24970,2002-06-02,Argentina,Nigeria,1,0,FIFA World Cup,Kashima,Japan,True,2002,0.733333
