# Team 6 - World Cup

![](https://img.fifa.com/image/upload/t_l4/v1543921822/ex1ksdevyxwsgu7rzdv6.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/abecklas/fifa-world-cup)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- Is there a way to integrate the data from all 3 datasets?
- It seems like the `winners` dataset doesn't have data of World Cup 2018. Can you Google the relevant information and add it to the dataset using `pandas`?
- The format of some number columns in `matches` dataset doesn't look right.
- Can you seperate the Date and the Time of `Datetime` column in `matches` dataset?
- And more...

### Import libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

### Create Data Frames

In [None]:
# df_matches_raw = pd.read_csv('/content/gdrive/My Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/matches.csv', encoding='utf-8')
# df_players_raw = pd.read_csv('/content/gdrive/My Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/players.csv', encoding='utf-8')
# df_winners_raw = pd.read_csv('/content/gdrive/My Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/winners.csv', encoding='utf-8')

df_matches_raw = pd.read_csv('/Users/jodythai/Google Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/matches.csv', encoding='utf-8')
df_players_raw = pd.read_csv('/Users/jodythai/Google Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/players.csv', encoding='utf-8')
df_winners_raw = pd.read_csv('/Users/jodythai/Google Drive/PROJECTS/CoderSchool_Fansipan/github_repo/world-cup-da/data/winners.csv', encoding='utf-8')

# Clean Data

In [None]:
df_matches_raw.tail()

In [None]:
df_players_raw.head(2)

In [None]:
df_winners_raw.head(5)

#### Remove NaN

In [None]:
# Remove null rows of Matches dataset
# Get data when RoundID is null
df_matches = df_matches_raw[(df_matches_raw["RoundID"].isnull() == False) & (df_matches_raw["MatchID"].isnull() == False)]

# Find NaN values from the data points
df_matches.isnull().sum()

In [None]:
# Get the 2 rows with Attendance is NaN
df_matches[df_matches["Attendance"].isnull()]

In [None]:
# Clean NaN value
df_matches["Attendance"].fillna(value = 0, inplace = True)
df_matches = df_matches.fillna('')
df_players = df_players_raw.fillna('')
df_winners = df_winners_raw.fillna('')

In [None]:
df_matches.sample(5)

#### Convert Data Types

In [None]:
df_matches.info()

In [None]:
# Convert data type of Matches dataset
df_matches["Year"] = df_matches["Year"].astype("category")
df_matches["Datetime"] = df_matches["Datetime"].astype("category")
df_matches["Stage"] = df_matches["Stage"].astype("category")
df_matches["Stadium"] = df_matches["Stadium"].astype("category")
df_matches["City"] = df_matches["City"].astype("category")
df_matches["Home Team Name"] = df_matches["Home Team Name"].astype("category")
df_matches["Home Team Goals"] = df_matches["Home Team Goals"].astype("int")
df_matches["Away Team Goals"] = df_matches["Away Team Goals"].astype("int")
df_matches["Away Team Name"] = df_matches["Away Team Name"].astype("category")
df_matches["Win conditions"] = df_matches["Win conditions"].astype("category")
df_matches["Attendance"] = df_matches["Attendance"].astype("int")
df_matches["Half-time Home Goals"] = df_matches["Half-time Home Goals"].astype("int")
df_matches["Half-time Away Goals"] = df_matches["Half-time Away Goals"].astype("int")
df_matches["Referee"] = df_matches["Referee"].astype("category")
df_matches["Assistant 1"] = df_matches["Assistant 1"].astype("category")
df_matches["Assistant 2"] = df_matches["Assistant 2"].astype("category")
df_matches["RoundID"] = df_matches["RoundID"].astype("int")
df_matches["MatchID"] = df_matches["MatchID"].astype("int")
df_matches["Home Team Initials"] = df_matches["Home Team Initials"].astype("category")
df_matches["Away Team Initials"] = df_matches["Away Team Initials"].astype("category")
df_matches.info()

In [None]:
df_players.head(2)

In [None]:
# Convert data type of Players dataset
df_players["Team Initials"] = df_players["Team Initials"].astype("category")
df_players["Coach Name"] = df_players["Coach Name"].astype("category")
df_players["Line-up"] = df_players["Line-up"].astype("category")
df_players["Player Name"] = df_players["Player Name"].astype("category")
df_players["Position"] = df_players["Position"].astype("category")
df_players["Event"] = df_players["Event"].astype("category")

df_players.info()

In [None]:
# Process Attendance values
df_winners = df_winners_raw.fillna('')
df_winners["Attendance"] = df_winners["Attendance"].str.replace('.', '')
df_winners.head()

In [None]:
# Convert data type of Winners dataset
df_winners["Country"] = df_winners["Country"].astype("category")
df_winners["Winner"] = df_winners["Winner"].astype("category")
df_winners["Runners-Up"] = df_winners["Runners-Up"].astype("category")
df_winners["Third"] = df_winners["Third"].astype("category")
df_winners["Fourth"] = df_winners["Fourth"].astype("category")
df_winners["Attendance"] = df_winners["Attendance"].astype("int")

df_winners.info()

#### Separate Datetime column

In [None]:
# Separate DateTime column
df_matches.insert(loc = 2, column="Time", value="")
df_matches.head(2)

In [None]:
df_matches.loc[:, 'Time'] = df_matches["Datetime"].apply(lambda x: x.split('-')[1].strip())
df_matches.loc[:, 'Datetime'] = df_matches["Datetime"].apply(lambda x: x.split('-')[0].strip())
df_matches.head(2)

In [None]:
# Rename Datetime column
df_matches.rename(columns={'Datetime': 'Date'}, inplace=True)
df_matches.sample(5)

In [None]:
# Convert data types for Date & Time columns
df_matches["Date"] = df_matches["Date"].astype("category")
df_matches["Time"] = df_matches["Time"].astype("category")
df_matches.info()

#### Check data duplication

In [None]:
# Check duplication
df_matches["MatchID"].nunique() == df_matches["MatchID"].count()

In [None]:
# Find duplicated rows from the dataset
def get_duplicated_data(df, key):
  """
    Return a DataFrame of duplicated data points of a given dataset
  """
  
  df_key = df[key]
  return df[df_key.isin(df_key[df_key.duplicated()])].sort_values(key)
  
get_duplicated_data(df_matches, "MatchID")

In [None]:
df_matches.drop_duplicates(keep = 'first', inplace = True)

In [None]:
df_players.drop_duplicates(keep = 'first', inplace = True)

# EDA & Feature Engineering

In [None]:
# Merge Players and Matches datasets

df_matches_players = pd.merge( df_matches, df_players, how='outer', on="MatchID")

df_matches_players[df_matches_players['Player Name'] == 'Alex THEPOT']

In [None]:
# Input a player name and return the country name
def get_country_name_from_player(df, player_name):
  country_initials = df[df["Player Name"] == player_name]['Team Initials'].unique()[0].lower()
  home_team = str(df[df["Player Name"] == player_name]["Home Team Name"].unique()[0].lower())
  away_team = str(df[df["Player Name"] == player_name]["Away Team Name"].unique()[0].lower())

  if country_initials in home_team:
    return home_team.capitalize()
  else:
    return away_team.capitalize()
  
get_country_name_from_player(df_matches_players, "Juan CARRENO")

#### Split the Event data in order to extract more valuable insights

In [None]:
df_players = df_players.assign(Event=df_players["Event"].str.split('\s')).explode('Event').reset_index(drop=True)

df_players[df_players["MatchID"] == 3079]

#### Other Features

In [None]:
# Add column to indicate if a player played in a winning team
df_players['Cup Won'] = ''

def get_total_cup_by_country(df, country):
  return df_winners[df_winners["Winner"] == country]["Winner"].count()
  
# get_total_cup_by_country(df_winners, "Brazil")

# df_players['Cup Won'] = df_players['Player Name'].apply(lambda name: get_country_name_from_player(df_matches_players, name))
df_players['Cup Won'] = df_players['Player Name'].apply(lambda name: print(name))
df_players.head()

In [None]:
df_matches.head()
df_winners

# TODO

### Get data by Events Type

### Make Corr() Diagrams