# JM0250 Data Visualization 
### Academic year 2022-2023

## FIFA World Cup 2022 Data Exploration
Data sources:

- FIFA World Cup 2022 Player Data (https://www.kaggle.com/datasets/swaptr/fifa-world-cup-2022-player-data)
- FIFA World Cup 2022 Match Data (https://www.kaggle.com/datasets/swaptr/fifa-world-cup-2022-match-data)
- FIFA World Cup 2022 Team Data (https://www.kaggle.com/datasets/swaptr/fifa-world-cup-2022-statistics)
- FIFA World Cup 2022 Twitter Dataset (https://www.kaggle.com/datasets/kumari2000/fifa-world-cup-twitter-dataset-2022)
- FIFA World Cup 2022 Prediction (https://www.kaggle.com/datasets/shilongzhuang/soccer-world-cup-challenge)
- FIFA World Cup 2022 Player Images (https://www.kaggle.com/datasets/soumendraprasad/fifa-2022-all-players-image-dataset)
- FIFA World Cup Historic (https://www.kaggle.com/datasets/piterfm/fifa-football-world-cup)
- FIFA World Cup Penalty Shootouts (https://www.kaggle.com/datasets/pablollanderos33/world-cup-penalty-shootouts, https://www.kaggle.com/datasets/jandimovski/world-cup-penalty-shootouts-2022)

Data dictionaries and additional info can be found in the respective data folders.

In [2]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.8.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (5.8 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.2.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.47.2-cp39-cp39-macosx_10_9_universal2.whl.metadata (157 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.5-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-10.2.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (9.7 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Using cached pyparsing-3.1.1-py3-none-any.whl.metadata (5.1 kB)
Collecting importlib-resources>=3.2.0 (from matp

In [3]:
# Import libraries
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import os

# Do not truncate tables
pd.set_option('display.max_columns', None)

In [4]:
# Load the data

# Match data
match = pd.read_csv('../Data/FIFA World Cup 2022 Match Data/data.csv', delimiter=',')

# Prediction data
groups  = pd.read_csv('../Data/FIFA World Cup 2022 Prediction/2022_world_cup_groups.csv', delimiter=',')
matches = pd.read_csv('../Data/FIFA World Cup 2022 Prediction/2022_world_cup_matches.csv', delimiter=',')
international_matches = pd.read_csv('../Data/FIFA World Cup 2022 Prediction/international_matches.csv', delimiter=',')
world_cup_matches = pd.read_csv('../Data/FIFA World Cup 2022 Prediction/world_cup_matches.csv', delimiter=',')
world_cups = pd.read_csv('../Data/FIFA World Cup 2022 Prediction/world_cups.csv', delimiter=',')

def list_full_paths(directory):
    return [os.path.join(directory, file) for file in os.listdir(directory)]

    

In [5]:
datasets = [match, groups,matches,international_matches,world_cup_matches,world_cups]

In [6]:
for i, dataset in enumerate(datasets, start=1):
    print(f"Dataset {i} columns: {', '.join(dataset.columns)}")

Dataset 1 columns: match, dayofweek, match_time, home_team, away_team, home_xg, away_xg, score, attendance, venue, referee, home_formation, away_formation, home_captain, away_captain, home_manager, away_manager, home_possession, away_possession, home_completed_passes, home_attempted_pases, away_completed_passes, away_attempted_pases, home_sot, away_sot, home_total_shots, away_total_shots, home_saves, away_saves, home_fouls, away_fouls, home_corners, away_corners, home_crosses, away_crosses, home_touches, away_touches, home_tackles, away_tackles, home_interceptions, away_interceptions, home_aerials_won, away_aerials_won, home_clearances, away_clearances, home_offsides, away_offsides, home_gks, away_gks, home_throw_ins, away_throw_ins, home_long_balls, away_long_balls
Dataset 2 columns: Group, Team, FIFA Ranking
Dataset 3 columns: ID, Year, Date, Stage, Home Team, Away Team, Host Team
Dataset 4 columns: ID, Tournament, Date, Home Team, Home Goals, Away Goals, Away Team, Winning Team, Los

In [7]:
import pandas as pd

def print_attribute_distribution(df):
    """
    This function takes a DataFrame and prints the distribution of values for each column.

    For categorical data, it prints the frequency of each category.
    For numerical data, it prints a basic histogram-like distribution.
    """
    for column in df.columns:
        print(f"Distribution for {column}:")

        # Check if the column is numeric or categorical
        if pd.api.types.is_numeric_dtype(df[column]):
            # For numeric columns, print a histogram-like distribution
            print(df[column].value_counts(bins=10, sort=False))
        else:
            # For categorical columns, print the frequency of each category
            print(df[column].value_counts())

        print("\n")  # Print a newline for better readability between columns

# Example usage with a DataFrame
# df = pd.read_csv('your_dataset.csv')  # Replace with your dataset loading method
# print_attribute_distribution(df)
for dataset in datasets:
    print_attribute_distribution(dataset)
    

Distribution for match:
(0.936, 7.3]    7
(7.3, 13.6]     6
(13.6, 19.9]    6
(19.9, 26.2]    7
(26.2, 32.5]    6
(32.5, 38.8]    6
(38.8, 45.1]    7
(45.1, 51.4]    6
(51.4, 57.7]    6
(57.7, 64.0]    7
Name: count, dtype: int64


Distribution for dayofweek:
dayofweek
Tue    11
Fri    10
Mon     9
Wed     9
Sat     9
Sun     8
Thu     8
Name: count, dtype: int64


Distribution for match_time:
match_time
2022-11-29 18:00:00    2
2022-11-29 22:00:00    2
2022-11-30 18:00:00    2
2022-11-30 22:00:00    2
2022-12-01 18:00:00    2
2022-12-01 22:00:00    2
2022-12-02 18:00:00    2
2022-12-02 22:00:00    2
2022-11-20 19:00:00    1
2022-11-21 16:00:00    1
2022-11-28 19:00:00    1
2022-11-28 22:00:00    1
2022-12-03 18:00:00    1
2022-12-03 22:00:00    1
2022-12-04 18:00:00    1
2022-12-04 22:00:00    1
2022-12-05 18:00:00    1
2022-12-05 22:00:00    1
2022-12-06 18:00:00    1
2022-12-06 22:00:00    1
2022-12-09 18:00:00    1
2022-12-09 22:00:00    1
2022-12-10 18:00:00    1
2022-12-10 22:00: