# Data exploration 

---

Group name: Gruppe C

---


## Introduction

*This section includes a short description of the data* 

## Setup

In [1]:
import pandas as pd
import altair as alt
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Data

## Import data

In [2]:
import pandas as pd

df = pd.read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2022/wc_forecasts.csv")

df.to_csv("data.csv")

df.head(3)

Unnamed: 0,forecast_timestamp,team,group,spi,global_o,global_d,sim_wins,sim_ties,sim_losses,sim_goal_diff,...,group_1,group_2,group_3,group_4,make_round_of_16,make_quarters,make_semis,make_final,win_league,timestamp
0,2022-12-14 20:55:37 UTC,Argentina,C,88.85631,2.69895,0.37464,2.0,0.0,1.0,3.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.53049,2022-12-14 20:56:18 UTC
1,2022-12-14 20:55:37 UTC,France,D,88.41321,2.89548,0.49957,2.0,0.0,1.0,3.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.46951,2022-12-14 20:56:18 UTC
2,2022-12-14 20:55:37 UTC,Morocco,F,73.92282,1.73737,0.50047,2.0,1.0,0.0,3.0,...,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,2022-12-14 20:56:18 UTC


### Data structure

In [3]:
df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   forecast_timestamp  224 non-null    object 
 1   team                224 non-null    object 
 2   group               224 non-null    object 
 3   spi                 224 non-null    float64
 4   global_o            224 non-null    float64
 5   global_d            224 non-null    float64
 6   sim_wins            224 non-null    float64
 7   sim_ties            224 non-null    float64
 8   sim_losses          224 non-null    float64
 9   sim_goal_diff       224 non-null    float64
 10  goals_scored        224 non-null    float64
 11  goals_against       224 non-null    float64
 12  group_1             224 non-null    float64
 13  group_2             224 non-null    float64
 14  group_3             224 non-null    float64
 15  group_4             224 non-null    float64
 16  make_rou

### Data corrections

In [4]:
df['forecast_timestamp'] = pd.to_datetime(df['forecast_timestamp'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224 entries, 0 to 223
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   forecast_timestamp  224 non-null    datetime64[ns, UTC]
 1   team                224 non-null    object             
 2   group               224 non-null    object             
 3   spi                 224 non-null    float64            
 4   global_o            224 non-null    float64            
 5   global_d            224 non-null    float64            
 6   sim_wins            224 non-null    float64            
 7   sim_ties            224 non-null    float64            
 8   sim_losses          224 non-null    float64            
 9   sim_goal_diff       224 non-null    float64            
 10  goals_scored        224 non-null    float64            
 11  goals_against       224 non-null    float64            
 12  group_1             224 non-null    

## Exploratory data analysis

### Korrelation zwischen Soccer Power Index und Sieg der WM

In [5]:
# define outcome variable as y_label
y_label = 'win_league'

# select features
X = df[["spi"]]

# create response
y = df[y_label]

In [10]:
%matplotlib inline
import altair as alt

# Visualize the data
alt.Chart(df).mark_circle(size=100).encode(
    x=alt.X('spi',
            axis=alt.Axis(title='ESPN Soccer Power Index',
                          labelAngle=0,
                          titleAnchor='start',
                          grid=False)),
    y=alt.Y('win_league',
            axis=alt.Axis(title='Wahrscheinlichkeit für den WM Sieg',
                          titleAnchor='end',
                          grid=False)),
    color=alt.Color('team', legend=None),
    tooltip=['team', 'spi', 'win_league']

).properties(
    title={'text':'Korrelation des SPI und der Wahrscheinloichkeit die WM zu gewinnen','subtitle':'Für die Fußballweltmeisterschaft im Jahr 2022'},  
    width=500,
    height=300
).configure_view(
    strokeWidth=0
).configure_title( 
    fontSize=16,
    font='Arial',
    color='black',
    anchor='start'
).interactive()

Diese Visualisierung soll die Korrelation von SPI und der Wahrscheinlichkeit, die WM zu gewinnen, darstellen.
Wie man sehen kann, erhöht sich die Wahrscheinlichkeit eines Gewinns, je höher der SPI. Die Farben stellen die einzelnen Teams dar.

### Korrelation zwischen Tordifferenz und Soccer Power Index

In [12]:
# define outcome variable as y_label
y_label = 'spi'

# select features
X = df[["sim_goal_diff"]]

# create response
y = df[y_label]

In [13]:
# Visualize the data
alt.Chart(df).mark_circle(size=100).encode(
    x='sim_goal_diff',
    y='spi',
    color=alt.Color('team', legend=None),
    tooltip=['team', 'sim_goal_diff', 'spi']

).interactive()

Diese Visualisierung zeigt die Korrelation von Tordifferenz und Soccer Power Index. Teams mit einer hohen Tordifferenz haben auch einen höheren SPI.

### Korrelation zwischen Gruppen 1. und Sieg der WM

In [14]:
# define outcome variable as y_label
y_label = 'win_league'

# select features
X = df[["group_1"]]

# create response
y = df[y_label]

In [15]:
# Visualize the data
alt.Chart(df).mark_circle(size=100).encode(
    x='group_1',
    y='win_league',
    color=alt.Color('team', legend=None),
    tooltip=['team', 'group_1', 'win_league']

).interactive()

Korrelation zwischen der Wahrscheinlichkeit Gruppen 1. zu werden und Wahrscheinlichkeit eines Sieges der WM. Sie zeigt, dass sehr hohe (85%) und sehr geringe (<30%) Wahrscheinlichkeiten für Gruppen 1. eine ebenfalls hohe bzw. geringe Chance eines Sieges der WM bedeuten. In der Mitte lässt sich keine eindeutige Aussage treffen.

### Korrelation zwischen Globale Offensive und Sieg der WM

In [18]:
# define outcome variable as y_label
y_label = 'win_league'

# select features
X = df[["global_o"]]

# create response
y = df[y_label]

In [19]:
# Visualize the data
alt.Chart(df).mark_circle(size=100).encode(
    x='global_o',
    y='win_league',
    color=alt.Color('team', legend=None),
    tooltip=['team', 'global_o', 'win_league']

).interactive()

Korrelation zwischen Global Offensive Rating und der Wahrscheinlichkeit, die WM zu gewinnen. Die Grafik zeigt, dass noch andere Faktoren miteinspielen müssen, da Deutschland beispielsweise zwar ein hohes Global Defense Rating hat, jedoch ist die WK, dass sie Weltmeister werden eher gering.