# Quick comparison of heights between fifa and mlb players

In this analysis, we take two datasets containing observations about MLB and FIFA player. The goal of this analysis is to compare heights statistcs across the two leages and ansewr two questions:

1. which league has the **highest average**?
2. which league has the **greatest height difference across players**?

In [1]:
import pandas as pd

## 1. Importing data and initial exploration

In [2]:
baseball = pd.read_csv("baseball.csv")
fifa = pd.read_csv('fifa.csv')

In [3]:
baseball.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015 entries, 0 to 1014
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         1015 non-null   object 
 1   Team         1015 non-null   object 
 2   Position     1015 non-null   object 
 3   Height       1015 non-null   int64  
 4   Weight       1015 non-null   int64  
 5   Age          1015 non-null   float64
 6   PosCategory  1015 non-null   object 
dtypes: float64(1), int64(2), object(4)
memory usage: 55.6+ KB


In [4]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8847 entries, 0 to 8846
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            8847 non-null   int64  
 1    name         8847 non-null   object 
 2    rating       8847 non-null   int64  
 3    position     8847 non-null   object 
 4    height       8847 non-null   int64  
 5    foot         8847 non-null   object 
 6    rare         8847 non-null   int64  
 7    pace         8847 non-null   object 
 8    shooting     8847 non-null   object 
 9    passing      8847 non-null   object 
 10   dribbling    8847 non-null   object 
 11   defending    8847 non-null   object 
 12   heading      8847 non-null   object 
 13   diving       8847 non-null   object 
 14   handling     8847 non-null   object 
 15   kicking      8847 non-null   object 
 16   reflexes     8847 non-null   object 
 17   speed        8847 non-null   object 
 18   positioning  930 non-null  

### 1.1. Cleaning fifa's column labels

Looking at the fifa dataset there seems to be a space before the column name gain from index 1 to index 18. This has to be cleaned up.

In [5]:
fifa_columns = list(fifa.columns)
fifa_columns

['id',
 ' name',
 ' rating',
 ' position',
 ' height',
 ' foot',
 ' rare',
 ' pace',
 ' shooting',
 ' passing',
 ' dribbling',
 ' defending',
 ' heading',
 ' diving',
 ' handling',
 ' kicking',
 ' reflexes',
 ' speed',
 ' positioning']

In [6]:
fifa_columns = [label.strip() for label in fifa_columns]
fifa_columns

['id',
 'name',
 'rating',
 'position',
 'height',
 'foot',
 'rare',
 'pace',
 'shooting',
 'passing',
 'dribbling',
 'defending',
 'heading',
 'diving',
 'handling',
 'kicking',
 'reflexes',
 'speed',
 'positioning']

Now that the column labels have been cleaned they can be assigned to the dataset.

In [7]:
fifa.columns = fifa_columns

## 2. Exploring the datasets

### 2.1. MLB data

In [8]:
baseball_height = baseball['Height']
baseball_height.describe()

count    1015.000000
mean       73.689655
std         2.313932
min        67.000000
25%        72.000000
50%        74.000000
75%        75.000000
max        83.000000
Name: Height, dtype: float64

The height observations in the baseball dataset is expressed in inches. However, for this analysis we'll work with the metric system.

In [9]:
baseball_height = baseball_height * 2.54
baseball_height_stats = baseball_height.describe()
baseball_height_stats

count    1015.000000
mean      187.171724
std         5.877387
min       170.180000
25%       182.880000
50%       187.960000
75%       190.500000
max       210.820000
Name: Height, dtype: float64

### 2.2 FIFA data

In [10]:
fifa_height = fifa['height']
fifa_height_stats = fifa_height.describe()
fifa_height_stats

count    8847.000000
mean      181.750424
std         6.454356
min       158.000000
25%       178.000000
50%       182.000000
75%       186.000000
max       208.000000
Name: height, dtype: float64

The heights are expressed in centimeters so there is no need of further cleaning.

## 3. Problem: samples of different sizes

Looking at the summary statistics for the datasets we can see that the fifa daset has more observations compared to the baseball dataset.

In [11]:
fifa_size = fifa_height_stats['count']
baseball_size = baseball_height_stats['count']

In [12]:
observations_difference = fifa_size - baseball_size
observations_difference_multiplier = fifa_size / baseball_size

In [13]:
print(f"The fifa dataset has {observations_difference:,} more observations. This means that it is {observations_difference_multiplier:.1f} times bigger")

The fifa dataset has 7,832.0 more observations. This means that it is 8.7 times bigger


The fifa dataset must be downsized in order to compare players across the two leagues. The subset of fifa is going to be created using random sampling technique, this to remove bias from the selection.

In [14]:
fifa_height = fifa_height.sample(int(baseball_size))
fifa_height_stats = fifa_height.describe()
assert fifa_height_stats['count'] == baseball_size

## 4. Comparing the data

The final part of this analysis involves creating a table to compare the variables needed to answer the four questions we set in the beginning.

In [15]:
target_data = ['mean','std', 'min', 'max']

In [17]:
comparison_frame = {'MLB': baseball_height_stats[target_data], 'FIFA': fifa_height_stats[target_data]}
comparison_frame = pd.DataFrame(comparison_frame)
comparison_frame['Difference'] = baseball_height_stats - fifa_height_stats
comparison_frame

Unnamed: 0,MLB,FIFA,Difference
mean,187.171724,181.763547,5.408177
std,5.877387,6.36615,-0.488762
min,170.18,160.0,10.18
max,210.82,202.0,8.82


## 5. Conclusion

Looking at the table aboe we can reach the following conclusions:

1. MLB players are higher on average compared to the FIFA players.
2. Looking at the standard deviation we can say that FIFA players have more variance in height compared to the MLB players.

*NOTE*: We did take preventive steps during the sampling of FIFA players using a randomized sampling technique. However, there is still a minor chance that results might be different if the FIFA sample would be different.