In [16]:
import pandas as pd
import numpy as np

# 0. `pandas` and preliminary knowledge

In this exercise we will be using the [`pandas`](https://pandas.pydata.org/docs/) library.

You can use a [cheat sheet of the basic functions](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) if you are not familiar with `pandas`.

`pandas` is a popular data manipulation library for Python. It provides data structures like Series and DataFrame alongside the essential functionality required for cleaning, aggregating, transforming, visualizing, and more tasks on data. `pandas` is built on top of the [`numpy`](https://numpy.org/) library, offering a higher-level, more intuitive interface for data analysis and manipulation. It's especially suitable for working with structured data, including datasets from Excel or CSV files, SQL tables, and more. With `pandas`, users can perform tasks like handling missing data, merging and joining datasets, filtering and reshaping data, and many other complex data operations. The extensive functionality of `pandas` makes it an essential tool for data scientists, analysts, and researchers working in Python.


# 1. Load the data
Find a way to load the data into a pandas dataframe. You can find the data here: https://osf.io/fv8c3.

You need to find a way to load this data. There is a function in `pandas` that can help you. If you can't find it on your own or you feel unsure, you can click [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to find the correct function you can use.

In [17]:
df = pd.read_csv('data\CrowdstormingDataJuly1st.csv')
df.head()

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002


In [18]:
df

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,0.50,1,1,GRC,0.326391,712.0,0.000564,0.396000,750.0,0.002696
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146023,tomas-rosicky,Tomáš Rosický,Arsenal FC,England,04.10.1980,178.0,67.0,Attacking Midfielder,1,1,...,0.00,3147,21,HUN,0.376127,574.0,0.000714,0.498350,606.0,0.002968
146024,winston-reid,Winston Reid,West Ham United,England,03.07.1988,190.0,87.0,Center Back,1,0,...,0.50,3147,21,HUN,0.376127,574.0,0.000714,0.498350,606.0,0.002968
146025,xherdan-shaqiri,Xherdan Shaqiri,Bayern München,Germany,10.10.1991,169.0,72.0,Left Midfielder,1,1,...,0.25,3147,21,HUN,0.376127,574.0,0.000714,0.498350,606.0,0.002968
146026,yassine-el-ghanassi,Yassine El Ghanassi,West Bromwich Albion,England,12.07.1990,173.0,,Left Winger,1,0,...,0.50,3147,21,HUN,0.376127,574.0,0.000714,0.498350,606.0,0.002968


To look at the columns, we can use .columns.

In [19]:
df.columns

Index(['playerShort', 'player', 'club', 'leagueCountry', 'birthday', 'height',
       'weight', 'position', 'games', 'victories', 'ties', 'defeats', 'goals',
       'yellowCards', 'yellowReds', 'redCards', 'photoID', 'rater1', 'rater2',
       'refNum', 'refCountry', 'Alpha_3', 'meanIAT', 'nIAT', 'seIAT',
       'meanExp', 'nExp', 'seExp'],
      dtype='object')

The way you read the shape is (rows, columns).
You can also look at the number of rows using len(), or check in your notebook when printing out the DataFrame.

In [20]:
df.shape

(146028, 28)

# 2. Clean the data
Here we use a very simple approach to clean the data. We remove all the rows that contain missing values. You can try a more sophisticated approach if you want.


In `pandas`, `NaN` (Not a Number) values represent missing or undefined data. `NaN` values can occur when loading datasets, during data manipulation, or when performing calculations. They are often used as a placeholder for missing or unrepresentable data.

Handling `NaN` values is essential as they can affect the results of data analysis and machine learning algorithms. Here are a few ways to handle `NaN` values in pandas:
- You can use the [`dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method to remove any rows or columns that contain NaN values from a DataFrame.
- You can use the [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method to replace `NaN` values with a specific value or a method (like forward fill or backward fill).

For simplicity's sake we want to just remove all rows that have `NaN` values.

Look up how we could do backfill and forward fill and what doing that would mean in our case. Can you construct a case (on another dataset, or an imagined dataset) where filling is useful?

In [21]:
df = df.dropna()

# confirm that the size went down
df.shape

(115457, 28)

In [22]:
len(df)

115457

## A Note about `NaN` and `NaT`

Note that `NaN != NaN`!

You can find a detailed explanation of why this is the case [here](https://stackoverflow.com/a/1573715/5320601).

In short:

- `a == b` should hold if `(a - b) == 0`
- but then what is `(a - NaN)`?

The way to check for `NaN` values is to use [`.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html).

There is also [`NaT`](https://pandas.pydata.org/docs/reference/api/pandas.NaT.html), which is used for missing time values (like a datetime column), and [`NA`](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data-na) which is still experimental and whose behaviour could change without warning between versions of `pandas`.

In [23]:
a = np.nan
b = np.nan

a == b

False

In [24]:
print(f'{"NaN and None equality:":<30} {np.nan == None}')
print(f'{"None and None equality:":<30} {None == None}')
print(f'{"NaN and NaN equality:":<30} {np.nan == np.nan}')

NaN and None equality:         False
None and None equality:        True
NaN and NaN equality:          False


Keep in mind that this can behave unexpectedly if you are not aware of it.


Here we create a DataFrame with integers, and add a row with a missing value. For the missing value we use `None`.

In [25]:
_numbers = pd.DataFrame(
    [[0, 1, 2], [3, 4, 5]],
    columns=['a', 'b', 'c']
)
_numbers

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5


We see that this `None` becomes `NaN` in the DataFrame. That is because the dtype of the column is `float64` and `None` is not a number (`NaN`).

In [26]:
_numbers.loc[2] = [6, 7, None]
_numbers

Unnamed: 0,a,b,c
0,0.0,1.0,2.0
1,3.0,4.0,5.0
2,6.0,7.0,


In [27]:
_numbers.dtypes

a    float64
b    float64
c    float64
dtype: object

When we do the same with strings though, this is different.

In [28]:
_strings = pd.DataFrame(
    [['Apple', 'red'], ['Banana', 'yellow']],
    columns=['Fruit', 'Color']
)
_strings

Unnamed: 0,Fruit,Color
0,Apple,red
1,Banana,yellow


In [29]:
_strings.loc[2] = ['Cherry', None]
_strings

Unnamed: 0,Fruit,Color
0,Apple,red
1,Banana,yellow
2,Cherry,


In [30]:
_strings.dtypes

Fruit    object
Color    object
dtype: object

In [31]:
_strings

Unnamed: 0,Fruit,Color
0,Apple,red
1,Banana,yellow
2,Cherry,


In [32]:
_strings.dropna()

Unnamed: 0,Fruit,Color
0,Apple,red
1,Banana,yellow


# 3. Simple statistics
Calculate the mean, median, min and maximum values for all columns.

*Hint*: You can find this in the cheat sheet, in the section 'Summarize Data'.

If you already know how to do this the simple/automatic way, try to calculate these values by hand.


In [33]:
df.describe()

Unnamed: 0,height,weight,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,rater1,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
count,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0,115457.0
mean,182.176135,76.517413,3.033761,1.371506,0.721134,0.941121,0.360351,0.404592,0.01229,0.012801,0.261946,0.300796,1532.497363,29.367124,0.348564,17725.43,0.0006292873,0.466222,18372.61,0.002993
std,6.855077,7.18721,3.641059,1.918978,1.155027,1.433641,0.960867,0.831051,0.112127,0.114175,0.294666,0.291061,916.310481,27.981717,0.032004,126078.8,0.004801956,0.21935,129533.9,0.019733
min,161.0,55.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.047254,2.0,2.235373e-07,-1.375,2.0,1e-06
25%,178.0,72.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,652.0,7.0,0.334684,1785.0,5.454025e-05,0.335967,1897.0,0.000225
50%,183.0,76.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.25,0.25,1579.0,15.0,0.336628,2882.0,0.0001508847,0.356446,3011.0,0.000586
75%,187.0,81.0,3.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.25,0.5,2337.0,45.0,0.369894,7749.0,0.0002294896,0.588297,7974.0,0.001002
max,203.0,100.0,47.0,29.0,14.0,18.0,23.0,14.0,3.0,2.0,1.0,1.0,3147.0,161.0,0.573793,1975803.0,0.2862871,1.8,2029548.0,1.06066


When we try to calculate the median on all columns, we run into a problem:

In [34]:
df.median()

  df.median()


height          183.000000
weight           76.000000
games             1.000000
victories         1.000000
ties              0.000000
defeats           1.000000
goals             0.000000
yellowCards       0.000000
yellowReds        0.000000
redCards          0.000000
rater1            0.250000
rater2            0.250000
refNum         1579.000000
refCountry       15.000000
meanIAT           0.336628
nIAT           2882.000000
seIAT             0.000151
meanExp           0.356446
nExp           3011.000000
seExp             0.000586
dtype: float64

df.describe() did not calculate the median for us. Let's do it manually.
First, find all numeric columns.

In [35]:
numeric_columns = df.select_dtypes(include=[np.number]).columns
numeric_columns

Index(['height', 'weight', 'games', 'victories', 'ties', 'defeats', 'goals',
       'yellowCards', 'yellowReds', 'redCards', 'rater1', 'rater2', 'refNum',
       'refCountry', 'meanIAT', 'nIAT', 'seIAT', 'meanExp', 'nExp', 'seExp'],
      dtype='object')

Next, select the numeric columns.

In [36]:
df[numeric_columns].head()

Unnamed: 0,height,weight,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,rater1,rater2,refNum,refCountry,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,177.0,72.0,1,0,0,1,0,0,0,0,0.25,0.5,1,1,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,179.0,82.0,1,0,0,1,0,1,0,0,0.75,0.75,2,2,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
5,182.0,71.0,1,0,0,1,0,0,0,0,0.25,0.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
6,187.0,80.0,1,1,0,0,0,0,0,0,0.0,0.25,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752
7,180.0,68.0,1,0,0,1,0,0,0,0,1.0,1.0,4,4,0.325185,127.0,0.003297,0.538462,130.0,0.013752


Finally, calculate the median.

In [37]:
df[numeric_columns].median()  # by default column-wise!

height          183.000000
weight           76.000000
games             1.000000
victories         1.000000
ties              0.000000
defeats           1.000000
goals             0.000000
yellowCards       0.000000
yellowReds        0.000000
redCards          0.000000
rater1            0.250000
rater2            0.250000
refNum         1579.000000
refCountry       15.000000
meanIAT           0.336628
nIAT           2882.000000
seIAT             0.000151
meanExp           0.356446
nExp           3011.000000
seExp             0.000586
dtype: float64

# 4. Average cards per game
Calculate the average number of yellow and red cards per game for each player. Then print out the 5 players with the highest average number of cards per game.

## 4.1 Count the number of cards each player has gotten.
As an intermediate step, let's first calculate the number of cards each player has gotten.

In [38]:
df['total_cards'] = df['yellowCards'] + df['redCards']
df['total_cards'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['total_cards'] = df['yellowCards'] + df['redCards']


0    0
1    1
5    0
6    0
7    0
Name: total_cards, dtype: int64

## 4.2 Calculate the average number of cards per game for each player.
Next, we can now use this column to calculate the average number of cards per game for each player.

In [39]:
df['avg_cards_per_game'] = df['total_cards'] / df['games']
df['avg_cards_per_game'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['avg_cards_per_game'] = df['total_cards'] / df['games']


0    0.0
1    1.0
5    0.0
6    0.0
7    0.0
Name: avg_cards_per_game, dtype: float64

## 4.3 Sort the players by the average number of cards per game.
Then we sort by this column.

In [40]:
avg_cards_per_game_df = df.sort_values(by='avg_cards_per_game', ascending=False)
avg_cards_per_game_df

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,total_cards,avg_cards_per_game
103688,jean-pascal-mignot,Jean-Pascal Mignot,AS Saint-Étienne,France,26.02.1981,183.0,75.0,Center Back,1,1,...,72,PRT,0.396803,1079.0,0.000392,0.790366,1121.0,0.001798,3,3.0
59371,barragan_3,Barragán,Valencia CF,Spain,12.06.1987,187.0,83.0,Right Fullback,1,0,...,44,ENGL,0.326690,44791.0,0.000010,0.356446,46916.0,0.000037,2,2.0
115519,ricardo-costa,Ricardo Costa,Valencia CF,Spain,16.05.1981,183.0,80.0,Center Back,1,0,...,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,2,2.0
83849,david-villa,David Villa,FC Barcelona,Spain,03.12.1981,175.0,69.0,Center Forward,1,0,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,2,2.0
68897,apono,Apoño,Real Zaragoza,Spain,13.02.1984,173.0,72.0,Defensive Midfielder,1,1,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,2,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56737,tolgay-arslan,Tolgay Arslan,Hamburger SV,Germany,16.08.1990,180.0,77.0,Attacking Midfielder,2,0,...,8,DEU,0.336628,7749.0,0.000055,0.335967,7974.0,0.000225,0,0.0
56734,timo-hildebrand,Timo Hildebrand,FC Schalke 04,Germany,05.04.1979,186.0,80.0,Goalkeeper,1,0,...,8,DEU,0.336628,7749.0,0.000055,0.335967,7974.0,0.000225,0,0.0
56732,timmy-simons,Timmy Simons,1. FC Nürnberg,Germany,11.12.1976,186.0,79.0,Defensive Midfielder,8,6,...,8,DEU,0.336628,7749.0,0.000055,0.335967,7974.0,0.000225,0,0.0
56726,thomas-kleine,Thomas Kleine,SpVgg Greuther Fürth,Germany,28.12.1977,191.0,82.0,Center Back,8,6,...,8,DEU,0.336628,7749.0,0.000055,0.335967,7974.0,0.000225,0,0.0


## 4.4 Print out the top 5 players.
This is now very easy to do. We will not use .head() this time though.

In [41]:
avg_cards_per_game_df[:5]

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp,total_cards,avg_cards_per_game
103688,jean-pascal-mignot,Jean-Pascal Mignot,AS Saint-Étienne,France,26.02.1981,183.0,75.0,Center Back,1,1,...,72,PRT,0.396803,1079.0,0.000392,0.790366,1121.0,0.001798,3,3.0
59371,barragan_3,Barragán,Valencia CF,Spain,12.06.1987,187.0,83.0,Right Fullback,1,0,...,44,ENGL,0.32669,44791.0,1e-05,0.356446,46916.0,3.7e-05,2,2.0
115519,ricardo-costa,Ricardo Costa,Valencia CF,Spain,16.05.1981,183.0,80.0,Center Back,1,0,...,7,FRA,0.334684,2882.0,0.000151,0.336101,3011.0,0.000586,2,2.0
83849,david-villa,David Villa,FC Barcelona,Spain,03.12.1981,175.0,69.0,Center Forward,1,0,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,2,2.0
68897,apono,Apoño,Real Zaragoza,Spain,13.02.1984,173.0,72.0,Defensive Midfielder,1,1,...,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002,2,2.0


# 5. Average number of cards per country
Do the same as in 4. but this time for each country. This means we need to group the countries!

## 5.1 Group the data by country.
This is our first step. It will be annoying to calculate the average for each country otherwise.

In [45]:
grouped_by_country = df.groupby('leagueCountry')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002C60AF4F4F0>

We can also use this to check what countries we have.

In [46]:
grouped_by_country.groups.keys()

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002C60AF4F4F0>

## 5.2 Calculate the average number of cards per game for each country.

In [47]:
(grouped_by_country['yellowCards'].sum() + grouped_by_country['redCards'].sum()) / grouped_by_country['games'].sum()

leagueCountry
England    0.115345
France     0.132275
Germany    0.128539
Spain      0.176037
dtype: float64