# Cleaning Data

## About the Data
In this notebook, we will be working with FIFA players data for 2022 obtained from [Kaggle](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset)

## Setup
We will be working with the `players_22.csv` file, so we need to handle our imports and read it in.

In [38]:
import pandas as pd

In [39]:
players = pd.read_csv(
    'players_22.csv', 
    usecols=['short_name', 'wage_eur', 'age', 'dob', 'height_cm', 'weight_kg', 'club_name', 'nationality_name', 'preferred_foot']
)
players.head()

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club_name,nationality_name,preferred_foot
0,L. Messi,320000.0,34,6/24/1987,170,72,Paris Saint-Germain,Argentina,Left
1,R. Lewandowski,270000.0,32,8/21/1988,185,81,FC Bayern München,Poland,Right
2,Cristiano Ronaldo,270000.0,36,2/5/1985,187,83,Manchester United,Portugal,Right
3,Neymar Jr,270000.0,29,2/5/1992,175,68,Paris Saint-Germain,Brazil,Right
4,K. De Bruyne,350000.0,30,6/28/1991,181,70,Manchester City,Belgium,Right


## Renaming Columns
We start out with the following columns:

We want to rename the `club_name` and the `nationality_name` column. For this task, we use the `rename()` method and pass in a dictionary mapping the column names to their new names. We pass `inplace=True` to change our original dataframe instead of getting a new one back:

In [40]:
players.rename(
    columns={
        'club_name': 'club',
        'nationality_name': 'nationality'
    }, inplace=True
)

Those columns have been successfully renamed:

In [41]:
players.columns

Index(['short_name', 'wage_eur', 'age', 'dob', 'height_cm', 'weight_kg',
       'club', 'nationality', 'preferred_foot'],
      dtype='object')

We can use the `assign()` method for working with multiple columns at once (or creating new ones):

In [42]:
new_players = players.assign(
    wage_dol = players.wage_eur * 1.07,
    height_ft = players.height_cm / 30.48,
)

In [43]:
new_players.head()

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot,wage_dol,height_ft
0,L. Messi,320000.0,34,6/24/1987,170,72,Paris Saint-Germain,Argentina,Left,342400.0,5.577428
1,R. Lewandowski,270000.0,32,8/21/1988,185,81,FC Bayern München,Poland,Right,288900.0,6.069554
2,Cristiano Ronaldo,270000.0,36,2/5/1985,187,83,Manchester United,Portugal,Right,288900.0,6.135171
3,Neymar Jr,270000.0,29,2/5/1992,175,68,Paris Saint-Germain,Brazil,Right,288900.0,5.74147
4,K. De Bruyne,350000.0,30,6/28/1991,181,70,Manchester City,Belgium,Right,374500.0,5.93832


## Reordering and sorting
Let's say we want to get the top 10 highest paid players; we can sort our values by the `wage_eur` column and set `ascending` to false to show the largest values on top: 

In [57]:
players.sort_values(by='wage_eur', ascending=False).head(10)

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot
4,K. De Bruyne,350000.0,30,6/28/1991,181,70,Manchester City,Belgium,Right
11,K. Benzema,350000.0,33,12/19/1987,185,81,Real Madrid CF,France,Right
0,L. Messi,320000.0,34,6/24/1987,170,72,Paris Saint-Germain,Argentina,Left
14,Casemiro,310000.0,29,2/23/1992,185,84,Real Madrid CF,Brazil,Right
24,T. Kroos,310000.0,31,1/4/1990,183,76,Real Madrid CF,Germany,Right
27,R. Sterling,290000.0,26,12/8/1994,170,69,Manchester City,England,Right
2,Cristiano Ronaldo,270000.0,36,2/5/1985,187,83,Manchester United,Portugal,Right
3,Neymar Jr,270000.0,29,2/5/1992,175,68,Paris Saint-Germain,Brazil,Right
1,R. Lewandowski,270000.0,32,8/21/1988,185,81,FC Bayern München,Poland,Right
17,M. Salah,270000.0,29,6/15/1992,175,71,Liverpool,Egypt,Left


When just looking for the n-largest values, rather than wanting to sort all the data, we can use `nlargest()`:

In [47]:
players.nlargest(n=10, columns='wage_eur')

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot
4,K. De Bruyne,350000.0,30,6/28/1991,181,70,Manchester City,Belgium,Right
11,K. Benzema,350000.0,33,12/19/1987,185,81,Real Madrid CF,France,Right
0,L. Messi,320000.0,34,6/24/1987,170,72,Paris Saint-Germain,Argentina,Left
14,Casemiro,310000.0,29,2/23/1992,185,84,Real Madrid CF,Brazil,Right
24,T. Kroos,310000.0,31,1/4/1990,183,76,Real Madrid CF,Germany,Right
27,R. Sterling,290000.0,26,12/8/1994,170,69,Manchester City,England,Right
1,R. Lewandowski,270000.0,32,8/21/1988,185,81,FC Bayern München,Poland,Right
2,Cristiano Ronaldo,270000.0,36,2/5/1985,187,83,Manchester United,Portugal,Right
3,Neymar Jr,270000.0,29,2/5/1992,175,68,Paris Saint-Germain,Brazil,Right
16,S. Mané,270000.0,29,4/10/1992,175,69,Liverpool,Senegal,Right


We use `nsmallest()` for the n-smallest values.

In [49]:
players.nsmallest(n=5, columns=['wage_eur', 'age'])

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot
13331,V. Barco,500.0,16,7/23/2004,172,68,Boca Juniors,Argentina,Left
14281,A. Kalogeropoulos,500.0,16,7/26/2004,187,75,Olympiacos CFP,Greece,Right
15929,Yayo,500.0,16,7/30/2004,175,71,Real Oviedo,Spain,Right
16303,R. van den Berg,500.0,16,7/7/2004,190,80,PEC Zwolle,Netherlands,Right
16982,T. Small,500.0,16,8/1/2004,175,68,Southampton,England,Left


The `sample()` method will give us rows (or columns with `axis=1`) at random. We can provide a seed (`random_state`) to make this reproducible. The index after we do this is jumbled:

In [54]:
players.sample(5, random_state=3)

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot
11790,S. Nago,2000.0,25,4/17/1996,168,64,Shonan Bellmare,Japan,Right
1450,A. Hložek,500.0,18,7/25/2002,188,86,AC Sparta Praha,Czech Republic,Right
6083,A. Santamaría,7000.0,29,1/10/1992,184,79,Club Atlas,Peru,Right
1135,N. Madueke,11000.0,19,3/10/2002,182,70,PSV,England,Left
7219,I. Erquiaga,5000.0,23,3/26/1998,176,72,Club Atlético Huracán,Argentina,Left


We can use `sort_index()` to order it:

In [55]:
players.sample(5, random_state=3).sort_index()

Unnamed: 0,short_name,wage_eur,age,dob,height_cm,weight_kg,club,nationality,preferred_foot
1135,N. Madueke,11000.0,19,3/10/2002,182,70,PSV,England,Left
1450,A. Hložek,500.0,18,7/25/2002,188,86,AC Sparta Praha,Czech Republic,Right
6083,A. Santamaría,7000.0,29,1/10/1992,184,79,Club Atlas,Peru,Right
7219,I. Erquiaga,5000.0,23,3/26/1998,176,72,Club Atlético Huracán,Argentina,Left
11790,S. Nago,2000.0,25,4/17/1996,168,64,Shonan Bellmare,Japan,Right


The `sort_index()` method can also sort columns alphabetically:

In [56]:
players.sort_index(axis=1).head()

Unnamed: 0,age,club,dob,height_cm,nationality,preferred_foot,short_name,wage_eur,weight_kg
0,34,Paris Saint-Germain,6/24/1987,170,Argentina,Left,L. Messi,320000.0,72
1,32,FC Bayern München,8/21/1988,185,Poland,Right,R. Lewandowski,270000.0,81
2,36,Manchester United,2/5/1985,187,Portugal,Right,Cristiano Ronaldo,270000.0,83
3,29,Paris Saint-Germain,2/5/1992,175,Brazil,Right,Neymar Jr,270000.0,68
4,30,Manchester City,6/28/1991,181,Belgium,Right,K. De Bruyne,350000.0,70
