# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 004. Iteration

## 004.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [2]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())

In [3]:
import pandas as pd
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"

### 004.001 Using Itertuples

1. Use itertuples to print each row


In [4]:
df = fresh_df("002.tsv", id="Name")
# solution

for row in df.itertuples():
    print(row)

Pandas(Index='Mbappé', DOB='1998-12-20')
Pandas(Index='De Bruyne', DOB='1991-06-28')
Pandas(Index='Lewandowski', DOB='1988-08-21')
Pandas(Index='Benzema', DOB='1987-12-19')
Pandas(Index='Messi', DOB='1987-06-24')


### 004.002 Using iterrows

1. Use iterrows to print the index of each row, and then the row as a Series

In [5]:
df = fresh_df("002.tsv")
# solution

for index, row in df.iterrows():
    index
    row

0

Name        Mbappé
DOB     1998-12-20
Name: 0, dtype: object

1

Name     De Bruyne
DOB     1991-06-28
Name: 1, dtype: object

2

Name    Lewandowski
DOB      1988-08-21
Name: 2, dtype: object

3

Name       Benzema
DOB     1987-12-19
Name: 3, dtype: object

4

Name         Messi
DOB     1987-06-24
Name: 4, dtype: object

### 004.003 Using Items

1. Use items to print each column as a series


In [6]:
df = fresh_df("002.tsv")
# solution

for index, col in df.items():
    index
    col

'Name'

0         Mbappé
1      De Bruyne
2    Lewandowski
3        Benzema
4          Messi
Name: Name, dtype: object

'DOB'

0    1998-12-20
1    1991-06-28
2    1988-08-21
3    1987-12-19
4    1987-06-24
Name: DOB, dtype: object

### 004.004 Using Query

1. Add an 'Age' column with years of player
1. Use query to find all players older than 30 years 
1. Make another query for all players older than 30 and with a name longer than 8 characters


In [7]:
from datetime import datetime
df = fresh_df("002.tsv")
# solution

1
df["Age"] = df.apply(lambda row: (datetime.now() - datetime.strptime(row['DOB'], '%Y-%m-%d')).days // 365, axis=1)

2
df.query("Age > 30")

3
df.query("Age > 30 and Name.str.len() > 8")

1

2

Unnamed: 0,Name,DOB,Age
1,De Bruyne,1991-06-28,31
2,Lewandowski,1988-08-21,34
3,Benzema,1987-12-19,35
4,Messi,1987-06-24,35


3

Unnamed: 0,Name,DOB,Age
1,De Bruyne,1991-06-28,31
2,Lewandowski,1988-08-21,34


### 004.003 aggregagtion

1. Load the minimal_player_list file
1. Find the mean, median and max for Age and Wages in the complete datase

In [11]:
1
datafile = "minimal_player_list.csv"
df = fresh_df(datafile, id="ID", sep=",")
df.head()
# solution

2
df[["Age", "Wage"]].agg(['mean',  'median', 'max'])


1

Unnamed: 0_level_0,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Position
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
231747,Kylian Mbappé,23,France,91,95,Paris Saint Germain,190500000,23000000,ST
192985,Kevin De Bruyne,31,Belgium,91,91,Manchester City,107500000,35000000,CM
188545,Robert Lewandowski,33,Poland,91,91,FC Barcelona,84000000,42000000,ST
165153,Karim Benzema,34,France,91,91,Real Madrid,64000000,45000000,CF
158023,Lionel Messi,35,Argentina,91,91,Paris Saint Germain,54000000,19500000,RW


2

Unnamed: 0,Age,Wage
mean,26.341552,1308528.0
median,26.0,500000.0
max,44.0,45000000.0


### 004.005 Using Grouping

1. Generate a DataFrame with the rows grouped by nationality and for each nationality the max wages. Use `reset_index` to generate the DataFrame. Show its head
1. Merge back to the original dataframe, so that the original rows are still with the aggregates. Show its head
1. Extract the "Name", "nationality", and "wage" fields. Show its head
1. Sort by Wage descending. Show its head
1. Format so that "Wage" has comma separators, and no decimals. Show its head


In [12]:
datafile = "minimal_player_list.csv"
df = fresh_df(datafile, id="ID", sep=",")
group_by_field = "Nationality"
# solution
df.head()

1
max_wages = df.groupby([group_by_field])["Wage"].agg('max').reset_index()
max_wages.head()

2 
agg = max_wages.merge(df, on=[group_by_field, "Wage"])
agg.head()

3
agg = df[["Name", group_by_field, "Wage"]]
agg.head()

4
agg = agg.sort_values(by="Wage", ascending=False)
agg.head()

4
agg.style.format("{:,.0f}", subset=["Wage"]).hide(axis="index")


Unnamed: 0_level_0,Name,Age,Nationality,Overall,Potential,Club,Value,Wage,Position
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
231747,Kylian Mbappé,23,France,91,95,Paris Saint Germain,190500000,23000000,ST
192985,Kevin De Bruyne,31,Belgium,91,91,Manchester City,107500000,35000000,CM
188545,Robert Lewandowski,33,Poland,91,91,FC Barcelona,84000000,42000000,ST
165153,Karim Benzema,34,France,91,91,Real Madrid,64000000,45000000,CF
158023,Lionel Messi,35,Argentina,91,91,Paris Saint Germain,54000000,19500000,RW


1

Unnamed: 0,Nationality,Wage
0,Afghanistan,100000
1,Albania,6400000
2,Algeria,20000000
3,Andorra,300000
4,Angola,3400000


2

Unnamed: 0,Nationality,Wage,Name,Age,Overall,Potential,Club,Value,Position
0,Afghanistan,100000,Rahmat Akbari,22,63,68,Brisbane Roar,72500000,CM
1,Albania,6400000,Armando Broja,20,75,83,Chelsea,12500000,ST
2,Algeria,20000000,Riyad Mahrez,31,85,85,Manchester City,44500000,RW
3,Andorra,300000,Iker Álvarez de Eulate,20,63,74,Villarreal,90000000,GK
4,Angola,3400000,Hélder Wander Sousa Azevedo Costa,28,74,74,Al Ittihad,4200000,RM


3

Unnamed: 0_level_0,Name,Nationality,Wage
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
231747,Kylian Mbappé,France,23000000
192985,Kevin De Bruyne,Belgium,35000000
188545,Robert Lewandowski,Poland,42000000
165153,Karim Benzema,France,45000000
158023,Lionel Messi,Argentina,19500000


4

Unnamed: 0_level_0,Name,Nationality,Wage
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
165153,Karim Benzema,France,45000000
188545,Robert Lewandowski,Poland,42000000
192985,Kevin De Bruyne,Belgium,35000000
182521,Toni Kroos,Germany,31000000
209331,Mohamed Salah,Egypt,27000000


4

Name,Nationality,Wage
Karim Benzema,France,45000000
Robert Lewandowski,Poland,42000000
Kevin De Bruyne,Belgium,35000000
Toni Kroos,Germany,31000000
Mohamed Salah,Egypt,27000000
Bernardo Mota Carvalho e Silva,Portugal,26000000
Antonio Rüdiger,Germany,25000000
Thibaut Courtois,Belgium,25000000
João Pedro Cavaco Cancelo,Portugal,25000000
Carlos Henrique Venancio Casimiro,Brazil,24000000


### 004.006 Using Grouping

1. Repeat 004.005, but applied to "Club" instead of "Nationality"
1. Also, it should only apply to players with "Wage > 2000000"

In [10]:
datafile = "minimal_player_list.csv"
df = fresh_df(datafile, id="ID", sep=",")
group_by_field = "Club"
# solution

max_wages = df.query("Wage > 2000000").groupby([group_by_field])["Wage"].max().reset_index()
agg = max_wages.merge(df, on=[group_by_field, "Wage"])
agg = agg[["Name", group_by_field, "Wage"]]
agg = agg.sort_values(by="Wage", ascending=False)
agg.style.format("{:,.0f}", subset=["Wage"]).hide(axis="index")


Name,Club,Wage
Karim Benzema,Real Madrid,45000000
Robert Lewandowski,FC Barcelona,42000000
Kevin De Bruyne,Manchester City,35000000
Mohamed Salah,Liverpool,27000000
Carlos Henrique Venancio Casimiro,Manchester United,24000000
Heung Min Son,Tottenham Hotspur,24000000
Harry Kane,Tottenham Hotspur,24000000
Kylian Mbappé,Paris Saint Germain,23000000
N'Golo Kanté,Chelsea,21000000
Romelu Lukaku,Inter,20000000
