# Reshaping data with Pandas

## Wide x long format

### Wide format
* Each feature is a separate column
* Each rows contains many features of the same player
* No repetition but large number of missing values
* Simple statistics and imputation

| name | age | nationality | club |
|------|-----|-------------|------|
| Messi | 31 | Argentina | Barcelona |
| Ronaldo | **NaN** | Portugal | Juventus |
| Neymar | 26 | Brazil | PSG |
| Mbappe | 19 | France | PSG |

### Long format
* Each row represents one feature
* Multiple rows for each player
* A column ( name ) to identify same player
* Tidy data:
  * Better to summarize data
  * Key-value pairs
  * Preferred for analysis and graphing

| name | variable | value |
|------|----------|-------|
| Messi | age | 31 |
| Ronaldo | age | **NaN** |
| Neymar | age | 26 |
| Mbappe | age | 19 |
| Messi | club | Barcelona |
| Ronaldo | club | Juventus |
| Neymar | club | PSG |
| Mbappe | club | PSG |
| Messi | nationality | Argentina |
| Ronaldo | nationality | Portugal |
| Neymar | nationality | Brazil |
| MBappe | nationality | France |



## Reshaping with Pandas

In [9]:
import pandas as pd

# Read the data from file using read_csv
fifa_players = pd.read_csv("files/fifa_players.csv")

fifa_players

Unnamed: 0,name,age,height,weight,nationality,club
0,Lionel Messi,32,170,72,Argentina,FC Barcelona
1,Cristiano Ronaldo,34,187,83,Portugal,Juventus
2,Neymar da Silva,27,175,68,Brazil,Paris Saint-Germain
3,Jan Oblak,26,188,87,Slovenia,Atlético Madrid
4,Eden Hazard,28,175,74,Belgium,Real Madrid


In [10]:
# Set name as index
fifa_players = fifa_players.set_index("name")

# Select only the columns height and weight from the fifa_players
fifa_players = fifa_players[["height", "weight"]]

# Transpose the data
fifa_players = fifa_players.transpose()

fifa_players

name,Lionel Messi,Cristiano Ronaldo,Neymar da Silva,Jan Oblak,Eden Hazard
height,170,187,175,188,175
weight,72,83,68,87,74


### Pivot method

In [20]:
import pandas as pd

# Read the data from file using read_csv
fifa_movement = pd.read_csv("files/fifa_movement.csv")

fifa_movement

Unnamed: 0,name,movement,overall,attacking
0,L. Messi,shooting,92,70
1,Cristiano Ronaldo,shooting,93,89
2,L. Messi,passing,92,92
3,Cristiano Ronaldo,passing,82,83
4,L. Messi,dribbling,96,88
5,Cristiano Ronaldo,dribbling,89,84


In [21]:
# Pivot fifa_movement to get overall scores indexed by name and identified by movement
fifa_overall = fifa_movement.pivot(index='name', columns='movement', values='overall')

fifa_overall

movement,dribbling,passing,shooting
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cristiano Ronaldo,89,82,93
L. Messi,96,92,92


In [23]:
# Use the pivot method to get overall scores indexed by movement and identified by name
fifa_names = fifa_movement.pivot(index='movement', columns='name', values='overall')

fifa_names

name,Cristiano Ronaldo,L. Messi
movement,Unnamed: 1_level_1,Unnamed: 2_level_1
dribbling,89,96
passing,82,92
shooting,93,92


In [24]:
# Pivot fifa_players to get overall and attacking scores indexed by name and identified by movement
fifa_overall_attacking = fifa_movement.pivot(index='name', columns='movement', values=['overall', 'attacking'])

fifa_overall_attacking

Unnamed: 0_level_0,overall,overall,overall,attacking,attacking,attacking
movement,dribbling,passing,shooting,dribbling,passing,shooting
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cristiano Ronaldo,89,82,93,84,83,89
L. Messi,96,92,92,88,92,70


In [25]:
# Dropping a row
another_fifa = fifa_movement.drop(4, axis=0)

another_fifa

Unnamed: 0,name,movement,overall,attacking
0,L. Messi,shooting,92,70
1,Cristiano Ronaldo,shooting,93,89
2,L. Messi,passing,92,92
3,Cristiano Ronaldo,passing,82,83
5,Cristiano Ronaldo,dribbling,89,84


### Pivot table method

#### Pivot method limitations

* General purpose pivoting
* Index/column pair must be unique
* Cannot aggregate values

In [26]:
import pandas as pd

# Read the data from file using read_csv
fifa_players_long = pd.read_csv("files/fifa_players_long.csv")

fifa_players_long

Unnamed: 0,name,variable,metric_system,imperial_system
0,Cristiano Ronaldo,weight,83,183.0
1,J. Oblak,weight,87,191.0
2,Cristiano Ronaldo,height,187,6.13
3,J. Oblak,height,188,6.16
4,Cristiano Ronaldo,height,187,6.14


In [28]:
fifa_players_long.pivot(index="name", columns="variable")

# ! Returns an error: ValueError: Index contains duplicate entries, cannot reshape

In [29]:
fifa_players_long.pivot_table(index="name", columns="variable", aggfunc="mean")

Unnamed: 0_level_0,imperial_system,imperial_system,metric_system,metric_system
variable,height,weight,height,weight
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Cristiano Ronaldo,6.135,183.0,187,83
J. Oblak,6.16,191.0,188,87


In [30]:
# Add margins to the pivot to get the totals
fifa_players_long.pivot_table(index="name", columns="variable", aggfunc="mean", margins=True)

Unnamed: 0_level_0,imperial_system,imperial_system,imperial_system,metric_system,metric_system,metric_system
variable,height,weight,All,height,weight,All
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cristiano Ronaldo,6.135,183.0,65.09,187.0,83,152.333333
J. Oblak,6.16,191.0,98.58,188.0,87,137.5
All,6.143333,187.0,78.486,187.333333,85,146.4


### Pivot or pivot table?

* Does the DataFrame have more than one value for each index/column pair?
* Do you need to have a multi-index in your resulting pivoted DataFrame?
* Do you need summary statistics of your large DataFrame?

If yes to any question, use .pivot_table()

## Reshaping with melt

### Wide to long transformation
* Perform analytics
* Plot different variables in the same graph

In [50]:
import pandas as pd

# Read the data from file using read_csv
books = pd.read_csv("files/books.csv")

books.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


In [39]:
books.melt(id_vars='title')

Unnamed: 0,title,variable,value
0,Harry Potter and the Half-Blood Prince (Harry ...,bookID,1
1,Harry Potter and the Order of the Phoenix (Har...,bookID,2
2,Harry Potter and the Chamber of Secrets (Harry...,bookID,4
3,Harry Potter and the Prisoner of Azkaban (Harr...,bookID,5
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,bookID,8
...,...,...,...
122348,Expelled from Eden: A William T. Vollmann Reader,publisher,Da Capo Press
122349,You Bright and Risen Angels,publisher,Penguin Books
122350,The Ice-Shirt (Seven Dreams #1),publisher,Penguin Books
122351,Poor People,publisher,Ecco


#### Specifying values to melt and naming values and variables

In [51]:
books.melt(id_vars='title', value_vars=['language_code', 'num_pages'], var_name='feature', value_name='value')

Unnamed: 0,title,feature,value
0,Harry Potter and the Half-Blood Prince (Harry ...,language_code,eng
1,Harry Potter and the Order of the Phoenix (Har...,language_code,eng
2,Harry Potter and the Chamber of Secrets (Harry...,language_code,eng
3,Harry Potter and the Prisoner of Azkaban (Harr...,language_code,eng
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,language_code,eng
...,...,...,...
22241,Expelled from Eden: A William T. Vollmann Reader,num_pages,512
22242,You Bright and Risen Angels,num_pages,635
22243,The Ice-Shirt (Seven Dreams #1),num_pages,415
22244,Poor People,num_pages,434


### Reshaping with wide to long function

In [53]:
import pandas as pd

book_stubs = pd.read_csv("files/book_stubs.csv")

book_stubs

Unnamed: 0,title,ratings2019,sold2019,ratings2020,sold2020
0,Mostly Harmless,4.2,456,4.3,436
1,The Hitchhiker's Guide,4.8,980,4.9,998
2,El restaurante del fin del mundo,4.5,678,4.6,638


In [54]:
pd.wide_to_long(book_stubs, stubnames=['ratings', 'sold'], i='title', j='year')

Unnamed: 0_level_0,Unnamed: 1_level_0,ratings,sold
title,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Mostly Harmless,2019,4.2,456
The Hitchhiker's Guide,2019,4.8,980
El restaurante del fin del mundo,2019,4.5,678
Mostly Harmless,2020,4.3,436
The Hitchhiker's Guide,2020,4.9,998
El restaurante del fin del mundo,2020,4.6,638


Configuring the separator and suffix

In [55]:
import pandas as pd

books_brown = pd.read_csv("files/books_brown.csv")

books_brown

Unnamed: 0,title,author,language_code,language_name,publisher_code,publisher_name
0,The Da Vinci Code,Dan Brown,0,english,12,Random House
1,Angels & Demons,Dan Brown,0,english,34,Pocket Books
2,La fortaleza digital,Dan Brown,84,spanish,43,Umbriel


In [57]:
pd.wide_to_long(books_brown, stubnames=['language', 'publisher'], i=['author', 'title'], j='code', sep='_', suffix='\w+')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,language,publisher
author,title,code,Unnamed: 3_level_1,Unnamed: 4_level_1
Dan Brown,The Da Vinci Code,code,0,12
Dan Brown,The Da Vinci Code,name,english,Random House
Dan Brown,Angels & Demons,code,0,34
Dan Brown,Angels & Demons,name,english,Pocket Books
Dan Brown,La fortaleza digital,code,84,43
Dan Brown,La fortaleza digital,name,spanish,Umbriel
