# Reshaping data with Pandas

## Wide x long format

### Wide format
* Each feature is a separate column
* Each rows contains many features of the same player
* No repetition but large number of missing values
* Simple statistics and imputation

| name | age | nationality | club |
|------|-----|-------------|------|
| Messi | 31 | Argentina | Barcelona |
| Ronaldo | **NaN** | Portugal | Juventus |
| Neymar | 26 | Brazil | PSG |
| Mbappe | 19 | France | PSG |

### Long format
* Each row represents one feature
* Multiple rows for each player
* A column ( name ) to identify same player
* Tidy data:
 * Better to summarize data
  * Key-value pairs
  * Preferred for analysis and graphing

| name | variable | value |
|------|----------|-------|
| Messi | age | 31 |
| Ronaldo | age | **NaN** |
| Neymar | age | 26 |
| Mbappe | age | 19 |
| Messi | club | Barcelona |
| Ronaldo | club | Juventus |
| Neymar | club | PSG |
| Mbappe | club | PSG |
| Messi | nationality | Argentina |
| Ronaldo | nationality | Portugal |
| Neymar | nationality | Brazil |
| MBappe | nationality | France |



## Reshaping with Pandas

In [2]:
import pandas as pd

# Read the data from file using read_csv
fifa_players = pd.read_csv("files/fifa_players.csv")

fifa_players

Unnamed: 0,name,age,height,weight,nationality,club
0,Lionel Messi,32,170,72,Argentina,FC Barcelona
1,Cristiano Ronaldo,34,187,83,Portugal,Juventus
2,Neymar da Silva,27,175,68,Brazil,Paris Saint-Germain
3,Jan Oblak,26,188,87,Slovenia,Atlético Madrid
4,Eden Hazard,28,175,74,Belgium,Real Madrid


In [3]:
# Set name as index
fifa_players = fifa_players.set_index("name")

# Select only the columns height and weight from the fifa_players
fifa_players = fifa_players[["height", "weight"]]

# Transpose the data
fifa_players = fifa_players.transpose()

fifa_players

name,Lionel Messi,Cristiano Ronaldo,Neymar da Silva,Jan Oblak,Eden Hazard
height,170,187,175,188,175
weight,72,83,68,87,74


### Pivot method

In [4]:
import pandas as pd

# Read the data from file using read_csv
fifa_movement = pd.read_csv("files/fifa_movement.csv")

fifa_movement

Unnamed: 0,name,movement,overall,attacking
0,L. Messi,shooting,92,70
1,Cristiano Ronaldo,shooting,93,89
2,L. Messi,passing,92,92
3,Cristiano Ronaldo,passing,82,83
4,L. Messi,dribbling,96,88
5,Cristiano Ronaldo,dribbling,89,84


In [5]:
# Pivot fifa_movement to get overall scores indexed by name and identified by movement
fifa_overall = fifa_movement.pivot(index='name', columns='movement', values='overall')

fifa_overall

movement,dribbling,passing,shooting
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cristiano Ronaldo,89,82,93
L. Messi,96,92,92


In [6]:
# Use the pivot method to get overall scores indexed by movement and identified by name
fifa_names = fifa_movement.pivot(index='movement', columns='name', values='overall')

fifa_names

name,Cristiano Ronaldo,L. Messi
movement,Unnamed: 1_level_1,Unnamed: 2_level_1
dribbling,89,96
passing,82,92
shooting,93,92


In [7]:
# Pivot fifa_players to get overall and attacking scores indexed by name and identified by movement
fifa_overall_attacking = fifa_movement.pivot(index='name', columns='movement', values=['overall', 'attacking'])

fifa_overall_attacking

Unnamed: 0_level_0,overall,overall,overall,attacking,attacking,attacking
movement,dribbling,passing,shooting,dribbling,passing,shooting
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cristiano Ronaldo,89,82,93,84,83,89
L. Messi,96,92,92,88,92,70


In [8]:
# Dropping a row
another_fifa = fifa_movement.drop(4, axis=0)

another_fifa

Unnamed: 0,name,movement,overall,attacking
0,L. Messi,shooting,92,70
1,Cristiano Ronaldo,shooting,93,89
2,L. Messi,passing,92,92
3,Cristiano Ronaldo,passing,82,83
5,Cristiano Ronaldo,dribbling,89,84


### Pivot table method

#### Pivot method limitations

* General purpose pivoting
* Index/column pair must be unique
* Cannot aggregate values

In [9]:
import pandas as pd

# Read the data from file using read_csv
fifa_players_long = pd.read_csv("files/fifa_players_long.csv")

fifa_players_long

Unnamed: 0,name,variable,metric_system,imperial_system
0,Cristiano Ronaldo,weight,83,183.0
1,J. Oblak,weight,87,191.0
2,Cristiano Ronaldo,height,187,6.13
3,J. Oblak,height,188,6.16
4,Cristiano Ronaldo,height,187,6.14


In [11]:
# fifa_players_long.pivot(index="name", columns="variable")

# ! Returns an error: ValueError: Index contains duplicate entries, cannot reshape

In [12]:
fifa_players_long.pivot_table(index="name", columns="variable", aggfunc="mean")

Unnamed: 0_level_0,imperial_system,imperial_system,metric_system,metric_system
variable,height,weight,height,weight
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Cristiano Ronaldo,6.135,183.0,187,83
J. Oblak,6.16,191.0,188,87


In [13]:
# Add margins to the pivot to get the totals
fifa_players_long.pivot_table(index="name", columns="variable", aggfunc="mean", margins=True)

Unnamed: 0_level_0,imperial_system,imperial_system,imperial_system,metric_system,metric_system,metric_system
variable,height,weight,All,height,weight,All
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cristiano Ronaldo,6.135,183.0,65.09,187.0,83,152.333333
J. Oblak,6.16,191.0,98.58,188.0,87,137.5
All,6.143333,187.0,78.486,187.333333,85,146.4


### Pivot or pivot table?

* Does the DataFrame have more than one value for each index/column pair?
* Do you need to have a multi-index in your resulting pivoted DataFrame?
* Do you need summary statistics of your large DataFrame?

If yes to any question, use .pivot_table()

## Reshaping with melt

### Wide to long transformation
* Perform analytics
* Plot different variables in the same graph

In [14]:
import pandas as pd

# Read the data from file using read_csv
books = pd.read_csv("files/books.csv")

books.head()

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic


In [15]:
books.melt(id_vars='title')

Unnamed: 0,title,variable,value
0,Harry Potter and the Half-Blood Prince (Harry ...,bookID,1
1,Harry Potter and the Order of the Phoenix (Har...,bookID,2
2,Harry Potter and the Chamber of Secrets (Harry...,bookID,4
3,Harry Potter and the Prisoner of Azkaban (Harr...,bookID,5
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,bookID,8
...,...,...,...
122348,Expelled from Eden: A William T. Vollmann Reader,publisher,Da Capo Press
122349,You Bright and Risen Angels,publisher,Penguin Books
122350,The Ice-Shirt (Seven Dreams #1),publisher,Penguin Books
122351,Poor People,publisher,Ecco


#### Specifying values to melt and naming values and variables

In [16]:
books.melt(id_vars='title', value_vars=['language_code', 'num_pages'], var_name='feature', value_name='value')

Unnamed: 0,title,feature,value
0,Harry Potter and the Half-Blood Prince (Harry ...,language_code,eng
1,Harry Potter and the Order of the Phoenix (Har...,language_code,eng
2,Harry Potter and the Chamber of Secrets (Harry...,language_code,eng
3,Harry Potter and the Prisoner of Azkaban (Harr...,language_code,eng
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,language_code,eng
...,...,...,...
22241,Expelled from Eden: A William T. Vollmann Reader,num_pages,512
22242,You Bright and Risen Angels,num_pages,635
22243,The Ice-Shirt (Seven Dreams #1),num_pages,415
22244,Poor People,num_pages,434


### Reshaping with wide to long function

In [17]:
import pandas as pd

book_stubs = pd.read_csv("files/book_stubs.csv")

book_stubs

Unnamed: 0,title,ratings2019,sold2019,ratings2020,sold2020
0,Mostly Harmless,4.2,456,4.3,436
1,The Hitchhiker's Guide,4.8,980,4.9,998
2,El restaurante del fin del mundo,4.5,678,4.6,638


In [18]:
pd.wide_to_long(book_stubs, stubnames=['ratings', 'sold'], i='title', j='year')

Unnamed: 0_level_0,Unnamed: 1_level_0,ratings,sold
title,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Mostly Harmless,2019,4.2,456
The Hitchhiker's Guide,2019,4.8,980
El restaurante del fin del mundo,2019,4.5,678
Mostly Harmless,2020,4.3,436
The Hitchhiker's Guide,2020,4.9,998
El restaurante del fin del mundo,2020,4.6,638


Configuring the separator and suffix

In [19]:
import pandas as pd

books_brown = pd.read_csv("files/books_brown.csv")

books_brown

Unnamed: 0,title,author,language_code,language_name,publisher_code,publisher_name
0,The Da Vinci Code,Dan Brown,0,english,12,Random House
1,Angels & Demons,Dan Brown,0,english,34,Pocket Books
2,La fortaleza digital,Dan Brown,84,spanish,43,Umbriel


In [20]:
pd.wide_to_long(books_brown, stubnames=['language', 'publisher'], i=['author', 'title'], j='code', sep='_', suffix='\w+')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,language,publisher
author,title,code,Unnamed: 3_level_1,Unnamed: 4_level_1
Dan Brown,The Da Vinci Code,code,0,12
Dan Brown,The Da Vinci Code,name,english,Random House
Dan Brown,Angels & Demons,code,0,34
Dan Brown,Angels & Demons,name,english,Pocket Books
Dan Brown,La fortaleza digital,code,84,43
Dan Brown,La fortaleza digital,name,spanish,Umbriel


### DataFrame with index

In [21]:
import pandas as pd

books_with_index = pd.read_csv("files/books_with_index.csv", index_col=0)

books_with_index

Unnamed: 0_level_0,author,ratings2019,sold2019
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
To Kill a Mockingbird,Harper Lee,4.7,456
The Hitchhiker's Guide,Douglas Adams,4.8,980
The Black Cat Edgar,Alan Poe,4.5,678


The index of the DataFrame contains the title of the books. You know that you cannot reshape it in this format. If you do, you will lose valuable data, the title.

In [22]:
pd.wide_to_long(books_with_index, stubnames=['ratings', 'sold'], i='author', j='year')

Unnamed: 0_level_0,Unnamed: 1_level_0,ratings,sold
author,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Harper Lee,2019,4.7,456
Douglas Adams,2019,4.8,980
Alan Poe,2019,4.5,678


In [23]:
# Reset the index
books_with_index.reset_index(drop=False, inplace=True)

# Reshape using title and language as index
pd.wide_to_long(books_with_index, stubnames=['ratings', 'sold'], i=['title', 'author'], j='year')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ratings,sold
title,author,year,Unnamed: 3_level_1,Unnamed: 4_level_1
To Kill a Mockingbird,Harper Lee,2019,4.7,456
The Hitchhiker's Guide,Douglas Adams,2019,4.8,980
The Black Cat Edgar,Alan Poe,2019,4.5,678


## Working with string columns

In [24]:
books_list = [
    ['title','raitings_2015','sold_2015','raitings_2016','sold_2016'],
    ['The Civil War:Vol. 1',4.3,234,4.2,254],
    ['The Civil War:Vol. 2',4.5,525,4.3,515],
    ['The Civil War:Vol. 3',4.1,242,4.2,251],
]

books = pd.DataFrame(books_list[1:], columns=books_list[0])

books

Unnamed: 0,title,raitings_2015,sold_2015,raitings_2016,sold_2016
0,The Civil War:Vol. 1,4.3,234,4.2,254
1,The Civil War:Vol. 2,4.5,525,4.3,515
2,The Civil War:Vol. 3,4.1,242,4.2,251


In [25]:
books['title'].dtypes

dtype('O')

### Splitting into two columns

In [26]:
books['title'].str.split(':')

0    [The Civil War, Vol. 1]
1    [The Civil War, Vol. 2]
2    [The Civil War, Vol. 3]
Name: title, dtype: object

In [27]:
# Split the title column on the : character to new columns
books[['main_title', 'subtitle']] = books['title'].str.split(':', expand=True)

books

Unnamed: 0,title,raitings_2015,sold_2015,raitings_2016,sold_2016,main_title,subtitle
0,The Civil War:Vol. 1,4.3,234,4.2,254,The Civil War,Vol. 1
1,The Civil War:Vol. 2,4.5,525,4.3,515,The Civil War,Vol. 2
2,The Civil War:Vol. 3,4.1,242,4.2,251,The Civil War,Vol. 3


In [28]:
# Drop original title column
books.drop('title', axis=1, inplace=True)

# Reshape books to long format
pd.wide_to_long(books , stubnames=['ratings', 'sold'], i=['main_title', 'subtitle'], j='year', sep='_')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,raitings_2016,raitings_2015,ratings,sold
main_title,subtitle,year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Civil War,Vol. 1,2015,4.2,4.3,,234
The Civil War,Vol. 1,2016,4.2,4.3,,254
The Civil War,Vol. 2,2015,4.3,4.5,,525
The Civil War,Vol. 2,2016,4.3,4.5,,515
The Civil War,Vol. 3,2015,4.2,4.1,,242
The Civil War,Vol. 3,2016,4.2,4.1,,251


### Concatenate two columns

In [29]:
books_new_lists = [
    ['name_author','lastname_author','nationality','number_books'],
    ['Virginia','Wolf','British',50],
    ['Margaret','Atwood','Canadian',40],
    ['Harper','Lee','American',2],
]

books_new = pd.DataFrame(books_new_lists[1:], columns=books_new_lists[0])

books_new

Unnamed: 0,name_author,lastname_author,nationality,number_books
0,Virginia,Wolf,British,50
1,Margaret,Atwood,Canadian,40
2,Harper,Lee,American,2


In [30]:
books_new['author'] = books_new['name_author'].str.cat(books_new['lastname_author'], sep=' ')

books_new

Unnamed: 0,name_author,lastname_author,nationality,number_books,author
0,Virginia,Wolf,British,50,Virginia Wolf
1,Margaret,Atwood,Canadian,40,Margaret Atwood
2,Harper,Lee,American,2,Harper Lee


In [31]:
books_new.melt(id_vars='author', value_vars=['nationality', 'number_books'], var_name='feature', value_name='value')

Unnamed: 0,author,feature,value
0,Virginia Wolf,nationality,British
1,Margaret Atwood,nationality,Canadian
2,Harper Lee,nationality,American
3,Virginia Wolf,number_books,50
4,Margaret Atwood,number_books,40
5,Harper Lee,number_books,2


### Concatenate index

In [32]:
comics_marvel_list = [
    ['main_title','subtitle','year','ratings','sold'],
    ['Avengers','Next',1992,4.5,234],
    ['Avengers','Forever',1998,4.6,224],
    ['Avengers','2099',1999,4.8,141]
]

comics_marvel = pd.DataFrame(comics_marvel_list[1:], columns=comics_marvel_list[0])

# Set main_title as index
comics_marvel = comics_marvel.set_index('main_title')

comics_marvel

Unnamed: 0_level_0,subtitle,year,ratings,sold
main_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers,Next,1992,4.5,234
Avengers,Forever,1998,4.6,224
Avengers,2099,1999,4.8,141


In [33]:
comics_marvel.index = comics_marvel.index.str.cat(comics_marvel['subtitle'], sep='-')

comics_marvel

Unnamed: 0_level_0,subtitle,year,ratings,sold
main_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers-Next,Next,1992,4.5,234
Avengers-Forever,Forever,1998,4.6,224
Avengers-2099,2099,1999,4.8,141


In [34]:
# Split index
comics_marvel.index = comics_marvel.index.str.split('-', expand=True)

comics_marvel

Unnamed: 0,Unnamed: 1,subtitle,year,ratings,sold
Avengers,Next,Next,1992,4.5,234
Avengers,Forever,Forever,1998,4.6,224
Avengers,2099,2099,1999,4.8,141


### Concatenate Series

In [35]:
books_new_lists = [
    ['name_author','lastname_author','nationality','number_books'],
    ['Virginia','Wolf','British',50],
    ['Margaret','Atwood','Canadian',40],
    ['Harper','Lee','American',2],
]

books_new = pd.DataFrame(books_new_lists[1:], columns=books_new_lists[0])

books_new['name_author']

0    Virginia
1    Margaret
2      Harper
Name: name_author, dtype: object

In [36]:
new_list = ['Wolf', 'Atwood', 'Lee']

books_new['name_author'].str.cat(new_list, sep=' ')

0      Virginia Wolf
1    Margaret Atwood
2         Harper Lee
Name: name_author, dtype: object

In [37]:
dystopia = [
    ['title', 'year', 'num_pages', 'average_rating', 'ratings_count'],
    ['Fahrenheit 451-1953', 1953.0, 186.0, 4.1, 23244.0],
    ['1984-1949', 1949.0, 268.0, 4.31, 14353.0],
    ['Brave New World-1932', 1932.0, 123.0, 4.3, 23535.0]
]

books_dys = pd.DataFrame(dystopia[1:], columns=dystopia[0])

# Set title as index
books_dys = books_dys.set_index('title')

books_dys

Unnamed: 0_level_0,year,num_pages,average_rating,ratings_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fahrenheit 451-1953,1953.0,186.0,4.1,23244.0
1984-1949,1949.0,268.0,4.31,14353.0
Brave New World-1932,1932.0,123.0,4.3,23535.0


In [38]:
author_list = ['Ray Bradbury', 'George Orwell', 'Aldous Huxley']

author_list

['Ray Bradbury', 'George Orwell', 'Aldous Huxley']

In [39]:
# Get the first element after splitting the index of books_dys

books_dys.index = books_dys.index.str.split('-').str.get(0)

books_dys

Unnamed: 0_level_0,year,num_pages,average_rating,ratings_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fahrenheit 451,1953.0,186.0,4.1,23244.0
1984,1949.0,268.0,4.31,14353.0
Brave New World,1932.0,123.0,4.3,23535.0


In [40]:
hp_books_list = [
    ['title', 'subtitle', 'authors', 'goodreads', 'amazon'],
    ['Harry Potter', 'the Half-Blood Prince ','J.K. Rowling/Mary GrandPré', 4.57, 4.5200000000000005],
    ['Harry Potter', 'the Order of the Phoenix ', 'J.K. Rowling/Mary GrandPré', 4.49, 4.44],
    ['Harry Potter', 'the Chamber of Secrets ', 'J.K. Rowling', 4.42, 4.37],
    ['Harry Potter', 'the Prisoner of Azkaban ', 'J.K. Rowling/Mary GrandPré', 4.56, 4.51],
    ['Harry Potter', 'The Deathly Hallows', 'J.K. Rowling/Mary GrandPré', 4.42, 4.37],
    ['Harry Potter', "the Sorcerer's Stone ", 'J.K. Rowling/Mary GrandPré', 4.47, 4.42],
    ['Harry Potter', 'the Goblet of Fire ', 'J.K. Rowling', 4.56, 4.51]
]

hp_books = pd.DataFrame(hp_books_list[1:], columns=hp_books_list[0])

hp_books

Unnamed: 0,title,subtitle,authors,goodreads,amazon
0,Harry Potter,the Half-Blood Prince,J.K. Rowling/Mary GrandPré,4.57,4.52
1,Harry Potter,the Order of the Phoenix,J.K. Rowling/Mary GrandPré,4.49,4.44
2,Harry Potter,the Chamber of Secrets,J.K. Rowling,4.42,4.37
3,Harry Potter,the Prisoner of Azkaban,J.K. Rowling/Mary GrandPré,4.56,4.51
4,Harry Potter,The Deathly Hallows,J.K. Rowling/Mary GrandPré,4.42,4.37
5,Harry Potter,the Sorcerer's Stone,J.K. Rowling/Mary GrandPré,4.47,4.42
6,Harry Potter,the Goblet of Fire,J.K. Rowling,4.56,4.51


In [41]:
# Concatenate the title and subtitle separated by "and" surrounded by spaces
hp_books['full_title'] = hp_books['title'].str.cat(hp_books['subtitle'], sep =' and ') 

hp_books

Unnamed: 0,title,subtitle,authors,goodreads,amazon,full_title
0,Harry Potter,the Half-Blood Prince,J.K. Rowling/Mary GrandPré,4.57,4.52,Harry Potter and the Half-Blood Prince
1,Harry Potter,the Order of the Phoenix,J.K. Rowling/Mary GrandPré,4.49,4.44,Harry Potter and the Order of the Phoenix
2,Harry Potter,the Chamber of Secrets,J.K. Rowling,4.42,4.37,Harry Potter and the Chamber of Secrets
3,Harry Potter,the Prisoner of Azkaban,J.K. Rowling/Mary GrandPré,4.56,4.51,Harry Potter and the Prisoner of Azkaban
4,Harry Potter,The Deathly Hallows,J.K. Rowling/Mary GrandPré,4.42,4.37,Harry Potter and The Deathly Hallows
5,Harry Potter,the Sorcerer's Stone,J.K. Rowling/Mary GrandPré,4.47,4.42,Harry Potter and the Sorcerer's Stone
6,Harry Potter,the Goblet of Fire,J.K. Rowling,4.56,4.51,Harry Potter and the Goblet of Fire


In [42]:
# Split the authors into writer and illustrator columns
hp_books[['writer', 'illustrator']] = hp_books['authors'].str.split('/', expand=True) 

hp_books

Unnamed: 0,title,subtitle,authors,goodreads,amazon,full_title,writer,illustrator
0,Harry Potter,the Half-Blood Prince,J.K. Rowling/Mary GrandPré,4.57,4.52,Harry Potter and the Half-Blood Prince,J.K. Rowling,Mary GrandPré
1,Harry Potter,the Order of the Phoenix,J.K. Rowling/Mary GrandPré,4.49,4.44,Harry Potter and the Order of the Phoenix,J.K. Rowling,Mary GrandPré
2,Harry Potter,the Chamber of Secrets,J.K. Rowling,4.42,4.37,Harry Potter and the Chamber of Secrets,J.K. Rowling,
3,Harry Potter,the Prisoner of Azkaban,J.K. Rowling/Mary GrandPré,4.56,4.51,Harry Potter and the Prisoner of Azkaban,J.K. Rowling,Mary GrandPré
4,Harry Potter,The Deathly Hallows,J.K. Rowling/Mary GrandPré,4.42,4.37,Harry Potter and The Deathly Hallows,J.K. Rowling,Mary GrandPré
5,Harry Potter,the Sorcerer's Stone,J.K. Rowling/Mary GrandPré,4.47,4.42,Harry Potter and the Sorcerer's Stone,J.K. Rowling,Mary GrandPré
6,Harry Potter,the Goblet of Fire,J.K. Rowling,4.56,4.51,Harry Potter and the Goblet of Fire,J.K. Rowling,


In [43]:
# Melt goodreads and amazon columns into a single column
hp_melt = hp_books.melt(id_vars=['full_title', 'writer'], value_vars=['goodreads', 'amazon'], var_name='source', value_name='rating')

hp_melt

Unnamed: 0,full_title,writer,source,rating
0,Harry Potter and the Half-Blood Prince,J.K. Rowling,goodreads,4.57
1,Harry Potter and the Order of the Phoenix,J.K. Rowling,goodreads,4.49
2,Harry Potter and the Chamber of Secrets,J.K. Rowling,goodreads,4.42
3,Harry Potter and the Prisoner of Azkaban,J.K. Rowling,goodreads,4.56
4,Harry Potter and The Deathly Hallows,J.K. Rowling,goodreads,4.42
5,Harry Potter and the Sorcerer's Stone,J.K. Rowling,goodreads,4.47
6,Harry Potter and the Goblet of Fire,J.K. Rowling,goodreads,4.56
7,Harry Potter and the Half-Blood Prince,J.K. Rowling,amazon,4.52
8,Harry Potter and the Order of the Phoenix,J.K. Rowling,amazon,4.44
9,Harry Potter and the Chamber of Secrets,J.K. Rowling,amazon,4.37


In [44]:
import pandas as pd

sherlock_books = [
    ['main_title', 'version', 'number_pages', 'number_ratings'],
    ['Sherlock Holmes: The Complete Novels', 'Vol I', 1059, 24087],
    ['Sherlock Holmes: The Complete Novels', 'Vol II', 709, 26794],
    ['Adventures of Sherlock Holmes: Memoirs', 'Vol I', 334, 2184],
    ['Adventures of Sherlock Holmes: Memoirs', 'Vol II', 238, 1884],
]

books_sh = pd.DataFrame(sherlock_books[1:], columns=sherlock_books[0])

books_sh

Unnamed: 0,main_title,version,number_pages,number_ratings
0,Sherlock Holmes: The Complete Novels,Vol I,1059,24087
1,Sherlock Holmes: The Complete Novels,Vol II,709,26794
2,Adventures of Sherlock Holmes: Memoirs,Vol I,334,2184
3,Adventures of Sherlock Holmes: Memoirs,Vol II,238,1884


In [45]:
# Split main_title by a colon and assign it to two columns named title and subtitle 
books_sh[['title', 'subtitle']] = books_sh['main_title'].str.split(':', expand=True)

books_sh

Unnamed: 0,main_title,version,number_pages,number_ratings,title,subtitle
0,Sherlock Holmes: The Complete Novels,Vol I,1059,24087,Sherlock Holmes,The Complete Novels
1,Sherlock Holmes: The Complete Novels,Vol II,709,26794,Sherlock Holmes,The Complete Novels
2,Adventures of Sherlock Holmes: Memoirs,Vol I,334,2184,Adventures of Sherlock Holmes,Memoirs
3,Adventures of Sherlock Holmes: Memoirs,Vol II,238,1884,Adventures of Sherlock Holmes,Memoirs


In [46]:
# Split version by a space and assign the second element to the column named volume 
books_sh['volume'] = books_sh['version'].str.split(' ').str.get(1)

books_sh

Unnamed: 0,main_title,version,number_pages,number_ratings,title,subtitle,volume
0,Sherlock Holmes: The Complete Novels,Vol I,1059,24087,Sherlock Holmes,The Complete Novels,I
1,Sherlock Holmes: The Complete Novels,Vol II,709,26794,Sherlock Holmes,The Complete Novels,II
2,Adventures of Sherlock Holmes: Memoirs,Vol I,334,2184,Adventures of Sherlock Holmes,Memoirs,I
3,Adventures of Sherlock Holmes: Memoirs,Vol II,238,1884,Adventures of Sherlock Holmes,Memoirs,II


In [47]:
# Drop the main_title and version columns modifying books_sh
books_sh.drop(['main_title', 'version'], axis=1, inplace=True)

books_sh

Unnamed: 0,number_pages,number_ratings,title,subtitle,volume
0,1059,24087,Sherlock Holmes,The Complete Novels,I
1,709,26794,Sherlock Holmes,The Complete Novels,II
2,334,2184,Adventures of Sherlock Holmes,Memoirs,I
3,238,1884,Adventures of Sherlock Holmes,Memoirs,II


In [48]:
# Reshape using title, subtitle and volume as index, name feature the new variable from columns starting with number, separated by undescore and ending in words 
sh_long = pd.wide_to_long(books_sh, stubnames='number', i=['title', 'subtitle', 'volume'], j='feature', sep='_', suffix='\w+')

sh_long

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,number
title,subtitle,volume,feature,Unnamed: 4_level_1
Sherlock Holmes,The Complete Novels,I,pages,1059
Sherlock Holmes,The Complete Novels,I,ratings,24087
Sherlock Holmes,The Complete Novels,II,pages,709
Sherlock Holmes,The Complete Novels,II,ratings,26794
Adventures of Sherlock Holmes,Memoirs,I,pages,334
Adventures of Sherlock Holmes,Memoirs,I,ratings,2184
Adventures of Sherlock Holmes,Memoirs,II,pages,238
Adventures of Sherlock Holmes,Memoirs,II,ratings,1884


## Stacking DataFrames

### Rows with multi-indices

Setting the index

In [49]:
import pandas as pd

churn_list = [
    ['credit_score', 'age', 'country', 'num_products', 'exited'],
    [619, 43, 'France', 1, 'Yes'],
    [608, 34, 'Germany', 0, 'No'],
    [502, 23, 'France', 1, 'Yes'],
]

churn = pd.DataFrame(churn_list[1:], columns=churn_list[0])

churn

Unnamed: 0,credit_score,age,country,num_products,exited
0,619,43,France,1,Yes
1,608,34,Germany,0,No
2,502,23,France,1,Yes


In [50]:
churn.set_index(['country', 'age'], inplace=True)

churn

Unnamed: 0_level_0,Unnamed: 1_level_0,credit_score,num_products,exited
country,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
France,43,619,1,Yes
Germany,34,608,0,No
France,23,502,1,Yes


MultiIndex from array

In [51]:
new_array = [['yes', 'no', 'yes'], ['no', 'yes', 'yes']]

churn.index = pd.MultiIndex.from_arrays(new_array, names=['member', 'credit_card'])

churn

Unnamed: 0_level_0,Unnamed: 1_level_0,credit_score,num_products,exited
member,credit_card,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
yes,no,619,1,Yes
no,yes,608,0,No
yes,yes,502,1,Yes


MultiIndex DataFrames

In [52]:
data = [
    [25, 68, 26, 72],
    [31, 72, 32, 73],
    [41, 68, 42, 69],
    [32, 75, 33, 74],
]

index = pd.MultiIndex.from_arrays([['Wick', 'Wick', 'Shelley', 'Shelley'],
    ['John', 'Julien', 'Mary', 'Frank']],

names=['last', 'first'])

columns = pd.MultiIndex.from_arrays([['2019', '2019', '2020', '2020'],
    ['age', 'weight', 'age', 'weight']],

names=['year', 'feature'])

patients = pd.DataFrame(data, index=index, columns=columns)

patients

Unnamed: 0_level_0,year,2019,2019,2020,2020
Unnamed: 0_level_1,feature,age,weight,age,weight
last,first,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Wick,John,25,68,26,72
Wick,Julien,31,72,32,73
Shelley,Mary,41,68,42,69
Shelley,Frank,32,75,33,74


**The .stack() method**

Rearrange a level of the columns to obtain a reshaped DataFrame with a new inner-most
level row index

In [53]:
import pandas as pd

churn_list = [
    ['credit_score', 'age', 'country', 'num_products', 'exited'],
    [619, 43, 'France', 1, 'Yes'],
    [608, 34, 'Germany', 0, 'No'],
    [502, 23, 'France', 1, 'Yes'],
]

churn = pd.DataFrame(churn_list[1:], columns=churn_list[0])

churn.index = pd.MultiIndex.from_arrays(new_array, names=['member', 'credit_card'])

churn

Unnamed: 0_level_0,Unnamed: 1_level_0,credit_score,age,country,num_products,exited
member,credit_card,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
yes,no,619,43,France,1,Yes
no,yes,608,34,Germany,0,No
yes,yes,502,23,France,1,Yes


In [54]:
churned_stacked = churn.stack()

churned_stacked.head(10)

member  credit_card              
yes     no           credit_score        619
                     age                  43
                     country          France
                     num_products          1
                     exited              Yes
no      yes          credit_score        608
                     age                  34
                     country         Germany
                     num_products          0
                     exited               No
dtype: object

In [55]:
patients

Unnamed: 0_level_0,year,2019,2019,2020,2020
Unnamed: 0_level_1,feature,age,weight,age,weight
last,first,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Wick,John,25,68,26,72
Wick,Julien,31,72,32,73
Shelley,Mary,41,68,42,69
Shelley,Frank,32,75,33,74


In [56]:
patients_stacked = patients.stack()

patients_stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,year,2019,2020
last,first,feature,Unnamed: 3_level_1,Unnamed: 4_level_1
Wick,John,age,25,26
Wick,John,weight,68,72
Wick,Julien,age,31,32
Wick,Julien,weight,72,73
Shelley,Mary,age,41,42
Shelley,Mary,weight,68,69
Shelley,Frank,age,32,33
Shelley,Frank,weight,75,74


Stack a level by number

In [57]:
patients.stack(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,age,weight
last,first,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Wick,John,2019,25,68
Wick,John,2020,26,72
Wick,Julien,2019,31,72
Wick,Julien,2020,32,73
Shelley,Mary,2019,41,68
Shelley,Mary,2020,42,69
Shelley,Frank,2019,32,75
Shelley,Frank,2020,33,74


Stack a level by name

In [66]:
patients.stack(level='year')

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,age,weight
last,first,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Wick,John,2019,25,68
Wick,John,2020,26,72
Wick,Julien,2019,31,72
Wick,Julien,2020,32,73
Shelley,Mary,2019,41,68
Shelley,Mary,2020,42,69
Shelley,Frank,2019,32,75
Shelley,Frank,2020,33,74


## Unstacking DataFrames

### The .unstack() method

In [67]:
patients_stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,feature,age,weight
last,first,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Wick,John,2019,25,68
Wick,John,2020,26,72
Wick,Julien,2019,31,72
Wick,Julien,2020,32,73
Shelley,Mary,2019,41,68
Shelley,Mary,2020,42,69
Shelley,Frank,2019,32,75
Shelley,Frank,2020,33,74


In [65]:
patients_stacked.unstack()

Unnamed: 0_level_0,feature,age,age,weight,weight
Unnamed: 0_level_1,year,2019,2020,2019,2020
last,first,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Shelley,Frank,32,33,75,74
Shelley,Mary,41,42,68,69
Wick,John,25,26,68,72
Wick,Julien,31,32,72,73


#### Sorting the index

In [70]:
patients_stacked.unstack().sort_index(ascending=False)

Unnamed: 0_level_0,feature,age,age,weight,weight
Unnamed: 0_level_1,year,2019,2020,2019,2020
last,first,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Wick,Julien,31,32,72,73
Wick,John,25,26,68,72
Shelley,Mary,41,42,68,69
Shelley,Frank,32,33,75,74


#### Rearrange levels

In [73]:
patients_stacked = patients.stack()

patients_stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,year,2019,2020
last,first,feature,Unnamed: 3_level_1,Unnamed: 4_level_1
Wick,John,age,25,26
Wick,John,weight,68,72
Wick,Julien,age,31,32
Wick,Julien,weight,72,73
Shelley,Mary,age,41,42
Shelley,Mary,weight,68,69
Shelley,Frank,age,32,33
Shelley,Frank,weight,75,74


In [74]:
patients_stacked.unstack(level=1).stack(level=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,first,Frank,John,Julien,Mary
last,feature,year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shelley,age,2019,32.0,,,41.0
Shelley,age,2020,33.0,,,42.0
Shelley,weight,2019,75.0,,,68.0
Shelley,weight,2020,74.0,,,69.0
Wick,age,2019,,25.0,31.0,
Wick,age,2020,,26.0,32.0,
Wick,weight,2019,,68.0,72.0,
Wick,weight,2020,,72.0,73.0,


### Working with multiple levels

#### Swap levels

In [1]:
import pandas as pd

cars_data = [
    ['','','','2019','2020'],
    ['price','Golf','VW',25,26],
    ['sold','Golf','VW',68,72],
    ['price','Passat','VW',31,32],
    ['sold','Passat','VW',72,73],
    ['price','A-class','Mercedes',41,42],
    ['sold','A-class','Mercedes',68,69],
    ['price','C-class','Mercedes',32,33],
    ['sold','C-class','Mercedes',75,74]
]


df = pd.DataFrame(cars_data[1:], columns=cars_data[0], index=pd.MultiIndex.from_arrays([[row[1] for row in cars_data[1:]], [row[2] for row in cars_data[1:]]]))

print(df)

                                            2019  2020
Golf    VW        price     Golf        VW    25    26
        VW         sold     Golf        VW    68    72
Passat  VW        price   Passat        VW    31    32
        VW         sold   Passat        VW    72    73
A-class Mercedes  price  A-class  Mercedes    41    42
        Mercedes   sold  A-class  Mercedes    68    69
C-class Mercedes  price  C-class  Mercedes    32    33
        Mercedes   sold  C-class  Mercedes    75    74
