# Data Analysis in Python - XIII: Pivoting and Unpivoting Data

## Introduction


In this lesson, we will learn how to pivot and unpivot data using pandas. Pivoting and unpivoting are operations that allow us to change the format of a dataset to make it easier to perform subsequent analyses. 

Note: 
1. Use the TOC to navigate between sections.


## Different data formats

The data that we encounter and import will typically be organized in a long or a wide format or a combination of both. Each format has its advantages in terms of allowing appropriate data analysis. However, it is sometimes useful to convert data from one format to another to simplify subsequent analysis.

### Long format

The long format is a more detailed representation of data. Each row in the long format represents an observation of one or more variables. As an example, consider the table below which is in a long format. 

|Student|Exam|Score|
|-|-|-|
|Amos|Exam 1|90|
|Amos|Exam 2|99|
|Betty|Exam 1|98|
|Betty|Exam 2|92|


### Wide format

The wide format provides a summary of the long format data by creating columns corresponding to a variable (one new column for each variable). The below table shows the above data in the wide format.

|Student|Exam 1|Exam 2|
|-|-|-|
|Amos|90|99|
|Betty|98|92|

## Pivoting data

Pivoting converts data from a long format to a wide format by extracting variables from a specified column and summarizing or aggregating them in multiple columns. If you have used Pivot tables in Excel, you have performed pivoting before. 

We can use the `pivot_table()` function of a data frame to perform pivoting. Let's try that with an example.

In [2]:
import pandas as pd

# import numpy to use various statistical aggregation functions.
import numpy as np

# read monthly product sales unpivoted data
longSales = pd.read_csv('../scratch/monthly_product_sales_unpivoted.csv', sep=',')

In [3]:
longSales.head()

Unnamed: 0,Month,Product,Sales
0,1/1/2001,Shampoo,266.0
1,1/1/2001,Shampoo,4.0
2,3/1/2001,Shampoo,183.1
3,4/1/2001,Shampoo,119.3
4,5/1/2001,Shampoo,180.3


In [4]:
longSales.tail()

Unnamed: 0,Month,Product,Sales
58,8/1/2003,Conditioner,1745.0
59,9/1/2003,Conditioner,7027.0
60,10/1/2003,Conditioner,5635.0
61,11/1/2003,Conditioner,7839.0
62,12/1/2003,Conditioner,8593.0


In [5]:
# pivot data to create columns for each product

# index: a column or a list of columns that should stay the same
# columns: a column or a list of columns from which to extract the variables. These are the columns on whose values the data will be grouped.
# values: a column or a list of columns in which the values are stored. These are the values that will be aggregated in the pivot.
#         the default aggregation function is mean. 

salesPivot = longSales.pivot_table(index='Month',columns='Product',values='Sales')
salesPivot.head()

Product,Conditioner,Shampoo
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2001,11813.0,135.0
1/1/2002,1517.0,192004.3
1/1/2003,,332009.7
10/1/2001,2421.0,122.9
10/1/2002,10082.0,


In [6]:
# create a pivot table as above but aggregate using the sum function.
salesPivot = longSales.pivot_table(index='Month',columns='Product',values='Sales',aggfunc=np.sum)
salesPivot.head()

Product,Conditioner,Shampoo
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1/1/2001,11813.0,270.0
1/1/2002,1517.0,192004.3
1/1/2003,,332009.7
10/1/2001,2421.0,122.9
10/1/2002,10082.0,


Check the [list of statistical functions available in numpy](https://numpy.org/doc/stable/reference/routines.statistics.html).

## Unpivoting data

It is possible that the data you read into a data frame is in a wide format but you need to convert it to a long format. Unpivoting data is also known as melting data.

We can use the `melt()` function of a data frame to melt or unpivot data. 

Let's try doing so with an example. 

In [8]:
# read monthly product sales pivoted data
wideSales = pd.read_csv('../scratch/monthly_product_sales_pivoted.csv', sep=',')

In [9]:
wideSales.head()

Unnamed: 0,Month,Conditioner_Sales,Shampoo_Sales
0,1/1/2001,11813.0,300.0
1,2/1/2001,4953.0,
2,3/1/2001,2170.0,183.1
3,4/1/2001,4054.0,119.3
4,5/1/2001,,180.3


In [10]:
wideSales.tail()

Unnamed: 0,Month,Conditioner_Sales,Shampoo_Sales
31,8/1/2003,1745.0,402007.6
32,9/1/2003,7027.0,682.0
33,10/1/2003,5635.0,475.3
34,11/1/2003,7839.0,581.3
35,12/1/2003,8593.0,646.9


In [19]:
# unpivot the sales data and place product names in the Product column. Records the sales in the Sales column

# id_vars: a column or a list of columns that should stay the same
# value_vars: a column or a list of columns which should be unpivoted. All such columns will be combined into a single column.
#             leave blank if all columns except id_vars should be used.
# var_name: name of the column created after unpivoting variables
# value_name: name to use for the column that will contain the values. 

salesUnpivot = wideSales.melt(id_vars='Month',value_vars=['Conditioner_Sales','Shampoo_Sales'],var_name='Product',value_name='Sales')
#salesUnpivot = wideSales.melt(id_vars='Month',var_name='Product',value_name='Sales')
salesUnpivot.head()

Unnamed: 0,Month,Product,Sales
0,1/1/2001,Conditioner_Sales,11813.0
1,2/1/2001,Conditioner_Sales,4953.0
2,3/1/2001,Conditioner_Sales,2170.0
3,4/1/2001,Conditioner_Sales,4054.0
4,5/1/2001,Conditioner_Sales,


**Note:** There can be data loss if you pivot and then unpivot data. For e.g., if the original data had two rows for shampoo sales for a given month, you can't retrieve the two separate rows by unpivoting the pivoted data.

We can make a further improvement over the above by making sure that the Product column does not contain the string '_Sales' and only contains the product name.

In [25]:
# remove _Sales from product name
salesUnpivot['Product'] = salesUnpivot['Product'].str.replace('_Sales','')
#salesUnpivot['Product'].str.split('_')[0][0]
#salesUnpivot['Product'].str.split('_').str[0]

In [22]:
salesUnpivot.head()

Unnamed: 0,Month,Product,Sales
0,1/1/2001,Conditioner_Sales,11813.0
1,2/1/2001,Conditioner_Sales,4953.0
2,3/1/2001,Conditioner_Sales,2170.0
3,4/1/2001,Conditioner_Sales,4054.0
4,5/1/2001,Conditioner_Sales,


In [18]:
salesUnpivot.tail()

Unnamed: 0,Month,Product,Sales
67,8/1/2003,Shampoo,402007.6
68,9/1/2003,Shampoo,682.0
69,10/1/2003,Shampoo,475.3
70,11/1/2003,Shampoo,581.3
71,12/1/2003,Shampoo,646.9
