# pivot versus pivot_table

In [1]:
import numpy as np
import pandas as pd

pandas has two function to restructure dataframes.  Although they are similar, each has its won applications.

## Data

To experiment with them, we use the patient data set, consisting of the experimental numerical data, and the categorical metadata.

In [2]:
experiment = pd.read_excel('data/patient_experiment.xlsx',
                           dtype={'dose': np.float32,
                                  'temperature': np.float32})

In [3]:
metadata = pd.read_excel('data/patient_metadata.xlsx',
                         dtype={'gender': 'category',
                                'condition': 'category'})

We merge the dataframes. There will be missing data in each data column.

In [4]:
data = pd.merge(experiment, metadata, how='left', on='patient')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   patient      62 non-null     int64         
 1   dose         61 non-null     float32       
 2   date         62 non-null     datetime64[ns]
 3   temperature  61 non-null     float32       
 4   gender       55 non-null     category      
 5   condition    55 non-null     category      
dtypes: category(2), datetime64[ns](1), float32(2), int64(1)
memory usage: 1.8 KB


In [6]:
data.head()

Unnamed: 0,patient,dose,date,temperature,gender,condition
0,1,0.0,2012-10-02 10:00:00,38.299999,M,A
1,1,2.0,2012-10-02 11:00:00,38.5,M,A
2,1,2.0,2012-10-02 12:00:00,38.099998,M,A
3,1,2.0,2012-10-02 13:00:00,37.299999,M,A
4,1,0.0,2012-10-02 14:00:00,37.5,M,A


## pivot

Using the `pivot` method, all columns are taken into accout, so when using the `'date'` column as the new index, and `'patient'` as second level column, we get a new dataframe with $4 \times 9$ columns, the first level columns will be `'dose'`, `'temperature'`, `'gender'` and `'condition'`, the second level the `'patient'` ID.

In [7]:
time_series = data.pivot(index='date', columns='patient')

In [8]:
time_series.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2012-10-02 10:00:00 to 2012-10-02 16:00:00
Data columns (total 36 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   (dose, 1)         7 non-null      float32 
 1   (dose, 2)         7 non-null      float32 
 2   (dose, 3)         7 non-null      float32 
 3   (dose, 4)         6 non-null      float32 
 4   (dose, 5)         7 non-null      float32 
 5   (dose, 6)         6 non-null      float32 
 6   (dose, 7)         7 non-null      float32 
 7   (dose, 8)         7 non-null      float32 
 8   (dose, 9)         7 non-null      float32 
 9   (temperature, 1)  7 non-null      float32 
 10  (temperature, 2)  7 non-null      float32 
 11  (temperature, 3)  6 non-null      float32 
 12  (temperature, 4)  7 non-null      float32 
 13  (temperature, 5)  7 non-null      float32 
 14  (temperature, 6)  6 non-null      float32 
 15  (temperature, 7)  7 non-null      float

The `'gender'` and `'condition'` column in this dataframe will contain identical values for each row.

The optional `values` argument can be used to select only the columns of interest, e.g., we can discard `'dose'` and `'condition'`.

In [9]:
temp_gender_data = data.pivot(index='date', columns='patient',
                              values=['temperature', 'gender'])

In [10]:
temp_gender_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2012-10-02 10:00:00 to 2012-10-02 16:00:00
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   (temperature, 1)  7 non-null      object
 1   (temperature, 2)  7 non-null      object
 2   (temperature, 3)  6 non-null      object
 3   (temperature, 4)  7 non-null      object
 4   (temperature, 5)  7 non-null      object
 5   (temperature, 6)  6 non-null      object
 6   (temperature, 7)  7 non-null      object
 7   (temperature, 8)  7 non-null      object
 8   (temperature, 9)  7 non-null      object
 9   (gender, 1)       7 non-null      object
 10  (gender, 2)       7 non-null      object
 11  (gender, 3)       7 non-null      object
 12  (gender, 4)       0 non-null      object
 13  (gender, 5)       7 non-null      object
 14  (gender, 6)       6 non-null      object
 15  (gender, 7)       7 non-null      object
 16  (gender, 8)       7 non-nul

## pivot_table

The `pivot_table` method on the other hand will only take the numerical columns into account.  Hence it will not work on this dataframe since it contains categorical data as well.

In [11]:
time_series_table = data.pivot_table(index='date', columns='patient', values=['dose', 'temperature'])

This dataframe has just $2 \times 9$ columns, two top level columns `'dose'` and `'temperature'`, and the `'patient'` ID as second level.

In [12]:
time_series_table.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2012-10-02 10:00:00 to 2012-10-02 16:00:00
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   (dose, 1)         7 non-null      float32
 1   (dose, 2)         7 non-null      float32
 2   (dose, 3)         7 non-null      float32
 3   (dose, 4)         6 non-null      float32
 4   (dose, 5)         7 non-null      float32
 5   (dose, 6)         6 non-null      float32
 6   (dose, 7)         7 non-null      float32
 7   (dose, 8)         7 non-null      float32
 8   (dose, 9)         7 non-null      float32
 9   (temperature, 1)  7 non-null      float32
 10  (temperature, 2)  7 non-null      float32
 11  (temperature, 3)  6 non-null      float32
 12  (temperature, 4)  7 non-null      float32
 13  (temperature, 5)  7 non-null      float32
 14  (temperature, 6)  6 non-null      float32
 15  (temperature, 7)  7 non-null      float32
 16  (temperat

The motivation for this implementation is that `pivot_table` is mainly inteneded to aggregate data.  For instance, the cumulative dose can be computed.

In [13]:
dose_table = data.pivot_table(index='date',
                              values=['dose'],
                              columns='patient',
                              aggfunc='sum',
                              margins=True,)

In [14]:
dose_table

Unnamed: 0_level_0,dose,dose,dose,dose,dose,dose,dose,dose,dose,dose
patient,1,2,3,4,5,6,7,8,9,All
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2012-10-02 10:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2012-10-02 11:00:00,2.0,5.0,2.0,5.0,3.0,2.0,10.0,0.0,10.0,39.0
2012-10-02 12:00:00,2.0,5.0,5.0,5.0,7.0,3.0,5.0,0.0,12.0,44.0
2012-10-02 13:00:00,2.0,5.0,2.0,0.0,5.0,2.0,8.0,0.0,4.0,28.0
2012-10-02 14:00:00,0.0,0.0,2.0,0.0,9.0,1.0,3.0,0.0,4.0,19.0
2012-10-02 15:00:00,0.0,0.0,2.0,0.0,3.0,0.0,3.0,0.0,0.0,8.0
2012-10-02 16:00:00,0.0,0.0,0.0,0.0,0.0,,1.0,0.0,0.0,1.0
All,6.0,15.0,13.0,10.0,27.0,8.0,30.0,0.0,30.0,139.0


Note that the `margins` argument results in the computation of totals for rows and colomns (according to the aggregation function).

Compute the maximum temperature for each gender/condition.

In [15]:
data.pivot_table(index=['gender', 'condition'],
                 values='temperature',
                 aggfunc='max',)

Unnamed: 0_level_0,Unnamed: 1_level_0,temperature
gender,condition,Unnamed: 2_level_1
F,A,39.400002
F,B,38.099998
M,A,39.5
M,B,40.700001


Compute the total dose and the maximum temperature for each patient grouped by gender.

In [16]:
data.pivot_table(index=['gender', 'patient'],
                 values=['temperature', 'dose'],
                 aggfunc={
                     'temperature': 'max',
                     'dose': 'sum',
                 },)

Unnamed: 0_level_0,Unnamed: 1_level_0,dose,temperature
gender,patient,Unnamed: 2_level_1,Unnamed: 3_level_1
F,1,0.0,
F,2,15.0,39.400002
F,3,0.0,
F,4,0.0,
F,5,0.0,
F,6,8.0,38.099998
F,7,0.0,
F,8,0.0,37.900002
F,9,0.0,
M,1,6.0,38.5
