# 1. Data 📦

In [1]:
# Import packages
import pandas as pd
from seaborn import load_dataset
# Import data 
df = load_dataset('tips').rename(columns={'sex': 'gender'})
df

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


# 2. Tips 🌟

## 📍 Tip #1: Use `crosstab()` for multi-variable counts/percentages

You are probably already familiar with this series function: `value_counts()`. Running `df['day'].value_counts()` will give us the counts of unique values in day variable. If we specify `normalize=True` inside the method, it will give us percentages instead. This is useful for a single variable, but sometimes we need to see counts by multiple variables. For instance, if we wanted to get counts by day and time, one way to get this is to use `groupby()` + `size()` + `unstack()`:

In [2]:
df.groupby(['time', 'day']).size().unstack()

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,61,7,0,0
Dinner,1,12,87,76


In [3]:
pd.crosstab(df['time'], df['day'])

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,61,7,0,0
Dinner,1,12,87,76


Using `crosstab()` has some advantages. Firstly, it’s easy to get row and column subtotals - we just add `margins=True`:

In [4]:
pd.crosstab(df['time'], df['day'], margins=True)

day,Thur,Fri,Sat,Sun,All
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Lunch,61,7,0,0,68
Dinner,1,12,87,76,176
All,62,19,87,76,244


Secondly, we can easily get percentages instead of counts by tweaking the `normalize` argument:

In [5]:
pd.crosstab(df['time'], df['day'], margins=True, normalize=True)

day,Thur,Fri,Sat,Sun,All
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Lunch,0.25,0.028689,0.0,0.0,0.278689
Dinner,0.004098,0.04918,0.356557,0.311475,0.721311
All,0.254098,0.077869,0.356557,0.311475,1.0


In this example, we got table percentages by setting `normalize=True`. This is equivalent to setting it to `normalize='all'`. For row percentages, we use `normalize='index'` and `normalize='columns'` for column percentages. We could further extend the variable sets for columns and rows too:

In [6]:
pd.crosstab([df['time'], df['gender']], [df['day'], df['smoker']], 
            margins=True)

Unnamed: 0_level_0,day,Thur,Thur,Fri,Fri,Sat,Sat,Sun,Sun,All
Unnamed: 0_level_1,smoker,Yes,No,Yes,No,Yes,No,Yes,No,Unnamed: 10_level_1
time,gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Lunch,Male,10,20,3,0,0,0,0,0,33
Lunch,Female,7,24,3,1,0,0,0,0,35
Dinner,Male,0,0,5,2,27,32,15,43,124
Dinner,Female,0,1,4,1,15,13,4,14,52
All,,17,45,15,4,42,45,19,57,244


## 📍 Tip #2: Use `groupby()` with `describe()` for group summary statistics

You may already know about both `groupby()` and `describe()`. But have you used them together? By using these two together, we can check the summary statistics of a numeric variable by unique values in a categorical column with just one line like this:

In [7]:
df.groupby('day')['tip'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Thur,62.0,2.771452,1.240223,1.25,2.0,2.305,3.3625,6.7
Fri,19.0,2.734737,1.019577,1.0,1.96,3.0,3.365,4.73
Sat,87.0,2.993103,1.631014,1.0,2.0,2.75,3.37,10.0
Sun,76.0,3.255132,1.23488,1.01,2.0375,3.15,4.0,6.5


## 📍 Tip #3: Use `agg()`/`aggregate()` for flexible aggregations

*In this post, we will use `agg()`, the alias of `aggregate()`. However, both can be used interchangeably.*

You may probably know the basic aggregation syntax like this one:

In [8]:
df.groupby('day')[['tip']].mean()

Unnamed: 0_level_0,tip
day,Unnamed: 1_level_1
Thur,2.771452
Fri,2.734737
Sat,2.993103
Sun,3.255132


Here are some alternative ways to get the same output with `agg()`:

In [9]:
df.groupby('day')[['tip']].agg('mean')

Unnamed: 0_level_0,tip
day,Unnamed: 1_level_1
Thur,2.771452
Fri,2.734737
Sat,2.993103
Sun,3.255132


In [10]:
df.groupby('day').agg({'tip': 'mean'})

Unnamed: 0_level_0,tip
day,Unnamed: 1_level_1
Thur,2.771452
Fri,2.734737
Sat,2.993103
Sun,3.255132


In this simple example, there isn’t a clear advantage why one should use `agg()` over the first alternative. However, using `agg()` gives us more flexibility when we want to look at output of multiple aggregate functions. For instance, we can get both mean and standard deviation at one go by either passing a list or a dictionary to `agg()`.

In [11]:
df.groupby('day')[['tip']].agg(['mean', 'std']) # list

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,mean,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,2.771452,1.240223
Fri,2.734737,1.019577
Sat,2.993103,1.631014
Sun,3.255132,1.23488


In [12]:
df.groupby(['day']).agg({'tip': ['mean', 'std']}) # dictionary

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,mean,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,2.771452,1.240223
Fri,2.734737,1.019577
Sat,2.993103,1.631014
Sun,3.255132,1.23488


If we ever had to rename the output columns, instead of doing this:

In [13]:
df.groupby('day')[['tip']].agg(['mean', 'std']).rename(
    columns={'mean': 'avg', 'std': 'sd'}
)

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,avg,sd
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,2.771452,1.240223
Fri,2.734737,1.019577
Sat,2.993103,1.631014
Sun,3.255132,1.23488


we could do either of these more succinctly:

In [14]:
df.groupby(['day'])[['tip']].agg([('avg', 'mean'), ('sd', 'std')])

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,avg,sd
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,2.771452,1.240223
Fri,2.734737,1.019577
Sat,2.993103,1.631014
Sun,3.255132,1.23488


In [15]:
df.groupby(['day']).agg({'tip': [('avg', 'mean'), ('sd', 'std')]})

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,avg,sd
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,2.771452,1.240223
Fri,2.734737,1.019577
Sat,2.993103,1.631014
Sun,3.255132,1.23488


Using either list or dictionary so far has worked equally. However, using list is more concise if we want to inspect a same set of summary statistics for multiple variables.

In [16]:
df.groupby('day')[['tip', 'size']].agg(['mean', 'std'])

Unnamed: 0_level_0,tip,tip,size,size
Unnamed: 0_level_1,mean,std,mean,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Thur,2.771452,1.240223,2.451613,1.066285
Fri,2.734737,1.019577,2.105263,0.567131
Sat,2.993103,1.631014,2.517241,0.819275
Sun,3.255132,1.23488,2.842105,1.007341


On the other hand, sometimes using dictionary is the way to go. With dictionary, we can specify different sets of aggregate functions for each variable:

In [17]:
df.groupby(['day']).agg(
    {
        'tip': ['mean', 'std'],
        'size': ['median']
    }
)

Unnamed: 0_level_0,tip,tip,size
Unnamed: 0_level_1,mean,std,median
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Thur,2.771452,1.240223,2.0
Fri,2.734737,1.019577,2.0
Sat,2.993103,1.631014,2.0
Sun,3.255132,1.23488,2.0


There are many aggregate functions to use:

- Frequency / Counts: `size()`, `count()`

- Central tendency: `mean()`, `median()`

- Variance: `std()`, `var()`

- Others: `min()`, `max()`, ️`sum()`, `prod()`, `quantile()` and many more.

On top of these, we could use any Series or DataFrame method inside `agg()`. For instance, to see the highest two tips by day, we use:

In [18]:
df.groupby('day')['tip'].nlargest(2)

day      
Thur  141     6.70
      88      5.85
Fri   95      4.73
      93      4.30
Sat   170    10.00
      212     9.00
Sun   183     6.50
      47      6.00
Name: tip, dtype: float64

In addition, we could use lambda functions too:

In [19]:
df.groupby(['day']).agg(
    {
        'tip': [
            ('range', lambda x: x.max() - x.min()),
            ('IQR', lambda x: x.quantile(.75) - x.quantile(.25))
        ]
    }
)

Unnamed: 0_level_0,tip,tip
Unnamed: 0_level_1,range,IQR
day,Unnamed: 1_level_2,Unnamed: 2_level_2
Thur,5.45,1.3625
Fri,3.73,1.405
Sat,9.0,1.37
Sun,5.49,1.9625


## 📍 Tip #4: Take advantage of `pivot_table()`

Let’s say we needed to get the mean tip by 2 variables. A common approach is to use `groupby()`:

In [20]:
df.groupby(['time', 'day'])['tip'].mean().unstack()

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,2.767705,2.382857,,
Dinner,3.0,2.94,2.993103,3.255132


But a slightly better way to do this is to use `pivot_table()`:

In [21]:
df.pivot_table(values='tip', index='time', columns='day')

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,2.767705,2.382857,,
Dinner,3.0,2.94,2.993103,3.255132


I think readability won’t be compromised if we omit the first argument name to be slightly more concise:

In [22]:
df.pivot_table('tip', index='time', columns='day')

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,2.767705,2.382857,,
Dinner,3.0,2.94,2.993103,3.255132


By default, `pivot_table()` gives us mean values. However, we can easily change to our preferred function such as `sum()` by specifying it to the `aggfunc` argument:

In [23]:
df.pivot_table('tip', index='time', columns='day', aggfunc='sum')

day,Thur,Fri,Sat,Sun
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lunch,168.83,16.68,0.0,0.0
Dinner,3.0,35.28,260.4,247.39


Similar to `crosstab()`, it's also easy to get subtotals with `pivot_table()`.

In [24]:
df.pivot_table('tip', index='time', columns='day', aggfunc='sum', 
               margins=True)

day,Thur,Fri,Sat,Sun,All
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Lunch,168.83,16.68,0.0,0.0,185.51
Dinner,3.0,35.28,260.4,247.39,546.07
All,171.83,51.96,260.4,247.39,731.58


We can also pass dictionary to `aggfunc` to customise the aggregate functions for each variable passed to `values` argument. Another useful argument is `fill_value` where we specify what values we want to see if the output is missing. Let’s see an example exemplifying these points:

In [25]:
df.pivot_table(
    ['tip', 'size'], 
    index=['time', 'smoker'], 
    columns='day', 
    fill_value=0, 
    margins=True,
    aggfunc={'tip': 'sum', 'size': 'max'}
)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size,size,size,tip,tip,tip,tip,tip
Unnamed: 0_level_1,day,Thur,Fri,Sat,Sun,All,Thur,Fri,Sat,Sun,All
time,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
Lunch,Yes,4,2,0,0,4,51.51,13.68,0.0,0.0,65.19
Lunch,No,6,3,0,0,6,117.32,3.0,0.0,0.0,120.32
Dinner,Yes,0,4,5,5,5,0.0,27.03,120.77,66.82,214.62
Dinner,No,2,2,4,6,6,3.0,8.25,139.63,180.57,331.45
All,,6,4,5,6,6,171.83,51.96,260.4,247.39,731.58


With `pivot_table()`, you know exactly which variables would be in rows and columns and no reshaping the data is necessary.

## 📍 Tip #5: Add aggregate statistics to the data with `transform()`

This tip is useful when we want to append the group aggregate measures back to the ungrouped data. Here is an example to clarify this:

In [26]:
df['avg_tip_by_gender'] = df.groupby('gender')['tip'].transform('mean')
df.head()

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size,avg_tip_by_gender
0,16.99,1.01,Female,No,Sun,Dinner,2,2.833448
1,10.34,1.66,Male,No,Sun,Dinner,3,3.089618
2,21.01,3.5,Male,No,Sun,Dinner,3,3.089618
3,23.68,3.31,Male,No,Sun,Dinner,2,3.089618
4,24.59,3.61,Female,No,Sun,Dinner,4,2.833448


In this example, newly created variable *avg_tip_by_gender* shows average *tip* by *gender*. In other words, mean *tip* by *gender* from below have been added back to the ungrouped data.

In [27]:
df.groupby('gender')['tip'].agg(['mean', 'std'])

Unnamed: 0_level_0,mean,std
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,3.089618,1.489102
Female,2.833448,1.159495


Now, let’s take a slightly more advanced example:

In [28]:
df['n_sd_from_gender_avg_tip'] = df.groupby('gender')['tip'].transform(
    lambda x: (x-x.mean())/x.std()
)
df.head()

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size,avg_tip_by_gender,n_sd_from_gender_avg_tip
0,16.99,1.01,Female,No,Sun,Dinner,2,2.833448,-1.572623
1,10.34,1.66,Male,No,Sun,Dinner,3,3.089618,-0.960054
2,21.01,3.5,Male,No,Sun,Dinner,3,3.089618,0.27559
3,23.68,3.31,Male,No,Sun,Dinner,2,3.089618,0.147997
4,24.59,3.61,Female,No,Sun,Dinner,4,2.833448,0.669733


Here, using `lambda` function, we did 3 things for variable *tip*:

`x.mean()`: Find mean by gender.

`x-x.mean()`: Find the distance from mean by gender.
`(x-x.mean())/x.std()`: Find the distance in units of standard deviation.

Let’s take the first record (index=0) as an example and round up the numbers to 2 decimal places for simplicity: `x=1.01`, `x.mean()=2.83`, `x.std()= 1.16`

Then, *n_sd_from_gender_avg_tip* = (1.01 - 2.83)/ 1.16 = -1.57

That’s what we find in the first row of *n_sd_from_gender_avg_tip*. For this record, the *tip* amount was about 1.57 standard deviations lower than the average *tip* among female customers.