# Pandas Introduction - DataFrame Part 5 - CSVs, Data Aggregation, and Grouping

This notebook supplements the notebooks from Chapter 7.14 in ***Intro to Python for Computer Science and Data Science*** with more information on Pandas DataFrames. You should study both sets of notebooks.

The examples here were extracted and adapted from ***Python for Data Analysis*** by Wes McKinney ISBN: 9781449319793

In [111]:
import numpy as np
from pandas import Series, DataFrame
import pandas as pd

### Reading Data from Comma Separated Value (CSV) Files

In [112]:
data_in = 'tips.csv'

Read our data file into a dataframe.  CSV files often have a header row (this one does) and Pandas will use the header row to identify column names.  Of course, you are free to explicitly define the column names, as well.

If the CSV does not have a header row, you may specify the column names when creating the dataframe by reading the file.  See the Pandas documentation for options.

In [113]:
tips = pd.read_csv(data_in)

Let's take a look at the data.  Note that Jupyter notebooks conveniently skips rows when dataframes are large.

In [114]:
tips

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.50,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4
...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3
240,27.18,2.00,Yes,Sat,Dinner,2
241,22.67,2.00,Yes,Sat,Dinner,2
242,17.82,1.75,No,Sat,Dinner,2


It is very easy to write a CSV from a DataFrame.

In [115]:
data_out = 'tips_from_df.csv'

In [116]:
tips.to_csv(data_out)

### Exploring DataFrame Contents

Here are some more ways to look at the data.  With very large data sets, being able to look at just a portion of the data is useful.

In [117]:
tips.head(10)

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4
5,25.29,4.71,No,Sun,Dinner,4
6,8.77,2.0,No,Sun,Dinner,2
7,26.88,3.12,No,Sun,Dinner,4
8,15.04,1.96,No,Sun,Dinner,2
9,14.78,3.23,No,Sun,Dinner,2


In [118]:
tips.tail(10)

Unnamed: 0,total_bill,tip,smoker,day,time,size
234,15.53,3.0,Yes,Sat,Dinner,2
235,10.07,1.25,No,Sat,Dinner,2
236,12.6,1.0,Yes,Sat,Dinner,2
237,32.83,1.17,Yes,Sat,Dinner,2
238,35.83,4.67,No,Sat,Dinner,3
239,29.03,5.92,No,Sat,Dinner,3
240,27.18,2.0,Yes,Sat,Dinner,2
241,22.67,2.0,Yes,Sat,Dinner,2
242,17.82,1.75,No,Sat,Dinner,2
243,18.78,3.0,No,Thur,Dinner,2


In [119]:
tips[['total_bill', 'tip', 'size']].head(10)

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.5,3
3,23.68,3.31,2
4,24.59,3.61,4
5,25.29,4.71,4
6,8.77,2.0,2
7,26.88,3.12,4
8,15.04,1.96,2
9,14.78,3.23,2


In [120]:
tips[:6]

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4
5,25.29,4.71,No,Sun,Dinner,4


In [121]:
tips[5:10]

Unnamed: 0,total_bill,tip,smoker,day,time,size
5,25.29,4.71,No,Sun,Dinner,4
6,8.77,2.0,No,Sun,Dinner,2
7,26.88,3.12,No,Sun,Dinner,4
8,15.04,1.96,No,Sun,Dinner,2
9,14.78,3.23,No,Sun,Dinner,2


In [122]:
tips[tips['smoker'] == 'Yes'].head(10)

Unnamed: 0,total_bill,tip,smoker,day,time,size
56,38.01,3.0,Yes,Sat,Dinner,4
58,11.24,1.76,Yes,Sat,Dinner,2
60,20.29,3.21,Yes,Sat,Dinner,2
61,13.81,2.0,Yes,Sat,Dinner,2
62,11.02,1.98,Yes,Sat,Dinner,2
63,18.29,3.76,Yes,Sat,Dinner,4
67,3.07,1.0,Yes,Sat,Dinner,1
69,15.01,2.09,Yes,Sat,Dinner,2
72,26.86,3.14,Yes,Sat,Dinner,2
73,25.28,5.0,Yes,Sat,Dinner,2


In [123]:
tips[(tips['smoker'] == 'No') & (tips['time'] == 'Dinner') & (tips['size'] > 3)]

Unnamed: 0,total_bill,tip,smoker,day,time,size
4,24.59,3.61,No,Sun,Dinner,4
5,25.29,4.71,No,Sun,Dinner,4
7,26.88,3.12,No,Sun,Dinner,4
11,35.26,5.0,No,Sun,Dinner,4
13,18.43,3.0,No,Sun,Dinner,4
23,39.42,7.58,No,Sat,Dinner,4
25,17.81,2.34,No,Sat,Dinner,4
31,18.35,2.5,No,Sat,Dinner,4
33,20.69,2.45,No,Sat,Dinner,4
44,30.4,5.6,No,Sun,Dinner,4


Hmmm...it doesn't seem we have data for every day.  Let's check...

In [124]:
tips['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

What about some of the other columns?

In [125]:
tips['smoker'].value_counts()

No     151
Yes     93
Name: smoker, dtype: int64

In [126]:
tips['size'].value_counts()

2    156
3     38
4     37
5      5
1      4
6      4
Name: size, dtype: int64

### Grouping and Aggregating

What are the most popular days to eat?  Are they the same for smokers and nonsmokers?

In [127]:
tips.groupby(['smoker', 'day']).size()

smoker  day 
No      Fri      4
        Sat     45
        Sun     57
        Thur    45
Yes     Fri     15
        Sat     42
        Sun     19
        Thur    17
dtype: int64

Use *unstack* to "unstack" the hierarchical index into a table.

In [128]:
tips.groupby(['smoker', 'day']).size().unstack()

day,Fri,Sat,Sun,Thur
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,4,45,57,45
Yes,15,42,19,17


Let's look at some descriptive statistics for the data while grouping the data.

In [129]:
tips.groupby(['smoker', 'day']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,tip,tip,tip,tip,tip,size,size,size,size,size,size,size,size
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
smoker,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
No,Fri,4.0,18.42,5.059282,12.46,15.1,19.235,22.555,22.75,4.0,2.8125,...,3.3125,3.5,4.0,2.25,0.5,2.0,2.0,2.0,2.25,3.0
No,Sat,45.0,19.661778,8.939181,7.25,14.73,17.82,20.65,48.33,45.0,3.102889,...,3.39,9.0,45.0,2.555556,0.78496,1.0,2.0,2.0,3.0,4.0
No,Sun,57.0,20.506667,8.130189,8.77,14.78,18.43,25.0,48.17,57.0,3.167895,...,3.92,6.0,57.0,2.929825,1.032674,2.0,2.0,3.0,4.0,6.0
No,Thur,45.0,17.113111,7.721728,7.51,11.69,15.95,20.27,41.19,45.0,2.673778,...,3.0,6.7,45.0,2.488889,1.179796,1.0,2.0,2.0,2.0,6.0
Yes,Fri,15.0,16.813333,9.086388,5.75,11.69,13.42,18.665,40.17,15.0,2.714,...,3.24,4.73,15.0,2.066667,0.593617,1.0,2.0,2.0,2.0,4.0
Yes,Sat,42.0,21.276667,10.069138,3.07,13.405,20.39,26.7925,50.81,42.0,2.875476,...,3.1975,10.0,42.0,2.47619,0.862161,1.0,2.0,2.0,3.0,5.0
Yes,Sun,19.0,24.12,10.442511,7.25,17.165,23.1,32.375,45.35,19.0,3.516842,...,4.0,6.5,19.0,2.578947,0.901591,2.0,2.0,2.0,3.0,5.0
Yes,Thur,17.0,19.190588,8.355149,10.34,13.51,16.47,19.81,43.11,17.0,3.03,...,4.0,5.0,17.0,2.352941,0.701888,2.0,2.0,2.0,2.0,4.0


That was a little overwhelming.  Let's look at just the descriptive statistics for the total bill.

In [130]:
tips.groupby(['smoker', 'day'])['total_bill'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Fri,4.0,18.42,5.059282,12.46,15.1,19.235,22.555,22.75
No,Sat,45.0,19.661778,8.939181,7.25,14.73,17.82,20.65,48.33
No,Sun,57.0,20.506667,8.130189,8.77,14.78,18.43,25.0,48.17
No,Thur,45.0,17.113111,7.721728,7.51,11.69,15.95,20.27,41.19
Yes,Fri,15.0,16.813333,9.086388,5.75,11.69,13.42,18.665,40.17
Yes,Sat,42.0,21.276667,10.069138,3.07,13.405,20.39,26.7925,50.81
Yes,Sun,19.0,24.12,10.442511,7.25,17.165,23.1,32.375,45.35
Yes,Thur,17.0,19.190588,8.355149,10.34,13.51,16.47,19.81,43.11


We can go deeper on our grouping...

In [131]:
tips.groupby(['smoker', 'day', 'time']).size()

smoker  day   time  
No      Fri   Dinner     3
              Lunch      1
        Sat   Dinner    45
        Sun   Dinner    57
        Thur  Dinner     1
              Lunch     44
Yes     Fri   Dinner     9
              Lunch      6
        Sat   Dinner    42
        Sun   Dinner    19
        Thur  Lunch     17
dtype: int64

Unstack this to get a cross reference table.

In [132]:
tips.groupby(['smoker', 'day', 'time']).size().unstack().unstack()

time,Dinner,Dinner,Dinner,Dinner,Lunch,Lunch,Lunch,Lunch
day,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur
smoker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
No,3.0,45.0,57.0,1.0,1.0,,,44.0
Yes,9.0,42.0,19.0,,6.0,,,17.0


In [133]:
tips.groupby(['smoker', 'day', 'time']).size().unstack()

Unnamed: 0_level_0,time,Dinner,Lunch
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1
No,Fri,3.0,1.0
No,Sat,45.0,
No,Sun,57.0,
No,Thur,1.0,44.0
Yes,Fri,9.0,6.0
Yes,Sat,42.0,
Yes,Sun,19.0,
Yes,Thur,,17.0


We could also create a new dataframe using the results of the groupby and aggregation.

In [134]:
size_df = tips.groupby(['smoker', 'day', 'time']).size().reset_index(name='counts')
size_df

Unnamed: 0,smoker,day,time,counts
0,No,Fri,Dinner,3
1,No,Fri,Lunch,1
2,No,Sat,Dinner,45
3,No,Sun,Dinner,57
4,No,Thur,Dinner,1
5,No,Thur,Lunch,44
6,Yes,Fri,Dinner,9
7,Yes,Fri,Lunch,6
8,Yes,Sat,Dinner,42
9,Yes,Sun,Dinner,19


It looks like there were 3 nonsmokers at Friday dinner.  Let's find those rows.

In [135]:
tips[(tips['smoker'] == 'No') & (tips['day'] == 'Fri') & (tips['time'] == 'Dinner')]

Unnamed: 0,total_bill,tip,smoker,day,time,size
91,22.49,3.5,No,Fri,Dinner,2
94,22.75,3.25,No,Fri,Dinner,2
99,12.46,1.5,No,Fri,Dinner,2


What do the bills and tips average on each day?

In [136]:
tips.groupby('day')[['total_bill', 'tip']].mean()

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,17.151579,2.734737
Sat,20.441379,2.993103
Sun,21.41,3.255132
Thur,17.682742,2.771452


Compute an interesting subset of the results.  Note the use of the *agg* method to invoke a set of operations upon specific columns for each of the groups.

In [137]:
tips.groupby(['day', 'size'])[['total_bill', 'tip']].agg(['mean', 'std', 'min', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill,total_bill,total_bill,tip,tip,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,min,max,mean,std,min,max
day,size,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Fri,1,8.58,,8.58,8.58,1.92,,1.92,1.92
Fri,2,16.321875,6.45553,5.75,28.97,2.644375,0.96145,1.0,4.3
Fri,3,15.98,,15.98,15.98,3.0,,3.0,3.0
Fri,4,40.17,,40.17,40.17,4.73,,4.73,4.73
Sat,1,5.16,2.955706,3.07,7.25,1.0,0.0,1.0,1.0
Sat,2,16.83717,5.514486,7.74,32.83,2.517547,0.977048,1.0,5.0
Sat,3,25.509444,10.158213,15.69,50.81,3.797778,2.017869,2.0,10.0
Sat,4,29.876154,11.368102,17.81,48.33,4.123846,2.267695,2.0,9.0
Sat,5,28.15,,28.15,28.15,3.0,,3.0,3.0
Sun,2,17.56,7.367284,7.25,40.55,2.816923,1.040557,1.01,5.65


Let's add a column for tip percentage.

In [138]:
tips['tip_percent'] = tips['tip'] / tips['total_bill']

In [139]:
tips.head(10)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_percent
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808
5,25.29,4.71,No,Sun,Dinner,4,0.18624
6,8.77,2.0,No,Sun,Dinner,2,0.22805
7,26.88,3.12,No,Sun,Dinner,4,0.116071
8,15.04,1.96,No,Sun,Dinner,2,0.130319
9,14.78,3.23,No,Sun,Dinner,2,0.218539


Show some averages by day...

In [140]:
tips.groupby('day')[['total_bill', 'tip', 'tip_percent']].mean()

Unnamed: 0_level_0,total_bill,tip,tip_percent
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.734737,0.169913
Sat,20.441379,2.993103,0.153152
Sun,21.41,3.255132,0.166897
Thur,17.682742,2.771452,0.161276


Another grouping looking at means...

In [141]:
tips.groupby(['smoker', 'day', 'time'])[['total_bill', 'tip', 'tip_percent']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,tip_percent
smoker,day,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
No,Fri,Dinner,19.233333,2.75,0.139622
No,Fri,Lunch,15.98,3.0,0.187735
No,Sat,Dinner,19.661778,3.102889,0.158048
No,Sun,Dinner,20.506667,3.167895,0.160113
No,Thur,Dinner,18.78,3.0,0.159744
No,Thur,Lunch,17.075227,2.666364,0.160311
Yes,Fri,Dinner,19.806667,3.003333,0.165347
Yes,Fri,Lunch,12.323333,2.28,0.188937
Yes,Sat,Dinner,21.276667,2.875476,0.147906
Yes,Sun,Dinner,24.12,3.516842,0.18725


What is the overall average tip percentage?

In [142]:
tips['tip_percent'].mean()

0.16080258172250478

That is a lot of decimal places.  Make things a bit more readable.

In [143]:
f'Average tip = {tips["tip_percent"].mean()*100:.2f}%'

'Average tip = 16.08%'

Let's invoke a sequence of functions to show us the percentages sorted by the highest tip percentages by day.

In [144]:
tips.groupby('day')[['total_bill', 'tip', 'tip_percent']].mean().sort_values(by='tip_percent', ascending=False)

Unnamed: 0_level_0,total_bill,tip,tip_percent
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.734737,0.169913
Sun,21.41,3.255132,0.166897
Thur,17.682742,2.771452,0.161276
Sat,20.441379,2.993103,0.153152


What were the biggest tip percentages?

In [145]:
tips.sort_values(by='tip_percent', ascending=False)[:10]

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_percent
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
149,7.51,2.0,No,Thur,Lunch,2,0.266312
93,16.32,4.3,Yes,Fri,Dinner,2,0.26348
221,13.42,3.48,Yes,Fri,Lunch,2,0.259314
51,10.29,2.6,No,Sun,Dinner,2,0.252672


Do smokers or nonsmokers give better tip percentages on average?

In [146]:
tips.groupby('smoker')['tip_percent'].mean()

smoker
No     0.159328
Yes    0.163196
Name: tip_percent, dtype: float64

### Applying User Defined Functions

The term *split-apply-combine* is sometimes used to describe the process of splitting a data set into groups, applying some function to each group independently, then combining the results back.

DataFrames provide this function using the *apply* method.

Define a function to return the top (largest) values in a particular column.

In [147]:
def top(df, n=5, column='tip_percent'):
    return df.sort_values(by=column, ascending=False)[:n]

Let's try it on the entire dataframe.

In [148]:
top(tips, n=10)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_percent
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
149,7.51,2.0,No,Thur,Lunch,2,0.266312
93,16.32,4.3,Yes,Fri,Dinner,2,0.26348
221,13.42,3.48,Yes,Fri,Lunch,2,0.259314
51,10.29,2.6,No,Sun,Dinner,2,0.252672


Now, use *apply* to perform tops on each subgroup.

In [149]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_percent
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


Passing parameters to *apply* looks like this...

In [150]:
tips.groupby('smoker').apply(top, n=10, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_percent
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,212,48.33,9.0,No,Sat,Dinner,4,0.18622
No,59,48.27,6.73,No,Sat,Dinner,4,0.139424
No,156,48.17,5.0,No,Sun,Dinner,6,0.103799
No,142,41.19,5.0,No,Thur,Lunch,5,0.121389
No,23,39.42,7.58,No,Sat,Dinner,4,0.192288
No,112,38.07,4.0,No,Sun,Dinner,3,0.10507
No,238,35.83,4.67,No,Sat,Dinner,3,0.130338
No,11,35.26,5.0,No,Sun,Dinner,4,0.141804
No,85,34.83,5.17,No,Thur,Lunch,4,0.148435
No,52,34.81,5.2,No,Sun,Dinner,4,0.149382


Remove some of the columns that don't interest us.

In [151]:
tips.groupby('smoker')[['total_bill', 'tip', 'tip_percent']].apply(top, n=10, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,tip_percent
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
No,212,48.33,9.0,0.18622
No,59,48.27,6.73,0.139424
No,156,48.17,5.0,0.103799
No,142,41.19,5.0,0.121389
No,23,39.42,7.58,0.192288
No,112,38.07,4.0,0.10507
No,238,35.83,4.67,0.130338
No,11,35.26,5.0,0.141804
No,85,34.83,5.17,0.148435
No,52,34.81,5.2,0.149382


In [152]:
tips.groupby(['smoker', 'day', 'time'])[['total_bill', 'tip', 'tip_percent']].apply(top, n=5, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,total_bill,tip,tip_percent
smoker,day,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,Fri,Dinner,94,22.75,3.25,0.142857
No,Fri,Dinner,91,22.49,3.5,0.155625
No,Fri,Dinner,99,12.46,1.5,0.120385
No,Fri,Lunch,223,15.98,3.0,0.187735
No,Sat,Dinner,212,48.33,9.0,0.18622
No,Sat,Dinner,59,48.27,6.73,0.139424
No,Sat,Dinner,23,39.42,7.58,0.192288
No,Sat,Dinner,238,35.83,4.67,0.130338
No,Sat,Dinner,39,31.27,5.0,0.159898
No,Sun,Dinner,156,48.17,5.0,0.103799


What happens when the output is long?

In [153]:
tips.groupby(['smoker', 'day', 'time'])[['total_bill', 'tip', 'tip_percent']].apply(top, n=10, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,total_bill,tip,tip_percent
smoker,day,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,Fri,Dinner,94,22.75,3.25,0.142857
No,Fri,Dinner,91,22.49,3.50,0.155625
No,Fri,Dinner,99,12.46,1.50,0.120385
No,Fri,Lunch,223,15.98,3.00,0.187735
No,Sat,Dinner,212,48.33,9.00,0.186220
...,...,...,...,...,...,...
Yes,Thur,Lunch,80,19.44,3.00,0.154321
Yes,Thur,Lunch,200,18.71,4.00,0.213789
Yes,Thur,Lunch,194,16.58,4.00,0.241255
Yes,Thur,Lunch,205,16.47,3.23,0.196114


In [154]:
old_limit = pd.options.display.max_rows
old_limit

60

In [155]:
pd.options.display.max_rows = 200

In [156]:
tips.groupby(['smoker', 'day', 'time'])[['total_bill', 'tip', 'tip_percent']].apply(top, n=10, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,total_bill,tip,tip_percent
smoker,day,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No,Fri,Dinner,94,22.75,3.25,0.142857
No,Fri,Dinner,91,22.49,3.5,0.155625
No,Fri,Dinner,99,12.46,1.5,0.120385
No,Fri,Lunch,223,15.98,3.0,0.187735
No,Sat,Dinner,212,48.33,9.0,0.18622
No,Sat,Dinner,59,48.27,6.73,0.139424
No,Sat,Dinner,23,39.42,7.58,0.192288
No,Sat,Dinner,238,35.83,4.67,0.130338
No,Sat,Dinner,39,31.27,5.0,0.159898
No,Sat,Dinner,239,29.03,5.92,0.203927


In [157]:
pd.options.display.max_rows = old_limit

### Pivot Tables

Pivot tables aggregate data  by multiple keys and arrange the results as a table.

Panda dataframes can build pivot tables.  These are, in essence, a particular case of grouping and aggregration.

In [158]:
tips

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_percent
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.50,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.139780
4,24.59,3.61,No,Sun,Dinner,4,0.146808
...,...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3,0.203927
240,27.18,2.00,Yes,Sat,Dinner,2,0.073584
241,22.67,2.00,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,No,Sat,Dinner,2,0.098204


In [159]:
tips.pivot_table(['total_bill'], index=['smoker', 'day'])

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill
smoker,day,Unnamed: 2_level_1
No,Fri,18.42
No,Sat,19.661778
No,Sun,20.506667
No,Thur,17.113111
Yes,Fri,16.813333
Yes,Sat,21.276667
Yes,Sun,24.12
Yes,Thur,19.190588


Applying the *unstack* method puts the results in a more familiar pivot table format.

In [160]:
tips.pivot_table(['total_bill'], index=['smoker', 'day']).unstack()

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill
day,Fri,Sat,Sun,Thur
smoker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,18.42,19.661778,20.506667,17.113111
Yes,16.813333,21.276667,24.12,19.190588


By default, the aggregation function for pivots is 'mean', but you can change that.  Here, we compute the sum of the bills.

In [161]:
tips.pivot_table(['total_bill'], index=['smoker', 'day'], aggfunc='sum').unstack()

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill
day,Fri,Sat,Sun,Thur
smoker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,73.68,884.78,1168.88,770.09
Yes,252.2,893.62,458.28,326.24


Pivot tables can aggregate on multiple columns.

In [162]:
tips.pivot_table(['total_bill', 'size'], index=['smoker', 'day'], aggfunc=['sum', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,mean,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,size,total_bill,size,total_bill
smoker,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
No,Fri,9,73.68,2.25,18.42
No,Sat,115,884.78,2.555556,19.661778
No,Sun,167,1168.88,2.929825,20.506667
No,Thur,112,770.09,2.488889,17.113111
Yes,Fri,31,252.2,2.066667,16.813333
Yes,Sat,104,893.62,2.47619,21.276667
Yes,Sun,49,458.28,2.578947,24.12
Yes,Thur,40,326.24,2.352941,19.190588


In [163]:
tips.pivot_table(['total_bill', 'size'], index=['smoker', 'day'], aggfunc=['sum', 'mean']).unstack()

Unnamed: 0_level_0,sum,sum,sum,sum,sum,sum,sum,sum,mean,mean,mean,mean,mean,mean,mean,mean
Unnamed: 0_level_1,size,size,size,size,total_bill,total_bill,total_bill,total_bill,size,size,size,size,total_bill,total_bill,total_bill,total_bill
day,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur,Fri,Sat,Sun,Thur
smoker,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
No,9,115,167,112,73.68,884.78,1168.88,770.09,2.25,2.555556,2.929825,2.488889,18.42,19.661778,20.506667,17.113111
Yes,31,104,49,40,252.2,893.62,458.28,326.24,2.066667,2.47619,2.578947,2.352941,16.813333,21.276667,24.12,19.190588
