In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Data Aggregation
Data aggregation is the process of combining multiple values into a single summary statistic. For example, you might calculate:

+ Mean, Median, Sum: The average, median, or total value for a group.
+ Count: How many records belong to a group.
+ Minimum and Maximum: The smallest or largest value in each group.
+ Standard Deviation, Variance: Measures of variability within groups.

#### Group Operations
Group operations (using Pandas’ groupby method) let you split your dataset into subsets based on one or more keys, perform operations on each subset independently, and then combine the results. This follows the “split-apply-combine” strategy:

1. Split: Divide the data into groups based on some criteria (e.g., day of the week, region, project type).
2. Apply: Apply an aggregation function (or transformation) to each group.
3. Combine: Combine the results into a summary DataFrame.

## 1. Group operation

### The groupby Method
```grouped = df.groupby('grouping_column')```  
Let's see simple example

In [2]:
data = {
    'cat': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'B', 'C', 'A'],
    'Values1': [10, 20, 15, 25, 10, 30, 35, 20, 40, 15],
    'Values2': [5, 10, 8, 12, 6, 15, 18, 9, 20, 7],
    'grp': ['X', 'X', 'Y', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y']
}

df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,cat,Values1,Values2,grp
0,A,10,5,X
1,B,20,10,X
2,A,15,8,Y
3,B,25,12,Y
4,A,10,6,X
5,C,30,15,Y
6,C,35,18,X
7,B,20,9,Y
8,C,40,20,X
9,A,15,7,Y


In [4]:
grouped = df['Values1'].groupby(df['cat'])

In [5]:
grouped.mean()

cat
A    12.500000
B    21.666667
C    35.000000
Name: Values1, dtype: float64

In [6]:
# We can pass more than one grouping keys
data = df['Values1'].groupby([df['cat'], df['grp']]).mean()

In [7]:
data

cat  grp
A    X      10.0
     Y      15.0
B    X      20.0
     Y      22.5
C    X      37.5
     Y      30.0
Name: Values1, dtype: float64

In [8]:
data.unstack()

grp,X,Y
cat,Unnamed: 1_level_1,Unnamed: 2_level_1
A,10.0,15.0
B,20.0,22.5
C,37.5,30.0


In the above example, the group keys are all Series, though they could be any arrays of the right length:

In [9]:
arr = np.array(['USA', 'UK', 'CHINA', 'USA', 'UK', 'CHINA', 'USA', 'UK', 'CHINA', 'UK'])
lst = [2020, 2020, 2020, 2021, 2021, 2021, 2021, 2022, 2022, 2022]

In [10]:
# So let's group by the above array
df['Values1'].groupby(arr).mean()

CHINA    28.333333
UK       16.250000
USA      23.333333
Name: Values1, dtype: float64

In [11]:
# Group by list
df['Values1'].groupby(lst).mean()

2020    15.0
2021    25.0
2022    25.0
Name: Values1, dtype: float64

In [12]:
# Group by both arr and lst
df['Values1'].groupby([arr, lst]).mean()

CHINA  2020    15.0
       2021    30.0
       2022    40.0
UK     2020    20.0
       2021    10.0
       2022    17.5
USA    2020    10.0
       2021    30.0
Name: Values1, dtype: float64

In [13]:
# If the column is part of the dataframe column we can just call the column name
df.groupby("cat").mean(numeric_only = True)

Unnamed: 0_level_0,Values1,Values2
cat,Unnamed: 1_level_1,Unnamed: 2_level_1
A,12.5,6.5
B,21.666667,10.333333
C,35.0,17.666667


In [14]:
# Using another key
df.groupby("grp").mean(numeric_only = True)

Unnamed: 0_level_0,Values1,Values2
grp,Unnamed: 1_level_1,Unnamed: 2_level_1
X,23.0,11.8
Y,21.0,10.2


In [15]:
df.groupby(["cat", "grp"]).mean(numeric_only = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Values1,Values2
cat,grp,Unnamed: 2_level_1,Unnamed: 3_level_1
A,X,10.0,5.5
A,Y,15.0,7.5
B,X,20.0,10.0
B,Y,22.5,10.5
C,X,37.5,19.0
C,Y,30.0,15.0


If we want to count how many values fall in to each group we can use "size()"

In [16]:
df.groupby('cat').size().reset_index(name = 'Count')

Unnamed: 0,cat,Count
0,A,4
1,B,3
2,C,3


A group function similar in spirit to size is count, which computes the number of nonnull values in each group:

In [17]:
df.groupby('cat').count()

Unnamed: 0_level_0,Values1,Values2,grp
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,4,4,4
B,3,3,3
C,3,3,3


### Iterating over Groups
The object returned by groupby supports iteration, generating a sequence of 2-tuples
 containing the group name along with the chunk of data.

In [18]:
for name, data in df.groupby('cat'):
    print(name)
    print(data)

A
  cat  Values1  Values2 grp
0   A       10        5   X
2   A       15        8   Y
4   A       10        6   X
9   A       15        7   Y
B
  cat  Values1  Values2 grp
1   B       20       10   X
3   B       25       12   Y
7   B       20        9   Y
C
  cat  Values1  Values2 grp
5   C       30       15   Y
6   C       35       18   X
8   C       40       20   X


In the case f multiple keys

In [19]:
for name, data in df.groupby(['cat', 'grp']):
    print(name)
    print(data)

('A', 'X')
  cat  Values1  Values2 grp
0   A       10        5   X
4   A       10        6   X
('A', 'Y')
  cat  Values1  Values2 grp
2   A       15        8   Y
9   A       15        7   Y
('B', 'X')
  cat  Values1  Values2 grp
1   B       20       10   X
('B', 'Y')
  cat  Values1  Values2 grp
3   B       25       12   Y
7   B       20        9   Y
('C', 'X')
  cat  Values1  Values2 grp
6   C       35       18   X
8   C       40       20   X
('C', 'Y')
  cat  Values1  Values2 grp
5   C       30       15   Y


may be we want store the group as a dictionary

In [20]:
d = {name: data for name, data in df.groupby('cat')}

In [21]:
d

{'A':   cat  Values1  Values2 grp
 0   A       10        5   X
 2   A       15        8   Y
 4   A       10        6   X
 9   A       15        7   Y,
 'B':   cat  Values1  Values2 grp
 1   B       20       10   X
 3   B       25       12   Y
 7   B       20        9   Y,
 'C':   cat  Values1  Values2 grp
 5   C       30       15   Y
 6   C       35       18   X
 8   C       40       20   X}

In [22]:
# accessing a by key name of the dictionary
d['B']

Unnamed: 0,cat,Values1,Values2,grp
1,B,20,10,X
3,B,25,12,Y
7,B,20,9,Y


By default groupby groups on axis="index", but you can group on any of the other
 axes.

In [23]:
for idx, data in df.T.groupby({'cat': 'key', 'grp': 'key', 'Values1': 'value', 'Values2': 'value'}):
    print(idx)
    print(data)

key
     0  1  2  3  4  5  6  7  8  9
cat  A  B  A  B  A  C  C  B  C  A
grp  X  X  Y  Y  X  Y  X  Y  X  Y
value
          0   1   2   3   4   5   6   7   8   9
Values1  10  20  15  25  10  30  35  20  40  15
Values2   5  10   8  12   6  15  18   9  20   7


Let's see the above concepts with a real world dataset "tips"

In [24]:
tips = sns.load_dataset('tips')

In [25]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [26]:
tips.groupby('sex', observed = True).size().reset_index(name = 'count')

Unnamed: 0,sex,count
0,Male,157
1,Female,87


In [27]:
tips.groupby('smoker', observed = True).size().reset_index(name = 'count')

Unnamed: 0,smoker,count
0,Yes,93
1,No,151


In [28]:
tips.groupby('day', observed = True).size().reset_index(name = 'count')

Unnamed: 0,day,count
0,Thur,62
1,Fri,19
2,Sat,87
3,Sun,76


In [29]:
tips.groupby('time', observed = True).size().reset_index(name = 'count')

Unnamed: 0,time,count
0,Lunch,68
1,Dinner,176


In [30]:
tips.groupby(['sex', 'day'], observed = True).size()

sex     day 
Male    Thur    30
        Fri     10
        Sat     59
        Sun     58
Female  Thur    32
        Fri      9
        Sat     28
        Sun     18
dtype: int64

In [31]:
tips.groupby(['day', 'time'], observed = True).size()

day   time  
Thur  Lunch     61
      Dinner     1
Fri   Lunch      7
      Dinner    12
Sat   Dinner    87
Sun   Dinner    76
dtype: int64

let's apply some functions

In [32]:
tips['total_bill'].groupby(tips['day'], observed = True).mean().reset_index(name = 'Average Bill')

Unnamed: 0,day,Average Bill
0,Thur,17.682742
1,Fri,17.151579
2,Sat,20.441379
3,Sun,21.41


In [33]:
tips['total_bill'].groupby(tips['sex'], observed = True).mean().reset_index(name = 'Average Bill')

Unnamed: 0,sex,Average Bill
0,Male,20.744076
1,Female,18.056897


In [34]:
tips['total_bill'].groupby(tips['smoker'], observed = True).mean().reset_index(name = 'Average Bill')

Unnamed: 0,smoker,Average Bill
0,Yes,20.756344
1,No,19.188278


In [35]:
tips['total_bill'].groupby(tips['time'], observed = True).mean().reset_index(name = 'Average Bill')

Unnamed: 0,time,Average Bill
0,Lunch,17.168676
1,Dinner,20.797159


In [36]:
tips['tip'].groupby(tips['day'], observed = True).mean().reset_index(name = 'Average tip')

Unnamed: 0,day,Average tip
0,Thur,2.771452
1,Fri,2.734737
2,Sat,2.993103
3,Sun,3.255132


In [37]:
tips['tip'].groupby(tips['sex'], observed = True).mean().reset_index(name = 'Average tip')

Unnamed: 0,sex,Average tip
0,Male,3.089618
1,Female,2.833448


In [38]:
tips['tip'].groupby(tips['smoker'], observed = True).mean().reset_index(name = 'Average tip')

Unnamed: 0,smoker,Average tip
0,Yes,3.00871
1,No,2.991854


In [39]:
tips['tip'].groupby(tips['time'], observed = True).mean().reset_index(name = 'Average tip')

Unnamed: 0,time,Average tip
0,Lunch,2.728088
1,Dinner,3.10267


In [40]:
tips.groupby('sex', observed = True).mean(numeric_only = True)

Unnamed: 0_level_0,total_bill,tip,size
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,20.744076,3.089618,2.630573
Female,18.056897,2.833448,2.45977


### Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array
 of column names has the effect of column subsetting for aggregation. For example let's see the 'tips' sataset

In [41]:
# This one creates a 'DataFrameGroupBy' object
tips.groupby('sex', observed = True)['total_bill']
tips.groupby('sex', observed = True)[['total_bill', 'tip']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C78602A950>

The above method is first groupung all columns by the grouping key then select a column or subset of columns which is equivalent to...

In [42]:
tips['total_bill'].groupby(tips['sex'], observed = True)
tips[['total_bill', 'tip']].groupby(tips['sex'], observed = True)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C786057CD0>

In [43]:
tips[['total_bill', 'tip']].groupby(tips['sex'], observed = True).mean()

Unnamed: 0_level_0,total_bill,tip
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,20.744076,3.089618
Female,18.056897,2.833448


### Grouping with Dictionaries and Series

In [44]:
df = pd.DataFrame(np.random.randint(30, 101, size = (4, 6)),
                  index = ['math', 'history', 'science', 'economics'],
                  columns = ['Helen', 'Ron', 'Mely', 'John', 'Randy', 'Julia'])

In [45]:
df

Unnamed: 0,Helen,Ron,Mely,John,Randy,Julia
math,98,35,72,67,77,48
history,40,66,63,37,69,40
science,91,46,77,90,69,41
economics,54,89,96,93,47,65


In [46]:
mapping = {'Ashley': 'Female', 'Ron': 'Male', 'Melissa': 'Female',
          'John': 'Male', 'Randy': 'Male', 'Julia':'Female'}

In [47]:
for key, data in df.T.groupby(mapping):
    print(key)
    print(data)

Female
       math  history  science  economics
Julia    48       40       41         65
Male
       math  history  science  economics
Ron      35       66       46         89
John     67       37       90         93
Randy    77       69       69         47


In [48]:
df.T.groupby(mapping).mean()

Unnamed: 0,math,history,science,economics
Female,48.0,40.0,41.0,65.0
Male,59.666667,57.333333,68.333333,76.333333


In [49]:
# We can see the mapping data as a series
pd.Series(mapping)

Ashley     Female
Ron          Male
Melissa    Female
John         Male
Randy        Male
Julia      Female
dtype: object

###  Grouping with Functions

In [50]:
df = df.T

In [51]:
df

Unnamed: 0,math,history,science,economics
Helen,98,40,91,54
Ron,35,66,46,89
Mely,72,63,77,96
John,67,37,90,93
Randy,77,69,69,47
Julia,48,40,41,65


Let's apply a 'len()' python function to group them by length of their names

In [52]:
df.groupby(len)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C786066290>

In [53]:
for key, data in df.groupby(len):
    print(key)
    print(data)

3
     math  history  science  economics
Ron    35       66       46         89
4
      math  history  science  economics
Mely    72       63       77         96
John    67       37       90         93
5
       math  history  science  economics
Helen    98       40       91         54
Randy    77       69       69         47
Julia    48       40       41         65


In [54]:
df.groupby(len).mean()

Unnamed: 0,math,history,science,economics
3,35.0,66.0,46.0,89.0
4,69.5,50.0,83.5,94.5
5,74.333333,49.666667,67.0,55.333333


###  Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate
 using one of the levels of an axis index.

In [55]:
columns = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"],
                                    [1, 3, 5, 1, 3]],
                                    names=["cty", "tenor"])
df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=columns)

In [56]:
df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.130093,-1.008944,-2.291357,-0.27344,-0.191727
1,-0.956626,1.18956,-1.466489,-1.217648,-0.296265
2,0.890127,-0.849051,0.535697,0.61905,-0.907685
3,1.244851,-0.611971,0.256006,-1.273652,0.779602


In [57]:
df.T.groupby(level = 'cty').mean().T

cty,JP,US
0,-0.232583,-1.056736
1,-0.756956,-0.411185
2,-0.144318,0.192258
3,-0.247025,0.296295


# Data Aggregation
_Aggregations refer to any data transformation that produces scalar values from arrays._

Some example with the above methods using "tips" dataset

In [58]:
tips = sns.load_dataset('tips')

In [59]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


#### 1. any() and all()

In [60]:
# Check if all "total_bill" is greater than 5
tips.groupby('day', observed = True)['total_bill'].apply(lambda x: (x > 5).all())

day
Thur     True
Fri      True
Sat     False
Sun      True
Name: total_bill, dtype: bool

In [61]:
# Check if any "total_bill" greater than 40 in each day
tips.groupby('day', observed = True)['total_bill'].apply(lambda x: (x > 40).any())

day
Thur    True
Fri     True
Sat     True
Sun     True
Name: total_bill, dtype: bool

In [62]:
tips['total_bill > 40'] = tips['total_bill'] > 40

In [63]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_bill > 40
0,16.99,1.01,Female,No,Sun,Dinner,2,False
1,10.34,1.66,Male,No,Sun,Dinner,3,False
2,21.01,3.5,Male,No,Sun,Dinner,3,False
3,23.68,3.31,Male,No,Sun,Dinner,2,False
4,24.59,3.61,Female,No,Sun,Dinner,4,False


In [64]:
tips.groupby('day', observed = True)['total_bill > 40'].any()

day
Thur    True
Fri     True
Sat     True
Sun     True
Name: total_bill > 40, dtype: bool

In [65]:
tips = tips.drop('total_bill > 40', axis = 1)

#### 2. first(), last() and nth()

In [66]:
# The first customer stat from each day
tips.groupby('day', observed = True).first()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,time,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thur,27.2,4.0,Male,No,Lunch,4
Fri,28.97,3.0,Male,Yes,Dinner,2
Sat,20.65,3.35,Male,No,Dinner,3
Sun,16.99,1.01,Female,No,Dinner,2


In [67]:
# The last customer data from each day
tips.groupby('day', observed = True).last()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,time,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thur,18.78,3.0,Female,No,Dinner,2
Fri,10.09,2.0,Female,Yes,Lunch,2
Sat,17.82,1.75,Male,No,Dinner,2
Sun,15.69,1.5,Male,Yes,Dinner,2


In [68]:
# The second customer data in each day
tips.groupby('day', observed = True).nth(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
2,21.01,3.5,Male,No,Sun,Dinner,3
21,20.29,2.75,Female,No,Sat,Dinner,2
79,17.29,2.71,Male,No,Thur,Lunch,2
92,5.75,1.0,Female,Yes,Fri,Dinner,2


#### 3. Quantile() and rank()

In [69]:
tips.groupby('day', observed = True)['total_bill'].quantile(0.25).reset_index(name = '25th percentile')

Unnamed: 0,day,25th percentile
0,Thur,12.4425
1,Fri,12.095
2,Sat,13.905
3,Sun,14.9875


In [70]:
# total rank
tips['rank'] = tips['total_bill'].rank(method = 'min')

In [71]:
tips.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,rank
0,16.99,1.01,Female,No,Sun,Dinner,2,113.0
1,10.34,1.66,Male,No,Sun,Dinner,3,25.0
2,21.01,3.5,Male,No,Sun,Dinner,3,162.0
3,23.68,3.31,Male,No,Sun,Dinner,2,179.0
4,24.59,3.61,Female,No,Sun,Dinner,4,187.0
5,25.29,4.71,Male,No,Sun,Dinner,4,192.0
6,8.77,2.0,Male,No,Sun,Dinner,2,12.0
7,26.88,3.12,Male,No,Sun,Dinner,4,199.0
8,15.04,1.96,Male,No,Sun,Dinner,2,82.0
9,14.78,3.23,Male,No,Sun,Dinner,2,79.0


In [72]:
# Rank by group
tips['group_rank'] = tips.groupby('day', observed = True)['total_bill'].rank()

In [73]:
tips.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,rank,group_rank
0,16.99,1.01,Female,No,Sun,Dinner,2,113.0,28.0
1,10.34,1.66,Male,No,Sun,Dinner,3,25.0,9.0
2,21.01,3.5,Male,No,Sun,Dinner,3,162.0,43.0
3,23.68,3.31,Male,No,Sun,Dinner,2,179.0,50.0
4,24.59,3.61,Female,No,Sun,Dinner,4,187.0,54.0
5,25.29,4.71,Male,No,Sun,Dinner,4,192.0,56.0
6,8.77,2.0,Male,No,Sun,Dinner,2,12.0,2.0
7,26.88,3.12,Male,No,Sun,Dinner,4,199.0,59.0
8,15.04,1.96,Male,No,Sun,Dinner,2,82.0,20.0
9,14.78,3.23,Male,No,Sun,Dinner,2,79.0,18.0


 You can use aggregations of your own devising and additionally call any method
 that is also defined on the object being grouped. For example, the nsmallest
 Series method selects the smallest requested number of values from the data.
 While nsmallest is not explicitly implemented for GroupBy, we can still use it
 with a nonoptimized implementation. Internally, GroupBy slices up the Series, calls
 piece.nsmallest(n) for each piece, and then assembles those results into the result
 object:

In [74]:
tips.groupby('day', observed = True)['tip'].nsmallest(2)

day      
Thur  135    1.25
      146    1.36
Fri   92     1.00
      97     1.50
Sat   67     1.00
      111    1.00
Sun   0      1.01
      43     1.32
Name: tip, dtype: float64

### Applying Custom Functions with GroupBy

#### Using .apply() with Custom Functions

In [75]:
def myfun(x):
    return x.max() - x.min()

In [76]:
tips.groupby('day', observed = True)['total_bill'].apply(myfun)

day
Thur    35.60
Fri     34.42
Sat     47.74
Sun     40.92
Name: total_bill, dtype: float64

#### Using .agg() with Custom Functions

In [77]:
tips.groupby('day', observed = True)['total_bill'].agg(myfun)

day
Thur    35.60
Fri     34.42
Sat     47.74
Sun     40.92
Name: total_bill, dtype: float64

In [78]:
tips = sns.load_dataset('tips')

In [79]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


We can use describe method on the grouped Series or DataFrame

In [80]:
tips.groupby('day', observed = True).describe()

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,total_bill,tip,tip,tip,tip,tip,size,size,size,size,size,size,size,size
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Thur,62.0,17.682742,7.88617,7.51,12.4425,16.2,20.155,43.11,62.0,2.771452,...,3.3625,6.7,62.0,2.451613,1.066285,1.0,2.0,2.0,2.0,6.0
Fri,19.0,17.151579,8.30266,5.75,12.095,15.38,21.75,40.17,19.0,2.734737,...,3.365,4.73,19.0,2.105263,0.567131,1.0,2.0,2.0,2.0,4.0
Sat,87.0,20.441379,9.480419,3.07,13.905,18.24,24.74,50.81,87.0,2.993103,...,3.37,10.0,87.0,2.517241,0.819275,1.0,2.0,2.0,3.0,5.0
Sun,76.0,21.41,8.832122,7.25,14.9875,19.63,25.5975,48.17,76.0,3.255132,...,4.0,6.5,76.0,2.842105,1.007341,2.0,2.0,2.0,4.0,6.0


###  Column-Wise and Multiple Function Application

Apply different aggregate function on different columns  
First Let's add one column in our tips dataset 'tip_pct' tip percent

In [81]:
tips['tip_pct'] = tips['tip']/tips['total_bill']

In [82]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [83]:
# Let's group by "day" and "sex"
grouped = tips.groupby(['day', 'sex'], observed = True)

In [84]:
# Average of tip
grouped['tip'].mean()

day   sex   
Thur  Male      2.980333
      Female    2.575625
Fri   Male      2.693000
      Female    2.781111
Sat   Male      3.083898
      Female    2.801786
Sun   Male      3.220345
      Female    3.367222
Name: tip, dtype: float64

In [85]:
# median of bill
grouped['total_bill'].median()

day   sex   
Thur  Male      16.975
      Female    13.785
Fri   Male      17.215
      Female    15.380
Sat   Male      18.240
      Female    18.360
Sun   Male      20.725
      Female    17.410
Name: total_bill, dtype: float64

Passing a list of functions to the agg

In [86]:
tips.groupby(['day', 'sex'], observed = True)['total_bill'].agg(['mean', 'max', 'min', myfun])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,max,min,myfun
day,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Thur,Male,18.714667,41.19,7.51,33.68
Thur,Female,16.715312,43.11,8.35,34.76
Fri,Male,19.857,40.17,8.58,31.59
Fri,Female,14.145556,22.75,5.75,17.0
Sat,Male,20.802542,50.81,7.74,43.07
Sat,Female,19.680357,44.3,3.07,41.23
Sun,Male,21.887241,48.17,7.25,40.92
Sun,Female,19.872222,35.26,9.6,25.66


 You don’t need to accept the names that GroupBy gives to the columns; notably,
 lambda functions have the name "<lambda>", which makes them hard to identify
 (you can see for yourself by looking at a function’s ____name__ attribute). Thus, if you
 pass a list of (name, function) tuples, the first element of each tuple will be used
 as the DataFrame column names (you can think of a list of 2-tuples as an ordered
 mapping):

In [87]:
tips.groupby(['day', 'sex'], observed = True)['total_bill'].agg([('Average', 'mean'), 
                                                                 ('Maximum','max'), 
                                                                 ('Minimum','min'), 
                                                                 ('Range', myfun)
                                                                ]
                                                               )

Unnamed: 0_level_0,Unnamed: 1_level_0,Average,Maximum,Minimum,Range
day,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Thur,Male,18.714667,41.19,7.51,33.68
Thur,Female,16.715312,43.11,8.35,34.76
Fri,Male,19.857,40.17,8.58,31.59
Fri,Female,14.145556,22.75,5.75,17.0
Sat,Male,20.802542,50.81,7.74,43.07
Sat,Female,19.680357,44.3,3.07,41.23
Sun,Male,21.887241,48.17,7.25,40.92
Sun,Female,19.872222,35.26,9.6,25.66


Passing different functions for columns as a dictionary

In [88]:
tips.groupby('sex', observed = True).agg({'tip': 'mean', 'total_bill': 'max'})

Unnamed: 0_level_0,tip,total_bill
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,3.089618,50.81
Female,2.833448,44.3


###  Returning Aggregated Data Without Row Indexes

In [89]:
tips.groupby(['day', 'sex'], observed = True).mean(numeric_only = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
day,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Thur,Male,18.714667,2.980333,2.433333,0.165276
Thur,Female,16.715312,2.575625,2.46875,0.157525
Fri,Male,19.857,2.693,2.1,0.143385
Fri,Female,14.145556,2.781111,2.111111,0.199388
Sat,Male,20.802542,3.083898,2.644068,0.151577
Sat,Female,19.680357,2.801786,2.25,0.15647
Sun,Male,21.887241,3.220345,2.810345,0.162344
Sun,Female,19.872222,3.367222,2.944444,0.181569


We can drop the index by passing "as_index = False" in groupby method

In [90]:
tips.groupby(['day', 'sex'], observed = True, as_index = False).mean(numeric_only = True)

Unnamed: 0,day,sex,total_bill,tip,size,tip_pct
0,Thur,Male,18.714667,2.980333,2.433333,0.165276
1,Thur,Female,16.715312,2.575625,2.46875,0.157525
2,Fri,Male,19.857,2.693,2.1,0.143385
3,Fri,Female,14.145556,2.781111,2.111111,0.199388
4,Sat,Male,20.802542,3.083898,2.644068,0.151577
5,Sat,Female,19.680357,2.801786,2.25,0.15647
6,Sun,Male,21.887241,3.220345,2.810345,0.162344
7,Sun,Female,19.872222,3.367222,2.944444,0.181569


## Apply: General split-apply-combine

In [91]:
def ntop(df, n = 5, column = 'tip_pct'):
    return df.sort_values(column, ascending = False)[:n]

In [92]:
ntop(tips)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535


In [93]:
ntop(tips, 3, 'tip')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
23,39.42,7.58,Male,No,Sat,Dinner,4,0.192288


Let's apply this function on grouped DataFrame

In [94]:
tips.groupby('sex', observed = True).apply(ntop, include_groups = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Male,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Male,232,11.61,3.39,No,Sat,Dinner,2,0.29199
Male,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Male,149,7.51,2.0,No,Thur,Lunch,2,0.266312
Male,181,23.33,5.65,Yes,Sun,Dinner,2,0.242177
Female,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Female,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Female,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
Female,93,16.32,4.3,Yes,Fri,Dinner,2,0.26348
Female,221,13.42,3.48,Yes,Fri,Lunch,2,0.259314


In [95]:
tips.groupby('sex', observed = True).apply(ntop, 2, 'total_bill', include_groups = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Male,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812
Male,212,48.33,9.0,No,Sat,Dinner,4,0.18622
Female,102,44.3,2.5,Yes,Sat,Dinner,3,0.056433
Female,197,43.11,5.0,Yes,Thur,Lunch,4,0.115982


## Suppressing the Group Keys

In the preceding examples, you see that the resulting object has a hierarchical index
 formed from the group keys, along with the indexes of each piece of the original
 object. You can disable this by passing group_keys=False to groupby:

In [96]:
tips.groupby('sex', observed = True, group_keys = False).apply(ntop, include_groups = False)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
149,7.51,2.0,No,Thur,Lunch,2,0.266312
181,23.33,5.65,Yes,Sun,Dinner,2,0.242177
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
93,16.32,4.3,Yes,Fri,Dinner,2,0.26348
221,13.42,3.48,Yes,Fri,Lunch,2,0.259314


##  Quantile and Bucket Analysis

Some practice on tips dataset

In [97]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [98]:
# The 25th percentile of tip
tips['tip'].quantile(0.25)

np.float64(2.0)

In [99]:
# Let's bucket the tip in to 4 buckets
tips['Quant'] = pd.qcut(tips['tip'], 4, labels = ['Q1', 'Q2', 'Q3', 'Q4'])

In [100]:
tips.head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct,Quant
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,Q1
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,Q1
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,Q3
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,Q3
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,Q4
5,25.29,4.71,Male,No,Sun,Dinner,4,0.18624,Q4
6,8.77,2.0,Male,No,Sun,Dinner,2,0.22805,Q1
7,26.88,3.12,Male,No,Sun,Dinner,4,0.116071,Q3
8,15.04,1.96,Male,No,Sun,Dinner,2,0.130319,Q1
9,14.78,3.23,Male,No,Sun,Dinner,2,0.218539,Q3


Let's group the dataset by quant and apply functions

In [101]:
tips.groupby('Quant', observed = True)['tip'].agg(['mean', 'max', 'min'])

Unnamed: 0_level_0,mean,max,min
Quant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,1.711538,2.0,1.0
Q2,2.416818,2.88,2.01
Q3,3.189836,3.55,2.92
Q4,4.871475,10.0,3.6


#### Example: Filling Missing Values with Group-Specific Values

Let's simulate missing values in the tip column of the Tips dataset and then fill these missing values with the median tip for each day.

In [102]:
tips = sns.load_dataset('tips')

In [103]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [104]:
# Let's simulate a missing values
np.random.seed(42)
mask = np.random.rand(len(tips)) < 0.1

In [105]:
tips.loc[mask, 'tip'] = np.nan

In [106]:
# Checking number of missing values
tips['tip'].isna().sum()

np.int64(27)

In [107]:
tips['filled_tip'] = tips.groupby('day', observed = True)['tip'].transform(lambda x: x.fillna(x.median()))

In [108]:
tips.head(30)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,filled_tip
0,16.99,1.01,Female,No,Sun,Dinner,2,1.01
1,10.34,1.66,Male,No,Sun,Dinner,3,1.66
2,21.01,3.5,Male,No,Sun,Dinner,3,3.5
3,23.68,3.31,Male,No,Sun,Dinner,2,3.31
4,24.59,3.61,Female,No,Sun,Dinner,4,3.61
5,25.29,4.71,Male,No,Sun,Dinner,4,4.71
6,8.77,,Male,No,Sun,Dinner,2,3.31
7,26.88,3.12,Male,No,Sun,Dinner,4,3.12
8,15.04,1.96,Male,No,Sun,Dinner,2,1.96
9,14.78,3.23,Male,No,Sun,Dinner,2,3.23


In [109]:
# let's check the missing value for "filled_tip"
tips['filled_tip'].isna().sum()

np.int64(0)

Let's use transform method

In [110]:
def fill_median(x):
    return x.fillna(x.median())

In [111]:
tips['tip'] = tips.groupby('day', observed = True)['tip'].transform(fill_median)

In [112]:
tips.head(30)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,filled_tip
0,16.99,1.01,Female,No,Sun,Dinner,2,1.01
1,10.34,1.66,Male,No,Sun,Dinner,3,1.66
2,21.01,3.5,Male,No,Sun,Dinner,3,3.5
3,23.68,3.31,Male,No,Sun,Dinner,2,3.31
4,24.59,3.61,Female,No,Sun,Dinner,4,3.61
5,25.29,4.71,Male,No,Sun,Dinner,4,4.71
6,8.77,3.31,Male,No,Sun,Dinner,2,3.31
7,26.88,3.12,Male,No,Sun,Dinner,4,3.12
8,15.04,1.96,Male,No,Sun,Dinner,2,1.96
9,14.78,3.23,Male,No,Sun,Dinner,2,3.23


#### Example: Random Sampling and Permutation

Let's creat a playing card

In [113]:
card = ['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'Q', 'K']
value = list(range(1, 14)) * 4
suits = ['H', 'S', 'C', 'D']

In [114]:
decks = []
for suit in suits:
    decks.extend(str(c) + suit for c in card)

In [115]:
len(decks)

52

In [116]:
deck = pd.Series(value, index = decks)

In [117]:
deck.head(13)

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     11
QH     12
KH     13
dtype: int64

In [118]:
def draw(card, n = 5):
    return card.sample(n)

In [119]:
draw(deck)

2C     2
2S     2
5H     5
KC    13
KD    13
dtype: int64

In [120]:
draw(deck, 3)

QD     12
KS     13
10C    10
dtype: int64

Now let's draw two cards from each suits

In [121]:
deck.groupby(lambda card: card[-1]).apply(draw, 2)

C  7C     7
   2C     2
D  6D     6
   8D     8
H  QH    12
   9H     9
S  6S     6
   JS    11
dtype: int64

Alternatively, we could pass group_keys=False to drop the outer suit index, leaving
 in just the selected cards:

In [122]:
deck.groupby(lambda card: card[-1], group_keys = False).apply(draw, 2)

7C     7
9C     9
8D     8
5D     5
8H     8
QH    12
3S     3
QS    12
dtype: int64

#### Example: Group Weighted Average and Correlation

Let's take the tips dataset and do dome weighted and group weigthed mean

In [123]:
tips = sns.load_dataset('tips')

In [124]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


let's find the mean weight of tip with the total_bill as a weight

In [125]:
np.average(tips['tip'], weights = tips['total_bill'])

np.float64(3.4172321382335937)

now let's calculate the group weight mean

In [126]:
tips.groupby('day', observed = True).apply(lambda g: np.average(g['tip'], weights = g['total_bill']), 
                                           include_groups=False).reset_index(name='weighted_tip_mean')

Unnamed: 0,day,weighted_tip_mean
0,Thur,3.213559
1,Fri,3.095674
2,Sat,3.519242
3,Sun,3.50737


## Group Transforms and “Unwrapped” GroupBys

When you perform a groupby in Pandas, you "split" your data into groups based on a key, then "apply" a function to each group, and finally "combine" the results. This is the famous split-apply-combine strategy.
+ apply():  
Applies a function to each group and combines the results. The output can be of different shapes or types than the original data. This method is very flexible but sometimes returns a Series or DataFrame with a different index (often a MultiIndex).
+ transform():  
Applies a function to each group and returns a Series that has the same size and index as the original DataFrame. This is useful when you want to "broadcast" a group-level calculation back to the original DataFrame.

Key Difference:
transform returns an "unwrapped" result that aligns with the original data (i.e., one value per row), whereas apply can return a reduced or altered shape. This makes transform ideal for imputations or adding new columns based on group-level calculations.

In [127]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [128]:
tips.groupby('day', observed = True)['tip'].apply(lambda g: g.mean()).reset_index(name = 'mean_tip')

Unnamed: 0,day,mean_tip
0,Thur,2.771452
1,Fri,2.734737
2,Sat,2.993103
3,Sun,3.255132


In [129]:
tips.groupby('day', observed = True)['tip'].transform(lambda g: g.mean()).reset_index(name = 'mean_tip')

Unnamed: 0,index,mean_tip
0,0,3.255132
1,1,3.255132
2,2,3.255132
3,3,3.255132
4,4,3.255132
...,...,...
239,239,2.993103
240,240,2.993103
241,241,2.993103
242,242,2.993103


In [130]:
# Using apply to compute mean total_bill per day
mean_total_bill_apply = tips.groupby('day', observed = True)['total_bill'].apply(lambda x: x.mean())
print("Mean total_bill by day using apply:")
print(mean_total_bill_apply)

# Using transform to compute and broadcast the mean total_bill back to each row
tips['mean_total_bill'] = tips.groupby('day', observed = True)['total_bill'].transform(lambda x: x.mean())
print("\nDataFrame with mean_total_bill column using transform:")
print(tips[['day', 'total_bill', 'mean_total_bill']].head(10))

Mean total_bill by day using apply:
day
Thur    17.682742
Fri     17.151579
Sat     20.441379
Sun     21.410000
Name: total_bill, dtype: float64

DataFrame with mean_total_bill column using transform:
   day  total_bill  mean_total_bill
0  Sun       16.99            21.41
1  Sun       10.34            21.41
2  Sun       21.01            21.41
3  Sun       23.68            21.41
4  Sun       24.59            21.41
5  Sun       25.29            21.41
6  Sun        8.77            21.41
7  Sun       26.88            21.41
8  Sun       15.04            21.41
9  Sun       14.78            21.41


## Pivot Tables and Cross-Tabulation

In [131]:
tips = sns.load_dataset('tips')

In [132]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [133]:
tips.pivot_table(index = ['day', 'smoker'], values = ['total_bill', 'tip', 'size'], observed = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,2.352941,3.03,19.190588
Thur,No,2.488889,2.673778,17.113111
Fri,Yes,2.066667,2.714,16.813333
Fri,No,2.25,2.8125,18.42
Sat,Yes,2.47619,2.875476,21.276667
Sat,No,2.555556,3.102889,19.661778
Sun,Yes,2.578947,3.516842,24.12
Sun,No,2.929825,3.167895,20.506667


In [134]:
tips.groupby(['day', 'smoker'], observed = True).mean(numeric_only = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,19.190588,3.03,2.352941
Thur,No,17.113111,2.673778,2.488889
Fri,Yes,16.813333,2.714,2.066667
Fri,No,18.42,2.8125,2.25
Sat,Yes,21.276667,2.875476,2.47619
Sat,No,19.661778,3.102889,2.555556
Sun,Yes,24.12,3.516842,2.578947
Sun,No,20.506667,3.167895,2.929825


In [135]:
tips.pivot_table(index = ['day', 'time'], columns = 'smoker',
                 values = ['total_bill', 'tip'], observed = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,total_bill,total_bill
Unnamed: 0_level_1,smoker,Yes,No,Yes,No
day,time,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Thur,Lunch,3.03,2.666364,19.190588,17.075227
Thur,Dinner,,3.0,,18.78
Fri,Lunch,2.28,3.0,12.323333,15.98
Fri,Dinner,3.003333,2.75,19.806667,19.233333
Sat,Dinner,2.875476,3.102889,21.276667,19.661778
Sun,Dinner,3.516842,3.167895,24.12,20.506667


We could augment this table to include partial totals by passing margins=True. This
 has the effect of adding All row and column labels, with corresponding values being
 the group statistics for all the data within a single tier:

In [136]:
tips.pivot_table(index = ['day', 'time'], columns = 'smoker',
                 values = ['total_bill', 'tip'], observed = True, margins = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,tip,tip,total_bill,total_bill,total_bill
Unnamed: 0_level_1,smoker,Yes,No,All,Yes,No,All
day,time,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Thur,Lunch,3.03,2.666364,2.767705,19.190588,17.075227,17.664754
Thur,Dinner,,3.0,3.0,,18.78,18.78
Fri,Lunch,2.28,3.0,2.382857,12.323333,15.98,12.845714
Fri,Dinner,3.003333,2.75,2.94,19.806667,19.233333,19.663333
Sat,Dinner,2.875476,3.102889,2.993103,21.276667,19.661778,20.441379
Sun,Dinner,3.516842,3.167895,3.255132,24.12,20.506667,21.41
All,,3.00871,2.991854,2.998279,20.756344,19.188278,19.785943


#### Cross-Tabulation in Pandas
```
    cross_tab = pd.crosstab(
    data['row_variable'],
    data['column_variable'],
    margins=True,         # Optionally add row/column totals
    normalize=False       # Optionally normalize to proportions
)
```

In [137]:
pd.crosstab(tips['day'], tips['time'], margins = True)

time,Lunch,Dinner,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,61,1,62
Fri,7,12,19
Sat,0,87,87
Sun,0,76,76
All,68,176,244


In [138]:
crosstab_tips_norm = pd.crosstab(
    tips['day'],
    tips['time'],
    margins=True,
    normalize='index'  # Normalize by row
)

In [139]:
crosstab_tips_norm

time,Lunch,Dinner
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,0.983871,0.016129
Fri,0.368421,0.631579
Sat,0.0,1.0
Sun,0.0,1.0
All,0.278689,0.721311
