<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Group and Aggregate Data

With the expressiveness of Python and pandas, we can perform quite complex group operations by expressing them as custom Python functions that manipulate the data associated with each group:

- **slicing/filtering**: Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)

- **groupby**: Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function.

- **split-apply-combine**: Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection.

- **pivot_table**: Compute pivot tables and cross-tabulations.

- **summary statistics**: Perform quantile analysis and other statistical group analyses.



# Setup Libraries and Functions

In [1]:
%reload_ext autoreload
%autoreload

In [2]:
import numpy as np
import pandas as pd

### 10.1 Group Operations

**split-apply-combine** concept:

- Data contained in a pandas Serie or DataFrame is split into groups based on one or more keys that you provide. 

- The splitting is performed on a particular axis of an object. 

- A DataFrame can be grouped on its rows (axis="index") or its columns (axis="columns"). 

- Once this is done, a function is applied to each group, producing a new value. 

- Finally, the results of all those function applications are combined into a result object. 


Each grouping key can take many forms, and the keys do not have to be all of the same type:

- A list or array of values that is the same length as the axis being grouped

- A value indicating a column name in a DataFrame

- A dictionary or Series giving a correspondence between the values on the axis being grouped and the group names

- A function to be invoked on the axis index or the individual labels in the index




In [3]:
![image.png](attachment:pda3_1001.png)

/bin/bash: -c: line 0: syntax error near unexpected token `attachment:pda3_1001.png'
/bin/bash: -c: line 0: `[image.png](attachment:pda3_1001.png)'


Example: Compute the mean of the data1 column using the labels from key1. 

In [73]:
df = pd.DataFrame({
        "key1" : ["a", "a", None, "b", "b", "a", None],
        "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
        "data1" : np.random.standard_normal(7),
        "data2" : np.random.standard_normal(7)
})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,0.473414,-1.366382
1,a,2.0,0.814117,0.112157
2,,1.0,1.800812,0.98194
3,b,2.0,0.881388,0.926022
4,b,1.0,-0.002598,1.505829
5,a,,1.162324,-0.436305
6,,1.0,-0.481928,0.311539


Create a new Python object **grouped** that groups all the values in column *data1* based on the unique labels in column *key1* of the DataFrame *df*:

In [74]:
grouped = df["data1"].groupby(df["key1"])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fb62d4c8190>

You can now apply methods like *.mean()* to compute the mean value for each group of column *data1*. This organizes the results in form of a new DataFrame with *key1* as row index and the unique values (labels) in *key1* as row labels. The mean values (i.e. the results of applying the *.mean()* method) are listed as values in a new column also named *data1* (like the original column on which the grouping and aggregation was done). The new DataFrame *grouped.mean()* of course has a new shape (less rows):

In [75]:
grouped.mean()

key1
a    0.816618
b    0.439395
Name: data1, dtype: float64


You can also group on several columns e.g. on column *key1* and then on column *key2*. The data in column *data1* is grouped in groups based on unique pairs of values of the two columns *key1*, *key2* used for the grouping (i.e. entered in a list as pandas Series *df["key1"]* and *df["key2"]* as argument in the `.groupby()` call to Series *df["data1"]*):

In [76]:
means = df["data1"].groupby([df["key1"], df["key2"]]).mean()
means

key1  key2
a     1       0.473414
      2       0.814117
b     1      -0.002598
      2       0.881388
Name: data1, dtype: float64

In order to disentangle the result you can switch from the hierarchical row index with *key1* and *key2* to a an 'unstacked' representation with *key1* as row index (first column used in the grouping) and *key2* as column index 

In [77]:
means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.473414,0.814117
b,-0.002598,0.881388


Note: You can also use any other list of arrays or lists to do the grouping in the `.groupby()` method call (not just Series).

Shorter notation if you are only using columns in the same DataFrame for grouping: Just use the column name.
 
Example below does the grouping and calculates the mean on all the columns of the DataFrame that are not column *key1* that is used for the grouping. 

Important: The mean can only be computed on numeric columns (see flag used; otherwise error).

In [78]:
df.groupby("key1").mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,0.816618,-0.56351
b,1.5,0.439395,1.215925


In [79]:
df.groupby("key2").mean()

TypeError: can only concatenate str (not "int") to str

In [80]:
df.groupby("key2").mean(numeric_only=True)

Unnamed: 0_level_0,data1,data2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.447425,0.358232
2,0.847752,0.51909


In [81]:
df.groupby(["key1", "key2"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.473414,-1.366382
a,2,0.814117,0.112157
b,1,-0.002598,1.505829
b,2,0.881388,0.926022


Useful methods or additional flags:

In [12]:
df.groupby(["key1", "key2"]).size()

key1  key2
a     1       1
      2       1
b     1       1
      2       1
dtype: int64

In [13]:
df.groupby("key1", dropna=False).size()

key1
a      3
b      2
NaN    2
dtype: int64

In [14]:
df.groupby(["key1", "key2"], dropna=False).size()

key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

In [15]:
df.groupby("key1").count()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2


You can iterate over groups (groups are just another Python object):

In [16]:
for name, group in df.groupby("key1"):
        print(name)
        print(group)

a
  key1  key2     data1     data2
0    a     1  0.370441 -0.349011
1    a     2  0.123473  1.970984
5    a  <NA>  0.997768  0.203423
b
  key1  key2     data1     data2
3    b     2 -0.384046  0.728813
4    b     1  0.206900 -1.833244


In [17]:
for (k1, k2), group in df.groupby(["key1", "key2"]):
        print((k1, k2))
        print(group)

('a', 1)
  key1  key2     data1     data2
0    a     1  0.370441 -0.349011
('a', 2)
  key1  key2     data1     data2
1    a     2  0.123473  1.970984
('b', 1)
  key1  key2   data1     data2
4    b     1  0.2069 -1.833244
('b', 2)
  key1  key2     data1     data2
3    b     2 -0.384046  0.728813


Selecting a Column or Subset of Columns:

In [18]:
df.groupby(["key1", "key2"])[["data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,-0.349011
a,2,1.970984
b,1,-1.833244
b,2,0.728813


In [19]:
df.groupby(["key1", "key2"])[["data1", "data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.370441,-0.349011
a,2,0.123473,1.970984
b,1,0.2069,-1.833244
b,2,-0.384046,0.728813


You can also group based on a dictionary:

Example: Perform a mapping of column labels onto certain new values to create groups and perform grouping on those groups using a Python dictionary. (Similar when using a pandas Series instead.) 

Important Update: You need to transpose the dataframe with *.T* if you want to do the grouping on columns. Otherwise .groupby() will try to do grouping on the rows.

In [82]:
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people

Unnamed: 0,a,b,c,d,e
Joe,-2.491642,-0.194876,0.535224,-1.805949,-0.12395
Steve,0.358106,-0.644409,-0.417606,0.123906,0.936589
Wanda,-0.607237,,,-0.334701,0.231399
Jill,-0.316258,1.122168,0.4615,0.297188,-0.259514
Trey,-0.407495,0.286012,-0.390774,1.440359,-1.345791


In [83]:
people.T

Unnamed: 0,Joe,Steve,Wanda,Jill,Trey
a,-2.491642,0.358106,-0.607237,-0.316258,-0.407495
b,-0.194876,-0.644409,,1.122168,0.286012
c,0.535224,-0.417606,,0.4615,-0.390774
d,-1.805949,0.123906,-0.334701,0.297188,1.440359
e,-0.12395,0.936589,0.231399,-0.259514,-1.345791


In [21]:
mapping = {"a": "red", "b": "red", "c": "blue", "d": "blue", "e": "red", "f" : "orange"}

In [84]:
by_column = people.T.groupby(mapping)
by_column.sum()

Unnamed: 0,Joe,Steve,Wanda,Jill,Trey
blue,-1.270725,-0.2937,-0.334701,0.758688,1.049585
red,-2.810468,0.650286,-0.375838,0.546396,-1.467274


In [23]:
new_mapping = {"Joe": "Friend", "Steve": "Stranger", "Wanda": "Stranger", "Jill": "Friend", "Trey": "Stranger"}

In [24]:
by_row = people.groupby(new_mapping)
by_row.sum()

Unnamed: 0,a,b,c,d,e
Friend,-0.887675,-0.271378,-1.875904,1.104165,-2.673337
Stranger,-0.960652,-0.036037,0.73943,1.733807,4.230376


You can group with with functions as well:

Example:
- Take the length of each label in the row index and group the data based on those lengths via the method call *.groupby(len)*.
- Then aggregate the data across all the rows e.g. using *.sum()*.
- If you want to group using the column name lengths and then aggregate you need to use *.groupby(len, axis="columns")* (deprecated) or better transpose and then groupby via *.T.groupby(len)*.

In [25]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.706563,0.162762,-1.043686,-1.46571,-2.310257
4,0.515745,-1.388313,-0.59407,4.160559,1.550295
5,-1.657508,0.918136,0.501282,0.143123,2.317001


In [26]:
people.groupby(len, axis="columns").sum()

Unnamed: 0,1
Joe,-5.363454
Steve,1.604077
Wanda,0.617957
Jill,0.759325
Trey,3.484891


In [27]:
people.T.groupby(len).sum()

Unnamed: 0,Joe,Steve,Wanda,Jill,Trey
1,-5.363454,1.604077,0.617957,0.759325,3.484891


In [28]:
people.T.groupby(len).sum().T

Unnamed: 0,1
Joe,-5.363454
Steve,1.604077
Wanda,0.617957
Jill,0.759325
Trey,3.484891


### 10.2 Data Aggregation

Examples of aggregation methods:

Table 10.1: Optimized groupby methods

|   Function name | Description
|----------------:|-----------------------------:
|       any, all	 | Return True if any (one or more values) or all non-NA values are "truthy"
|          count	 | Number of non-NA values
| cummin, cummax	 | Cumulative minimum and maximum of non-NA values
|         cumsum	 | Cumulative sum of non-NA values
|        cumprod	 | Cumulative product of non-NA values
|    first, last	 | First and last non-NA values
|           mean	 | Mean of non-NA values
|         median	 | Arithmetic median of non-NA values
|       min, max	 | Minimum and maximum of non-NA values
|            nth	 | Retrieve value that would appear at position n with the data in sorted order
|           ohlc	 | Compute four "open-high-low-close" statistics for time series-like data
|           prod	 | Product of non-NA values
|       quantile	 | Compute sample quantile
|           rank	 | Ordinal ranks of non-NA values, like calling Series.rank
|           size	 | Compute group sizes, returning result as a Series
|            sum	 | Sum of non-NA values
|       std, var	 | Sample standard deviation and variance


You can define your own aggregation method with **.agg** and a function that can transform the row or column values:

In [None]:
df['data1'].mean()
df['agg'].agg('mean')
df['agg'].aggregate('mean')

In [29]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,0.370441,-0.349011
1,a,2.0,0.123473,1.970984
2,,1.0,0.620048,2.068026
3,b,2.0,-0.384046,0.728813
4,b,1.0,0.2069,-1.833244
5,a,,0.997768,0.203423
6,,1.0,0.587645,-0.967694


In [30]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [31]:
grouped = df.groupby("key1")

In [32]:
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.874295,2.319994
b,1,0.590946,2.562057


Note: Missing values in *key1* are again ignored for the grouping and aggregation.

You can also use *.describe()" as (quasi-) aggregation method:

In [33]:
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,3.0,0.497227,...,0.684104,0.997768,3.0,0.608465,1.211874,-0.349011,-0.072794,0.203423,1.087204,1.970984
b,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,-0.088573,...,0.059163,0.2069,2.0,-0.552215,1.811648,-1.833244,-1.19273,-0.552215,0.088299,0.728813


You can do several aggregations at once by simply listing them. You can also re-name them if you like:

In [34]:
grouped.agg(["mean", "std", peak_to_peak])

Unnamed: 0_level_0,key2,key2,key2,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,mean,std,peak_to_peak,mean,std,peak_to_peak,mean,std,peak_to_peak
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
a,1.5,0.707107,1,0.497227,0.450726,0.874295,0.608465,1.211874,2.319994
b,1.5,0.707107,1,-0.088573,0.417862,0.590946,-0.552215,1.811648,2.562057


In [35]:
grouped.agg([("average", "mean"), ("stdev", "std")])

Unnamed: 0_level_0,key2,key2,data1,data1,data2,data2
Unnamed: 0_level_1,average,stdev,average,stdev,average,stdev
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
a,1.5,0.707107,0.497227,0.450726,0.608465,1.211874
b,1.5,0.707107,-0.088573,0.417862,-0.552215,1.811648


You can also perform different aggregations on different columns. Use a dictionary mapping for that as argument *.agg()*:

In [36]:
grouped.agg({"data1" : "sum", "data2" : "mean"})

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.491682,0.608465
b,-0.177146,-0.552215


In [37]:
grouped.agg({"data1" : ["sum", peak_to_peak], "data2" : ["mean", peak_to_peak]})

Unnamed: 0_level_0,data1,data1,data2,data2
Unnamed: 0_level_1,sum,peak_to_peak,mean,peak_to_peak
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1.491682,0.874295,0.608465,2.319994
b,-0.177146,0.590946,-0.552215,2.562057


Transform (hierarchical) index back into columns with **as_index=False** in the `.groupby()` argument: 

Example below: *key1* transforms from a row index back to a column (the row labels turn its column values) and a new integer index is attached as row index. Alternative: Use *.reset_index()*

In [85]:
grouped = df.groupby(["key1", "key2"])
grouped.agg({"data1" : "sum", "data2" : "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.473414,-1.366382
a,2,0.814117,0.112157
b,1,-0.002598,1.505829
b,2,0.881388,0.926022


In [86]:
grouped.agg({"data1" : "sum", "data2" : "mean"}).reset_index()

Unnamed: 0,key1,key2,data1,data2
0,a,1,0.473414,-1.366382
1,a,2,0.814117,0.112157
2,b,1,-0.002598,1.505829
3,b,2,0.881388,0.926022


In [87]:
grouped.agg({"data1" : "sum", "data2" : "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.473414,-1.366382
a,2,0.814117,0.112157
b,1,-0.002598,1.505829
b,2,0.881388,0.926022


In [88]:
grouped.agg({"data1" : "sum", "data2" : "mean"}).unstack()

Unnamed: 0_level_0,data1,data1,data2,data2
key2,1,2,1,2
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,0.473414,0.814117,-1.366382,0.112157
b,-0.002598,0.881388,1.505829,0.926022


In [40]:
grouped = df.groupby(["key1", "key2"], as_index=False)
grouped.agg({"data1" : "sum", "data2" : "mean"})

Unnamed: 0,key1,key2,data1,data2
0,a,1,0.370441,-0.349011
1,a,2,0.123473,1.970984
2,b,1,0.2069,-1.833244
3,b,2,-0.384046,0.728813


### 10.3 Apply: General split-apply-combine

General purpose GroupBy method: `.apply()`

- `.apply()` splits the object being manipulated into pieces with an (aggregation) function that is passed as argument.
- It invokes the passed function on each pieces
- It then concatenates the pieces.

In [41]:
# tips = pd.read_csv("/Users/markjack/GSU_Fall2024/IFI8410/DataScienceProgramming/07-Text-Processing/tips.csv")
tips = pd.read_csv("tips.csv")
tips["tip_pct"] = tips["tip"] / tips["total_bill"]
print(tips.columns)
tips.head()

Index(['total_bill', 'tip', 'smoker', 'day', 'time', 'size', 'tip_pct'], dtype='object')


Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [42]:
def top(df, n=5, column="tip_pct"):
    return df.sort_values(column, ascending=False)[:n]

In [43]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


Example:

- Split the **tips** DataFrame using `.groupby("smoker")`. 
- Then call the `top()` function on each group.
- The results of each function call are glued together using `pandas.concat` (internally).
- The pieces are labeled with the group names. 
- The result has a hierarchical index with an inner level that contains index values from the original DataFrame.

In [44]:
tips.groupby("smoker").apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


In [45]:
tips.groupby("smoker")[['total_bill', 'tip', 'smoker', 'day', 'time', 'size', 'tip_pct']].apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


You can pass the arguments of the `top()` function as arguments into the `.apply()` method. These will then be passed into and used with the `top()` function:

In [46]:
tips.groupby("smoker")[['total_bill', 'tip', 'smoker', 'day', 'time', 'size', 'tip_pct']].apply(top, n=1, column="total_bill")

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,212,48.33,9.0,No,Sat,Dinner,4,0.18622
Yes,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812


Example of more advanced applications of `.apply()` with `.groupby()`: 

Group Weighted Average and Correlation

In [47]:
df = pd.DataFrame({"category": ["a", "a", "a", "a",
                                "b", "b", "b", "b"],
                   "data": np.random.standard_normal(8),
                   "weights": np.random.uniform(size=8)})
df

Unnamed: 0,category,data,weights
0,a,-0.300344,0.525225
1,a,-1.420755,0.935274
2,a,-0.245391,0.879226
3,a,0.292409,0.489396
4,b,0.970709,0.530736
5,b,1.426995,0.539652
6,b,0.438492,0.099593
7,b,0.110262,0.241187


In [48]:
grouped = df.groupby("category")

In [49]:
def get_wavg(group):
    return np.average(group["data"], weights=group["weights"])

In [50]:
grouped[["data", "weights"]].apply(get_wavg)

category
a   -0.551123
b    0.960577
dtype: float64

### 10.4 Group Transforms and "Unwrapped" GroupBys

Use the `.transform()` method if you want to 'unwrap' the results of your aggregation and add those back as a column to your DataFrame. 

There are some constraints in using `.transform()`:

- It can produce a scalar value to be broadcast to the shape of the group.

- It can produce an object of the same shape as the input group.

- It must not mutate its input.

In [51]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [52]:
g = df.groupby('key')['value']
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [53]:
def get_mean(group):
    return group.mean()

In [54]:
g.transform(get_mean)

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [55]:
g.transform("mean")

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [56]:
df["value_mean"] = df.groupby('key')['value'].transform("mean")
df

Unnamed: 0,key,value,value_mean
0,a,0.0,4.5
1,b,1.0,5.5
2,c,2.0,6.5
3,a,3.0,4.5
4,b,4.0,5.5
5,c,5.0,6.5
6,a,6.0,4.5
7,b,7.0,5.5
8,c,8.0,6.5
9,a,9.0,4.5


In [57]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [58]:
df["value_normalized"] = df.groupby('key')['value'].transform(normalize)
df

Unnamed: 0,key,value,value_mean,value_normalized
0,a,0.0,4.5,-1.161895
1,b,1.0,5.5,-1.161895
2,c,2.0,6.5,-1.161895
3,a,3.0,4.5,-0.387298
4,b,4.0,5.5,-0.387298
5,c,5.0,6.5,-0.387298
6,a,6.0,4.5,0.387298
7,b,7.0,5.5,0.387298
8,c,8.0,6.5,0.387298
9,a,9.0,4.5,1.161895


Fast unwrapped group operations:

In [59]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized 

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### 10.5 Pivot Tables and Cross-Tabulation

A pivot table is a data summarization tool:
 
pandas function `pd.pivot_table()`:

- Aggregate a table of data by one or more keys. `pd.pivot_table()` automatically calculates the mean of each group.

- Arrange the data in a rectangle with some of the group keys along the rows and some along the columns. 


In [89]:
# tips = pd.read_csv("/Users/markjack/GSU_Fall2024/IFI8410/DataScienceProgramming/07-Text-Processing/tips.csv")
tips = pd.read_csv("tips.csv")
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [90]:
tips.pivot_table(index=["day", "smoker"],
                 values=["size", "tip", "total_bill"])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,2.25,2.8125,18.42
Fri,Yes,2.066667,2.714,16.813333
Sat,No,2.555556,3.102889,19.661778
Sat,Yes,2.47619,2.875476,21.276667
Sun,No,2.929825,3.167895,20.506667
Sun,Yes,2.578947,3.516842,24.12
Thur,No,2.488889,2.673778,17.113111
Thur,Yes,2.352941,3.03,19.190588


This is equivalent to:

In [91]:
tips.groupby(["day", "smoker"])[["size", "tip", "total_bill"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,2.25,2.8125,18.42
Fri,Yes,2.066667,2.714,16.813333
Sat,No,2.555556,3.102889,19.661778
Sat,Yes,2.47619,2.875476,21.276667
Sun,No,2.929825,3.167895,20.506667
Sun,Yes,2.578947,3.516842,24.12
Thur,No,2.488889,2.673778,17.113111
Thur,Yes,2.352941,3.03,19.190588


You can define other aggregation functions by adding another argument to `pd.pivot_table()`:

In [92]:
tips.pivot_table(index=["day", "smoker"],
                 values=["size", "tip", "total_bill"], aggfunc="count")

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,total_bill
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,4,4,4
Fri,Yes,15,15,15
Sat,No,45,45,45
Sat,Yes,42,42,42
Sun,No,57,57,57
Sun,Yes,19,19,19
Thur,No,45,45,45
Thur,Yes,17,17,17


In [93]:
tips.pivot_table(index=["time", "smoker", "day"],
                 values="tip", aggfunc=len)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tip
time,smoker,day,Unnamed: 3_level_1
Dinner,No,Fri,3
Dinner,No,Sat,45
Dinner,No,Sun,57
Dinner,No,Thur,1
Dinner,Yes,Fri,9
Dinner,Yes,Sat,42
Dinner,Yes,Sun,19
Lunch,No,Fri,1
Lunch,No,Thur,44
Lunch,Yes,Fri,6


For *aggfunc=len* missing values are NOT ignored. You can fill them with *fill_value*:

In [94]:
tips.pivot_table(index=["time", "smoker"], columns="day",
                 values="tip", aggfunc=len)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dinner,No,3.0,45.0,57.0,1.0
Dinner,Yes,9.0,42.0,19.0,
Lunch,No,1.0,,,44.0
Lunch,Yes,6.0,,,17.0


In [65]:
tips.pivot_table(index=["time", "smoker"], columns="day",
                 values="tip", aggfunc=len, fill_value=0)

Unnamed: 0_level_0,day,Fri,Sat,Sun,Thur
time,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dinner,No,3,45,57,1
Dinner,Yes,9,42,19,0
Lunch,No,1,0,0,44
Lunch,Yes,6,0,0,17


You can include the result of the aggregations across all the row labels and all the column labels with additional argument **margins=True**:

In [66]:
tips.pivot_table(index=["time", "day"], 
                 values=["size"])

Unnamed: 0_level_0,Unnamed: 1_level_0,size
time,day,Unnamed: 2_level_1
Dinner,Fri,2.166667
Dinner,Sat,2.517241
Dinner,Sun,2.842105
Dinner,Thur,2.0
Lunch,Fri,2.0
Lunch,Thur,2.459016


In [67]:
tips.pivot_table(index=["time", "day"],
                 values=["size"], margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size
time,day,Unnamed: 2_level_1
Dinner,Fri,2.166667
Dinner,Sat,2.517241
Dinner,Sun,2.842105
Dinner,Thur,2.0
Lunch,Fri,2.0
Lunch,Thur,2.459016
All,,2.569672


Note: If you want to further group the column(s) **size** with other columns you include the additional argument **columns=...**

In [68]:
tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["size"])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size
Unnamed: 0_level_1,smoker,No,Yes
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2
Dinner,Fri,2.0,2.222222
Dinner,Sat,2.555556,2.47619
Dinner,Sun,2.929825,2.578947
Dinner,Thur,2.0,
Lunch,Fri,3.0,1.833333
Lunch,Thur,2.5,2.352941


In [69]:
tips.pivot_table(index=["time", "day"], columns="smoker",
                 values=["size"], margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,size
Unnamed: 0_level_1,smoker,No,Yes,All
time,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Dinner,Fri,2.0,2.222222,2.166667
Dinner,Sat,2.555556,2.47619,2.517241
Dinner,Sun,2.929825,2.578947,2.842105
Dinner,Thur,2.0,,2.0
Lunch,Fri,3.0,1.833333,2.0
Lunch,Thur,2.5,2.352941,2.459016
All,,2.668874,2.408602,2.569672


Cross-Tabulations: Crosstab

Pivot a table and count frequencies of groups (i.e. number of occurrences of unique value combinations of groups) with a pandas function `pd.crosstab()`.

In [70]:
import pandas as pd

from io import StringIO

data = """Sample  Nationality  Handedness
    1   USA  Right-handed
    2   Japan    Left-handed
    3   USA  Right-handed
    4   Japan    Right-handed
    5   Japan    Left-handed
    6   Japan    Right-handed
    7   USA  Right-handed
    8   USA  Left-handed
    9   Japan    Right-handed
    10  USA  Right-handed"""

data = pd.read_table(StringIO(data), sep="\s+")
data

Unnamed: 0,Sample,Nationality,Handedness
0,1,USA,Right-handed
1,2,Japan,Left-handed
2,3,USA,Right-handed
3,4,Japan,Right-handed
4,5,Japan,Left-handed
5,6,Japan,Right-handed
6,7,USA,Right-handed
7,8,USA,Left-handed
8,9,Japan,Right-handed
9,10,USA,Right-handed


In [71]:
pd.crosstab(data["Nationality"], data["Handedness"], margins=True)

Handedness,Left-handed,Right-handed,All
Nationality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Japan,2,3,5
USA,1,4,5
All,3,7,10
