In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.936615,-0.839326
1,a,two,-0.19079,0.195303
2,b,one,0.202993,2.092143
3,b,two,0.859372,1.066229
4,a,one,-0.291592,-0.135164


Suppose you wanted to compute the mean of the data1 column using the labels from
key1. There are a number of ways to do this. One is to access data1 and call groupby
with the column (a Series) at key1:

In [4]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000219C3AE43D0>

This grouped variable is now a GroupBy object. It has not actually computed anything
yet except for some intermediate data about the group key df['key1']. The idea is
that this object has all of the information needed to then apply some operation to
each of the groups. For example, to compute group means we can call the GroupBy’s
mean method:

In [5]:
grouped.mean()

key1
a    0.151411
b    0.531183
Name: data1, dtype: float64

The important
thing here is that the data (a Series) has been aggregated according to the group key,
producing a new Series that is now indexed by the unique values in the key1 column.
The result index has the name 'key1' because the DataFrame column df['key1']
did.

In [6]:
# If instead we had passed multiple arrays as a list, we’d get something different:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.322511
      two    -0.190790
b     one     0.202993
      two     0.859372
Name: data1, dtype: float64

In [7]:
# Here we grouped the data using two keys, and the resulting Series now has a hier‐
# archical index consisting of the unique pairs of keys observed:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.322511,-0.19079
b,0.202993,0.859372


In [8]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005   -0.190790
            2006    0.202993
Ohio        2005    0.897994
            2006   -0.291592
Name: data1, dtype: float64

Frequently the grouping information is found in the same DataFrame as the data you
want to work on. In that case, you can pass column names (whether those are strings,
numbers, or other Python objects) as the group keys:

In [9]:
# chưa biết lỗi, tìm cách sửa lỗi nếu có xem lại
df.groupby('key1').mean()

# trong sách ghi vậy
# You may have noticed in the first case df.groupby('key1').mean() that there is no
# key2 column in the result. Because df['key2'] is not numeric data, it is said to be a
# nuisance column, which is therefore excluded from the result. By default, all of the
# numeric columns are aggregated, though it is possible to filter down to a subset, as
# you’ll see soon.

TypeError: Could not convert onetwoone to numeric

In [10]:
df.groupby(['key1', 'key2']).mean()
# Take note that any missing values in a group key will be excluded from the result.

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.322511,-0.487245
a,two,-0.19079,0.195303
b,one,0.202993,2.092143
b,two,0.859372,1.066229


### Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing
the group name along with the chunk of data. Consider the following:

In [13]:
for name, group in df.groupby('key1'):
    print("__"+name+"__")
    print(group)

__a__
  key1 key2     data1     data2
0    a  one  0.936615 -0.839326
1    a  two -0.190790  0.195303
4    a  one -0.291592 -0.135164
__b__
  key1 key2     data1     data2
2    b  one  0.202993  2.092143
3    b  two  0.859372  1.066229


In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [15]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print("__",(k1, k2),"__")
    print(group)

__ ('a', 'one') __
  key1 key2     data1     data2
0    a  one  0.936615 -0.839326
4    a  one -0.291592 -0.135164
__ ('a', 'two') __
  key1 key2    data1     data2
1    a  two -0.19079  0.195303
__ ('b', 'one') __
  key1 key2     data1     data2
2    b  one  0.202993  2.092143
__ ('b', 'two') __
  key1 key2     data1     data2
3    b  two  0.859372  1.066229


Of course, you can choose to do whatever you want with the pieces of data. A recipe
you may find useful is computing a dict of the data pieces as a one-liner:

In [19]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,0.202993,2.092143
3,b,two,0.859372,1.066229


By default groupby groups on axis=0, but you can group on any of the other axes.
For example, we could group the columns of our example df here by dtype like so:

In [20]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [21]:
grouped = df.groupby(df.dtypes, axis=1)

We can print out the groups like so:

In [22]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0  0.936615 -0.839326
1 -0.190790  0.195303
2  0.202993  2.092143
3  0.859372  1.066229
4 -0.291592 -0.135164
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Selecting a Column or Subset of Columns

In [24]:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000219C45DF290>

are syntactic sugar for:

In [25]:
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000219C80F2810>

Especially for large datasets, it may be desirable to aggregate only a few columns. For
example, in the preceding dataset, to compute means for just the data2 column and
get the result as a DataFrame, we could write:

In [26]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.487245
a,two,0.195303
b,one,2.092143
b,two,1.066229


The object returned by this indexing operation is a grouped DataFrame if a list or
array is passed or a grouped Series if only a single column name is passed as a scalar:

In [27]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000219C80B9A10>

In [28]:
s_grouped.mean()

key1  key2
a     one    -0.487245
      two     0.195303
b     one     2.092143
      two     1.066229
Name: data2, dtype: float64

### Grouping with Dicts and Series

In [30]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan

people

Unnamed: 0,a,b,c,d,e
Joe,0.408914,-0.854035,0.312512,-0.662532,1.393661
Steve,0.489954,0.438924,0.048247,0.320244,-1.810197
Wes,-0.586837,,,-1.379964,-0.30598
Jim,-0.363691,0.456427,-1.707374,-0.949644,-0.026215
Travis,-0.106621,-0.224403,0.5218,-1.847634,-1.374127


Now, suppose I have a group correspondence for the columns and want to sum
together the columns by group:

In [31]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

Now, you could construct an array from this dict to pass to groupby, but instead we
can just pass the dict (I included the key 'f' to highlight that unused grouping keys
are OK):

In [33]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,-0.35002,0.94854
Steve,0.368491,-0.881318
Wes,-1.379964,-0.892818
Jim,-2.657018,0.066521
Travis,-1.325834,-1.705151


The same functionality holds for Series, which can be viewed as a fixed-size mapping:

In [34]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [35]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with Functions

Using Python functions is a more generic way of defining a group mapping compared
with a dict or Series. Any function passed as a group key will be called once per index
value, with the return values being used as the group names. More concretely, con‐
sider the example DataFrame from the previous section, which has people’s first
names as index values. Suppose you wanted to group by the length of the names;
while you could compute an array of string lengths, it’s simpler to just pass the len
function:

In [36]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-0.541615,-0.397608,-1.394862,-2.992139,1.061466
5,0.489954,0.438924,0.048247,0.320244,-1.810197
6,-0.106621,-0.224403,0.5218,-1.847634,-1.374127


Mixing functions with arrays, dicts, or Series is not a problem as everything gets con‐
verted to arrays internally:

In [37]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.586837,-0.854035,0.312512,-1.379964,-0.30598
3,two,-0.363691,0.456427,-1.707374,-0.949644,-0.026215
5,one,0.489954,0.438924,0.048247,0.320244,-1.810197
6,two,-0.106621,-0.224403,0.5218,-1.847634,-1.374127


### Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate
using one of the levels of an axis index. Let’s look at an example:

In [38]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]],
                                     names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.352683,0.746576,0.327959,0.172212,2.532965
1,-0.984878,-0.62717,1.099452,-0.016977,0.914127
2,-0.933432,-1.027103,-0.061841,0.25125,0.609953
3,0.087532,0.422311,-0.649562,-1.430165,1.407262


To group by level, pass the level number or name using the level keyword:

In [39]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3
