In [None]:
import numpy as np
import pandas as pd

- pandas.Series.groupby
   - Series.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
   - Group Series using a mapper or by a Series of columns.
   - A groupby operation involves some combination of splitting the object, applying a function, and combining the results. 
   - This can be used to group large amounts of data and compute operations on these groups.
- Parameters: by:
   - mapping, function, label, or list of labels
   - Used to determine the groups for the groupby
   - If by is a function, it’s called on each value of the object’s index. 
   - If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups  
      - (the Series’ values are first aligned; see .align() method). 
   - If a list or ndarray of equal length is passed , the values are used as-is to determine the groups. 

In [None]:
ser = pd.Series([390., 350., 30., 20.],
                index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed")
ser

In [None]:
ser.groupby(["a", "b", "a", "b"]).mean()

In [None]:
ser.groupby(level=0).mean()

In [None]:
ser.groupby(ser > 100).mean()

Grouping by Indexes
- We can groupby different levels of a hierarchical index using the level parameter:

In [None]:
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed")
ser

In [None]:
ser.groupby(level=0).mean()

In [None]:
ser.groupby(level="Type").mean()

- We can also choose to include NA in group keys or not by defining dropna parameter, the default setting is True.
- By default NA values are excluded from group keys during the groupby operation. However, in case you want to include NA values in group keys, you could pass dropna=False to achieve it.

In [None]:
ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])
ser.groupby(level=0).sum()

In [None]:
ser.groupby(level=0, dropna=False).sum()

In [None]:
arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot']
ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed")
ser.groupby(["a", "b", "a", np.nan]).mean()

In [None]:
ser.groupby(["a", "b", "a", np.nan], dropna=False).mean()

In [None]:
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),
)
df

The mapping can be specified many different ways:

- A Python function, to be called on each of the axis labels.

- A list or NumPy array of the same length as the selected axis.

- A dict or Series, providing a label -> group name mapping.

- For DataFrame objects, a string indicating either a column name or an index level name to be used to group.

- df.groupby('A') is just syntactic sugar for df.groupby(df['A']).

- A list of any of the above things.

In [None]:
grouped = df.groupby(df["class"], axis=0) # default on axis 0 or axis 'index'
grouped.count()

In [None]:
grouped = df.groupby("order")
grouped.count()

In [None]:
grouped = df.groupby(["class", "order"])
grouped.count()

In [None]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

GroupBy sorting
- By default the group keys are sorted during the groupby operation. 
- You may however pass sort=False for potential speedups:

In [None]:
df2 = df.set_index(["A", "B"]) # create a multiIndex, note that we create a new data frame
grouped = df2.groupby(level=0, sort=False) # level=0 means 'A', level=1 means 'B', default sort=True
grouped.count()

In [None]:
grouped.first() #Compute first of group values

In [None]:
grouped.last() #Compute last of group values

In [None]:
grouped = df2.groupby(level=df2.index.names)
grouped.count()

In [None]:
# The groups attribute is a dict whose keys are the computed unique groups 
# and corresponding values being the axis labels belonging to each group.
df.groupby("A").groups

In [None]:
# Selecting a group
# A single group can be selected using get_group():
df.groupby('A').get_group('bar')

Grouping DataFrame with Index levels and columns
- A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects.
 

In [None]:
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)
df

In [None]:
#groups df by the second index level and the A column.
df.groupby([pd.Grouper(level=1), "A"]).sum()

DataFrame column selection in GroupBy
- Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. 
- Thus, using [] similar to getting a column from a DataFrame

In [None]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

In [None]:
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
# equivalent to df["C"].groupby(df["A"]), 
# ie. first get a Series of column "C" and group by its corresponding column "A" value
grouped_C.get_group('foo')

In [None]:
grouped_D = grouped["D"]
grouped_D.get_group('foo')

Iterating through groups
- With the GroupBy object in hand, iterating through the grouped data is very natural and functions similarly to itertools.groupby()
- In the case of grouping by multiple keys, the group name will be a tuple

In [None]:
grouped = df.groupby('A', sort=False)

for name, group in grouped:
    print(name)
    print(group)


In [None]:
for name, group in df.groupby(['A', 'B']):
    print(name)
    print(group)

Aggregation
- Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. These operations are similar to the aggregating API, window API, and resample API.
- Aggregation:  the aggregate() or equivalently agg() method
   - The result of the aggregation will have the group names as the new index along the grouped axis. 
   - In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option
      - you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:



In [None]:
grouped = df.groupby("A")
grouped.aggregate(np.sum)

In [None]:
grouped = df.groupby(["A", "B"])
grouped.aggregate(np.sum) # the result of the aggregation will have the group names as the new index along the grouped axis.

In [None]:
grouped = df.groupby(["A", "B"], as_index=False)
grouped.aggregate(np.sum)

In [None]:
df.groupby("A", as_index=False).sum()

In [None]:
df.groupby(["A", "B"]).aggregate(np.sum).reset_index()

Another simple aggregation example is to compute the size of each group. 
- This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.

In [None]:
grouped.size()

In [None]:
grouped.describe()

- The aggregating functions exclude NA values. 
- Any function which reduces a Series to a scalar value is an aggregation function and will work 
   - A trivial example is df.groupby('A').agg(lambda ser: 1). 

In [None]:
grouped['C'].agg(lambda ser : ser.size)

Applying multiple functions at once
- With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame


In [None]:
grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std]) # or grouped["C"].agg(['sum','mean','std'])

In [None]:
grouped[["C", "D"]].agg(['sum','mean','std'])

In [None]:
grouped["C"].agg([np.sum, np.mean, np.std]).rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})

In [None]:
animals = pd.DataFrame(
    {
        "kind": ["cat", "dog", "cat", "dog"],
        "height": [9.1, 6.0, 9.5, 34.0],
        "weight": [7.9, 7.5, 9.9, 198.0],
    }
)
animals

Named aggregation
- To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where
   - The keywords are the output column names
   - The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. 
   - pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. 
   - As usual, the aggregation can be a callable or a string alias.

In [None]:
animals.groupby("kind").agg(
    min_height=pd.NamedAgg(column="height", aggfunc="min"),
    max_height=pd.NamedAgg(column="height", aggfunc="max"),
    average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
)

In [None]:
animals.groupby("kind").agg(
    min_height=("height","min"),
    max_height=("height","max"),
    average_weight=("weight", np.mean),
)

- If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword arguments

In [None]:
animals.groupby("kind").agg(
    **{
        "total weight": pd.NamedAgg(column="weight", aggfunc=sum)
    }
)

Applying different functions to DataFrame columns
- By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame

In [None]:
grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
#grouped.agg({"C": "sum", "D": "std"})

Aggregations with User-Defined Functions
- Users can also provide their own functions for custom aggregations. 
- When aggregating with a User-Defined Function (UDF), the UDF should not mutate the provided Series

In [None]:
animals.groupby("kind")[["height"]].agg(lambda x: set(x))

In [None]:
animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())