Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 10: 'Data Aggregation and Group Operations'.</br>
Link: https://wesmckinney.com/book/data-aggregation

In [1]:
import numpy as np
import pandas as pd

<p>
Given below dataframe, sum the values in 'data1' grouping them by 'key1' and 'key2'. </br>
Run unstack() on the result and analyze what it does.
</p>


In [None]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                    "key2" : pd.Series([1, 2, 1, 2, 1, None, 1],
                                       dtype="Int64"),
                    "data1" : np.random.standard_normal(7),
                    "data2" : np.random.standard_normal(7)})

grouped = df['data1'].groupby([df['key1'], df['key2']]).sum() # syntax: df.groupby("key1")["data1"] == df["data1"].groupby(df["key1"])
print(grouped)
print(grouped.unstack()) # optional task

key1  key2
a     1       0.787710
      2       0.577167
b     1      -0.807945
      2       1.747757
Name: data1, dtype: float64
key2         1         2
key1                    
a     0.787710  0.577167
b    -0.807945  1.747757


<p>
Can you sum the values in 'data1' of the above dataframe grouping the result by below arrays of data? </br>
</p>


In [3]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

df['data1'].groupby([states, years]).sum()

CA  2005    1.043502
    2006   -1.054570
OH  2005    2.535467
    2006   -0.289361
Name: data1, dtype: float64

<p>
What if you want to calculate sum on every column which is numeric grouping the result by 'key2'? </br>
</p>


In [4]:
df.groupby('key2').sum(numeric_only=True)

Unnamed: 0_level_0,data1,data2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.556221,3.179345
2,2.324924,0.664142


<p>
Group the dataframe's data by 'key1' and calculate the size and count of each group with NA values included. </br>
</p>


In [6]:
print(df.groupby('key1', dropna=False).size())
print(df.groupby('key1', dropna=False).count())

key1
a      3
b      2
NaN    2
dtype: int64
      key2  data1  data2
key1                    
a        2      3      3
b        2      2      2
NaN      2      2      2


<p>
Grouping with Dictionaries and Series.</br>
Can you sum the values in 'people' dataframe grouping them by 'mapping' dictionary on column axis? </br>
</p>


In [7]:
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                       columns=["a", "b", "c", "d", "e"],
                       index=["Joe", "Steve", "Wanda", "Jill", "Trey"])

mapping = {"a": "red", "b": "red", "c": "blue",
            "d": "blue", "e": "red", "f" : "orange"}

people.groupby(mapping, axis="columns").sum()

Unnamed: 0,blue,red
Joe,-0.95072,3.461382
Steve,-0.909448,0.579547
Wanda,1.501712,1.234363
Jill,-0.108392,-0.690616
Trey,0.721913,-0.021113


<p>
Grouping with Functions.</br>
Any function passed as a group key will be called once per index value (or once per column value if using axis="columns"), with the return values being used as the group names. </br>
More concretely, consider the example DataFrame from the previous section, which has people’s first names as index values. </br>
Suppose you wanted to group by name length. While you could compute an array of string lengths, it's simpler to just pass the 'len' function.</br>
Run below and analyze the result. </br>
</p>


In [8]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,1.269084,2.188731,-0.018958,-0.931763,0.003567
4,0.09171,0.714022,2.119459,-1.505938,-1.517462
5,-0.128651,-0.115196,0.700535,-0.10827,2.057757


<p>
Aggregations refer to any data transformation that produces scalar values from arrays.</br>
To use your own aggregation functions, pass any function that aggregates an array to the aggregate method or its short alias 'agg'.</br>
Run below code and analyze the result.</br>
Note: Custom aggregation functions are generally much slower than the optimized built-in functions. This is because there is some extra overhead (function calls, data rearrangement) in constructing the intermediate group data chunks. </br>
</p>


In [9]:
grouped = df.groupby("key1")

def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.321375,0.620617
b,1,2.555702,0.10031


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>


<p>
Condition </br>
</p>
