# How to use _DataFrame.Groupby()_

[Pandas user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#dataframe-column-selection-in-groupby)

## Imports

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import scipy

## Settings

In [2]:
pd.set_option("display.precision", 2)

## Load the data

The dataset consists of measurements from a ficticious vinyl records pressing company.

There's a company with two locations:
* London, UK
* Cracow, Poland

London plant manufactures vinyl records made of 3 different materials.

Cracow plant manufactures vinyl records made of 2 different materials.

We've collected a few batches of measurements for various materials.

The measurements are:
* weight of a vinyl records in grams
* diameter of a vinyl records in milimiters

In [3]:
path = Path("datasets/vinyl_records.csv")

In [4]:
df = pd.read_csv(path)

### Inspect the data

In [5]:
df.shape

(40, 6)

**Comment:** The data set comprises of:
- 40 rows
- 6 columns

Let's preview the data. It's not big so it can be displayed in full.

In [6]:
df

Unnamed: 0,material,plant,batch,sample,weight_g,diameter_mm
0,PVC 1,London,A,1,132.64,302.02
1,PVC 1,London,A,2,131.25,300.83
2,PVC 1,London,A,3,131.49,301.84
3,PVC 1,London,A,4,132.63,301.76
4,PVC 1,London,A,5,132.2,301.35
5,PVC 1,London,B,1,148.14,301.41
6,PVC 1,London,B,2,148.12,301.62
7,PVC 1,London,B,3,148.11,301.5
8,PVC 1,London,B,4,150.33,301.77
9,PVC 1,London,B,5,147.57,301.7


Summary of the dataset.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   material     40 non-null     object 
 1   plant        40 non-null     object 
 2   batch        40 non-null     object 
 3   sample       40 non-null     int64  
 4   weight_g     40 non-null     float64
 5   diameter_mm  40 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.0+ KB


**Comment:** *material*, *plant* and *batch* are categorical variables.

They will be used for grouping the data. Let's check the number of unique values in each (object) column.

In [8]:
for col in df:
    if df[col].dtype == "O":
        print(col,"\nunique values:", len(df[col].unique()),"\n", df[col].unique(), "\n")

material 
unique values: 3 
 ['PVC 1' 'PVC 2' 'PVC 3'] 

plant 
unique values: 2 
 ['London' 'Cracow'] 

batch 
unique values: 3 
 ['A' 'B' 'C'] 



## Create `.groupby()` object

Let's group the data *by material*.

In [9]:
grouped = df.groupby("material")

In [10]:
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028AECC86150>

In [11]:
type(grouped)

pandas.core.groupby.generic.DataFrameGroupBy

### See the groups

Use `.groups` method to see the index of the items that will make up every group.

In [12]:
grouped.groups

{'PVC 1': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 27, 28, 29, 30, 31, 32, 33, 34, 35], 'PVC 2': [14, 15, 16, 17, 18, 19, 20, 21, 36, 37, 38, 39], 'PVC 3': [22, 23, 24, 25, 26]}

In [13]:
type(grouped.groups)

pandas.io.formats.printing.PrettyDict

In [14]:
isinstance(grouped.groups, dict)

True

Method `.groups` creates a `dict` with:
- *keys*: name of the group
- *values*: index of all items that belong to the group

**Q: How many groups have been created?**:

In [15]:
len(grouped.groups)

3

To display all of the *keys* use the following:

In [16]:
grouped.groups.keys()

dict_keys(['PVC 1', 'PVC 2', 'PVC 3'])

### Get a single group

To get a specific group as a `DataFrame`, use `.get_group()` method using one of the *keys* as an input.

In [17]:
grouped.get_group("PVC 3")

Unnamed: 0,material,plant,batch,sample,weight_g,diameter_mm
22,PVC 3,London,A,1,202.48,301.9
23,PVC 3,London,A,2,191.32,301.65
24,PVC 3,London,A,3,199.05,301.95
25,PVC 3,London,A,4,196.81,301.57
26,PVC 3,London,A,5,198.42,301.49


In [18]:
type(grouped.get_group("PVC 3"))

pandas.core.frame.DataFrame

### Get all the groups

[Pandas doc: iterating-through-groups](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#iterating-through-groups)

In [19]:
for name, group in grouped:
    print(name, '\n')
    print(group, '\n\n')

PVC 1 

   material   plant batch  sample  weight_g  diameter_mm
0     PVC 1  London     A       1    132.64       302.02
1     PVC 1  London     A       2    131.25       300.83
2     PVC 1  London     A       3    131.49       301.84
3     PVC 1  London     A       4    132.63       301.76
4     PVC 1  London     A       5    132.20       301.35
5     PVC 1  London     B       1    148.14       301.41
6     PVC 1  London     B       2    148.12       301.62
7     PVC 1  London     B       3    148.11       301.50
8     PVC 1  London     B       4    150.33       301.77
9     PVC 1  London     B       5    147.57       301.70
10    PVC 1  London     C       1    137.75       301.28
11    PVC 1  London     C       2    138.36       301.25
12    PVC 1  London     C       3    139.77       301.42
13    PVC 1  London     C       4    138.51       301.73
27    PVC 1  Cracow     A       1    141.44       301.34
28    PVC 1  Cracow     A       2    146.49       301.77
29    PVC 1  Cracow    

### Carry out calculations on groups

You can use one of the many implemented aggregation functions and apply it to the `groupby` object:

[Pandas doc: aggregation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation)

In [20]:
grouped.count()

Unnamed: 0_level_0,plant,batch,sample,weight_g,diameter_mm
material,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PVC 1,23,23,23,23,23
PVC 2,12,12,12,12,12
PVC 3,5,5,5,5,5


Full list of available operations is available [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods).

**Note:** not all calculations will always work. It's because our dataset have both: categorical and numerical variables and we need take this into consideration.

In [21]:
%%script python --no-raise-error
grouped.mean()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'grouped' is not defined


To fix this, select the relevant columns you want to be used with the aggregation function.

**Q: What's the average measurement per material?**

In [22]:
grouped[["weight_g", "diameter_mm"]].mean()

Unnamed: 0_level_0,weight_g,diameter_mm
material,Unnamed: 1_level_1,Unnamed: 2_level_1
PVC 1,140.04,301.54
PVC 2,182.08,301.53
PVC 3,197.62,301.71


**Q: How many samples were collected per material?**

In [23]:
grouped["batch"].count()

material
PVC 1    23
PVC 2    12
PVC 3     5
Name: batch, dtype: int64

## Create multilevel `.groupby()` object

Let's create a grouping by *material* and *plant*.

In [24]:
grouped = df.groupby(["material", "plant"])

Let's see the groups:

In [25]:
grouped.groups

{('PVC 1', 'Cracow'): [27, 28, 29, 30, 31, 32, 33, 34, 35], ('PVC 1', 'London'): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], ('PVC 2', 'Cracow'): [36, 37, 38, 39], ('PVC 2', 'London'): [14, 15, 16, 17, 18, 19, 20, 21], ('PVC 3', 'London'): [22, 23, 24, 25, 26]}

In [26]:
len(grouped.groups)

5

In [27]:
for name, group in grouped:
    print(name, '\n')
    print(group, '\n\n')

('PVC 1', 'Cracow') 

   material   plant batch  sample  weight_g  diameter_mm
27    PVC 1  Cracow     A       1    141.44       301.34
28    PVC 1  Cracow     A       2    146.49       301.77
29    PVC 1  Cracow     A       3    140.02       301.82
30    PVC 1  Cracow     A       4    136.82       301.29
31    PVC 1  Cracow     B       1    141.25       301.36
32    PVC 1  Cracow     B       2    139.59       301.52
33    PVC 1  Cracow     B       3    141.36       301.40
34    PVC 1  Cracow     B       4    141.44       301.72
35    PVC 1  Cracow     B       5    135.60       301.72 


('PVC 1', 'London') 

   material   plant batch  sample  weight_g  diameter_mm
0     PVC 1  London     A       1    132.64       302.02
1     PVC 1  London     A       2    131.25       300.83
2     PVC 1  London     A       3    131.49       301.84
3     PVC 1  London     A       4    132.63       301.76
4     PVC 1  London     A       5    132.20       301.35
5     PVC 1  London     B       1    148.

When selecting multiple columns, the name of each group is stored as *tuple*.

**Q: What's the average measurement per material per plant?**

In [28]:
grouped[["weight_g", "diameter_mm"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,140.45,301.55
PVC 1,London,139.78,301.53
PVC 2,Cracow,182.93,301.49
PVC 2,London,181.66,301.55
PVC 3,London,197.62,301.71


In [29]:
print(grouped[["weight_g", "diameter_mm"]].mean())

                 weight_g  diameter_mm
material plant                        
PVC 1    Cracow    140.45       301.55
         London    139.78       301.53
PVC 2    Cracow    182.93       301.49
         London    181.66       301.55
PVC 3    London    197.62       301.71


**Q4: How many samples were collected per material per plant?**

In [30]:
grouped["batch"].count()

material  plant 
PVC 1     Cracow     9
          London    14
PVC 2     Cracow     4
          London     8
PVC 3     London     5
Name: batch, dtype: int64

## Use `.agg()` to carry out multiple calculations at once

[Pandas doc: aggregate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)

* `.agg()` is an alias for `.aggregate()` so they are the same thing

It accepts functions as:
* function names
* function string names
* list of functions
* dictionary of functions

Using the `.agg()` method gives us flexibility to apply any function.

### Use function string name

In [31]:
grouped[["weight_g", "diameter_mm"]].agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,140.45,301.55
PVC 1,London,139.78,301.53
PVC 2,Cracow,182.93,301.49
PVC 2,London,181.66,301.55
PVC 3,London,197.62,301.71


### Use function name

* calculate geometric mean using Scipy ([doc](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.gmean.html))

In [32]:
grouped[["weight_g", "diameter_mm"]].agg(scipy.stats.gmean)

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,140.41,301.55
PVC 1,London,139.6,301.53
PVC 2,Cracow,182.92,301.49
PVC 2,London,181.65,301.55
PVC 3,London,197.58,301.71


### Use `list` of functions

Use `list` of multiple functions as an input for `.agg()`

Note that the multi-level index gets created for columns.

In [33]:
grouped[["weight_g", "diameter_mm"]].agg(["min", "max", "mean"])

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,weight_g,weight_g,diameter_mm,diameter_mm,diameter_mm
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,min,max,mean
material,plant,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
PVC 1,Cracow,135.6,146.49,140.45,301.29,301.82,301.55
PVC 1,London,131.25,150.33,139.78,300.83,302.02,301.53
PVC 2,Cracow,180.8,185.56,182.93,301.2,301.83,301.49
PVC 2,London,180.2,183.67,181.66,301.3,302.13,301.55
PVC 3,London,191.32,202.48,197.62,301.49,301.95,301.71


You can use a combination of function names and function string names:

In [34]:
grouped[["weight_g", "diameter_mm"]].agg(["mean", scipy.stats.gmean])

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,weight_g,diameter_mm,diameter_mm
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,gmean,mean,gmean
material,plant,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
PVC 1,Cracow,140.45,140.41,301.55,301.55
PVC 1,London,139.78,139.6,301.53,301.53
PVC 2,Cracow,182.93,182.92,301.49,301.49
PVC 2,London,181.66,181.65,301.55,301.55
PVC 3,London,197.62,197.58,301.71,301.71


### Use `dict` of functions

Use `dict` if you want to apply different calculations to different columns:
- *key*: column name or names (as `list`)
- *values*: function or functions that are to be used (as `list`)

In [35]:
grouped[["weight_g", "diameter_mm"]].agg(
    {
        "weight_g": ["min", "max", "mean"],
        "diameter_mm": "mean" 
    }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,weight_g,weight_g,diameter_mm
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,mean
material,plant,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
PVC 1,Cracow,135.6,146.49,140.45,301.55
PVC 1,London,131.25,150.33,139.78,301.53
PVC 2,Cracow,180.8,185.56,182.93,301.49
PVC 2,London,180.2,183.67,181.66,301.55
PVC 3,London,191.32,202.48,197.62,301.71


### Using lambda

You can use `lambda` functions inside `.agg()` too.

This is especially convenience if your function accepts some parameters.

**Example**

Let's have a a look at *standard deviation*.

Standard deviation accepts a *degree of freedom* parameter.

If you don't want to tweak *dof* parameter, you can always use the function directly.

In [36]:
grouped[["weight_g", "diameter_mm"]].agg("std")

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,3.11,0.21
PVC 1,London,7.28,0.31
PVC 2,Cracow,2.01,0.31
PVC 2,London,1.12,0.27
PVC 3,London,4.08,0.2


The about is the same as running the function with `ddof=1` (delta degrees of freedom).

This is what you would typically use if you only have a `sample` data from your population (probably 99% of cases).

In [37]:
grouped[["weight_g", "diameter_mm"]].agg(lambda x: np.std(x, ddof=1))

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,3.11,0.21
PVC 1,London,7.28,0.31
PVC 2,Cracow,2.01,0.31
PVC 2,London,1.12,0.27
PVC 3,London,4.08,0.2


But if you know your data is complete and contains the measurements for the whole population - you'll want to use `ddof=0` and `lambda` is ideal for this kind of situations:

In [38]:
grouped[["weight_g", "diameter_mm"]].agg(lambda x: np.std(x, ddof=0))

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,2.94,0.2
PVC 1,London,7.02,0.29
PVC 2,Cracow,1.74,0.27
PVC 2,London,1.05,0.25
PVC 3,London,3.65,0.18


### Use user defined functions

In [39]:
def my_std(x, ddof=1):
    return np.std(x, ddof=ddof)

In [40]:
grouped[["weight_g", "diameter_mm"]].agg(my_std)

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,3.11,0.21
PVC 1,London,7.28,0.31
PVC 2,Cracow,2.01,0.31
PVC 2,London,1.12,0.27
PVC 3,London,4.08,0.2


In [41]:
grouped[["weight_g", "diameter_mm"]].agg(my_std, ddof=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,weight_g,diameter_mm
material,plant,Unnamed: 2_level_1,Unnamed: 3_level_1
PVC 1,Cracow,2.94,0.2
PVC 1,London,7.02,0.29
PVC 2,Cracow,1.74,0.27
PVC 2,London,1.05,0.25
PVC 3,London,3.65,0.18
