# Data analysis with Pandas

This is a guide to doing data analysis with Pandas written to help understand the most important concepts and usage patterns in a way that sticks with you. I try to make things understandable and memorable by adhering to some principles:

- Identify and explore important general concepts:
    - the basic Pandas data structures: series, dataframes, indexes and multi-indexes
    - wide classes of operations on data: selection, grouping, aggregations and transformations, sorting
    - shapes that datasets come in and how to convert between them: long vs wide
- Use well chosen examples:
    - preferrably real datasets
    - understandable for a broad audience
    - interesting
    - distinct from one another in structure
- Stay concise, avoid minutiae that can be easily looked up in the documentation, provide pointers to references instead


Before we begin we need to import all the libraries we will use. We also change some options that I almost always find useful to change:

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Make Pandas less eager to hide data in dataframes with "..."
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

# Sensible floating point precision for basic use cases
pd.options.display.precision = 3

# Short exception tracebacks
%xmode Plain

Exception reporting mode: Plain


## Pandas data structures

We begin by looking into the fundamental high-level Pandas data structures: series, data frames and indexes. If you try to use Pandas without understanding the data structures and the relations between them, it gets confusing very easily. On the other hand learning the data structures can feel dry compared to working with real datasets. We cover only the most important basics here to keep things interesting and later revisit in more depth when need arises.

### Series and basic indexes

A series holds an array of values:

In [2]:
primes = pd.Series([2, 3, 5, 7, 11])
primes

0     2
1     3
2     5
3     7
4    11
dtype: int64

The right column of the output simply shows the values in the series. `dtype: int64` refers to the data type of the values.

The left column of the output shows labels corresponding to the values in the series. The labels are part of the *index* that is a part of every series. By default the series is indexed simply using position of each value in the series:

In [62]:
primes = pd.Series([2, 3, 5, 7, 11])
primes.index

RangeIndex(start=0, stop=5, step=1)

`.loc[]` call for a series takes an index label as argument and looks up the corresponding value in the series. For the default index it ends up working like basic array indexing (though it does not support negative indexing, since it does label lookup):

In [4]:
primes = pd.Series([2, 3, 5, 7, 11])
primes.loc[2]

5

The index is what makes a series something more than a simple array. It makes it possible to refer to the values in the series by whatever label is appropriate: by a string, by a date, by a pair of numbers, ... It also plays a role similar to a database index, hence the name: it speeds up operations like joins that have to lookup by label repeatedly. Here is how you construct a series with an explicit index that does not simply use position in the series as the label of each value:

In [5]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes

p0     2
p1     3
p2     5
p3     7
p4    11
dtype: int64

Now you can lookup values by label or a set of labels:

In [6]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes.loc["p0"]

2

In [7]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes.loc[["p0", "p2"]]

p0    2
p2    5
dtype: int64

### Data frames

Conceptually a dataframe is a collection of column series that share a common row index:

In [8]:
df = pd.DataFrame({"a": primes, "b": primes})
df

Unnamed: 0,a,b
p0,2,2
p1,3,3
p2,5,5
p3,7,7
p4,11,11


In [9]:
df = pd.DataFrame({"a": primes, "b": primes})
display(id(df.index))
display(id(df["a"].index))
display(id(df["b"].index))

140215916018064

140215916018064

140215916018064

Columns get labels through an additional index object, called the column index:

In [10]:
df = pd.DataFrame({"a": primes, "b": primes})
df.columns

Index(['a', 'b'], dtype='object')

Many Pandas methods work with either the row index or column index depending on the value of the `axis` keyword argument. For example removing rows by label and removing columns using a label or a set of labels are both done using the `drop` method:

In [11]:
df = pd.DataFrame({"a": primes, "b": primes})
df.drop(["p0", "p1"])

Unnamed: 0,a,b
p2,5,5
p3,7,7
p4,11,11


In [12]:
df = pd.DataFrame({"a": primes, "b": primes})
df.drop("a", axis=1)

Unnamed: 0,b
p0,2
p1,3
p2,5
p3,7
p4,11


### Jupyter representation of the data structures

It is good to learn how the Jupyter cell output corresponds to the underlying Pandas data structures and its attributes. Often we will not construct a `pd.Series` or `pd.DataFrame` directly, but receive one as a result of some sequence of Pandas operations. In this case we do not know up front whether it is a `pd.Series` or `pd.DataFrame`, what the index is, whether the index has a name, etc. You can inspect the result, for example call `type(result)`, but it is more efficient to simply learn what is shown where and how.

A `pd.Series` is displayed in monospaced font (unlike `pd.DataFrame`) and as the below example illustrates its Jupyter representation has at most four parts:

- top left corner: index name, only shown if present
- left column: index labels  
- right column: series values
- bottom line: series name, if present, and dtype of series values

In [13]:
index = pd.Index(["p0", "p1", "p2", "p3", "p4"], name="key")
series = pd.Series([2, 3, 5, 7, 11], index=index, name="primes")
series

key
p0     2
p1     3
p2     5
p3     7
p4    11
Name: primes, dtype: int64

A dataframe with one column is something different than a series:

In [14]:
index = pd.Index(["p0", "p1", "p2", "p3", "p4"], name="key")
series = pd.Series([2, 3, 5, 7, 11], index=index, name="primes")
series.to_frame()

Unnamed: 0_level_0,primes
key,Unnamed: 1_level_1
p0,2
p1,3
p2,5
p3,7
p4,11


## Dataset: mtcars

We will use the mtcars dataset as the default basic example in the following sections. Its contents are easy to understand and interesting and also it is provided in a common simple form: each row describes a single unique unit of observation (a car) and each column is a different measurement for the same unit of observation (miles/gallon, number of cylinders, horsepower etc.):

In [15]:
CARS = pd.read_csv("mtcars.csv")

def cars():
    # Return a fresh copy every time so that examples do not affect each other
    return CARS.copy()

## Exploring datasets

`.head(n)` returns the first `n` rows, `.tail(n)` the last `n` rows, and `.sample(n)` random `n` rows:

In [16]:
df = cars()
display(df.head(2))
display(df.tail(2))
display(df.sample(2))

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4


`info()` prints a summary description of a dataframe that contains a lot of useful information: number of entries in the index (which is equal to the total number of rows), a list of columns along with the datatype of the column and the number of non-null values and overall memory usage of the dataframe. The downside is that all this information is simply printed and nothing is returned:

In [17]:
df = cars()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB


`len(df)` returns the number of rows in a dataframe:

In [18]:
df = cars()
len(df)

32

`count()` returns number of rows with non-null value. Since this differs for each column series, when called on a dataframe it returns a series with a single value for each column label:

In [19]:
df = cars()
df.count()

model    32
mpg      32
cyl      32
disp     32
hp       32
drat     32
wt       32
qsec     32
vs       32
am       32
gear     32
carb     32
dtype: int64

`df.dtypes` returns a series indexed by column name that contains the dtype of each column of `df`:

In [20]:
df = cars()
df.dtypes

model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object

`describe()` returns a dataframe that basic summary statistics and the number of non-null rows for each column:

In [21]:
df = cars()
df.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.091,6.188,230.722,146.688,3.597,3.217,17.849,0.438,0.406,3.688,2.812
std,6.027,1.786,123.939,68.563,0.535,0.978,1.787,0.504,0.499,0.738,1.615
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.581,16.892,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


`value_counts()` on the given column series returns a series describing how many rows take each unique value:

In [22]:
df = cars()
df["cyl"].value_counts()

cyl
8    14
4    11
6     7
Name: count, dtype: int64

`value_counts(normalize=True)` returns a series describing what fraction of all rows have each unique value, rather than the raw counts:

In [23]:
df = cars()
df["cyl"].value_counts(normalize=True)

cyl
8    0.438
4    0.344
6    0.219
Name: proportion, dtype: float64

`unique()` returns only the unique value themselves:

In [24]:
df["cyl"].unique()

array([6, 4, 8])

`nunique()` returns the number of unique values:

In [25]:
df["cyl"].nunique()

3

`nlargest(n)` returns `n` largest values and `nsmallest(n)` `n` smallest (`n` defaults to 5):

In [26]:
df["hp"].nlargest()

30    335
28    264
6     245
23    245
16    230
Name: hp, dtype: int64

In [27]:
df["hp"].nsmallest()

18    52
7     62
19    65
17    66
25    66
Name: hp, dtype: int64

To see the cars with largest horse power, use the fact that the index of the series returned by `nlargest` is a subset of the dataframe index and hence can be used to subset the dataframe:

In [28]:
df.loc[df["hp"].nlargest().index]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
16,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


## Selecting rows and columns

### Selecting with []

Select rows with `[]`:

In [29]:
df = cars()
df[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Select a single column as a `pd.Series` with `[]`:

In [30]:
df = cars()
df["gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame` by passing a list to `[]` - note how a dataframe with one column is of different type than a series and is displayed in a different way:

In [31]:
df = cars()
df[["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Selection of rows and of columns can be combined:

In [56]:
df = cars()
df[df["gear"] == 5]["cyl"]

26    4
27    4
28    8
29    6
30    8
Name: cyl, dtype: int64

This form of row and column selection does not work for the purpose of modifying or inserting data:

In [33]:
df = cars()
df[df["gear"] == 5]["cyl"] = 3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["gear"] == 5]["cyl"] = 3


Lets breakdown how the `df[df["gear"] == 5]["cyl"] = 3` expression translates to method calls on the underlying objects: `df[df["gear"] == 5]` translates to a `df.__getitem__(df["gear"] == 5)` call on the data frame and then the `["cyl"] = 3` part to a `.__setitem__(3)` call on the resulting object. The problem is that the `__getitem__` call might return either a view or a copy of the dataframe, so the original dataframe might or might not be modified.

Instead, `df.loc[]` can be used to select rows and columns at the same time. `df.loc[]` will return a view or a copy just like `df[]`, but `df.loc[]=` is just a single `__setitem__` method call on the object stored in the `loc` attribute of the original dataframe, free of the ambiguity of `[][]=`, so that it will always correctly modify the dataframe.

### Selecting with loc[]

Select rows:

In [34]:
df = cars()
df.loc[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Select a single column as a `pd.Series`:

In [35]:
df = cars()
df.loc[:, "gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame`:

In [36]:
df = cars()
df.loc[:, ["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Modify a subpart of a dataframe:

In [57]:
df = cars()
df.loc[df["cyl"] == 6, "hp"] = 200
df.loc[df["cyl"] == 6, :]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,200,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,200,3.9,2.875,17.02,0,1,4,4
3,Hornet 4 Drive,21.4,6,258.0,200,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,200,2.76,3.46,20.22,1,0,3,1
9,Merc 280,19.2,6,167.6,200,3.92,3.44,18.3,1,0,4,4
10,Merc 280C,17.8,6,167.6,200,3.92,3.44,18.9,1,0,4,4
29,Ferrari Dino,19.7,6,145.0,200,3.62,2.77,15.5,0,1,5,6


### Boolean expressions in [] and loc[]

When selecting rows with `df[df["gear"] == 5]`, `df["gear"] == 5` is a `pd.Series` wrapping a boolean vector:

In [38]:
df = cars()
(df["gear"] == 5).head(5)

0    False
1    False
2    False
3    False
4    False
Name: gear, dtype: bool

Boolean operators like `&`, `|` and `~` (negation) can be used on those boolean vectors to represent compound filtering conditions. Individual conditions have to be enclosed in parenthesis since `&` and `|` have higher priority in Python than operators like `==`, `>=`, etc.:

In [39]:
df = cars()
df[(df["cyl"] >= 4) & (df["cyl"] <= 6) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


Use `isin()` series method for subset selection:

In [40]:
df = cars()
df[df["cyl"].isin([4, 6]) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


### Selecting with iloc[]

`iloc[]` returns rows and columns specified using array-like indexes (positive or negative offset of the row in the series or of the rows and optionally columns of the data frame):

In [41]:
df = cars()
df.iloc[2]

model    Datsun 710
mpg            22.8
cyl               4
disp          108.0
hp               93
drat           3.85
wt             2.32
qsec          18.61
vs                1
am                1
gear              4
carb              1
Name: 2, dtype: object

In [42]:
df = cars()
df.iloc[-2]

model    Maserati Bora
mpg               15.0
cyl                  8
disp             301.0
hp                 335
drat              3.54
wt                3.57
qsec              14.6
vs                   0
am                   1
gear                 5
carb                 8
Name: 30, dtype: object

In [43]:
df = cars()
df.iloc[0:3]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1


## Grouping

### Splitting into groups

### Aggregating using predefined methods

The series groupby and dataframe groupby objects both support calls like `min()`, `max()`, `mean()`, `std()`, `var()`, `quantile()` etc. that are groupwise versions of the respective series/dataframe operation:

In [44]:
df = cars()
df.groupby(["cyl", "gear"])["hp"].mean()

cyl  gear
4    3        97.000
     4        76.000
     5       102.000
6    3       107.500
     4       116.500
     5       175.000
8    3       194.167
     5       299.500
Name: hp, dtype: float64

To count the number of rows in each group call `.size()` on the groupby object:

In [45]:
df = cars()
df.groupby(["cyl", "gear"]).size()

cyl  gear
4    3        1
     4        8
     5        2
6    3        2
     4        4
     5        1
8    3       12
     5        2
dtype: int64

Confusingly, to get the number of rows in the whole dataframe, you need to call `len(df)` or `len(df.index)` - `.size` in a dataframe is a field not a method and it holds the number of cells in the dataframe not the number of rows. For example, to group cars by the number of cylinders and number of gears and see what percentage of all cars have what setup you call `.size()` on the groupby object returned by `groupby(["cyl", "gear"])`, divide by `len(df)` and multiply the result of the division by `100.0`:

In [46]:
df = cars()
(df.groupby(["cyl", "gear"]).size() / len(df)) * 100.0

cyl  gear
4    3        3.125
     4       25.000
     5        6.250
6    3        6.250
     4       12.500
     5        3.125
8    3       37.500
     5        6.250
dtype: float64

To count the groups themselves use the `.ngroups` attribute:

In [47]:
df = cars()
df.groupby(["cyl", "gear"]).ngroups

8

To compute a compound expression involving group level aggregates, for example the range of values within each group (group max - group min), reference to the groupby object and reuse it:

In [48]:
df = cars()
df_groupby = df.groupby(["cyl", "gear"])["hp"]
df_groupby.max() - df_groupby.min()

cyl  gear
4    3        0
     4       57
     5       22
6    3        5
     4       13
     5        0
8    3       95
     5       71
Name: hp, dtype: int64

### Aggregating using generic agg()

`agg(func)` calls `func(series)` once for each `series` of every group. `func` should return a scalar. `func` can be passed in as a function or a function name (a string). The result is a series or a dataframe depending whether aggregation is done on a single series or on a dataframe but also whether one aggregation is done or more. There are also multiple ways of providing arguments specifying the aggregations to do. Hence there are many cases which we now try to showcase.

#### Aggregating a single series

Simplest case is doing a single aggregation on a single series. The result is a series:

In [49]:
df = cars()
df.groupby("cyl")["hp"].agg("mean")

cyl
4     82.636
6    122.286
8    209.214
Name: hp, dtype: float64

To do multiple aggregations pass a list of functions or function names to `agg`. The result will be a dataframe, since in general there is more than one group each of which becomes a row of the result and for each group we compute more than one aggregation each of which becomes a column of the result:

In [50]:
df = cars()
df.groupby("cyl")["hp"].agg(["size", "mean", "std"])

Unnamed: 0_level_0,size,mean,std
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,11,82.636,20.935
6,7,122.286,24.26
8,14,209.214,50.977


`agg()` can also be called with keyword arguments, in which case the name of the argument specifies the name of the column for the aggregated data in the resulting dataframe. The value of each keyword argument should again be a function or a function name to perform the aggregation:

In [51]:
df = cars()
df.groupby("cyl")["hp"].agg(count="size", average="mean", stddev="std")

Unnamed: 0_level_0,count,average,stddev
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,11,82.636,20.935
6,7,122.286,24.26
8,14,209.214,50.977


#### Aggregating a dataframe

Next case is aggregation of multiple series. When a single aggregation is applied, the result is a simple dataframe whose column names are the same as the columns that were aggregated:

In [52]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg("mean")

Unnamed: 0_level_0,hp,wt
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,82.636,2.286
6,122.286,3.117
8,209.214,3.999


When multiple aggregations are applied to a dataframe, the result is a dataframe with a column multi-index:

In [53]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg(["mean", "std"])

Unnamed: 0_level_0,hp,hp,wt,wt
Unnamed: 0_level_1,mean,std,mean,std
cyl,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
4,82.636,20.935,2.286,0.57
6,122.286,24.26,3.117,0.356
8,209.214,50.977,3.999,0.759


You can avoid the column multi-index by using the keyword arguments to `agg()`. The name of the keyword argument again specifies the name of the aggregated series in the resulting dataframe, but values of the keyword arguments now have to be tuples of the form `(name_of_column_to_aggregate,aggregation)` where `aggregation` is as always a function or function name:

In [54]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg(
    hp_mean=("hp", "mean"),
    hp_std=("hp", "std"),
    wt_mean=("wt", "mean"),
    wt_std=("wt", "std"),
)

Unnamed: 0_level_0,hp_mean,hp_std,wt_mean,wt_std
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,82.636,20.935,2.286,0.57
6,122.286,24.26,3.117,0.356
8,209.214,50.977,3.999,0.759


Finally when aggregating a dataframe there is yet another way of specifying arguments for `agg`: to do different aggregations for different columns you can pass a dict as an argument. The result will be a dataframe and if any columns is aggregated using more than one function, it will have a column multi-index:

In [55]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg({
    "hp": ["mean", "std"],
    "wt": "mean"
})

Unnamed: 0_level_0,hp,hp,wt
Unnamed: 0_level_1,mean,std,mean
cyl,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
4,82.636,20.935,2.286
6,122.286,24.26,3.117
8,209.214,50.977,3.999
