# Data analysis with Pandas

This is a guide to doing data analysis with Pandas written to help understand the most important concepts and usage patterns in a way that sticks with you. I try to make things understandable and memorable by adhering to some principles:

- Identify and explore important general concepts:
    - the Pandas data model: series, dataframes, indexes and multi-indexes
    - wide classes of operations on data: selection, grouping, aggregations and transformations, sorting
    - shapes that datasets come in and how to convert between them: long vs wide
- Use well chosen examples:
    - preferrably real datasets
    - understandable for a broad audience
    - interesting
    - distinct from one another in structure
- Stay concise, avoid minutiae that can be easily looked up in the documentation, provide pointers to references instead


Before we begin we need to import all the libraries we will use. We also change some options that I almost always find useful to change:

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Make Pandas less eager to hide data in dataframes with "..."
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

# Sensible floating point precision for basic use cases
pd.options.display.precision = 3

# Short exception tracebacks
%xmode Plain

Exception reporting mode: Plain


## Pandas data model

We begin by exploring the basic Pandas data structures: series, indexes and dataframes. Our goal in this section is to show how things fit together in Pandas conceptually, rather than to be comprehensive about the specific features. We use many simple examples to build a solid foundation of understanding for analysing real datasets afterwards.

### Series and basic indexes

The basic pandas data type is a series which is a collection of data points. For our first example, we look at the number of points scored in the 2022/2023 NBA season by individual players. The numbers for 5 players who scored the most are as follows:

In [2]:
points_by_pos = pd.Series([2225, 2183, 2138, 2135, 1959])
points_by_pos

0    2225
1    2183
2    2138
3    2135
4    1959
dtype: int64

The right column of the output shows the values in the series.  `dtype: int64` refers to the data type of the values.

The left column of the output shows labels corresponding to the values in the series. The labels are part of the *index* that is a part of every series. By default the series is indexed simply using position of each value in the series. The index object in this case looks like this:

In [3]:
points_by_pos.index

RangeIndex(start=0, stop=5, step=1)

The  series `loc[]` method takes an index label as argument and looks up the corresponding value in the series. For the default index it ends up working like basic list indexing (in the simplest case, since it does label lookup it does not support negative indexing):

In [4]:
points_by_pos.loc[2]

2138

The index is what makes a series something more than a simple list. It makes it possible to refer to the values in the series by whatever label is appropriate: by a string, by a date, by a pair of numbers, ... It also plays a role similar to a database index, hence the name: it speeds up operations like joins that have to lookup by label repeatedly.  We can usefully index each number in this series with the name of the player who scored that many points:

In [5]:
points_by_player = pd.Series({
    "Jayson Tatum": 2225,
    "Joel Embiid": 2183,
    "Luka Dončić": 2138,
    "Shai Gilgeous-Alexander": 2135,
    "Giannis Antetokounmpo": 1959
})
points_by_player

Jayson Tatum               2225
Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
Giannis Antetokounmpo      1959
dtype: int64

The index object itself now looks like this:

In [6]:
points_by_player.index

Index(['Jayson Tatum', 'Joel Embiid', 'Luka Dončić', 'Shai Gilgeous-Alexander',
       'Giannis Antetokounmpo'],
      dtype='object')

#### Selecting data

With the index in place, we can now lookup the score for a player using the label:

In [7]:
points_by_player.loc["Jayson Tatum"]

2225

We can always lookup scores using list-like indices with `iloc[]`, regardless of what the series index is:

In [8]:
points_by_player.iloc[0]

2225

We can pass in a list as a argument when using both `loc[]` and `iloc[]` which allows to select a subset of the series, including the selection of a single data point as a single-element series rather than as a scalar:

In [9]:
points_by_player

Jayson Tatum               2225
Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
Giannis Antetokounmpo      1959
dtype: int64

In [10]:
points_by_player.loc[["Jayson Tatum"]]

Jayson Tatum    2225
dtype: int64

In [11]:
points_by_player.loc[["Jayson Tatum", "Joel Embiid"]]

Jayson Tatum    2225
Joel Embiid     2183
dtype: int64

In [12]:
points_by_player.iloc[[0]]

Jayson Tatum    2225
dtype: int64

In [87]:
points_by_player.iloc[[0, 1]]

Jayson Tatum    2225
Joel Embiid     2183
dtype: int64

Both `loc[]` and `iloc[]` also accept a slice as an argument, in case of `iloc[]` this works nearly the same as slicing a Python list, with the right endpoint not included in the result:

In [14]:
points_by_player.iloc[1:4]

Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
dtype: int64

In case of `loc[]` in contrast to Python lists and to `iloc[]` the right endpoint of the slice is included in the result:

In [15]:
points_by_player.loc["Joel Embiid":"Shai Gilgeous-Alexander"]

Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
dtype: int64

Finally, you can also select elements using `[]`:

In [85]:
points_by_player["Joel Embiid"]

2183

However `[]` can be quite confusing - whether it looks up elements by label or by position depends on the data type of the argument and on the data type of the labels:
  - `[]` with a single integer argument selects element by label, when series labels are integers, otherwise it selects by position, though the last behaviour is now deprecated and will get removed
  - `[]` with a slice with integer endpoints selects by position regardless of the datatype of the label

There are also additional caveats when using `[]` with dataframes that we cover later, this is why we suggest to explictly select by label with `loc[]` or by position with `iloc[]`, except for some very basic uses (selection of column from dataframe).

#### Sorting by label and by value

The ability to select data in a series by label using a slice brings up an important point: data points in the series conceptually have a definite order that is arbitrary, it is up to the user to sort the series in a way that is convenient for the task at hand. Consider a series using dates as labels - it is very tempting to expect `series.loc[pd.to_datetime("2023-01-01"):pd.to_datetime("2023-12-31")]` to get us simultaneously a) only data from 2023 and b) all the data from 2023 present in the series. The sneaky thing is that this is only guaranteed to be true if the series is sorted by label using calendar-like ordering - when the data was already in the right order in the data source it was imported from, or when it was explictly sorted. We can check if this is the case using attributes of the series index:

In [17]:
points_by_player.index.is_monotonic_decreasing # series labels are in the decreasing order according to <= etc.

False

In [18]:
points_by_player.index.is_monotonic_increasing # series labels are in the increasing order according to <= etc.

False

If the series is not sorted by label and we try to use `loc[]` with a slice argument, Pandas will first look for an element equal to the left end point of the slice and will include consecutive elements of the series in the result until encountering the element equal to the right endpoint. In this case it is possible for `series.loc[pd.to_datetime("2023-01-01"):pd.to_datetime("2023-12-31")]` to both include data points that are not from 2023 and to not include some series data points that are in fact from 2023. If either a label equal to the left endpoint or a label equal to the right endpoint can not be found in the series Pandas will raise a `KeyError`. To avoid confusion of this kind, it is a good habit to put series and dataframes that are inputs for data analysis in some definite order that is convenient given how the data will be analysed, for example if you intend to slice the series by date, sort it by date. `sort_index()` without arguments will put the series labels in a monotonic increasing order, with `ascending=False` keyword argument in a monotonic decreasing order:

In [19]:
points_by_player

Jayson Tatum               2225
Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
Giannis Antetokounmpo      1959
dtype: int64

In [20]:
points_by_player_sorted = points_by_player.sort_index()
points_by_player_sorted

Giannis Antetokounmpo      1959
Jayson Tatum               2225
Joel Embiid                2183
Luka Dončić                2138
Shai Gilgeous-Alexander    2135
dtype: int64

In [21]:
points_by_player_sorted.index.is_monotonic_increasing

True

In [22]:
points_by_player_sorted = points_by_player.sort_index(ascending=False)
points_by_player_sorted

Shai Gilgeous-Alexander    2135
Luka Dončić                2138
Joel Embiid                2183
Jayson Tatum               2225
Giannis Antetokounmpo      1959
dtype: int64

In [23]:
points_by_player_sorted.index.is_monotonic_decreasing

True

Note that labels and corresponding values are inseparable during sorting. `sort_index()` compares labels but reorders whole pairs composed of label and value. `sort_values()`, the second main way to sort a series, compares values and reorders pairs of label and value:

In [24]:
points_by_player.sort_values()

Giannis Antetokounmpo      1959
Shai Gilgeous-Alexander    2135
Luka Dončić                2138
Joel Embiid                2183
Jayson Tatum               2225
dtype: int64

#### Calculating with series

Moving on, series have many methods for doing basic calculations:

In [25]:
(points_by_player.min(), points_by_player.max())

(1959, 2225)

In [26]:
points_by_player.mean()

2128.0

In [27]:
points_by_player.sum()

10640

We can add, subtract, multiply, divide, ... a series by a constant or a constant by a series resulting in the given arithmetical operation being applied to each element of the series (and each time with the same constant as the second operand), producing a series of the same length. For example, we can express the number of points as a fraction of the points scored by the player who scored the most in this season - to do so we divide the series by a constant (the top score given by `points_by_player.iloc[0]`):

In [28]:
points_by_player / points_by_player.iloc[0]

Jayson Tatum               1.000
Joel Embiid                0.981
Luka Dončić                0.961
Shai Gilgeous-Alexander    0.960
Giannis Antetokounmpo      0.880
dtype: float64

We can also compute what is the fraction of the total those five players scored that each players points constitute:

In [29]:
points_by_player / points_by_player.sum()

Jayson Tatum               0.209
Joel Embiid                0.205
Luka Dončić                0.201
Shai Gilgeous-Alexander    0.201
Giannis Antetokounmpo      0.184
dtype: float64

Series can be compared to a constant using standard comparison operators:

In [30]:
points_by_player >= 2000

Jayson Tatum                True
Joel Embiid                 True
Luka Dončić                 True
Shai Gilgeous-Alexander     True
Giannis Antetokounmpo      False
dtype: bool

Lets look at another series so that we can discuss things like adding two series. Here is how many points the same five players scored one season earlier:

In [31]:
points_by_player_prev_season = pd.Series({
    "Giannis Antetokounmpo": 2002,
    "Jayson Tatum": 2046,
    "Shai Gilgeous-Alexander": 1371,
    "Luka Dončić": 1847,
    "Joel Embiid": 2079
})
points_by_player_prev_season

Giannis Antetokounmpo      2002
Jayson Tatum               2046
Shai Gilgeous-Alexander    1371
Luka Dončić                1847
Joel Embiid                2079
dtype: int64

We can now sum points from both season:

In [32]:
points_by_player + points_by_player_prev_season

Giannis Antetokounmpo      3961
Jayson Tatum               4271
Joel Embiid                4262
Luka Dončić                3985
Shai Gilgeous-Alexander    3506
dtype: int64

### Dataframes

For the purposes of understanding the Pandas API and not its implementation, a dataframe is a two dimensional table of values with labels attached to both rows and columns, like a series of column series. All the column series share the same row index, or to put it differently all the columns share the same row labels. Another index, the column index, assigns labels to the columns themselves. The dataframe API generalizes the series API, with the additional complexity introduced by the second dimension and the fact that different columns can hold different types of values. Lets see how this looks like - we start by putting the example series from the previous section in a dataframe:

In [33]:
points_by_player_by_season = pd.DataFrame({
    "2021/2022": points_by_player_prev_season,
    "2022/2023": points_by_player
})
points_by_player_by_season

Unnamed: 0,2021/2022,2022/2023
Giannis Antetokounmpo,2002,1959
Jayson Tatum,2046,2225
Joel Embiid,2079,2183
Luka Dončić,1847,2138
Shai Gilgeous-Alexander,1371,2135


You can access the column series with `[]` in simple cases, though as with series `loc[]` is often preferred:

In [34]:
points_by_player_by_season["2021/2022"]

Giannis Antetokounmpo      2002
Jayson Tatum               2046
Joel Embiid                2079
Luka Dončić                1847
Shai Gilgeous-Alexander    1371
Name: 2021/2022, dtype: int64

`index` attribute in case of a dataframe holds the row index and is shared across all the column series:

In [35]:
points_by_player_by_season.index

Index(['Giannis Antetokounmpo', 'Jayson Tatum', 'Joel Embiid', 'Luka Dončić',
       'Shai Gilgeous-Alexander'],
      dtype='object')

In [36]:
points_by_player_by_season["2021/2022"].index is points_by_player_by_season.index

True

In [37]:
points_by_player_by_season["2022/2023"].index is points_by_player_by_season.index

True

The index labelling the columns looks like this:

In [38]:
points_by_player_by_season.columns

Index(['2021/2022', '2022/2023'], dtype='object')

#### Selecting data

## Cars dataset

We will use the mtcars dataset as the default basic example in the following sections. Its contents are easy to understand and interesting and also it is provided in a common simple form: each row describes a single unique unit of observation (a car) and each column is a different measurement for the same unit of observation (miles/gallon, number of cylinders, horsepower etc.):

In [39]:
CARS = pd.read_csv("mtcars.csv")

def cars():
    # Return a fresh copy every time so that examples do not affect each other
    return CARS.copy()

## Exploring datasets

`.head(n)` returns the first `n` rows, `.tail(n)` the last `n` rows, and `.sample(n)` random `n` rows:

In [40]:
df = cars()
display(df.head(2))
display(df.tail(2))
display(df.sample(2))

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2


`info()` prints a summary description of a dataframe that contains a lot of useful information: number of entries in the index (which is equal to the total number of rows), a list of columns along with the datatype of the column and the number of non-null values and overall memory usage of the dataframe. The downside is that all this information is simply printed and nothing is returned:

In [41]:
df = cars()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB


`len(df)` returns the number of rows in a dataframe:

In [42]:
df = cars()
len(df)

32

`count()` returns number of rows with non-null value. Since this differs for each column series, when called on a dataframe it returns a series with a single value for each column label:

In [43]:
df = cars()
df.count()

model    32
mpg      32
cyl      32
disp     32
hp       32
drat     32
wt       32
qsec     32
vs       32
am       32
gear     32
carb     32
dtype: int64

`df.dtypes` returns a series indexed by column name that contains the dtype of each column of `df`:

In [44]:
df = cars()
df.dtypes

model     object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
dtype: object

`describe()` returns a dataframe that basic summary statistics and the number of non-null rows for each column:

In [45]:
df = cars()
df.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.091,6.188,230.722,146.688,3.597,3.217,17.849,0.438,0.406,3.688,2.812
std,6.027,1.786,123.939,68.563,0.535,0.978,1.787,0.504,0.499,0.738,1.615
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.581,16.892,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


`value_counts()` on the given column series returns a series describing how many rows take each unique value:

In [46]:
df = cars()
df["cyl"].value_counts()

cyl
8    14
4    11
6     7
Name: count, dtype: int64

`value_counts(normalize=True)` returns a series describing what fraction of all rows have each unique value, rather than the raw counts:

In [47]:
df = cars()
df["cyl"].value_counts(normalize=True)

cyl
8    0.438
4    0.344
6    0.219
Name: proportion, dtype: float64

`unique()` returns only the unique value themselves:

In [48]:
df["cyl"].unique()

array([6, 4, 8])

`nunique()` returns the number of unique values:

In [49]:
df["cyl"].nunique()

3

`nlargest(n)` returns `n` largest values and `nsmallest(n)` `n` smallest (`n` defaults to 5):

In [50]:
df["hp"].nlargest()

30    335
28    264
6     245
23    245
16    230
Name: hp, dtype: int64

In [51]:
df["hp"].nsmallest()

18    52
7     62
19    65
17    66
25    66
Name: hp, dtype: int64

To see the cars with largest horse power, use the fact that the index of the series returned by `nlargest` is a subset of the dataframe index and hence can be used to subset the dataframe:

In [52]:
df.loc[df["hp"].nlargest().index]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
16,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4


## Selecting rows and columns

### Selecting with []

Select rows with `[]`:

In [53]:
df = cars()
df[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Select a single column as a `pd.Series` with `[]`:

In [54]:
df = cars()
df["gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame` by passing a list to `[]` - note how a dataframe with one column is of different type than a series and is displayed in a different way:

In [55]:
df = cars()
df[["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Selection of rows and of columns can be combined:

In [56]:
df = cars()
df[df["gear"] == 5]["cyl"]

26    4
27    4
28    8
29    6
30    8
Name: cyl, dtype: int64

This form of row and column selection does not work for the purpose of modifying or inserting data:

In [57]:
df = cars()
df[df["gear"] == 5]["cyl"] = 3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["gear"] == 5]["cyl"] = 3


Lets breakdown how the `df[df["gear"] == 5]["cyl"] = 3` expression translates to method calls on the underlying objects: `df[df["gear"] == 5]` translates to a `df.__getitem__(df["gear"] == 5)` call on the data frame and then the `["cyl"] = 3` part to a `.__setitem__(3)` call on the resulting object. The problem is that the `__getitem__` call might return either a view or a copy of the dataframe, so the original dataframe might or might not be modified.

Instead, `df.loc[]` can be used to select rows and columns at the same time. `df.loc[]` will return a view or a copy just like `df[]`, but `df.loc[]=` is just a single `__setitem__` method call on the object stored in the `loc` attribute of the original dataframe, free of the ambiguity of `[][]=`, so that it will always correctly modify the dataframe.

### Selecting with loc[]

Select rows:

In [58]:
df = cars()
df.loc[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Select a single column as a `pd.Series`:

In [59]:
df = cars()
df.loc[:, "gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame`:

In [60]:
df = cars()
df.loc[:, ["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Modify a subpart of a dataframe:

In [61]:
df = cars()
df.loc[df["cyl"] == 6, "hp"] = 200
df.loc[df["cyl"] == 6, :]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,200,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,200,3.9,2.875,17.02,0,1,4,4
3,Hornet 4 Drive,21.4,6,258.0,200,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,200,2.76,3.46,20.22,1,0,3,1
9,Merc 280,19.2,6,167.6,200,3.92,3.44,18.3,1,0,4,4
10,Merc 280C,17.8,6,167.6,200,3.92,3.44,18.9,1,0,4,4
29,Ferrari Dino,19.7,6,145.0,200,3.62,2.77,15.5,0,1,5,6


### Boolean expressions in [] and loc[]

When selecting rows with `df[df["gear"] == 5]`, `df["gear"] == 5` is a `pd.Series` wrapping a boolean vector:

In [62]:
df = cars()
(df["gear"] == 5).head(5)

0    False
1    False
2    False
3    False
4    False
Name: gear, dtype: bool

Boolean operators like `&`, `|` and `~` (negation) can be used on those boolean vectors to represent compound filtering conditions. Individual conditions have to be enclosed in parenthesis since `&` and `|` have higher priority in Python than operators like `==`, `>=`, etc.:

In [63]:
df = cars()
df[(df["cyl"] >= 4) & (df["cyl"] <= 6) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


Use `isin()` series method for subset selection:

In [64]:
df = cars()
df[df["cyl"].isin([4, 6]) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


### Selecting with iloc[]

`iloc[]` returns rows and columns specified using list-like indexes (positive or negative offset of the row in the series or of the rows and optionally columns of the data frame):

In [65]:
df = cars()
df.iloc[2]

model    Datsun 710
mpg            22.8
cyl               4
disp          108.0
hp               93
drat           3.85
wt             2.32
qsec          18.61
vs                1
am                1
gear              4
carb              1
Name: 2, dtype: object

In [66]:
df = cars()
df.iloc[-2]

model    Maserati Bora
mpg               15.0
cyl                  8
disp             301.0
hp                 335
drat              3.54
wt                3.57
qsec              14.6
vs                   0
am                   1
gear                 5
carb                 8
Name: 30, dtype: object

In [67]:
df = cars()
df.iloc[0:3]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1


## Groups and aggregations

### Basics of groups

To form groups you call `.groupby` on a series or on a dataframe, supplying the group key in one of several supported ways.

In the simplest case you supply the name of the column to use as the group key:

In [68]:
df.groupby("cyl")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f264009d5b0>

This `DataframeGroupBy` object is the starting point from which various groupwise operations can be done:

In [69]:
df.groupby("cyl").mean(numeric_only=True)

Unnamed: 0_level_0,mpg,disp,hp,drat,wt,qsec,vs,am,gear,carb
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4,26.664,105.136,82.636,4.071,2.286,19.137,0.909,0.727,4.091,1.545
6,19.743,183.314,122.286,3.586,3.117,17.977,0.571,0.429,3.857,3.429
8,15.1,353.1,209.214,3.229,3.999,16.772,0.0,0.143,3.286,3.5


You can also select a single series from the dataframe groupby object which results in a `SeriesGroupBy` object:

In [70]:
df.groupby("cyl")["mpg"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f263ef39e20>

Aggregations then produce a single value per group and result in a `pd.Series`:

In [71]:
df.groupby("cyl")["mpg"].mean()

cyl
4    26.664
6    19.743
8    15.100
Name: mpg, dtype: float64

### Aggregating using predefined aggregations

The `SeriesGroupBy` and `DataframeGroupBy` objects both support calls like `min()`, `max()`, `mean()`, `std()`, `var()`, `quantile()` etc. that are groupwise versions of the respective series/dataframe operation:

In [72]:
df = cars()
df.groupby(["cyl", "gear"])["hp"].mean()

cyl  gear
4    3        97.000
     4        76.000
     5       102.000
6    3       107.500
     4       116.500
     5       175.000
8    3       194.167
     5       299.500
Name: hp, dtype: float64

To count the number of rows in each group call `.size()` on the groupby object:

In [73]:
df = cars()
df.groupby(["cyl", "gear"]).size()

cyl  gear
4    3        1
     4        8
     5        2
6    3        2
     4        4
     5        1
8    3       12
     5        2
dtype: int64

Confusingly, to get the number of rows in the whole dataframe, you have to call `len(df)` or `len(df.index)`, rather than `df.size()`. In a dataframe, `.size` is a field not a method and it holds the number of cells in the dataframe not the number of rows. For example, to see what fraction of all cars have which setup in terms of number of cylinders and number of gears you call `.size()` on the groupby object returned by `groupby(["cyl", "gear"])`, but divide by `len(df)`:

In [74]:
df = cars()
df.groupby(["cyl", "gear"]).size() / len(df)

cyl  gear
4    3       0.031
     4       0.250
     5       0.062
6    3       0.062
     4       0.125
     5       0.031
8    3       0.375
     5       0.062
dtype: float64

To count the groups themselves use the `.ngroups` attribute:

In [75]:
df = cars()
df.groupby(["cyl", "gear"]).ngroups

8

To compute a compound expression involving group level aggregates, for example the range of values within each group (group max - group min), reference to the groupby object and reuse it:

In [76]:
df = cars()
df_groupby = df.groupby(["cyl", "gear"])["hp"]
df_groupby.max() - df_groupby.min()

cyl  gear
4    3        0
     4       57
     5       22
6    3        5
     4       13
     5        0
8    3       95
     5       71
Name: hp, dtype: int64

### Aggregating using generic agg()

`agg(func)` calls `func(series)` once for each `series` of every group. `func` can be a function name, a function or a lambda expression and it should return a scalar value. The result of `agg` is a series or a dataframe depending whether aggregation is done on a single series or on a dataframe but also whether one aggregation is done or many. There are also multiple ways of providing arguments specifying the aggregations to do. Hence there are many cases which we now try to outline.

#### Aggregating a single series

The simplest aggregation is done on a single series and results in a series:

In [77]:
df = cars()
df.groupby("cyl")["hp"].agg("mean")

cyl
4     82.636
6    122.286
8    209.214
Name: hp, dtype: float64

When you pass a list as the argument to `agg` the result will be a dataframe, since then in general there can be more than one group and more than one aggregation - the groups become rows and the aggregations become columns of the result:

In [78]:
df = cars()
df.groupby("cyl")["hp"].agg(["size", "mean", "std"])

Unnamed: 0_level_0,size,mean,std
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,11,82.636,20.935
6,7,122.286,24.26
8,14,209.214,50.977


This might be useful also with only one aggregation in the list, just to force the aggregation result to be a dataframe:

In [79]:
df = cars()
df.groupby("cyl")["hp"].agg(["mean"])

Unnamed: 0_level_0,mean
cyl,Unnamed: 1_level_1
4,82.636
6,122.286
8,209.214


`agg()` can also be called with keyword arguments, in which case the name of the argument specifies the name of the column for the aggregated data in the resulting dataframe. The value of each keyword argument should again be a function or a function name to perform the aggregation:

In [80]:
df = cars()
df.groupby("cyl")["hp"].agg(count="size", average="mean", stddev="std")

Unnamed: 0_level_0,count,average,stddev
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,11,82.636,20.935
6,7,122.286,24.26
8,14,209.214,50.977


#### Aggregating a dataframe

Next case is aggregation of multiple series. When a single aggregation is applied, the result is a simple dataframe whose column names are the same as the columns that were aggregated:

In [81]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg("mean")

Unnamed: 0_level_0,hp,wt
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,82.636,2.286
6,122.286,3.117
8,209.214,3.999


When multiple aggregations are applied to a dataframe, the result is a dataframe with a column multi-index:

In [82]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg(["mean", "std"])

Unnamed: 0_level_0,hp,hp,wt,wt
Unnamed: 0_level_1,mean,std,mean,std
cyl,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
4,82.636,20.935,2.286,0.57
6,122.286,24.26,3.117,0.356
8,209.214,50.977,3.999,0.759


You can avoid the column multi-index by using the keyword arguments to `agg()`. The name of the keyword argument again specifies the name of the aggregated series in the resulting dataframe, but values of the keyword arguments now have to be tuples of the form `(name_of_column_to_aggregate,aggregation)` where `aggregation` is as always a function or function name:

In [83]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg(
    hp_mean=("hp", "mean"),
    hp_std=("hp", "std"),
    wt_mean=("wt", "mean"),
    wt_std=("wt", "std"),
)

Unnamed: 0_level_0,hp_mean,hp_std,wt_mean,wt_std
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4,82.636,20.935,2.286,0.57
6,122.286,24.26,3.117,0.356
8,209.214,50.977,3.999,0.759


Finally when aggregating a dataframe there is yet another way of specifying arguments for `agg`: to do different aggregations for different columns you can pass a dict as an argument. The result will be a dataframe and if any columns is aggregated using more than one function, it will have a column multi-index:

In [84]:
df = cars()
df.groupby("cyl")[["hp", "wt"]].agg({
    "hp": ["mean", "std"],
    "wt": "mean"
})

Unnamed: 0_level_0,hp,hp,wt
Unnamed: 0_level_1,mean,std,mean
cyl,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
4,82.636,20.935,2.286
6,122.286,24.26,3.117
8,209.214,50.977,3.999
