# Data analysis with Pandas

This is a guide to doing data analysis with Pandas written with a particular goal: to help understand the most important concepts and usage patterns in a way that sticks with you, as opposed to covering every possible application. We try to achieve this by laying out and following some specific principles:

- Build deep understanding of general concepts:
    - the basic Panas data structures: series, dataframes and indexes
    - wide classes of operations on data: selection, grouping, aggregations and transformations
    - shapes that datasets come in and how to convert between them: long vs wide
- Use well chosen examples:
    - preferrably real datasets
    - understandable for a broad audience
    - interesting
    - distinct from one another in structure
- Stay concise, avoid minutiae that can be easily looked up in the documentation, provide pointers to references instead


Before we begin we need to import all the libraries we will use. We also change some options that I almost always find useful to change:

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Make Pandas less eager to hide data in dataframes with "..."
pd.options.display.max_rows = 999
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200

# Sensible floating point precision for basic use cases
pd.options.display.precision = 3

# Short exception tracebacks
%xmode Plain

Exception reporting mode: Plain


## Pandas data structures

We begin by looking into the fundamental high-level Pandas data structures: series, data frames and indexes. If you try to use Pandas without understanding the data structures and the relations between them, it gets confusing very easily. On the other hand learning the data structures can feel dry compared to working with real datasets. We cover only the most important basics here to keep things interesting and later revisit in more depth when need arises.

### Series and basic indexes

A series holds an array of values:

In [2]:
primes = pd.Series([2, 3, 5, 7, 11])
primes

0     2
1     3
2     5
3     7
4    11
dtype: int64

The right column of the output simply shows the values in the series. `dtype: int64` refers to the data type of the values.

The left column of the output shows labels corresponding to the values in the series. The labels are part of the *index* that is a part of every series. By default the series is indexed simply using position of each value in the series:

In [3]:
primes = pd.Series([2, 3, 5, 7, 11])
primes.index

RangeIndex(start=0, stop=5, step=1)

`.loc[]` call for a series takes an index label as argument and looks up the corresponding value in the series. For the default index it ends up working like basic array indexing:

In [4]:
primes = pd.Series([2, 3, 5, 7, 11])
primes.loc[0]

2

The index is what makes a series something more than a simple array. It makes it possible to refer to the values in the series by whatever label is appropriate: by a string, by a date, by a pair of numbers, ... It also plays a role similar to a database index, hence the name: it speeds up operations like joins that have to lookup by this label repeatedly. Here is how you construct a series with an explicit index that does not simply use position in the series as the label of each value:

In [5]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes

p0     2
p1     3
p2     5
p3     7
p4    11
dtype: int64

Now you can lookup by label:

In [6]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes.loc["p0"]

2

Lookup by index is nevertheless always possible using `.iloc[]`:

In [7]:
primes = pd.Series([2, 3, 5, 7, 11], index=["p0", "p1", "p2", "p3", "p4"])
primes.iloc[2]

5

You can explicitly construct a `pd.Index` instance and pass it in the `pd.Series` constructor:

In [8]:
index = pd.Index(["p0", "p1", "p2", "p3", "p4"])
series = pd.Series([2, 3, 5, 7, 11], index=index)
series

p0     2
p1     3
p2     5
p3     7
p4    11
dtype: int64

### Data frames

Conceptually a dataframe is a collection of column series that share a common row index:

In [9]:
df = pd.DataFrame({"a": series, "b": series})
df

Unnamed: 0,a,b
p0,2,2
p1,3,3
p2,5,5
p3,7,7
p4,11,11


In [10]:
df = pd.DataFrame({"a": series, "b": series})
display(id(df.index))
display(id(df["a"].index))
display(id(df["b"].index))

139723360226192

139723360226192

139723360226192

Columns get labels through an additional index object, called the column index:

In [11]:
df = pd.DataFrame({"a": series, "b": series})
df.columns

Index(['a', 'b'], dtype='object')

Many Pandas methods work with either the row index or column index depending on the value of the `axis` keyword argument. For example removing rows by label and removing columns using a label or a set of labels are both done using the `drop` method:

In [12]:
df = pd.DataFrame({"a": series, "b": series})
df.drop(["p0", "p1"])

Unnamed: 0,a,b
p2,5,5
p3,7,7
p4,11,11


In [13]:
df = pd.DataFrame({"a": series, "b": series})
df.drop("a", axis=1)

Unnamed: 0,b
p0,2
p1,3
p2,5
p3,7
p4,11


### Jupyter representation of the data structures

It is good to learn how the Jupyter cell output corresponds to the underlying Pandas data structures and its attributes. Often we will not construct a `pd.Series` or `pd.DataFrame` directly, but receive one as a result of some sequence of Pandas operations. In this case we do not know up front whether it is a `pd.Series` or `pd.DataFrame`, what the index is, whether the index has a name, etc. You can inspect the result, for example call `type(result)`, but it is more efficient to simply learn what is shown where and how.

A `pd.Series` is displayed in monospaced font (unlike `pd.DataFrame`) and as the below example illustrates its Jupyter representation has at most four parts:

- top left corner: index name, only shown if present
- left column: index labels  
- right column: series values
- bottom line: series name, if present, and dtype of series values

In [14]:
index = pd.Index(["p0", "p1", "p2", "p3", "p4"], name="key")
series = pd.Series([2, 3, 5, 7, 11], index=index, name="primes")
series

key
p0     2
p1     3
p2     5
p3     7
p4    11
Name: primes, dtype: int64

A dataframe with one column is something different than a series:

In [15]:
index = pd.Index(["p0", "p1", "p2", "p3", "p4"], name="key")
series = pd.Series([2, 3, 5, 7, 11], index=index, name="primes")
series.to_frame()

Unnamed: 0_level_0,primes
key,Unnamed: 1_level_1
p0,2
p1,3
p2,5
p3,7
p4,11


## Dataset: mtcars

We will use the mtcars dataset as the default basic example in the following sections. Its contents are easy to understand and fairly interesting, while it also is provided in one of the simplest and most commonly encountered ways: each row describes a single unique unit of observation (a car) and each column is a different measurement for the same unit of observation (miles/gallon, number of cylinders, horsepower etc.):

In [16]:
CARS = pd.read_csv("mtcars.csv")

def cars():
    # Return a fresh copy every time so that examples do not affect each other
    return CARS.copy()

## Examining datasets

`.head(n)` returns the first `n` rows, `.tail(n)` the last `n` rows, and `.sample(n)` random `n` rows:

In [17]:
df = cars()
display(df.head(2))
display(df.tail(2))
display(df.sample(2))

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
18,Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
23,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4


`len(df)` returns the number of rows in a dataframe:

In [18]:
df = cars()
len(df)

32

`count()` returns number of rows with non-null value. Since this differs for each column series, when called on a dataframe it returns a series with a single value for each column label:

In [19]:
df = cars()
df.count()

model    32
mpg      32
cyl      32
disp     32
hp       32
drat     32
wt       32
qsec     32
vs       32
am       32
gear     32
carb     32
dtype: int64

Describe will also show the number of non-null rows for given column, but along with basic summary statistics:

In [20]:
df = cars()
df.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.091,6.188,230.722,146.688,3.597,3.217,17.849,0.438,0.406,3.688,2.812
std,6.027,1.786,123.939,68.563,0.535,0.978,1.787,0.504,0.499,0.738,1.615
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.581,16.892,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


`value_counts()` on the given column series returns a series describing how many rows take each unique value:

In [21]:
df = cars()
df["cyl"].value_counts()

cyl
8    14
4    11
6     7
Name: count, dtype: int64

`value_counts(normalize=True)` returns a series describing what fraction of all rows have each unique value, rather than the raw counts:

In [23]:
df = cars()
df["cyl"].value_counts(normalize=True)

cyl
8    0.438
4    0.344
6    0.219
Name: proportion, dtype: float64

`unique()` returns only the unique value themselves:

In [76]:
df["cyl"].unique()

array([6, 4, 8])

`nunique()` returns the number of unique values:

In [24]:
df["cyl"].nunique()

3

## Selecting rows and columns

### Select with []

Select rows with `[]`:

In [19]:
df = cars()
df[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


`[]` will accept a callable as argument for cases where a reference to the dataframe is not available, for example when chaining method calls on the dataframe:

In [20]:
df = cars()
df[lambda df: df["gear"] == 5][lambda df: df["cyl"] == 4]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


Select a single column as a `pd.Series` with `[]`:

In [21]:
df = cars()
df["gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame` by passing a list to `[]`:

In [22]:
df = cars()
df[["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Selection of rows and of columns can be combined:

In [23]:
df = cars()
df[df["gear"] == 5]["cyl"]

26    4
27    4
28    8
29    6
30    8
Name: cyl, dtype: int64

Note that chaining `[]` does not work for the purpose of modifying or inserting data:

In [24]:
df = cars()
df[df["gear"] == 5]["cyl"] = 3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["gear"] == 5]["cyl"] = 3


`df[][]=` translates to a `df.__getitem__()` call on the data frame and then a `.__setitem__()` call on the resulting object. The problem is that the `df.__getitem__()` call might return either a view or a copy of the dataframe, so the dataframe might or might not be modified.

Instead, `df.loc[]` can be used to select rows and columns at the same time. `df.loc[]` will return a view or a copy just like `df[]`, but `df.loc[]=` is just a single method call on the `loc` attribute of the original dataframe, free of the ambiguity of `[][]=`, so that it will always correctly modify the dataframe.

### Select with loc[]

Select rows:

In [25]:
df = cars()
df.loc[df["gear"] == 5]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


`loc[]` will accept a callable as argument for cases where a reference to the dataframe is not available, for example when chaining method calls on the dataframe:

In [26]:
df = cars()
df.loc[lambda df: df["gear"] == 5].loc[lambda df: df["cyl"] == 4]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
26,Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


Select a single column as a `pd.Series`:

In [27]:
df = cars()
df.loc[:, "gear"].head(5)

0    4
1    4
2    4
3    3
4    3
Name: gear, dtype: int64

Select one or more columns as a `pd.DataFrame`:

In [28]:
df = cars()
df.loc[:, ["gear"]].head(5)

Unnamed: 0,gear
0,4
1,4
2,4
3,3
4,3


Modify a subpart of a dataframe:

In [29]:
df = cars()
df.loc[df["cyl"] == 6, "hp"] = 200
df.loc[df["cyl"] == 6, :]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,200,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,200,3.9,2.875,17.02,0,1,4,4
3,Hornet 4 Drive,21.4,6,258.0,200,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,200,2.76,3.46,20.22,1,0,3,1
9,Merc 280,19.2,6,167.6,200,3.92,3.44,18.3,1,0,4,4
10,Merc 280C,17.8,6,167.6,200,3.92,3.44,18.9,1,0,4,4
29,Ferrari Dino,19.7,6,145.0,200,3.62,2.77,15.5,0,1,5,6


### Boolean expressions in [] and loc[]

When selecting rows with `df[df["gear"] == 5]`, `df["gear"] == 5` is a `pd.Series` wrapping a boolean vector:

In [30]:
df = cars()
(df["gear"] == 5).head(5)

0    False
1    False
2    False
3    False
4    False
Name: gear, dtype: bool

Boolean operators like `&`, `|` and `~` (negation) can be used on those boolean vectors to represent compound filtering conditions. Individual conditions have to be enclosed in parenthesis since `&` and `|` have higher priority in Python than operators like `==`, `>=`, etc.:

In [31]:
df = cars()
df[(df["cyl"] >= 4) & (df["cyl"] <= 6) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


Use `isin()` series method for subset selection:

In [32]:
df = cars()
df[df["cyl"].isin([4, 6]) & (df["gear"] == 3)]

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
20,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


## Grouping

### Forming groups

### Aggregating group data

#### Aggregate with specific provided aggregations

`df.groupby()` returns a `DataFrameGroupBy` that supports calls like `min()`, `max()`, `mean()`, `std()`, `var()`, `quantile()` etc.:

In [28]:
df = cars()
df.groupby(["cyl", "gear"])["hp"].mean()

cyl  gear
4    3        97.000
     4        76.000
     5       102.000
6    3       107.500
     4       116.500
     5       175.000
8    3       194.167
     5       299.500
Name: hp, dtype: float64

To count the number of items in each group use `.size()`:

In [26]:
df = cars()
df.groupby(["cyl", "gear"]).size()

cyl  gear
4    3        1
     4        8
     5        2
6    3        2
     4        4
     5        1
8    3       12
     5        2
dtype: int64

Note that confusingly to get the number of rows in the whole dataframe, you have to do `len(df)` or `len(df.index)` - in a dataframe `.size` is a field not a method and it holds the number of cells in the dataframe not the number of rows. For example, to group cars by the number of cylinders and number of gears and see what percentage of all cars have what setup you call `.size()` on the `DataFrameGroupBy` object returned by `groupby(["cyl", "gear"])`, divide by `len(df)` and multiply the result by `100.0`:

In [30]:
df = cars()
(df.groupby(["cyl", "gear"]).size() / len(df)) * 100.0

cyl  gear
4    3        3.125
     4       25.000
     5        6.250
6    3        6.250
     4       12.500
     5        3.125
8    3       37.500
     5        6.250
dtype: float64

To count the groups themselves use the `.ngroups` attribute:

In [31]:
df = cars()
df.groupby(["cyl", "gear"]).ngroups

8

To compute a compound expression involving group level aggregates, for example to compute the range of values within each group (group max - group min), reference to the groupby object and reuse it:

In [32]:
df = cars()
df_groupby = df.groupby(["cyl", "gear"])["hp"]
df_groupby.max() - df_groupby.min()

cyl  gear
4    3        0
     4       57
     5       22
6    3        5
     4       13
     5        0
8    3       95
     5       71
Name: hp, dtype: int64

`.pipe()` will accomplish the same thing in a way that can be placed in the middle of a method chain:

In [33]:
(
    cars()
    .groupby(["cyl", "gear"])["hp"]
    .pipe(lambda group: group.max() - group.min())
    .mean()
)

32.875

#### Aggregate with generic aggregate() method (series-by-series)

`df.groupby().aggregate(func)` will call `func(series)` once for each series of every group.

`func` should return a scalar.

In [70]:
df = cars()
df.groupby("cyl")[["hp"]].aggregate("mean")

Unnamed: 0_level_0,hp
cyl,Unnamed: 1_level_1
4,82.636
6,122.286
8,209.214


Multiple aggregations can be specified:

In [69]:
df = cars()
df.groupby("cyl")[["hp"]].aggregate(["size", "mean", "std"])

Unnamed: 0_level_0,hp,hp,hp
Unnamed: 0_level_1,size,mean,std
cyl,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
4,11,82.636,20.935
6,7,122.286,24.26
8,14,209.214,50.977


Use keyword arguments to rename the resulting columns:

In [41]:
df = cars()
df.groupby("cyl").aggregate(average=("hp", "mean"))

Unnamed: 0_level_0,average
cyl,Unnamed: 1_level_1
4,82.636
6,122.286
8,209.214


The last type of aggregate() aggregation has slightly different syntax when dealing with a single series:

In [42]:
df = cars()
df.groupby("cyl")["hp"].agg(average=np.mean)

Unnamed: 0_level_0,average
cyl,Unnamed: 1_level_1
4,82.636
6,122.286
8,209.214


#### Aggregate with generic apply() method (dataframe-by-dataframe)

`df.groupby().apply(func)` will call `func(group)` once for each group, where `group` is a dataframe containing the rows within each group. The form of the result depends of the return type of `func`.

**Case 1:** `func` returns a scalar - the result of `apply(func)` is a series indexed by the group key:

In [43]:
df = cars()
df.groupby("cyl")[["hp"]].apply(lambda df: np.mean(df.values))

cyl
4     82.636
6    122.286
8    209.214
dtype: float64

**Case 2:** `func` returns a series - the result of `apply(func)` is a dataframe indexed by the group key, with columns given by the index of the returned series:

In [44]:
df = cars()
df.groupby("cyl")[["hp", "mpg"]].apply(lambda df: df.mean())

Unnamed: 0_level_0,hp,mpg
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,82.636,26.664
6,122.286,19.743
8,209.214,15.1


**Case 3:** `func` returns a dataframe - the result of `apply(func)` is a dataframe with a multi-index and with same columns as the dataframe returned by `func`. The multi-index consists of a group key level concatenated with levels of the index of the dataframes returned by `func`:

In [45]:
df = cars()
df.groupby("cyl").apply(lambda df: df.drop(columns=["Currency"]).set_index(["Year"]).rolling(3).mean().dropna())

KeyError: "['Currency'] not found in axis"

### Transforming group data

In Pandas terminology, transformations differ from aggregations in that the result of a transformation has dimensions equal to the dimensions of the original series or dataframe before the `.groupby()`. Common use case is to do something with each row of a dataframe but using some data computed in the context of the group of given row, like the group mean of some attribute.

#### Transform with specific provided transformations

#### Transform rows one-by-one with generic transform() method

`df.groupby().transform(func)` will call `func(series_in_group)` once for each series in each group. In contrast to `apply()`, the result of `transform()` is of the same dimensions as the original dataframe.

`func(series_in_group)` should either return a series of the same dimensions as `series_in_group` or a scalar, in which case pandas will take care of making a series of length `len(series_in_group)` out of it.

In [None]:
df = usd_exchange_rates_df()
df.groupby("Currency")[["Currency/USD", "USD/Currency"]].transform(lambda df: df.mean())