# Chapter 12. Advanced pandas

## 12.1 Categorical data

This section introduces the pandas `Categorical` type.
It can often be more performance and memory efficient than the string equivalent.

### Background and motivation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)

In [2]:
%matplotlib inline

In [3]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [8]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

A good solution for sotring categorical data is to use a dimension table.

In [9]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [10]:
dim[values]

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

In [11]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

### Categorical type in pandas

The `Categorical` type uses an integer-based encoding.

In [14]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({
        'fruit': fruits,
        'basket_id': np.arange(N),
        'count': np.random.randint(3, 15, size=N),
        'weight': np.random.uniform(0, 4, size=N)
    }, columns = ['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,8,2.583576
1,1,orange,3,1.750349
2,2,apple,6,3.567092
3,3,apple,14,3.854651
4,4,apple,6,1.533766
5,5,orange,10,3.1669
6,6,apple,12,2.11558
7,7,apple,6,2.272178


In [16]:
# Turn the fruit data into a Categorical
ruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [18]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

This Categorical object has `categories` and `codes` attributes.

In [19]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [20]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [23]:
# Convert the DataFrame column into a Categorical
df['fruit'] = df['fruit'].astype('category')
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,8,2.583576
1,1,orange,3,1.750349
2,2,apple,6,3.567092
3,3,apple,14,3.854651
4,4,apple,6,1.533766
5,5,orange,10,3.1669
6,6,apple,12,2.11558
7,7,apple,6,2.272178


A Categorical object can be created from the codes and categories using the `from_codes()` method.

In [25]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

By default, no order is assumed for the data, though it can be by setting `ordered=True`.
It can also be removed with the `as_unordered()` method.

In [27]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [28]:
ordered_cat.as_unordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [32]:
my_cats_2.as_ordered(inplace=True)

In [33]:
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [34]:
my_cats_2.ordered

True

### Computations with categoricals

Most uses of Categorical will behave as if the data were still an unencoded structure (such as an array of strings).
Some pandas operations, such as `groupby()`, and other functions take advantage of the coded nature of Categorical for performance enhancements.

the `qcut()` and `cut()` pandas functions return Categoricals.

In [37]:
draws = np.random.randn(1000)
draws[:5]

array([ 0.3356357 ,  0.04034169,  1.67106186, -1.00292551,  1.40244569])

In [44]:
bins = pd.qcut(draws, 4)
bins

[(0.036, 0.668], (0.036, 0.668], (0.668, 2.912], (-3.4419999999999997, -0.629], (0.668, 2.912], ..., (0.036, 0.668], (-3.4419999999999997, -0.629], (-0.629, 0.036], (0.036, 0.668], (-0.629, 0.036]]
Length: 1000
Categories (4, interval[float64]): [(-3.4419999999999997, -0.629] < (-0.629, 0.036] < (0.036, 0.668] < (0.668, 2.912]]

In [45]:
# The same quartiles but with more helpful names.
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q3, Q3, Q4, Q1, Q4, ..., Q3, Q1, Q2, Q3, Q2]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [51]:
bins = pd.Series(binds, name='quartile')
results = (
    pd.Series(draws)
        .groupby(bins)
        .agg(['count', 'min', 'max'])
        .reset_index()
)

results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.440574,-0.629541
1,Q2,250,-0.629398,0.035599
2,Q3,250,0.036325,0.666579
3,Q4,250,0.673752,2.911616


In [52]:
results.quartile

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

Using a Categorical in a DataFrame will provide improvements in speed of computations and resource-consumption by the DataFrame.

In [56]:
N = 1000000

draws = pd.Series(np.random.randn(N))

labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

categories = labels.astype('category')

In [57]:
labels.memory_usage()

8000128

In [58]:
categories.memory_usage()

1000320

### Categorical methods

There are a few extra conviencne methods provided for Categorical objects.
They are accessed via the `cat` attribute of a Categorical.

In [61]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [62]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [63]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

The categories can be changed, even extended beyond the actual values used.

In [64]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [65]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [66]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

Unused categories can be removed with the `remove_unused_categories()` method.

In [67]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [69]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

Once final example use-case for Categorical is to create dummy variables for modeling.

In [71]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## 12.2 Advanced groupby use
