## Chapter 12. Advanced pandas
<a id='index'></a>


## Table of Content
- [12.1 Categorical Data](#121)
    - [12.1.1 Background and Motivation](#1211)
    - [12.1.2 Categorical Type in pandas](#1212)
    - [12.1.3 Computations with Categoricals](#1213)
        - [12.1.3.1 Better performance with categoricals](#12131)
    - [12.1.4 Categorical Methods](#1214)

<hr>

In [2]:
import pandas as pd
import numpy as np

<hr>

## 12.1 Categorical Data
<a id='121'></a>

### 12.1.1 Background and Motivation
<a id='1211'></a>

In [3]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [5]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

In data warehousing, a best practice is to use so-called ***dimension tables*** containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [6]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [7]:
dim

0     apple
1    orange
dtype: object

In [9]:
# We can use the *take* method to restore the original Series of strings:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the **categorical** or **dictionary-encoded representation**. The array of distinct values can be called the **categories, dictionary, or levels of the data**. In this book we will use the terms **categorical and categories**. The integer values that reference the categories are called the **category codes or simply codes**.

The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:

* Renaming categories
* Appending a new category without changing the order or position of the existing categories

### 12.1.2 Categorical Type in pandas
<a id='1212'></a>

In [10]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,3,2.532026
1,1,orange,6,1.659295
2,2,apple,7,3.49287
3,3,apple,14,2.656042
4,4,apple,3,3.047737
5,5,orange,12,3.022176
6,6,apple,7,3.015722
7,7,apple,7,2.605955


In [12]:
# Here, df['fruit'] is an array of Python string objects. We can convert it to categorical by calling:
fruit_cat = df['fruit'].astype('category')

# The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [14]:
c = fruit_cat.values
type(c)

pandas.core.categorical.Categorical

In [15]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [16]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [17]:
# You can convert a DataFrame column to categorical by assigning the converted result:
df['fruit'] = df['fruit'].astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [18]:
# You can also create pandas.Categorical directly from other types of Python sequences:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [21]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [23]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

# The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering, and so on.
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [24]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

### 12.1.3 Computations with Categoricals
<a id='1213'></a>
Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some func‐ tions that can utilize the ordered flag.

In [25]:
np.random.seed(12345)
draws = np.random.randn(1000)

draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [26]:
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [27]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [29]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [31]:
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [32]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### 12.1.3.1 Better performance with categoricals
<a id='12131'></a>
If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too.

***GroupBy*** operations can be significantly faster with categoricals because the underlying algorithms use the ***integer-based*** codes array instead of an array of strings.

In [38]:
N = 10000000

draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

categories = labels.astype('category')

In [39]:
labels.memory_usage()

80000080

In [40]:
categories.memory_usage()

10000272

In [42]:
# The conversion to category is not free, of course, but it is a one-time cost:
%time _ = labels.astype('category')

CPU times: user 388 ms, sys: 75.3 ms, total: 463 ms
Wall time: 461 ms


### 12.1.4 Categorical Methods
<a id='1214'></a>

In [43]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [45]:
# The special attribute cat provides access to categorical methods:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [52]:
# We can use the set_categories method to change them:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [53]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [54]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [55]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [56]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

<hr>

[Back to top](#index)