## Chapter 12. Advanced pandas
<a id='index'></a>


## Table of Content
- [12.1 Categorical Data](#121)
    - [12.1.1 Background and Motivation](#1211)
    - [12.1.2 Categorical Type in pandas](#1212)
    - [12.1.3 Computations with Categoricals](#1213)
        - [12.1.3.1 Better performance with categoricals](#12131)
    - [12.1.4 Categorical Methods](#1214)
        - [12.1.4.1 Creating dummy variables for modeling](#12141)
- [12.2 Advanced GroupBy Use](#122)
    - [12.2.1 Group Transforms and "Unwrapped" GroupBys](#1221)
    - [12.2.2 Grouped Time Resampling](#1222)
- [12.3 Techniques for Method Chaining](#123)
    - [12.3.1 The pipe Method](#1231)

<hr>

In [2]:
import pandas as pd
import numpy as np

<hr>

## 12.1 Categorical Data
<a id='121'></a>

### 12.1.1 Background and Motivation
<a id='1211'></a>

In [3]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [5]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

In data warehousing, a best practice is to use so-called ***dimension tables*** containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [6]:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [7]:
dim

0     apple
1    orange
dtype: object

In [8]:
# We can use the *take* method to restore the original Series of strings:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the **categorical** or **dictionary-encoded representation**. The array of distinct values can be called the **categories, dictionary, or levels of the data**. In this book we will use the terms **categorical and categories**. The integer values that reference the categories are called the **category codes or simply codes**.

The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:

* Renaming categories
* Appending a new category without changing the order or position of the existing categories

### 12.1.2 Categorical Type in pandas
<a id='1212'></a>

In [9]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,1.089168
1,1,orange,3,0.422349
2,2,apple,13,2.587918
3,3,apple,9,3.868012
4,4,apple,14,3.100831
5,5,orange,3,0.402638
6,6,apple,9,1.373505
7,7,apple,12,0.02757


In [10]:
# Here, df['fruit'] is an array of Python string objects. We can convert it to categorical by calling:
fruit_cat = df['fruit'].astype('category')

# The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [11]:
c = fruit_cat.values
type(c)

pandas.core.categorical.Categorical

In [12]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [13]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [14]:
# You can convert a DataFrame column to categorical by assigning the converted result:
df['fruit'] = df['fruit'].astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [15]:
# You can also create pandas.Categorical directly from other types of Python sequences:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [16]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [17]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

# The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering, and so on.
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [18]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

### 12.1.3 Computations with Categoricals
<a id='1213'></a>
Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some func‐ tions that can utilize the ordered flag.

In [19]:
np.random.seed(12345)
draws = np.random.randn(1000)

draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [20]:
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [21]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [22]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [23]:
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [24]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### 12.1.3.1 Better performance with categoricals
<a id='12131'></a>
If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too.

***GroupBy*** operations can be significantly faster with categoricals because the underlying algorithms use the ***integer-based*** codes array instead of an array of strings.

In [25]:
N = 10000000

draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

categories = labels.astype('category')

In [26]:
labels.memory_usage()

80000080

In [27]:
categories.memory_usage()

10000272

In [28]:
# The conversion to category is not free, of course, but it is a one-time cost:
%time _ = labels.astype('category')

CPU times: user 271 ms, sys: 61.2 ms, total: 332 ms
Wall time: 340 ms


### 12.1.4 Categorical Methods
<a id='1214'></a>

In [29]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [30]:
# The special attribute cat provides access to categorical methods:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [31]:
# We can use the set_categories method to change them:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [32]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [33]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [34]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [35]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

#### 12.1.4.1 Creating dummy variables for modeling
<a id='12141'></a>
When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as ***one-hot encoding***.

In [36]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


<hr>

## 12.2 Advanced GroupBy Use
<a id='122'></a>
### 12.2.1 Group Transforms and "Unwrapped" GroupBys
<a id='1221'></a>
There is another built-in method called ***transform***, which is similar to apply but imposes more constraints on the kind of function you can use:
* It can produce a scalar value to be broadcast to the shape of the group
* It can produce an object of the same shape as the input group
* It must not mutate its input

In [37]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [38]:
g = df.groupby('key').value
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [39]:
# Suppose instead we wanted to produce a Series of the same shape as df['value'] 
# but with values replaced by the average grouped by 'key'. 
# We can pass the function lambda x: x.mean() to transform:

g.transform(lambda x: x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [40]:
# For built-in aggregation functions, we can pass a string alias as with the GrouBy *agg* method:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [41]:
# Like apply, transform works with functions that return Series, but the result must be the same size as the input.
g.transform(lambda x: x*2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [42]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [43]:
def normalize(x):
    return (x - x.mean()) / x.std()

g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [44]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

Built-in aggregate functions like 'mean' or 'sum' are often much faster than a general apply function. These also have a “***fast past***” when used with transform. This allows us to perform a so-called ***unwrapped*** group operation:

In [45]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [47]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### 12.2.2 Grouped Time Resampling
<a id='1222'></a>

In [58]:
# For time series data, the resample method is semantically a group operation based on a time intervalization. 
N =15
times = pd.date_range('2017-05-20 00:00', periods=N, freq='1min')
df = pd.DataFrame({'time': times,
                  'value': np.arange(N)})

df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


In [61]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [63]:
df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3)})

df2[:7]

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,0
1,b,2017-05-20 00:00:00,1
2,c,2017-05-20 00:00:00,2
3,a,2017-05-20 00:01:00,3
4,b,2017-05-20 00:01:00,4
5,c,2017-05-20 00:01:00,5
6,a,2017-05-20 00:02:00,6


In [67]:
# One constraint with using Grouper is that the time must be the index of the Ser‐ ies or DataFrame.
time_key = pd.Grouper(freq='5min')
resampled = (df2.set_index('time')
            .groupby(['key', time_key])
            .sum())

resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30
a,2017-05-20 00:05:00,105
a,2017-05-20 00:10:00,180
b,2017-05-20 00:00:00,35
b,2017-05-20 00:05:00,110
b,2017-05-20 00:10:00,185
c,2017-05-20 00:00:00,40
c,2017-05-20 00:05:00,115
c,2017-05-20 00:10:00,190


In [66]:
resampled.reset_index()

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,30
1,a,2017-05-20 00:05:00,105
2,a,2017-05-20 00:10:00,180
3,b,2017-05-20 00:00:00,35
4,b,2017-05-20 00:05:00,110
5,b,2017-05-20 00:10:00,185
6,c,2017-05-20 00:00:00,40
7,c,2017-05-20 00:05:00,115
8,c,2017-05-20 00:10:00,190


<hr>

## 12.3 Techniques for Method Chaining
<a id='123'></a>
When applying a sequence of transformations to a dataset, you may find yourself cre‐ ating numerous temporary variables that are never used in your analysis. Consider this example, for instance:

While we’re not using any real data here, this example highlights some new methods. First, the DataFrame.assign method is a functional alternative to column assignments of the form ***df[k] = v***. Rather than modifying the object in-place, it returns a new DataFrame with the indicated modifications. So these statements are equivalent:

Assigning in-place may execute faster than using assign, but assign enables easier method chaining:

To show callables in action, consider a fragment of the example from before:

Here, the result of load_data is not assigned to a variable, so the function passed into [] is then bound to the object at that stage of the method chain.
This can be rewritten as:

### 12.3.1 The pipe Method
<a id='1231'></a>

<hr>

[Back to top](#index)