# CH 12 - Advanced Pandas

## 12.1 - Categorical Data 

In [1]:
import numpy as np
import pandas as pd

In [2]:
values = pd.Series(['apple', 'oranges', 'apple', 'apple'] * 2)
values

0      apple
1    oranges
2      apple
3      apple
4      apple
5    oranges
6      apple
7      apple
dtype: object

In [3]:
pd.unique(values)

array(['apple', 'oranges'], dtype=object)

In [4]:
pd.value_counts(values)

apple      6
oranges    2
dtype: int64

### Categorical Type in pandas

In [5]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

N = len(fruits)

df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,2.939915
1,1,orange,5,1.236828
2,2,apple,10,1.494231
3,3,apple,6,0.798839
4,4,apple,11,3.518871
5,5,orange,8,3.571194
6,6,apple,12,0.189354
7,7,apple,3,3.653015


In [6]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

#### The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical

In [7]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

In [8]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [9]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

#### You can also create pandas.Categorical directly from other types of Python sequences

In [10]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

#### If you have obtained categorical encoded data from another source, you can use the alternative from_codes constructor

In [11]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]

my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

#### Unless explicitly specified, categorical conversions assume no specific ordering of the categories. 

* When using from_codes or any of the other constructors, you can indicate that the categories have a meaningful ordering

In [12]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

####  An unordered categorical instance can be made ordered with as_ordered

In [13]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

* categorical data need not be strings

### Computations with Categoricals

* Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some functions that can utilize the ordered flag

* Let’s consider some random numeric data, and use the pandas.qcut binning function. This return `pandas.Categorical`

In [14]:
np.random.seed(12345)

In [15]:
draws = np.random.randn(1000)

In [16]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [17]:
bins = pd.qcut(draws, 4)
bins[:5]

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928]]
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

####  We can name the categories with the `labels` argument to `qcut`:

In [18]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [19]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

* The labeled bins categorical does not contain information about the bin edges in the data, so we can use groupby to extract some summary statistics

In [20]:
bins = pd.Series(bins, name='quartiles')

results = pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index()
results

Unnamed: 0,quartiles,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


### Better performance with categoricals

* If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too. 

In [21]:
N = 10000000

draws = pd.Series(np.random.randn(N))

labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

# Now we convert labels to categorical
categories = labels.astype('category')

In [22]:
# Now we note that labels uses significantly more memory than categories:
labels.memory_usage()

80000128

In [23]:
categories.memory_usage()

10000320

#### The conversion to category is not free, of course, but it is a one-time cost:

In [24]:
%time _= labels.astype('category')

CPU times: user 451 ms, sys: 142 ms, total: 593 ms
Wall time: 594 ms


### Categorical Methods

* Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.

In [25]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)

cat_s = s.astype('category')

cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

#### The special attribute cat provides access to categorical methods:

In [26]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [27]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

* Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the set_categories method to change them

In [28]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

cat_s2 = cat_s.cat.set_categories(actual_categories)

In [29]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

#### We can use the `remove_unused_categories` method to trim unobserved categories

In [30]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [31]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

### Creating dummy variables for modeling

In [32]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

In [33]:
pd.get_dummies(cat_s2)

Unnamed: 0,a,b,c,d,e
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,1,0,0,0,0
5,0,1,0,0,0
6,0,0,1,0,0
7,0,0,0,1,0


## 12.2 Advanced GroupBy Use

### Group Transforms and “Unwrapped” GroupBys

`transform` is similar to apply but imposes more constraints on the kind of function you can use:
* It can produce a scalar value to be broadcast to the shape of the group
* It can produce an object of the same shape as the input group
* It must not mutate its input

In [34]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4, 'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [37]:
g = df.groupby('key').value
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

#### Suppose instead we wanted to produce a Series of the same shape as `df['value']` but with values replaced by the average grouped by 'key'. 

In [39]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [40]:
g.transform(lambda x: x*2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

#### As a more complicated example, we can compute the ranks in descending order for each group

In [42]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

#### Consider a group transformation function composed from simple aggregations

#### We can obtain equivalent results in this case either using `transform` or `apply`

In [44]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [45]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [46]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### Grouped Time Resampling

* For time series data, the `resample` method is semantically a group operation based on a time intervalization. 

In [80]:
N = 15

times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)

df = pd.DataFrame({'time': times,
                   'value': np.arange(N)})
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


#### we can index by 'time' and then resample

In [81]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


#### Suppose that a DataFrame contains multiple time series, marked by an additional group key column:

In [82]:
df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3.)})
df2[:7]

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0


#### To do the same resampling for each value of 'key', we introduce the `pandas.Grouper` object

#### We can then set the time index, group by `'key'` and `time_key`, and aggregate:

In [83]:
time_key = pd.Grouper(key='time', freq='5min')

resampled = (df2.groupby(['key', time_key])
             .sum())
resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


## 12.3 Techniques for Method Chaining

* The DataFrame.assign method is a functional alternative to column assignments of the form `df[k] = v`. Rather than modifying the object in-place, it returns a new DataFrame with the indicated modifications. 

In [84]:
# Usual non-functional way
df2 = df.copy()
df2['k'] = df['value'] * 2
df2[:5]

Unnamed: 0,time,value,k
0,2017-05-20 00:00:00,0,0
1,2017-05-20 00:01:00,1,2
2,2017-05-20 00:02:00,2,4
3,2017-05-20 00:03:00,3,6
4,2017-05-20 00:04:00,4,8


In [85]:
# Functional assign way
df2 = df.assign(k=df['value'] * 2)
df2[:5]

Unnamed: 0,time,value,k
0,2017-05-20 00:00:00,0,0
1,2017-05-20 00:01:00,1,2
2,2017-05-20 00:02:00,2,4
3,2017-05-20 00:03:00,3,6
4,2017-05-20 00:04:00,4,8


#### Assigning in-place may execute faster than using assign, but assign enables easier method chaining

In [87]:
result = df2.assign(col1_prod=df2.value * df2.k.mean()) \
            .groupby('value') \
            .col1_prod.sum()
result[:5]

value
0     0.0
1    14.0
2    28.0
3    42.0
4    56.0
Name: col1_prod, dtype: float64

#### One thing to keep in mind when doing method chaining is that you may need to refer to temporary objects. 

####  To help with this, `assign` and many other pandas functions accept function-like arguments, also known  as *callables*.

In [74]:
import seaborn as sns

df = sns.load_dataset('iris')
df2 = df[df['sepal_length'] < 5]

df2[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
6,4.6,3.4,1.4,0.3,setosa
8,4.4,2.9,1.4,0.2,setosa


In [77]:
# This can be rewritten as:

df = sns.load_dataset('iris') \
        [lambda x: x['sepal_length'] < 5]
df[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
6,4.6,3.4,1.4,0.3,setosa
8,4.4,2.9,1.4,0.2,setosa


#### Here, the result of `load_dataset` is not assigned to a variable, so the function passed into `[]` is then *bound* to the object at that stage of the method chain.

#### We can continue, then, and write the entire sequence as a single chained expression:

In [92]:
result = sns.load_dataset('iris') \
         [lambda x: x['sepal_length'] < 5] \
         .assign(col1_demeaned=lambda x: x.sepal_length - x.sepal_length.mean()) \
         .groupby('species') \
         .col1_demeaned.std()

* Whether you prefer to write code in this style is a matter of taste, and splitting up the
expression into multiple steps may make your code more readable.

### The pipe Method

You can accomplish a lot with built-in pandas functions and the approaches to method chaining with callables that we just looked at. However, sometimes you need to use your own functions or functions from third-party libraries. This is where the pipe method comes in.     

Consider a sequence of function calls:     

``` 
 a = f(df, arg1=v1)     
 b = g(a, v2, arg3=v3)       
 c = h(b, arg4=v4)
```

When using functions that accept and return Series or DataFrame objects, you can rewrite this using calls to pipe:

```
result = (df.pipe(f, arg1=v1)       
            .pipe(g, v2, arg3=v3)      
            .pipe(h, arg4=v4))
```

The statement `f(df)` and `df.pipe(f)` are equivalent, but pipe makes chained invocation easier.     
A potentially useful pattern for `pipe` is to generalize sequences of operations into reusable functions. As an example, let’s consider substracting group means from a column:

```
g = df.groupby(['key1', 'key2'])
df['col1'] = df['col1'] - g.transform('mean')
```

Suppose that you wanted to be able to demean more than one column and easily change the group keys. Additionally, you might want to perform this transformation in a method chain. Here is an example implementation:

```
def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
    result[c] = df[c] - g[c].transform('mean')
    return result
```

Then it is possible to write:
```
result = (df[df.col1 < 0]
 .pipe(group_demean, ['key1', 'key2'], ['col1']))
```