# Categorical Data（分类数据）
> 统计数据的一种，指反映数据类别的数据。分类属性具有有限个（但可能很多）不同值，值之间无序

Frequently, a column in a table may contain repeated instances of a smaller set of distinct values. We have funcs like `unique` and `value_counts`, which enable us to extract the distinct values from an array and compute their frequencies.

In [1]:
import numpy as np
import pandas as pd
np.random.seed(12345)
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

In [2]:
values = pd.Series(["apple", 'orange', 'apple', 'apple'] * 2)

In [3]:
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

unique values in `values`. Uniques are returned in order of appearance, this does not SORT. Faster than `np.unique`.

In [4]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [5]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems(for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation.

In data warehousing, a best practice is to use so-called **dimension tables** containing the distinct values and stroing the primary observations as interger keys referencing the dimension table:

In [6]:
values = pd.Series([0, 1, 0, 0] * 2)

In [7]:
dim = pd.Series(['apple', 'orange'])

In [8]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [9]:
dim

0     apple
1    orange
dtype: object

use the `take` method to restore the original Series of the strings:

In [10]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

> `Seris.take(indices, axis=0, \*args, \**kwargs)`
Return the elements in the given *positional* indices along the axis.

This representation as integers is called the **categorical** or **dictionary-encoded** representation.  
The array of distinct values can be called the **categories, dictonary, or levels** of the data. 《Python for DA》uses the terms **categorical** or **categories**.  
The integer values that reference the categories are called the **category codes** or simply **codes**.

### 1 Categorical Type in pandas

panda has a special `Categorical` type for holding data that uses the integer-based categorical representation or *encoding*.

In [11]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [12]:
N = len(fruits)

In [13]:
df = pd.DataFrame({"fruit": fruits, 'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])

In [14]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,3.858058
1,1,orange,8,2.612708
2,2,apple,4,2.995627
3,3,apple,7,2.614279
4,4,apple,12,2.990859
5,5,orange,8,3.845227
6,6,apple,5,0.033553
7,7,apple,4,0.425778


`df['fruit']` is an array of Python string objects. Convert it to categorical by calling:

In [15]:
fruit_cat = df['fruit'].astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [16]:
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object

The values for `fruit_cat` are not a NumPy array, but an istance of `panda.Categorical`:

In [17]:
c = fruit_cat.values

In [18]:
c

[apple, orange, apple, apple, apple, orange, apple, apple]
Categories (2, object): [apple, orange]

In [19]:
type(c)

pandas.core.categorical.Categorical

The `Categorical` object has `categories` and `codes` attributes:

In [20]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [21]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

**Convert a DataFrame column to categorical** by assigning the converted result:

In [22]:
df['fruit'] = df['fruit'].astype('category')
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

**Create `pandas.Categorical` directly from other types of Python sequences:**

In [23]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

If u **have obtained categorical encoded data from another source**, u can **use the alternative `from_codes` constructor**:

In [24]:
categories = ['foo', 'bar', 'baz']

In [25]:
codes = [0, 1, 2, 0, 0, 1]

In [26]:
my_cat_2 = pd.Categorical.from_codes(codes, categories)
my_cat_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

除非特别指定，categorical conversions中的categories没有特定的顺序。 The `categories` array may be in a different order depending on the ordering of the input data. 当使用`from_codes`或其他构造器时，可以指定一个有意义的顺序。

In [27]:
order_cat = pd.Categorical.from_codes(codes, categories, ordered=True)

In [28]:
order_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

> The output [foo < bar < baz] indicates that 'foo' precedes(领先，在...之前) 'bar' in the oredering, and so on.

An unordered categorical instance can be made ordered with `as_ordered`:

In [29]:
my_cat_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

也可以把有序变成无序：

In [30]:
order_cat.as_unordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

***Categorical data need not be strings. A categorical array can consist of any immutable value types.***

### 2 Computations with Categorical

Using `Categorical` in pandas compared with the non-encoded version(like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby func, preform better when working with categoricals. There are also some funcs that utilize the *ordered* flag.

consider numeric data, and use the `pandas.qcut` bining func. This return `pandas.Categorical`

In [34]:
np.random.seed(12345)
draws = np.random.randn(1000)

In [35]:
draws[:5]

array([-0.2047,  0.4789, -0.5194, -0.5557,  1.9658])

In [36]:
bins = pd.qcut(draws, 4)

In [37]:
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

While useful, the exact sample quartiles may be less useful for producing a report than quartile names. Achieve this with the `labels` argument to qcut:

In [38]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

In [39]:
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [41]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [42]:
bins.categories

Index(['Q1', 'Q2', 'Q3', 'Q4'], dtype='object')

The labeled `bins` categorical does not contain info about the bin edges in the data, so we can use `groupby` to extract some summary statistics: 

In [43]:
bins = pd.Series(bins, name='quartile')

In [45]:
results = (pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max'])
          .reset_index())

In [46]:
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [51]:
(pd.Series(draws)
          .groupby(bins)
          .agg(['count', 'min', 'max']))

Unnamed: 0_level_0,count,min,max
quartile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,250,-2.949343,-0.685484
Q2,250,-0.683066,-0.010115
Q3,250,-0.010032,0.628894
Q4,250,0.634238,3.927528


The 'quartile' column in the result retain(保留) the original categorical info, including ordering, from bins:

In [53]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### Better performance with categorical

If u do a lot of analytics on a particular dataset, converting to categorical can **yield substantial(大量的，实质的) overall(全体的) performance gains**. A categorical version of a DataFrame column will often **use significantly less memory**, too.

In [54]:
# consider some Series with 10 million elements and a smaller number of distinct categories:
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

convert `labels` to categorical:

In [55]:
categories = labels.astype('category')

`labels` uses significantly more memory than `categories`

In [57]:
labels.memory_usage()

80000080

In [58]:
categories.memory_usage()

10000272

The conversion to category is not free, but it is a one-time cost:

In [59]:
%time _ = labels.astype('category')

Wall time: 443 ms


GroupBy operations can be significantly faseter with categoricals because the underlying algorithms use the integer-based codes array instead of an array of string.

### 3 Categorical Methods

Series containing categorical data have several special methods similiar to the `Series.str`specialized string methods. This also provides convenient access to the categories and codes.

In [60]:
s = pd.Series(list('abcd') * 2)

In [61]:
cat_s = s.astype('category')

In [62]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [64]:
cat_s.values.codes

array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int8)

In [65]:
cat_s.values.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

The special attribute `cat` provides access to categorical methods:

In [66]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [67]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [68]:
cat_s.cat

<pandas.core.categorical.CategoricalAccessor object at 0x000001AD6D52E438>

Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the `set_categories` method to change them:

In [69]:
actual_categories = list('abcde')

In [70]:
cat_s_2 = cat_s.cat.set_categories(actual_categories)

In [71]:
cat_s_2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [75]:
cat_s_2.values.codes

array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int8)

In [76]:
cat_s_2.values.categories

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, `value_counts` respects the categories, if present:

In [77]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [78]:
cat_s_2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, **categorical** are often used as a convenient tool for **memory savings** and **better performance**. After u filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the `remove_unused_categories` method to trim(修剪，修正) unobserved categories:

In [79]:
cat_s_3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s_3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [80]:
cat_s_3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

#### Categorical methods for Series in pandas

| Method | Description |
| :----: | ----: |
| __add_categories__ | Append new(unused) categories at end of existing categories |
|__as_ordered__|Make categories ordered|
|__as_unordered__|Make categories unordered|
|__remove_categories__|Remove categories, setting any removed values to null|
|__remove_unused_categories__|Remove any category values which do not appear in the data|
|__rename_categories__|Replace categories with indicated set of new category name; cannot change the number of categories|
|__reorder_catgories__|Behaves like rename_categories, but can also change the result to have ordered categories|
|__set_categories__|Replace the categories with the indicated set of new categories; can add or remove categories|

statistics or machine learning中，经常将categorical data转换为dummy variable(哑变量), also known as *one-hot encoding*. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurence of a given category and 0 otherwise.

In [81]:
cat_s = pd.Series(list('abcd') * 2, dtype='category')

In [84]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [85]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
