# Pandas 中 Category 的应用


`Categoricals` 是 `Pandas` 的一种数据类型，对应着被统计的变量。

`Categoricals` 是由固定的且有限数量的变量组成的。比如：性别、社会阶层、血型、国籍、观察时段、赞美程度等等，其实就是<span class="burk">离散型变量</span>。


参考资料：

官方网站：http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html 。

pd.Categorical 的用法：https://blog.csdn.net/weixin_38656890/article/details/81348539

https://www.jianshu.com/p/20169d7f60bc

In [12]:
# 先创建一个简单的 DataFrame 实例
# Terry, Hardon, Curry, Duran, James 和 Barter 代表东西部玩三打三
# 用一组数据记录各自的得分情况

import pandas as pd
import numpy as np

In [13]:
players = ['Garsol', 'Hardon', 'Bill', 'Duran', 'James', 'Barter']
scores = [22, 34, 12, 31, 26, 19]
teams = ['West', 'West', 'East', 'West', 'East', 'East']

根据 key 和 values 创建 DataFrame。

In [14]:
df = pd.DataFrame({'player': players, 'score': scores, 'team': teams})
df

Unnamed: 0,player,score,team
0,Garsol,22,West
1,Hardon,34,West
2,Bill,12,East
3,Duran,31,West
4,James,26,East
5,Barter,19,East


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
player    6 non-null object
score     6 non-null int64
team      6 non-null object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes


<div class="alert alert-success">
重点的地方在这里：
</div>

In [16]:
df['team']

0    West
1    West
2    East
3    West
4    East
5    East
Name: team, dtype: object

In [17]:
df['team'].astype('category')

0    West
1    West
2    East
3    West
4    East
5    East
Name: team, dtype: category
Categories (2, object): [East, West]

In [18]:
df.dtypes

player    object
score      int64
team      object
dtype: object

---

下面再看一个例子：

In [17]:
scores

[22, 34, 12, 31, 26, 19]

In [18]:
d = pd.Series(scores).describe()
d 

count     6.000000
mean     24.000000
std       8.074652
min      12.000000
25%      19.750000
50%      24.000000
75%      29.750000
max      34.000000
dtype: float64

---

## cut 函数的用法

In [19]:
pd.Series(scores)

0    22
1    34
2    12
3    31
4    26
5    19
dtype: int64

In [21]:
d = pd.Series(scores).describe()
d

count     6.000000
mean     24.000000
std       8.074652
min      12.000000
25%      19.750000
50%      24.000000
75%      29.750000
max      34.000000
dtype: float64

In [12]:
d = pd.Series(scores).describe()
score_ranges = [d['min'] - 1, d['mean'], d['max'] + 1]
score_labels = ['Role', 'Star']
# 用 pd.cut(ori_data, bins, labels) 方法
# 以 bins 设定的画界点来将 ori_data 归类，然后用 labels 中对应的 label 来作为分类名
df['level'] = pd.cut(df['score'], score_ranges, labels=score_labels)

In [13]:
df['level']

0    Role
1    Star
2    Role
3    Star
4    Star
5    Role
Name: level, dtype: category
Categories (2, object): [Role < Star]

In [14]:
df

Unnamed: 0,player,score,team,level
0,Garsol,22,West,Role
1,Hardon,34,West,Star
2,Bill,12,East,Role
3,Duran,31,West,Star
4,James,26,East,Star
5,Barter,19,East,Role


In [15]:
df['team']

0    West
1    West
2    East
3    West
4    East
5    East
Name: team, dtype: object

In [18]:
df['team'].get_values()

array(['West', 'West', 'East', 'West', 'East', 'East'], dtype=object)

In [16]:
df['level']

0    Role
1    Star
2    Role
3    Star
4    Star
5    Role
Name: level, dtype: category
Categories (2, object): [Role < Star]

In [17]:
df['level'].get_values()

array(['Role', 'Star', 'Role', 'Star', 'Star', 'Role'], dtype=object)

In [20]:
cg = pd.Categorical(['Role', 'Role', 'Star', 'Role',
                     'Killer', 'Star'], categories=['Role', 'Star'])
cg

[Role, Role, Star, Role, NaN, Star]
Categories (2, object): [Role, Star]

In [21]:
s = pd.Series(["a","b","c","a"], dtype="category")
print (s)

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]


In [24]:
cat = pd.Categorical(['spring', 'a', 'hibernate', 'b', 'c', 'a', 'b', 'c'])
print(cat)

[spring, a, hibernate, b, c, a, b, c]
Categories (5, object): [a, b, c, hibernate, spring]


In [26]:
cat = cat = pd.Categorical(
    ['a', 'b', 'c', 'a', 'b', 'c', 'd'], categories=['c', 'b', 'a'])
print(cat)

[a, b, c, a, b, c, NaN]
Categories (3, object): [c, b, a]


In [28]:
cat = cat = pd.Categorical(
    ['a', 'b', 'c', 'a', 'b', 'c', 'd'], categories=['c', 'b', 'a'], ordered=True)
print(cat)

[a, b, c, a, b, c, NaN]
Categories (3, object): [c < b < a]
