pandas的qcut可以把一组数字按大小区间进行分区

In [6]:
import pandas as pd
import numpy as np

In [2]:
data = pd.Series([0,8,1,5,3,7,2,6,10,4,9])

In [3]:
data

0      0
1      8
2      1
3      5
4      3
5      7
6      2
7      6
8     10
9      4
10     9
dtype: int64

In [4]:
# 数据分成两部分,一半大的,一半小的,如果是小的数,值就变成'small number',大的数,值就变成'large number':
print(pd.qcut(data,[0,0.5,1],labels=['small number','large number']))

0     small number
1     large number
2     small number
3     small number
4     small number
5     large number
6     small number
7     large number
8     large number
9     small number
10    large number
dtype: category
Categories (2, object): [small number < large number]


 qcut() 方法第一个参数是数据,第二个参数定义区间的分割方法,比如这里把数字分成两半,那就是 [0, 0.5, 1] 如果要分成4份,就是 [0, 0.25, 0.5, 0.75, 1] ,也可以不是均分,比如 [0, 0.1, 0.2, 0.3, 1] ,这就就会按照 1:1:1:7 进行分布

In [5]:
data1 = pd.Series([0,8,1,5,3,7,2,6,10,4,9])
print(pd.qcut(data1,[0, 0.1, 0.2, 0.3, 1],labels=['first 10%','second 10%','third 10%','70%']))

0      first 10%
1            70%
2      first 10%
3            70%
4      third 10%
5            70%
6     second 10%
7            70%
8            70%
9            70%
10           70%
dtype: category
Categories (4, object): [first 10% < second 10% < third 10% < 70%]


#### qcut() 方法第三个参数是要替换的值,就是对应区间的值应该替换成什么值,顺序和区间保持一致就好了,注意有几个区间,就要给几个值,不能多也不能少.

## 详解

In [8]:
factors = np.random.randn(9)
factors

array([-0.74219399, -1.85494203,  1.29540879,  1.48205625,  0.04482996,
        0.39032538,  1.42227708,  1.27902157, -0.89085492])

#### pd.cut()
qcut是根据这些值的频率来选择箱子的均匀间隔，即每个箱子中含有的数的数量是相同的

传入q参数

In [9]:
pd.qcut(factors, 3) #返回每个数对应的分组

[(-1.8559999999999999, -0.218], (-1.8559999999999999, -0.218], (1.284, 1.482], (1.284, 1.482], (-0.218, 1.284], (-0.218, 1.284], (1.284, 1.482], (-0.218, 1.284], (-1.8559999999999999, -0.218]]
Categories (3, interval[float64]): [(-1.8559999999999999, -0.218] < (-0.218, 1.284] < (1.284, 1.482]]

In [10]:
pd.qcut(factors, 3).value_counts() #计算每个分组中含有的数的数量

(-1.8559999999999999, -0.218]    3
(-0.218, 1.284]                  3
(1.284, 1.482]                   3
dtype: int64

In [11]:
# 传入label
pd.qcut(factors, 3,labels=["a","b","c"]) #返回每个数对应的分组，但分组名称由label指示

[a, a, c, c, b, b, c, b, a]
Categories (3, object): [a < b < c]

In [12]:
pd.qcut(factors, 3,labels=False) #返回每个数对应的分组，但仅显示分组下标

array([0, 0, 2, 2, 1, 1, 2, 1, 0], dtype=int64)

In [13]:
#传入retbins参数
pd.qcut(factors, 3,retbins=True)# 返回每个数对应的分组，且额外返回bins，即每个边界值

([(-1.8559999999999999, -0.218], (-1.8559999999999999, -0.218], (1.284, 1.482], (1.284, 1.482], (-0.218, 1.284], (-0.218, 1.284], (1.284, 1.482], (-0.218, 1.284], (-1.8559999999999999, -0.218]]
 Categories (3, interval[float64]): [(-1.8559999999999999, -0.218] < (-0.218, 1.284] < (1.284, 1.482]],
 array([-1.85494203, -0.21751136,  1.28448398,  1.48205625]))