Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Analysis with Python——08 #88

Open
hsipeng opened this issue Aug 23, 2019 · 0 comments
Open

Data Analysis with Python——08 #88

hsipeng opened this issue Aug 23, 2019 · 0 comments

Comments

@hsipeng
Copy link
Owner

hsipeng commented Aug 23, 2019

Data Analysis with Python——08

GroupBy 技术

“split-apply-combine” 拆分-应用- 合并

分组运算的第一个阶段,pandas对象(无论是Series、DataFrame还是其他的)中的数据会根据你所提供的一个或多个键被拆分(split)为多组。拆分操作是在对象的特定轴上执行的。例如,DataFrame可以在其行(axis=0)或列(axis=1)上进行分组。然后,将一个函数应用(apply)到各个分组并产生一个新值。最后,所有这些函数的执行结果会被合并(combine)到最终的结果对象中。结果对象的形式一般取决于数据上所执行的操作。

mr914s.png

from pandas import DataFrame, Series
import numpy as np
import pandas as pd

df = DataFrame({
    'key1': ['a', 'a', 'b', 'b', 'a'],
    'key2': ['one', 'two', 'one', 'two', 'one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df


grouped = df['data1'].groupby(df['key1'])

grouped

grouped.mean()
# key1
# a   -1.179357
# b    0.356868
# Name: data1, dtype: float64

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

# key1  key2
# a     one    -0.877566
#       two    -1.782940
# b     one     0.471518
#       two     0.242218
# Name: data1, dtype: float64

选取一个或一组列

对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。也就是说:

df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

是以下代码的语法糖:

df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

数据聚合

聚合, 指能够从数组产生标量值的数据转换过程

mean、 count、min、sum、quantile(样本分位数)

如果要使用自己的聚合函数,只需将其传入aggregate或agg方法

def peak_to_peak(arr):
	return arr.max() - arr.min()

grouped.agg(peak_to_peak)

数据列变换
transform 将一个函数应用到各个分组
GroupBy transform

def demean(arr)
	return arr - arr.mean() # 减去平均值

demeaned = people.groupby(key).transform(demean)

# 检验
demeaned.groupby(key).mean()

Apply: 一般性的”拆分 - 应用 - 合并”

apply 会将待处理对象拆分成多个片段, 然后对各片段调用传入的函数,最后尝试将各片段组合到一起。

需要返回一个 pandas 对象或标量值

def top(df, n=5, column='tip_pct'):
	return df.sort_index(by=column)[-n:]

tips.groupby('smoker').appley(top) # 最高的5个小费

调用describe()

result = tips.groupby('smoker')['tip_pct'].describe()
result

## 等效于

f = lambda x: x.describe()
grouped.apply(f)

分位数 和 桶分析

Pd.cut pd.pcut

frame = DataFrame({
    'data1': np.random.randn(1000),
    'data2': np.random.randn(1000)
})


factor =  pd.cut(frame.data1, 4)

factor[:10]

# 0       (0.346, 1.84]
# 1     (-1.148, 0.346]
# 2       (0.346, 1.84]
# 3     (-1.148, 0.346]
# 4       (0.346, 1.84]
# 5       (0.346, 1.84]
# 6     (-1.148, 0.346]
# 7     (-1.148, 0.346]
# 8    (-2.648, -1.148]
# 9       (0.346, 1.84]
# Name: data1, dtype: category
# Categories (4, interval[float64]): [(-2.648, -1.148] < (-1.148, 0.346] < (0.346, 1.84] < (1.84, 3.333]]


def get_stats(group):
    return {
        'min': group.min(),
        'max': group.max(),
        'count': group.count(),
        'mean': group.mean()
    }

grouped = frame.data2.groupby(factor)

grouped.apply(get_stats).unstack()


# 	count	max	mean	min
# data1				
# (-2.648, -1.148]	126.0	2.366660	-0.089391	-2.189044
# (-1.148, 0.346]	505.0	3.010882	-0.024205	-3.120295
# (0.346, 1.84]	341.0	2.903806	0.003853	-3.949120
# (1.84, 3.333]	28.0	1.769149	-0.033710	-1.478182

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)

grouped.apply(get_stats).unstack()


# count	max	mean	min
# data1				
# 0	100.0	2.140218	-0.087916	-2.189044
# 1	100.0	2.399182	-0.039417	-3.120295
# 2	100.0	3.010882	-0.089202	-2.500128
# 3	100.0	2.480194	0.161557	-2.701930
# 4	100.0	2.335100	-0.027526	-2.578736
# 5	100.0	2.448807	-0.214196	-2.832098
# 6	100.0	2.714845	0.096643	-2.711761
# 7	100.0	2.239721	-0.049126	-3.949120
# 8	100.0	1.843936	0.028906	-2.134242
# 9	100.0	2.903806	-0.010891	-2.362576

透视表和交叉表

pivot_table

重塑运算制作透视表

tips.pivot_table(rows=['sex', 'smoker'])
# 默认计算分组平均数(pivot_table 默认聚合类型)
# 假设我们只想聚合 tip_pct 和size , 而且想根据day 进行分组. 
# 将smoker 放到列上, 将 day 放到行上

tips.pivot_table(['tip_pct', 'size'], rows=['sex', 'day'], cols='smoker', margins=True)

# margins 分项小计


# 使用其他聚合函数,使用aggfunc
tips.pivot_table(['tip_pct', 'size'], rows=['sex', 'day'], cols='smoker', margins=True, aggfunc=len)
# 使用count 或 len 可以得到有关分组大小的交叉表


tips.pivot_table(['tip_pct', 'size'], rows=['sex', 'day'], cols='smoker', margins=True, aggfunc='sum', fill_value=0)
# 空组合设置一个fill_value

mr9lNj.png

交叉表

crosstab

pd.crosstab(data.Gender, data.Handedness, margins=True)
# 根据性别和用手习惯对数据进行统计汇总

# 前两个参数可以是数组

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant