# pandas数据处理

## 1、删除重复元素

In [1]:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True

In [3]:
df.duplicated()

0    False
1    False
2     True
3    False
dtype: bool

使用drop_duplicates()函数删除重复的行

In [4]:
df.drop_duplicates()

Unnamed: 0,color,size
0,red,10
1,white,20
3,green,30


如果使用pd.concat([df1,df2],axis = 1)生成新的DataFrame，新的df中columns相同，使用duplicated()和drop_duplicates()都会出问题

In [7]:
#就是列名相同，当我们删除重复元素时，会出问题
df2 = pd.concat((df,df),axis = 1)
df2

Unnamed: 0,color,size,color.1,size.1
0,red,10,red,10
1,white,20,white,20
2,red,10,red,10
3,green,30,green,30


In [8]:
df2.duplicated()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [9]:
df2.drop_duplicates()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [16]:
#drop_dupilicates ; drop 根据dupicates

# df.drop_duplicates() == 

du = df.duplicated()
#du = [0,0,1,0]
display(du)
df.drop(du)

0    False
1    False
2     True
3    False
dtype: bool

Unnamed: 0,color,size
2,red,10
3,green,30


In [20]:
np.logical_not(du)

0     True
1     True
2    False
3     True
dtype: bool

In [21]:
df[np.logical_not(du)]

Unnamed: 0,color,size
0,red,10
1,white,20
3,green,30


In [11]:
df

Unnamed: 0,color,size
0,red,10
1,white,20
2,red,10
3,green,30


## 2. 映射

映射的含义：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定

需要使用字典：

`map = {
    'label1':'value1',
    'label2':'value2',
    ...
    }
`

包含三种操作：

- replace()函数：替换元素
- 最重要：map()函数：新建一列
- rename()函数：替换索引

### 1) replace()函数：替换元素

In [23]:
df*2

Unnamed: 0,color,size
0,redred,20
1,whitewhite,40
2,redred,20
3,greengreen,60


使用replace()函数，对values进行替换操作

In [24]:
#red = 10
#green = 20
color = {'red':10,'green':20}

首先定义一个字典

调用.replace()

In [27]:
df.replace(color,inplace=True)

replace还经常用来替换NaN元素

In [29]:
df.loc[1] = np.nan

In [31]:
v = {np.nan:0.1}
df.replace(v)

Unnamed: 0,color,size
0,10.0,10.0
1,0.1,0.1
2,10.0,10.0
3,20.0,30.0


============================================

练习19：

    假设张三李四的课表里有满分的情况，老师认为是作弊，把所有满分的情况（包括150,300分）都记0分，如何实现？

============================================

### 2) map()函数：新建一列

使用map()函数，由已有的列生成一个新列

适合处理某一单独的列。

In [33]:
df = DataFrame(np.random.randint(0,150,size = (4,4)),columns=['Python','Java','PHP','HTML'],
               index = ['张三','旭日','阳刚','木兰'])
df

Unnamed: 0,Python,Java,PHP,HTML
张三,75,100,52,3
旭日,80,136,53,132
阳刚,146,106,51,72
木兰,35,126,2,0


In [34]:
#Go
#map也有映射关系，新添加一列，根据现存的那一列进行添加
v = {75:90,80:100,146:166,35:55}
df['Go'] = df['Python'].map(v)

In [35]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go
张三,75,100,52,3,90
旭日,80,136,53,132,100
阳刚,146,106,51,72,166
木兰,35,126,2,0,55


仍然是新建一个字典

map()函数中可以使用lambda函数

In [36]:
#C
df['C'] = df['Go'].map(lambda x : x -40)

In [37]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C
张三,75,100,52,3,90,50
旭日,80,136,53,132,100,60
阳刚,146,106,51,72,166,126
木兰,35,126,2,0,55,15


In [39]:
def mp(x):
    #复杂的条件
    if x <51:
        return '不及格'
    else:
        return '优秀'

In [40]:
df['score'] = df['C'].map(mp)
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C,score
张三,75,100,52,3,90,50,不及格
旭日,80,136,53,132,100,60,优秀
阳刚,146,106,51,72,166,126,优秀
木兰,35,126,2,0,55,15,不及格


In [43]:
#'int' object is not iterable
max(10)

TypeError: 'int' object is not iterable

In [42]:
#'int' object is not iterable
df['score2'] = df['C'].pma(max)

TypeError: 'int' object is not iterable

transform()和map()类似

In [44]:
df['score2'] = df['C'].transform(mp)

In [45]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C,score,score2
张三,75,100,52,3,90,50,不及格,不及格
旭日,80,136,53,132,100,60,优秀,优秀
阳刚,146,106,51,72,166,126,优秀,优秀
木兰,35,126,2,0,55,15,不及格,不及格


使用map()函数新建一个新列

In [46]:
#同时map还可以修改当前列
df['C'] = df['C'].map(lambda x : x*2)

In [47]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C,score,score2
张三,75,100,52,3,90,100,不及格,不及格
旭日,80,136,53,132,100,120,优秀,优秀
阳刚,146,106,51,72,166,252,优秀,优秀
木兰,35,126,2,0,55,30,不及格,不及格


============================================

练习20：

    新增两列，分别为张三、李四的成绩状态，如果分数低于90，则为"failed"，如果分数高于120，则为"excellent"，其他则为"pass"
    
    【提示】使用函数作为map的参数

============================================

### 3) rename()函数：替换索引

仍然是新建一个字典

In [48]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C,score,score2
张三,75,100,52,3,90,100,不及格,不及格
旭日,80,136,53,132,100,120,优秀,优秀
阳刚,146,106,51,72,166,252,优秀,优秀
木兰,35,126,2,0,55,30,不及格,不及格


In [52]:
def cols(x):
    if x == 'PHP':
        return 'php'
    if x == 'Python':
        return '大蟒蛇'
    else:
        return x

In [53]:
inds = {'张三':'Zhang Sir','木兰':'MissLan'}
# index, columns : scalar, list-like, dict-like or function, optional
#     Scalar or list-like will alter the ``Series.name`` attribute,
#     and raise on DataFrame or Panel.
#     dict-like or functions are transformations to apply to
#     that axis' values
df.rename(index = inds,columns=cols)

Unnamed: 0,大蟒蛇,Java,php,HTML,Go,C,score,score2
Zhang Sir,75,100,52,3,90,100,不及格,不及格
旭日,80,136,53,132,100,120,优秀,优秀
阳刚,146,106,51,72,166,252,优秀,优秀
MissLan,35,126,2,0,55,30,不及格,不及格


使用rename()函数替换行索引

## 3. 异常值检测和过滤

使用describe()函数查看每一列的描述性统计量

In [55]:
df.describe()

Unnamed: 0,Python,Java,PHP,HTML,Go,C
count,4.0,4.0,4.0,4.0,4.0,4.0
mean,84.0,117.0,39.5,51.75,102.75,125.5
std,45.978256,16.8523,25.01333,62.994047,46.370788,92.741576
min,35.0,100.0,2.0,0.0,55.0,30.0
25%,65.0,104.5,38.75,2.25,81.25,82.5
50%,77.5,116.0,51.5,37.5,95.0,110.0
75%,96.5,128.5,52.25,87.0,116.5,153.0
max,146.0,136.0,53.0,132.0,166.0,252.0


使用std()函数可以求得DataFrame对象每一列的标准差

In [59]:
df.std()

Python    45.978256
Java      16.852300
PHP       25.013330
HTML      62.994047
Go        46.370788
C         92.741576
dtype: float64

根据每一列的标准差，对DataFrame元素进行过滤。

借助any()函数, 测试是否有True，有一个或以上返回True，反之返回False

对每一列应用筛选条件,去除标准差太大的数据

In [63]:
df.drop(['score','score2'],axis = 1,inplace=True)

In [2]:
df2 = df.stack().unstack(level = 0)

NameError: name 'df' is not defined

In [95]:
cond = np.abs(df2) < df2.std()*2
cond

Unnamed: 0,张三,旭日,阳刚,木兰
Python,True,False,False,True
Java,False,False,True,False
PHP,True,True,True,True
HTML,True,False,True,True
Go,False,False,False,True
C,False,False,False,True


In [85]:
df.std(axis = 1)

张三    37.517996
旭日    32.420672
阳刚    72.923019
木兰    46.431311
dtype: float64

In [87]:
#对数据进行过滤
#std标准方差，稳定
cond = np.abs(df) < df.std(axis = 1)*5
cond

TypeError: 'axis' is an invalid keyword to ufunc 'absolute'

In [72]:
df[cond].dropna(axis = 1)

Unnamed: 0,Java
张三,100
旭日,136
阳刚,106
木兰,126


In [76]:
df.mean()

Python     84.00
Java      117.00
PHP        39.50
HTML       51.75
Go        102.75
C         125.50
dtype: float64

In [77]:
df

Unnamed: 0,Python,Java,PHP,HTML,Go,C
张三,75,100,52,3,90,100
旭日,80,136,53,132,100,120
阳刚,146,106,51,72,166,252
木兰,35,126,2,0,55,30


In [78]:
#分数足够高
cond2 = np.abs(df) > df.mean()*1.2

In [81]:
cond&cond2

Unnamed: 0,Python,Java,PHP,HTML,Go,C
张三,False,False,True,False,False,False
旭日,False,False,True,True,False,False
阳刚,True,False,True,True,True,True
木兰,False,False,False,False,False,False


删除特定索引df.drop(labels,inplace = True)

============================================

练习21：

    新建一个形状为10000*3的标准正态分布的DataFrame(np.random.randn)，去除掉所有满足以下情况的行：其中任一元素绝对值大于3倍标准差

============================================

In [96]:
n = np.random.randn(10000,3)

In [98]:
df = DataFrame(n)
df

Unnamed: 0,0,1,2
0,-1.126422,-0.919361,0.668098
1,-0.585425,-0.261040,-0.490388
2,-1.014050,0.249440,0.316363
3,0.729656,0.329345,-0.583070
4,0.716626,-0.176091,-0.319415
5,-0.974091,-0.458677,0.203684
6,-0.444945,0.433427,0.269263
7,-1.417488,-0.561283,1.196632
8,0.560407,-0.159346,0.340378
9,-0.000639,2.152312,-0.696502


In [108]:
cond = np.abs(df) >df.std()*3
cond

Unnamed: 0,0,1,2
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [118]:
drop_index = df[cond.any(axis = 1)].index

In [120]:
df2 = df.drop(drop_index)

In [121]:
df2.shape

(9917, 3)

In [122]:
cond2 = np.abs(df2) > df.std()*3

In [126]:
cond2.any(axis = 1).sum()

0

In [129]:
df2

Unnamed: 0,0,1,2
0,-1.126422,-0.919361,0.668098
1,-0.585425,-0.261040,-0.490388
2,-1.014050,0.249440,0.316363
3,0.729656,0.329345,-0.583070
4,0.716626,-0.176091,-0.319415
5,-0.974091,-0.458677,0.203684
6,-0.444945,0.433427,0.269263
7,-1.417488,-0.561283,1.196632
8,0.560407,-0.159346,0.340378
9,-0.000639,2.152312,-0.696502


In [131]:
#标准偏差的平均值
row_std_mean = df2.std(axis = 1).mean()

In [140]:
cond3 = df2.std(axis = 1) > row_std_mean*2.5

In [144]:
#一下行，的数据的标准偏差大于平均标准偏差的2.5 过滤掉
large_std_index = df2[cond3].index

In [146]:
df3 = df2.drop(large_std_index)

In [147]:
df3.shape

(9866, 3)

## 4. 排序

使用.take()函数排序

可以借助np.random.permutation()函数随机排序

In [149]:
df = DataFrame(np.random.randint(0,150,size = (4,4)),columns=['Python','Java','PHP','HTML'],
               index = ['张三','旭日','阳刚','木兰'])
df

Unnamed: 0,Python,Java,PHP,HTML
张三,142,50,147,128
旭日,17,29,24,60
阳刚,48,130,14,110
木兰,14,35,101,53


In [156]:
df.take([3,2,0])

Unnamed: 0,Python,Java,PHP,HTML
木兰,14,35,101,53
阳刚,48,130,14,110
张三,142,50,147,128


In [168]:
indices = np.random.permutation(4)

In [169]:
#此时得到了重新排列的数据
df.take(indices)

Unnamed: 0,Python,Java,PHP,HTML
张三,142,50,147,128
木兰,14,35,101,53
阳刚,48,130,14,110
旭日,17,29,24,60


### 随机抽样

当DataFrame规模足够大时，直接使用np.random.randint()函数，就配合take()函数实现随机抽样

In [170]:
df2 = DataFrame(np.random.randn(10000,3))
df2

Unnamed: 0,0,1,2
0,0.276699,0.419327,0.086632
1,-1.163483,-0.470376,-0.455511
2,-0.005259,1.080291,-0.417497
3,0.910627,-0.294704,0.430604
4,-0.903356,-1.956186,0.252204
5,-1.103340,1.105152,-1.684500
6,-2.895341,0.852676,0.843543
7,-0.484271,-0.347173,-1.758386
8,-0.190423,1.536464,0.248842
9,1.929386,1.455515,-0.686408


In [173]:
indices = np.random.randint(0,10000,size = 10)
df2.take(indices)

Unnamed: 0,0,1,2
4385,-0.172053,0.708154,-1.023976
9711,1.008797,-0.190119,1.300144
2613,-1.222124,-1.244814,-0.485013
1152,-0.009653,-0.176749,0.931804
4680,0.360396,-0.029017,-0.928102
3478,0.326191,1.277691,2.622012
8641,-0.145429,-2.258976,0.84213
3712,-2.190806,0.657491,0.823239
1442,0.263346,0.137804,1.14036
5137,1.57266,0.590736,0.25479


============================================
练习22：

   假设有张三李四王老五的期中考试成绩ddd2，对着三名同学随机排序

============================================

## 5. 数据聚合【重点】

数据聚合是数据处理的最后一步，通常是要使每一个数组生成一个单一的数值。

数据分类处理：

 - 分组：先把数据分为几组
 - 用函数处理：为不同组的数据应用不同的函数以转换数据
 - 合并：把不同组得到的结果合并起来
 
数据分类处理的核心：
     groupby()函数

In [176]:
df.std()

Python    59.840761
Java      46.840154
PHP       63.595073
HTML      36.935755
dtype: float64

In [174]:
df.mean()

Python    55.25
Java      61.00
PHP       71.50
HTML      87.75
dtype: float64

In [175]:
df2.max()

0    4.602212
1    3.747098
2    3.589108
dtype: float64

如果想使用color列索引，计算price1的均值，可以先获取到price1列，然后再调用groupby函数，用参数指定color这一列

In [177]:
#groupby（）根据某个属性，或者多个属性进行分类
df = DataFrame({'color':['red','white','red','cyan','cyan','green','white','cyan'],
                'price':np.random.randint(0,8,size = 8),
                'weight':np.random.randint(50,55,size = 8)})
df

Unnamed: 0,color,price,weight
0,red,1,52
1,white,6,52
2,red,5,51
3,cyan,1,53
4,cyan,1,51
5,green,2,53
6,white,3,52
7,cyan,2,53


使用.groups属性查看各行的分组情况：

In [200]:
#根据颜色对数据进行分类，目的计算机，将相同的事物进行分组，求和，求平局值
df_sum_weight = df.groupby(['color'])[['weight']].sum()

df_price_mean = df.groupby(['color'])[['price']].mean()

In [199]:
df_sum_weight

Unnamed: 0_level_0,weight
color,Unnamed: 1_level_1
cyan,157
green,53
red,103
white,104


In [187]:
df_sum_weight

color
cyan     157
green     53
red      103
white    104
Name: weight, dtype: int64

In [201]:
df_price_mean

Unnamed: 0_level_0,price
color,Unnamed: 1_level_1
cyan,1.333333
green,2.0
red,3.0
white,4.5


In [None]:
#pandas 聚合concat//append;merge

In [190]:
pd.concat([df,df_sum_weight],axis=1)

Unnamed: 0,color,price,weight,weight.1
0,red,1.0,52.0,
1,white,6.0,52.0,
2,red,5.0,51.0,
3,cyan,1.0,53.0,
4,cyan,1.0,51.0,
5,green,2.0,53.0,
6,white,3.0,52.0,
7,cyan,2.0,53.0,
cyan,,,,157.0
green,,,,53.0


In [194]:
type(df_sum_weight)

pandas.core.series.Series

In [203]:
df_sum = df.merge(df_sum_weight,left_on='color',right_index=True,suffixes=['','_sum'])

In [205]:
#平均价格进行整合
df_r = df_sum.merge(df_price_mean,left_on='color',right_index=True,suffixes=['','_平均'])

In [206]:
df_r

Unnamed: 0,color,price,weight,weight_sum,price_平均
0,red,1,52,103,3.0
2,red,5,51,103,3.0
1,white,6,52,104,4.5
6,white,3,52,104,4.5
3,cyan,1,53,157,1.333333
4,cyan,1,51,157,1.333333
7,cyan,2,53,157,1.333333
5,green,2,53,53,2.0


In [None]:
#take获取，提取，take根据传入参数获取部分的数据，获取之后，自身并没有进行排序

In [214]:
df_r.index

Int64Index([0, 2, 1, 6, 3, 4, 7, 5], dtype='int64')

In [212]:
df_r.take([2,3])

Unnamed: 0,color,price,weight,weight_sum,price_平均
1,white,6,52,104,4.5
6,white,3,52,104,4.5


In [211]:
df_r.sort_index()

Unnamed: 0,color,price,weight,weight_sum,price_平均
0,red,1,52,103,3.0
1,white,6,52,104,4.5
2,red,5,51,103,3.0
3,cyan,1,53,157,1.333333
4,cyan,1,51,157,1.333333
5,green,2,53,53,2.0
6,white,3,52,104,4.5
7,cyan,2,53,157,1.333333


============================================

练习23：

   假设菜市场张大妈在卖菜，有以下属性：
   
   菜品(item)：萝卜，白菜，辣椒，冬瓜
   
   颜色(color)：白，青，红
   
   重量(weight)
   
   价格(price)
   
1. 要求以属性作为列索引，新建一个ddd
2. 对ddd进行聚合操作，求出颜色为白色的价格总和
3. 对ddd进行聚合操作，求出萝卜的所有重量(包括白萝卜，胡萝卜，青萝卜）以及平均价格
4. 使用merge合并总重量及平均价格

============================================

## 6.0 高级数据聚合

可以使用pd.merge()函数将聚合操作的计算结果添加到df的每一行  
使用groupby分组后调用加和等函数进行运算，让后最后可以调用add_prefix()，来修改列名

### 可以使用transform和apply实现相同功能

在transform或者apply中传入函数即可

In [219]:
sum([10])

10

In [None]:
df['columns'] = df['color'].map()

In [217]:
#传递函数，这个和上午map(不能迭代) 
df.groupby('color').transform(sum)

Unnamed: 0,price,weight
0,6,103
1,9,104
2,6,103
3,4,157
4,4,157
5,2,53
6,9,104
7,4,157


In [221]:
df

Unnamed: 0,color,price,weight
0,red,1,52
1,white,6,52
2,red,5,51
3,cyan,1,53
4,cyan,1,51
5,green,2,53
6,white,3,52
7,cyan,2,53


In [227]:
df.groupby('color')[['price','weight']].apply(sum)

Unnamed: 0_level_0,price,weight
color,Unnamed: 1_level_1,Unnamed: 2_level_1
cyan,4,157
green,2,53
red,6,103
white,9,104


============================================

练习24：

   使用transform与apply实现练习23的功能

============================================