# 一.pandas 入门

Pandas是Python中用来处理数据的工具，它包含的数据结构和处理工具使得在Python中进行的数据清洗和数据分析十分快捷，它采用了很多Numpy的代码风格。

- Pandas是用来处理表格型或异质型数据的
- Numpy更适合处理同质型的数值类型数组数据

pandas两个常用的数据结构：Series 和 DataFrame

In [1]:
import pandas as pd
import numpy as np

## 1. Series

**Series是一种一维的数组型对象，它由两部分组成：**
    
    - 值序列
    - 数据标签/索引(index)

In [2]:
obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

**可见左面是索引(由于没有指定索引，所以默认索引从0到N-1)，右面是值序列，可以通过values属性和index属性访问Series对象**

In [3]:
obj.values #值序列属性 values

array([ 4,  7, -5,  3], dtype=int64)

In [4]:
obj.index #索引属性 index

RangeIndex(start=0, stop=4, step=1)

**一般在创建Series时会给他创建一个索引序列,并且选择处理数据时可以使用标签来进行索引**

In [5]:
obj = pd.Series([4,7,-5,3], index = ['d','b','a','c']) #在构造函数中传入索引序列，也可以使用赋值的方式改变索引序列 obj.index = []
obj

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj['a']

-5

In [7]:
obj[['a','b','c']]  #索引列表

a   -5
b    7
c    3
dtype: int64

In [8]:
'a' in obj

True

**可以对Serirs对象使用Numpy函数或Numpy风格的操作，比如使用布尔数组/索引进行过滤，标量相乘，或者数学函数**

In [9]:
obj[obj > 0]

d    4
b    7
c    3
dtype: int64

In [10]:
obj ** 2

d    16
b    49
a    25
c     9
dtype: int64

In [11]:
np.sort(obj)

array([-5,  3,  4,  7], dtype=int64)

**可以将Series看做是一个长度固定并且有序的字典，因为他的索引和数据值按位置配对，可以将Python字典转化生成Series对象**

In [12]:
dict = {'Ohio':3500,'Texas':7100,'Oregon':1600,'Utah':5000}
dict

{'Ohio': 3500, 'Oregon': 1600, 'Texas': 7100, 'Utah': 5000}

In [13]:
obj = pd.Series(dict)  #python字典传给Series的构造函数，产生的Series对象的索引是排好序的字典健值
obj

Ohio      3500
Oregon    1600
Texas     7100
Utah      5000
dtype: int64

In [14]:
#可以将字典健值按照你想要的顺序传给构造函数
s = ['Cal','Ohio','Oregon','Texas']
obj = pd.Series(dict, s)
obj

Cal          NaN
Ohio      3500.0
Oregon    1600.0
Texas     7100.0
dtype: float64

**python字典中的值被放在对应健值的位置，由于字典中没有‘Ohio’，索引它对应的值为NAN(not a number)，这时pandas中标记缺失值或NA值的方式 ** 

**pandas中使用isnull和notnull函数来检查缺失数据**

In [15]:
pd.isnull(obj)

Cal        True
Ohio      False
Oregon    False
Texas     False
dtype: bool

In [16]:
pd.notnull(obj)

Cal       False
Ohio       True
Oregon     True
Texas      True
dtype: bool

**Series自身和索引都有name属性**

In [17]:
obj.name = 'population'

In [18]:
obj.index.name = 'state'

In [19]:
obj

state
Cal          NaN
Ohio      3500.0
Oregon    1600.0
Texas     7100.0
Name: population, dtype: float64

## 2. DataFrame

DataFrame表示的是矩阵的数据表，它包含已经排序的列集合，每一列可以是不同的值类型(数值、字符串、布尔值等)，它既有行索引也有列索引，可以被视为一个共享相同索引的Series的字典。在DataFrame中，数据被存储为一个以上的二维块，而不是列表、字典或其他一维集合。

In [20]:
df = pd.read_csv('yelp_business.csv',encoding = 'latin-1',nrows = 8) #读取数据

In [21]:
df

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,4.0,22,1,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,3.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,1.5,18,1,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,3.5,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...
5,o9eMRCWt5PkpLDE0gOPtcQ,"""Messina""",,"""Richterstr. 11""",Stuttgart,BW,70567,48.7272,9.14795,4.0,5,1,Italian;Restaurants
6,kCoE3jvEtg6UVz5SOD3GVw,"""BDJ Realty""",Summerlin,"""2620 Regatta Dr, Ste 102""",Las Vegas,NV,89128,36.20743,-115.26846,4.0,5,1,Real Estate Services;Real Estate;Home Services...
7,OD2hnuuTJI9uotcKycxg1A,"""Soccer Zone""",,"""7240 W Lake Mead Blvd, Ste 4""",Las Vegas,NV,89128,36.197484,-115.24966,1.5,9,1,Shopping;Sporting Goods


- DataDrame的属性
    - columns 列名
    - index 行索引
    - values DataFrame中的数据

- 索引对象

pandas中的索引对象时用于存储轴标签或者其他元数据，轴名称/标签

索引对象时不可变的，无法修改索引对象

索引对象也像一个固定大小的集合，可以包含重复的标签  

每个索引都有一些集合逻辑的方法和属性
![微信图片_20190328170224.png](https://i.loli.net/2019/03/28/5c9c8dad46b2b.png)

In [22]:
index = df.index
index

RangeIndex(start=0, stop=8, step=1)

In [23]:
name = df.columns
name   

Index(['business_id', 'name', 'neighborhood', 'address', 'city', 'state',
       'postal_code', 'latitude', 'longitude', 'stars', 'review_count',
       'is_open', 'categories'],
      dtype='object')

In [24]:
values = df.values
values

array([['FYWN1wneV18bWNgQjJ2GNg', '"Dental by Design"', nan,
        '"4855 E Warner Rd, Ste B9"', 'Ahwatukee', 'AZ', 85044, 33.3306902,
        -111.9785992, 4.0, 22, 1,
        'Dentists;General Dentistry;Health & Medical;Oral Surgeons;Cosmetic Dentists;Orthodontists'],
       ['He-G7vWjzVUysIKrfNbPUQ', '"Stephen Szabo Salon"', nan,
        '"3101 Washington Rd"', 'McMurray', 'PA', 15317, 40.2916853,
        -80.1048999, 3.0, 11, 1,
        "Hair Stylists;Hair Salons;Men's Hair Salons;Blow Dry/Out Services;Hair Extensions;Beauty & Spas"],
       ['KQPW8lFf1y5BT2MxiSZ3QA', '"Western Motor Vehicle"', nan,
        '"6025 N 27th Ave, Ste 1"', 'Phoenix', 'AZ', 85017, 33.5249025,
        -112.1153098, 1.5, 18, 1,
        'Departments of Motor Vehicles;Public Services & Government'],
       ['8DShNS-LuFqpEWIp0HxijA', '"Sports Authority"', nan,
        '"5000 Arizona Mills Cr, Ste 435"', 'Tempe', 'AZ', 85282,
        33.3831468, -111.96472539999999, 3.0, 9, 0,
        'Sporting Goods;Shoppin

**行和列的处理**

   

- 行索引和列索引

In [25]:
col = df['name'] #列索引
col

0            "Dental by Design"
1         "Stephen Szabo Salon"
2       "Western Motor Vehicle"
3            "Sports Authority"
4    "Brick House Tavern + Tap"
5                     "Messina"
6                  "BDJ Realty"
7                 "Soccer Zone"
Name: name, dtype: object

In [26]:
row = df.loc[3] #行索引需要借助df.loc[]
row

business_id               8DShNS-LuFqpEWIp0HxijA
name                          "Sports Authority"
neighborhood                                 NaN
address         "5000 Arizona Mills Cr, Ste 435"
city                                       Tempe
state                                         AZ
postal_code                                85282
latitude                                 33.3831
longitude                               -111.965
stars                                          3
review_count                                   9
is_open                                        0
categories               Sporting Goods;Shopping
Name: 3, dtype: object

df[ name ]对于任意列名都有效，而df.name只是在列名时有效的Python变量名时有效

- 处理行和列(赋值、添加删除列)

In [27]:
df['stars'] = np.arange(8.)  #也可以借助索引给某一列赋值

In [28]:
df.stars   

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
Name: stars, dtype: float64

当将一个列表或数组赋值给一个列时，值的长度必须和DataFrame的长度相匹配，
 

将Series赋值给一列时，Series的索引会按照DataFrame的索引重新排列，并在空缺值的地方填充缺失值NaN,如果赋值的列并不存在，则会生成一个新的列

从DataFrame中选取的列是数据的视图，而不是拷贝，因此对Series的修改会映射到DataFrame中。如果需要复制，则应当显式的使用Series的copy方法

In [29]:
ser = pd.Series([18,25,32,6],index = [1,3,5,6])

In [30]:
df['popularity'] = ser
df

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,popularity
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,0.0,22,1,Dentists;General Dentistry;Health & Medical;Or...,
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,1.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,18.0
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,2.0,18,1,Departments of Motor Vehicles;Public Services ...,
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping,25.0
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,4.0,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...,
5,o9eMRCWt5PkpLDE0gOPtcQ,"""Messina""",,"""Richterstr. 11""",Stuttgart,BW,70567,48.7272,9.14795,5.0,5,1,Italian;Restaurants,32.0
6,kCoE3jvEtg6UVz5SOD3GVw,"""BDJ Realty""",Summerlin,"""2620 Regatta Dr, Ste 102""",Las Vegas,NV,89128,36.20743,-115.26846,6.0,5,1,Real Estate Services;Real Estate;Home Services...,6.0
7,OD2hnuuTJI9uotcKycxg1A,"""Soccer Zone""",,"""7240 W Lake Mead Blvd, Ste 4""",Las Vegas,NV,89128,36.197484,-115.24966,7.0,9,1,Shopping;Sporting Goods,


del方法可以移除df中的某一列

In [31]:
del df['popularity']
df

Unnamed: 0,business_id,name,neighborhood,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",,"""4855 E Warner Rd, Ste B9""",Ahwatukee,AZ,85044,33.33069,-111.978599,0.0,22,1,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",,"""3101 Washington Rd""",McMurray,PA,15317,40.291685,-80.1049,1.0,11,1,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",,"""6025 N 27th Ave, Ste 1""",Phoenix,AZ,85017,33.524903,-112.11531,2.0,18,1,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",,"""5000 Arizona Mills Cr, Ste 435""",Tempe,AZ,85282,33.383147,-111.964725,3.0,9,0,Sporting Goods;Shopping
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",,"""581 Howe Ave""",Cuyahoga Falls,OH,44221,41.119535,-81.47569,4.0,116,1,American (New);Nightlife;Bars;Sandwiches;Ameri...
5,o9eMRCWt5PkpLDE0gOPtcQ,"""Messina""",,"""Richterstr. 11""",Stuttgart,BW,70567,48.7272,9.14795,5.0,5,1,Italian;Restaurants
6,kCoE3jvEtg6UVz5SOD3GVw,"""BDJ Realty""",Summerlin,"""2620 Regatta Dr, Ste 102""",Las Vegas,NV,89128,36.20743,-115.26846,6.0,5,1,Real Estate Services;Real Estate;Home Services...
7,OD2hnuuTJI9uotcKycxg1A,"""Soccer Zone""",,"""7240 W Lake Mead Blvd, Ste 4""",Las Vegas,NV,89128,36.197484,-115.24966,7.0,9,1,Shopping;Sporting Goods


### 2.1 基本功能-重建索引

reindex是pandas对象的重要方法，用于创建一个符合新索引的新对象

reindex方法的参数

![微信图片_20190328171347.png](https://i.loli.net/2019/03/28/5c9c90588fb49.png)

In [32]:
ser = pd.Series([4,7,-5,3], index = ['d','b','a','c'])
ser

d    4
b    7
a   -5
c    3
dtype: int64

In [33]:
ser2 = ser.reindex(['a','b','c','d','e'])
ser2

a   -5.0
b    7.0
c    3.0
d    4.0
e    NaN
dtype: float64

对于顺序数据，在重建索引时可能需要进行插值或填值，method参数允许我们使用ffill等方法在重建索引时插值。ffill前向填充，bfill后向填充

- DataFrame中，reindex可以改变行索引、列索引，也可以同时改变二者

In [34]:
df = pd.DataFrame(np.arange(9).reshape((3,3)), index = ['a','c','d'],columns = ['Ohio','Texas','California'])
df

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [35]:
df.reindex(index = ['a','b','c','d']) #重建行索引

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [36]:
df.reindex(columns = ['Texas','Ohio','Utah','California'])

Unnamed: 0,Texas,Ohio,Utah,California
a,1,0,,2
c,4,3,,5
d,7,6,,8


- loc()方法:索引函数
    
    ```Series	s.loc[indexer]```
    
    ``` DataFrame	df.loc[row_indexer,column_indexer]```
    
     - A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

    - A list or array of labels ['a', 'b', 'c'].

    - A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels.).

    - A boolean array

    - A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

In [37]:
df.loc[['a','b','c','d'], ['Texas','Ohio','Utah','California']]

Unnamed: 0,Texas,Ohio,Utah,California
a,1.0,0.0,,2.0
b,,,,
c,4.0,3.0,,5.0
d,7.0,6.0,,8.0


In [38]:
df.loc['a']

Ohio          0
Texas         1
California    2
Name: a, dtype: int32

### 2.2 轴向上删除条目

- Series中删除

In [39]:
ser = pd.Series(np.arange(5.),index = ['a','b','c','d','e'])
ser

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [40]:
ser.drop('b')

a    0.0
c    2.0
d    3.0
e    4.0
dtype: float64

- 在DataFrame中删除

In [41]:
df

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [42]:
df.drop('c')

Unnamed: 0,Ohio,Texas,California
a,0,1,2
d,6,7,8


In [43]:
df.drop('Ohio',axis = 1)

Unnamed: 0,Texas,California
a,1,2
c,4,5
d,7,8


In [44]:
df.drop(['Ohio','Texas'],axis = 1)

Unnamed: 0,California
a,2
c,5
d,8


### 2.3 索引、选择和过滤

- Series索引

Series索引和Numpy数组功能相似，只不过Series的索引不仅是整数，还可以是字符串、列表等。

In [45]:
ser

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [46]:
ser['e'] #单字符索引

4.0

In [47]:
ser[4] #单值索引

4.0

In [48]:
ser[1:4] #数值切片

b    1.0
c    2.0
d    3.0
dtype: float64

In [49]:
ser['b':'d'] #字符切片:普通的索引不包含尾部，但是Series的切片包含尾部，指定标签范围

b    1.0
c    2.0
d    3.0
dtype: float64

In [50]:
ser[['b','c','d']] #字符列表索引，自己指定取出哪些行标签

b    1.0
c    2.0
d    3.0
dtype: float64

In [51]:
ser[ser<2] #boolearn列表索引

a    0.0
b    1.0
dtype: float64

- DataFrame索引

可以索引出一个或多个列

In [62]:
df.loc['s'] = [16,18,20]

In [64]:
df

Unnamed: 0,Ohio,Texas,California
a,0,0,0
c,0,0,5
d,6,7,8
s,16,18,20


In [65]:
df['Texas']

a     0
c     0
d     7
s    18
Name: Texas, dtype: int64

In [53]:
df[['Ohio','Texas']] #列索引

Unnamed: 0,Ohio,Texas
a,0,1
c,3,4
d,6,7


In [54]:
df[:4] #整数切片行索引

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [55]:
df[:'e']  #字符切片行索引

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [56]:
df[df['Texas'] > 5] #boolearn 索引

Unnamed: 0,Ohio,Texas,California
d,6,7,8


In [57]:
df[df<5] = 0
df

Unnamed: 0,Ohio,Texas,California
a,0,0,0
c,0,0,5
d,6,7,8


- 使用loc和iloc选择数据

针对DataFrame在行上的标签索引，使用特殊的索引符号loc和iloc语序使用轴标签和整数标签一Numpy风格语法从DataFrame中选出数组行和列的子集

    - 标签/列表索引
    - 切片功能
    
![微信图片_20190329115247.png](https://i.loli.net/2019/03/29/5c9d96a50e999.png)  

loc 是标签索引，loc[x:y]这个索引包含尾部的y；而iloc[x:y]是整数索引，类似于Numpy和Python的数组切片，不包含右区间闭合，不包含尾部

In [58]:
df.loc[['a','e','s'],['Texas','California']] #轴标签索引

Unnamed: 0,Texas,California
a,0.0,0.0
e,,
s,,


In [67]:
df.iloc[[0,1,3],[1,2]]  #整数标签索引

Unnamed: 0,Texas,California
a,0,0
c,0,5
s,18,20


In [60]:
df.loc['c':'e','Texas':'California']  #字符切片

Unnamed: 0,Texas,California
c,0,5
d,7,8


In [68]:
df

Unnamed: 0,Ohio,Texas,California
a,0,0,0
c,0,0,5
d,6,7,8
s,16,18,20


In [71]:
df.iloc[1:4,1:3]  #整数切片

Unnamed: 0,Texas,California
c,0,5
d,7,8
s,18,20


### 2.4 算数与数据对齐

不同DataFrame/Series对象之间的算数行为：对象相加时，如果存在某个索引对不相同，则结果返回的是索引对的并集。

In [72]:
d1 = pd.Series([7.3,-2.5,3.4,1.5],index = ['a','c','d','e'])
d1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [73]:
d2 = pd.Series([-2.1,3.6,-1.5,4,3.1], index = ['a','c','e','f','g'])
d2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [75]:
d1 + d2   #Series对象的对齐操作

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

没有重复的标签位置上，内部数据对齐会产生缺失值，缺失值对后续的算数等操作会产生影响

In [78]:
df1 = pd.DataFrame(np.arange(9.).reshape(3,3),index = ['Ohio','Texas','Colorado'],columns = ['a','b','c'])
df1

Unnamed: 0,a,b,c
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [81]:
df2 = pd.DataFrame(np.arange(12.).reshape(4,3), index = ['Utah','Ohio','Texas','Oregon'], columns = ['b','c','e'])
df2

Unnamed: 0,b,c,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [90]:
(df1 + df2)

Unnamed: 0,a,b,c,e
Colorado,,,,
Ohio,,4.0,6.0,
Oregon,,,,
Texas,,10.0,12.0,
Utah,,,,


#### 2.4.1 具有填充值的算数方法

In [106]:
df1.add(df2, fill_value = 2)

Unnamed: 0,a,b,c,e
Colorado,8.0,9.0,10.0,
Ohio,2.0,4.0,6.0,7.0
Oregon,,11.0,12.0,13.0
Texas,5.0,10.0,12.0,10.0
Utah,,2.0,3.0,4.0


#### 2.4.2 DataFrame 和 Series 间的操作

In [107]:
df = pd.DataFrame(np.arange(12.).reshape(4,3),columns = ['b','d','e'],index = ['Utah','Ohio','Texas','Oregon'])
df

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [108]:
ser = df.iloc[0]
ser

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

**DataFrame和Series的数学操作中会将Series的索引和DataFrame的列进行匹配，并广播到行上**

In [109]:
df - ser   

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [110]:
ser2 =pd.Series(range(3),index = ['d','e','s'])
ser2

d    0
e    1
s    2
dtype: int32

In [111]:
df - ser2

Unnamed: 0,b,d,e,s
Utah,,1.0,1.0,
Ohio,,4.0,4.0,
Texas,,7.0,7.0,
Oregon,,10.0,10.0,


也可以在列上广播，行上进行匹配，但是要使用的是pandas中的算数方法 
```
add()/+,
sub()/-,
div//,
floodiv/整除,
mul/*,
pow/**```

In [112]:
ser3 = df['e']
ser3

Utah       2.0
Ohio       5.0
Texas      8.0
Oregon    11.0
Name: e, dtype: float64

In [113]:
df.sub(ser3, axis = 'index')  #axis 是用于匹配轴的，'index'/0表示对行匹配

Unnamed: 0,b,d,e
Utah,-2.0,-1.0,0.0
Ohio,-2.0,-1.0,0.0
Texas,-2.0,-1.0,0.0
Oregon,-2.0,-1.0,0.0


### 2.5 函数应用和映射

In [114]:
df

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Numpy中的函数对pandas对象也有效

In [115]:
np.square(df)

Unnamed: 0,b,d,e
Utah,0.0,1.0,4.0
Ohio,9.0,16.0,25.0
Texas,36.0,49.0,64.0
Oregon,81.0,100.0,121.0


可以通过DataFrame的apply方法将某个函数应用到对象的一行或者一列的一维数组上

In [121]:
f = lambda x: x.max() - x.min()

In [122]:
df.apply(f)

b    9.0
d    9.0
e    9.0
dtype: float64

f函数会被DataFramed对象的每一列调用一次，如果想apply方法中传递参数 axis = 'columns',函数f会被对象的每一行调用一次 

传给apply方法的函数不一定要返回一个标量值，也可以返回带有多个值的Series

In [125]:
def f(x):
    return pd.Series([x.min(),x.max()],index = ['min','max'])

In [126]:
df.apply(f)

Unnamed: 0,b,d,e
min,0.0,1.0,2.0
max,9.0,10.0,11.0


逐元素的函数也可以使用,DataFrame 的 applymap() 函数可以实现此功能，使用applymap()作为函数名是因为Series对象有map方法
```  
df.apply()
df.applymap()
ser.map()```

In [129]:
f = lambda x: x+8

In [130]:
df.applymap(f)  #对元素逐一使用函数f

Unnamed: 0,b,d,e
Utah,8.0,9.0,10.0
Ohio,11.0,12.0,13.0
Texas,14.0,15.0,16.0
Oregon,17.0,18.0,19.0


In [131]:
df['e'].map(f) #对元素逐一使用函数f

Utah      10.0
Ohio      13.0
Texas     16.0
Oregon    19.0
Name: e, dtype: float64

### 2.6 排序和排名  

- 按标签索引排序：sort_index()
- 按值排序: sort_values()

    - axis 控制排序的方向-行或列
    - ascending 升序或降序

In [133]:
#Series对象排序
ser = pd.Series(range(4),index = ['d','a','s','e'])
ser

d    0
a    1
s    2
e    3
dtype: int32

In [134]:
ser.sort_index()

a    1
d    0
e    3
s    2
dtype: int32

In [135]:
#DataFrame对象排序
df = pd.DataFrame(np.arange(8).reshape(2,4),index = ['three','one'],columns = ['d','a','e','c'])
df

Unnamed: 0,d,a,e,c
three,0,1,2,3
one,4,5,6,7


**按索引标签排序**

In [140]:
df.sort_index() #默认行标签

Unnamed: 0,d,a,e,c
one,4,5,6,7
three,0,1,2,3


In [141]:
df.sort_index(axis = 1) #axis控制行或列标签排序

Unnamed: 0,a,c,d,e
three,1,3,0,2
one,5,7,4,6


In [142]:
df.sort_index(axis =1, ascending = False)#ascending控制升序或降序

Unnamed: 0,e,d,c,a
three,2,0,3,1
one,6,4,7,5


**按值排序**  

可以使用一列或者多列作为排序键，参数 by 控制

In [144]:
df.loc['five'] = [3,8,10,6]
df

Unnamed: 0,d,a,e,c
three,0,1,2,3
one,4,5,6,7
five,3,8,10,6


In [145]:
df.sort_values(by = 'd')

Unnamed: 0,d,a,e,c
three,0,1,2,3
five,3,8,10,6
one,4,5,6,7


In [146]:
df.sort_values(by = ['d','c'])

Unnamed: 0,d,a,e,c
three,0,1,2,3
five,3,8,10,6
one,4,5,6,7


In [149]:
df.sort_values(by = 'five',axis = 1)

Unnamed: 0,d,c,a,e
three,0,3,1,2
one,4,7,5,6
five,3,6,8,10


### 2.7 含有重复标签的轴索引

pandas允许有重复索引

In [5]:
ser = pd.Series(range(5),index = ['a','a','c','e','e'])
ser

a    0
a    1
c    2
e    3
e    4
dtype: int32

可以用索引的is_unique属性判断他的标签是否唯一

In [6]:
ser.index.is_unique

False

In [7]:
ser['e']

e    3
e    4
dtype: int32

DataFrame同理

## 3.描述性统计的概述和计算

Pandas对象封装了一个数学和统计学方法的集合，和Numpy中的类似方法相比，它们内键了处理缺失值的功能   

![微信图片_20190331192004.png](https://i.loli.net/2019/03/31/5ca0a3fc47530.png)  

函数中可选的常用参数  

``` 
   axis：函数作用的轴方向,axis = 0表示行向;axis = 1表示列向(列值相加)
   skipna:排除缺失值，默认值为True
   level:如果轴是多层索引的，该参数可以缩减分组层数```

In [2]:
df = pd.DataFrame([[1,4,np.nan],[7,5,8],[np.nan,np.nan,np.nan],[6,3,9]],index = ['a','b','c','d'], columns = ['one','two','three'])
df

Unnamed: 0,one,two,three
a,1.0,4.0,
b,7.0,5.0,8.0
c,,,
d,6.0,3.0,9.0


In [4]:
df.sum()  #1)默认求得是每一列之和，可以通过axis参数改变求和方向; 2)除非在整个行或列上全是NA值，否则Na值是自动排除的，可以通过skipna控制

one      14.0
two      12.0
three    17.0
dtype: float64

In [5]:
df.count()  #计算非NA值的个数

one      3
two      3
three    2
dtype: int64

In [10]:
df.describe() #计算DataFrame各列的汇总统计集合，返回的也是DataFrame对象

Unnamed: 0,one,two,three
count,3.0,3.0,2.0
mean,4.666667,4.0,8.5
std,3.21455,1.0,0.707107
min,1.0,3.0,8.0
25%,3.5,3.5,8.25
50%,6.0,4.0,8.5
75%,6.5,4.5,8.75
max,7.0,5.0,9.0


In [22]:
ser = pd.Series(['a','a','b','c'] * 4)

In [24]:
ser.describe()  #对于非数值型数据，describe会产生另一种汇总统计

count     16
unique     3
top        a
freq       8
dtype: object

### 3.1 相关性和协方差

```  
corr: 相关性
cov: 协方差
```

In [30]:
df = pd.DataFrame([[3,2,6],[6,8,10],[2,9,5],[15,1,7]],index = ['a','c','e','s'], columns = ['IBM','APL','Goole'])
df

Unnamed: 0,IBM,APL,Goole
a,3,2,6
c,6,8,10
e,2,9,5
s,15,1,7


In [31]:
#计算Series/df某一行或列之间的先关性和协方差
df.IBM.corr(df.Goole)

0.28690229202651563

In [34]:
df.IBM.cov(df.Goole)

3.6666666666666665

In [35]:
#计算DataFrame的corr和cov方法会分别以DataFrame的形式返回相关性和协方差矩阵
df.corr()

Unnamed: 0,IBM,APL,Goole
IBM,1.0,-0.593456,0.286902
APL,-0.593456,1.0,0.151186
Goole,0.286902,0.151186,1.0


In [36]:
df.cov()

Unnamed: 0,IBM,APL,Goole
IBM,35.0,-14.333333,3.666667
APL,-14.333333,16.666667,1.333333
Goole,3.666667,1.333333,4.666667


In [37]:
#使用corrwith可以计算出DataFrame中的行或列与另一个序列/DataFrame的相关性
df.corrwith(df.IBM)

IBM      1.000000
APL     -0.593456
Goole    0.286902
dtype: float64

### 3.2 唯一值、计数和成员属性

In [38]:
ser = pd.Series(['a','c','c','e','a','d','e','c','b','e','a','e'])

In [40]:
ser.unique()

array(['a', 'c', 'e', 'd', 'b'], dtype=object)

In [41]:
ser.value_counts()

e    4
a    3
c    3
d    1
b    1
dtype: int64