<a href="https://colab.research.google.com/github/jackqk/pandas-note/blob/master/Coursera_DataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The DataFrame Data Structure**


In [0]:
import pandas as pd
import numpy as np

# **一、创建DataFrame**

## Series创建

In [0]:
purchase_1 = pd.Series({'Name':'Chris',
                       'Item Purchased':'Dog Food',
                       'Cost':22.50})
purchase_2 = pd.Series({'Name':'kevyn',
                       'Item Purchased':'Kitty Litter',
                       'Cost':2.50})
purchase_3 = pd.Series({'Name':'Vinod',
                       'Item Purchased':'Bird Seed',
                       'Cost':5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index = ['Store 1', 'Store 1', 'Store 2'])
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Chris,Dog Food,22.5
Store 1,kevyn,Kitty Litter,2.5
Store 2,Vinod,Bird Seed,5.0


# **二、查询/选择**
DataFrame的行、列都是为了方便我们查询的<br>
由于index、column都可以是多个的，因此若结果是一条以上，则返回DataFrame；否则，返回Series。<br>
**dataframe 的查询基本要使用loc、iloc**

## 选择行（index）

In [0]:
#返回Series
print(type(df.loc['Store 2']))
print()
df.loc['Store 2']

<class 'pandas.core.series.Series'>



Name                  Vinod
Item Purchased    Bird Seed
Cost                      5
Name: Store 2, dtype: object

In [0]:
#返回DataFrame
print(type(df.loc['Store 1']))
print()
df.loc['Store 1']

<class 'pandas.core.frame.DataFrame'>



Unnamed: 0,Name,Item Purchased,Cost
Store 1,Chris,Dog Food,22.5
Store 1,kevyn,Kitty Litter,2.5


## 选择列（column）



In [0]:
#方法一：选择某列所有数据
df['Cost']

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

In [0]:
#方法二：矩阵转置，将列改为行
print(df.T)
print()

df.T.loc['Cost']


                 Store 1       Store 1    Store 2
Name               Chris         kevyn      Vinod
Item Purchased  Dog Food  Kitty Litter  Bird Seed
Cost                22.5           2.5          5



Store 1    22.5
Store 1     2.5
Store 2       5
Name: Cost, dtype: object

## 选择某一项

In [0]:
#方法一：使用列名,这个是chain call
#这种方法的缺点是，速度可能会慢，最好避免
#这种方法返回的是副本，若是想改变数据，这个方法就不行。
df.loc['Store 1']['Cost']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

In [0]:
#方法二：选择某一行的某一列
df.loc['Store 1', 'Cost']

#可支持切片
#df.loc[:, ['Name', 'Cost']]

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

## Boolean Mask
Boolean Mask：可以是一维的Series，也可以是二维DataFrame；其中每个元素是True或者False<br>
直接中括号被用来搞Boolean Mask

In [0]:
#数据准备
url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/master/MyNotebooks/olympics.csv'
df = pd.read_csv(url, index_col=0, skiprows=1)
for col in df.columns:
  if col[:2] == '01':
    df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
  if col[:2] == '02':
    df.rename(columns={col : 'Silver'+col[4:]}, inplace=True)
  if col[:2] == '03':
    df.rename(columns={col : 'Bronze' + col[4:]}, inplace=True)
  if col[:1] == '№':
    df.rename(columns={col: '#' + col[2:]}, inplace=True)
df.head()

Unnamed: 0,#Summer,Gold,Silver,Bronze,Total,#Winter,Gold.1,Silver.1,Bronze.1,Total.1,#Games,Gold.2,Silver.2,Bronze.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


In [0]:
# 夏季奥运会上获得过金牌的国家的Boolean Mask
df['Gold'] > 0

Afghanistan (AFG)                               False
Algeria (ALG)                                    True
Argentina (ARG)                                  True
Armenia (ARM)                                    True
Australasia (ANZ) [ANZ]                          True
                                                ...  
Independent Olympic Participants (IOP) [IOP]    False
Zambia (ZAM) [ZAM]                              False
Zimbabwe (ZIM) [ZIM]                             True
Mixed team (ZZX) [ZZX]                           True
Totals                                           True
Name: Gold, Length: 147, dtype: bool

### 方法一：where语句（与sql中的where差不多）
不符合条件那一行，全部为NaN

In [0]:
# 检索在夏季奥运会获得过金牌的国家
only_gold = df.where(df['Gold'] > 0)
only_gold.head()

Unnamed: 0,#Summer,Gold,Silver,Bronze,Total,#Winter,Gold.1,Silver.1,Bronze.1,Total.1,#Games,Gold.2,Silver.2,Bronze.2,Combined total
Afghanistan (AFG),,,,,,,,,,,,,,,
Algeria (ALG),12.0,5.0,2.0,8.0,15.0,3.0,0.0,0.0,0.0,0.0,15.0,5.0,2.0,8.0,15.0
Argentina (ARG),23.0,18.0,24.0,28.0,70.0,18.0,0.0,0.0,0.0,0.0,41.0,18.0,24.0,28.0,70.0
Armenia (ARM),5.0,1.0,2.0,9.0,12.0,6.0,0.0,0.0,0.0,0.0,11.0,1.0,2.0,9.0,12.0
Australasia (ANZ) [ANZ],2.0,3.0,4.0,5.0,12.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,4.0,5.0,12.0


In [0]:
#count对没有NaN的计数
only_gold['Gold'].count()

100

In [0]:
#对比上面的，得知筛选了一部分了
df['Gold'].count()

147

###方法二：直接干
返回原始数据副本,这种方法没有Nan，pandas自动帮你过滤掉了。<br>
在语法上可以想象成，将where放到中括号里面了。

In [0]:
only_gold = df[df['Gold'] > 0]
only_gold.head()

100

In [0]:
#在夏季、冬季奥运会拿过金牌的国家
len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])

101

In [0]:
df[(df['Gold'] == 0) & (df['Gold.1'] > 0)]

Unnamed: 0,#Summer,Gold,Silver,Bronze,Total,#Winter,Gold.1,Silver.1,Bronze.1,Total.1,#Games,Gold.2,Silver.2,Bronze.2,Combined total
Liechtenstein (LIE),16,0,0,0,0,18,2,2,5,9,34,2,2,5,9


# **三、常用 Operation**
有些操作会对原始DataFrame产生影响

## 插入

In [0]:
purchase_1 = pd.Series({'Name':'Chris',
                       'Item Purchased':'Dog Food',
                       'Cost':22.50})
purchase_2 = pd.Series({'Name':'kevyn',
                       'Item Purchased':'Kitty Litter',
                       'Cost':2.50})
purchase_3 = pd.Series({'Name':'Vinod',
                       'Item Purchased':'Bird Seed',
                       'Cost':5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index = ['Store 1', 'Store 1', 'Store 2'])

### 插入列

In [0]:
df['Delivered'] = True
df

Unnamed: 0,Name,Item Purchased,Cost,Delivered
Store 1,Chris,Dog Food,22.5,True
Store 1,kevyn,Kitty Litter,2.5,True
Store 2,Vinod,Bird Seed,5.0,True


In [0]:
# 长度要对齐
df['Date'] = ['December 1', 'January 1', 'mid_may']
print(df)

#df['Feedback'] = ['Positive', 'Negative']  拨错
df['Feedback'] = ['Positive', None, 'Negative']
df

          Name Item Purchased  Cost  Delivered        Date
Store 1  Chris       Dog Food  22.5       True  December 1
Store 1  kevyn   Kitty Litter   2.5       True   January 1
Store 2  Vinod      Bird Seed   5.0       True     mid_may


Unnamed: 0,Name,Item Purchased,Cost,Delivered,Date,Feedback
Store 1,Chris,Dog Food,22.5,True,December 1,Positive
Store 1,kevyn,Kitty Litter,2.5,True,January 1,
Store 2,Vinod,Bird Seed,5.0,True,mid_may,Negative


In [0]:
#长度不用对齐
#pandas自动帮你填充缺省值
adf = df.reset_index()
adf['Date'] = pd.Series({0:'December 1', 2:'mid-May'})
adf

Unnamed: 0,index,Name,Item Purchased,Cost,Delivered,Date,Feedback
0,Store 1,Chris,Dog Food,22.5,True,December 1,Positive
1,Store 1,kevyn,Kitty Litter,2.5,True,,
2,Store 2,Vinod,Bird Seed,5.0,True,mid-May,Negative


## 修改
如果不想修改 ，这先使用copy()

In [0]:
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

### 方法一：修改列

In [0]:
#原来的价格打八折
df['Cost'] *= 0.8
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Chris,Dog Food,19.6
Store 1,Kevyn,Kitty Litter,3.6
Store 2,Vinod,Bird Seed,5.6


### 方法二：


In [0]:
#按个数赋值
df['Cost'] = [2,4,6]
df

Unnamed: 0,Name,Item Purchased,Cost,new_colume
Store 1,Chris,Dog Food,2,
Store 1,kevyn,Kitty Litter,4,
Store 2,Vinod,Bird Seed,6,


### 方法三：按条件修改，如果做，**待完成**

## 删除
如用drop，有两个参数需要注意：inplace=True这就地更改，不返回副本。axis=0表示删除好难过，axis=1表示删除列

### 删除行
传入index删除

In [0]:
#方法一：不会修改原有数据，而是返回一个结果的副本
result = df.drop('Store 1')

print(result)
print()
df

          Name Item Purchased  Cost Location
Store 2  Vinod      Bird Seed   5.0     None



Unnamed: 0,Name,Item Purchased,Cost,Location
Store 1,Chris,Dog Food,22.5,
Store 1,kevyn,Kitty Litter,2.5,
Store 2,Vinod,Bird Seed,5.0,


In [0]:
#方法二：copy + drop
#暂时不知道这样做的意图。
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

Unnamed: 0,Name,Item Purchased,Cost,Location
Store 2,Vinod,Bird Seed,5.0,


In [0]:
df

Unnamed: 0,Name,Item Purchased,Cost,Location
Store 1,Chris,Dog Food,22.5,
Store 1,kevyn,Kitty Litter,2.5,
Store 2,Vinod,Bird Seed,5.0,


### 删除列

In [0]:
#方法一：修改axis参数
result = df.drop('Cost', axis = 1)
print(df)
print()
result

          Name Item Purchased  Cost Location
Store 1  Chris       Dog Food  22.5     None
Store 1  kevyn   Kitty Litter   2.5     None
Store 2  Vinod      Bird Seed   5.0     None



Unnamed: 0,Name,Item Purchased,Location
Store 1,Chris,Dog Food,
Store 1,kevyn,Kitty Litter,
Store 2,Vinod,Bird Seed,


In [0]:
# 方法二：del关键字，就地生效，不返回副本
del copy_df['Name']
copy_df

### 删除NaN的值
dropna()不影响原始DataFrame

In [0]:
import numpy as np
df_na = pd.DataFrame(np.random.randint(1,10, (3,5)))
df_na

Unnamed: 0,0,1,2,3,4
0,6,9,8,3,7
1,5,9,6,9,5
2,2,9,9,3,5


In [0]:
#删除有NaN的行
df_na.loc[1,1] = np.nan
print(df_na)

df_na.dropna(axis=0)


   0    1  2  3  4
0  6  9.0  8  3  7
1  5  NaN  6  9  5
2  2  9.0  9  3  5


Unnamed: 0,0,1,2,3,4
0,6,9.0,8,3,7
2,2,9.0,9,3,5


In [0]:
#删除有NaN的列
df_na.loc[2,2] = np.nan
print(df_na)
df_na.dropna(axis=1)

   0    1    2  3  4
0  6  1.0  8.0  4  5
1  9  NaN  3.0  5  5
2  2  6.0  NaN  4  7


Unnamed: 0,0,3,4
0,6,4,5
1,9,5,5
2,2,4,7


## index操作
**函数汇总**<br>
排序：sort_index()<br>

In [0]:
url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/master/MyNotebooks/olympics.csv'
df = pd.read_csv(url, index_col=0, skiprows=1)
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


### 方法一：将其他列设为index
set_index是一个破坏性过程，如果你以前的index是有用的，你需要备份

In [0]:
#备份
df['country'] = df.index
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total,country
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2,Afghanistan (AFG)
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15,Algeria (ALG)
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70,Argentina (ARG)
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12,Armenia (ARM)
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12,Australasia (ANZ) [ANZ]


In [0]:
#设置Total为新的index
df = df.set_index('Total')
df.head()

Unnamed: 0_level_0,№ Summer,01 !,02 !,03 !,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total,country
Total,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2,13,0,0,2,0,0,0,0,0,13,0,0,2,2,Afghanistan (AFG)
15,12,5,2,8,3,0,0,0,0,15,5,2,8,15,Algeria (ALG)
70,23,18,24,28,18,0,0,0,0,41,18,24,28,70,Argentina (ARG)
12,5,1,2,9,6,0,0,0,0,11,1,2,9,12,Armenia (ARM)
12,2,3,4,5,0,0,0,0,0,2,3,4,5,12,Australasia (ANZ) [ANZ]


In [0]:
#处理一下第一行
#这种做法，会为index列提供一个名字
#如果不想要，就将他赋值为None即可
df.index.name = None
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total,country
2,13,0,0,2,0,0,0,0,0,13,0,0,2,2,Afghanistan (AFG)
15,12,5,2,8,3,0,0,0,0,15,5,2,8,15,Algeria (ALG)
70,23,18,24,28,18,0,0,0,0,41,18,24,28,70,Argentina (ARG)
12,5,1,2,9,6,0,0,0,0,11,1,2,9,12,Armenia (ARM)
12,2,3,4,5,0,0,0,0,0,2,3,4,5,12,Australasia (ANZ) [ANZ]


### 方法二：设置有序整数为index（类似DB中的主键）

In [0]:
#因为之前index name给删掉了
df.index.name='Total'
df = df.reset_index()
  df.head()

Unnamed: 0,Total,№ Summer,01 !,02 !,03 !,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total,country
0,2,13,0,0,2,0,0,0,0,0,13,0,0,2,2,Afghanistan (AFG)
1,15,12,5,2,8,3,0,0,0,0,15,5,2,8,15,Algeria (ALG)
2,70,23,18,24,28,18,0,0,0,0,41,18,24,28,70,Argentina (ARG)
3,12,5,1,2,9,6,0,0,0,0,11,1,2,9,12,Armenia (ARM)
4,12,2,3,4,5,0,0,0,0,0,2,3,4,5,12,Australasia (ANZ) [ANZ]


### 方法三：Multi-Index
类似数据库中的联合主键

In [0]:
  url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/master/MyNotebooks/census.csv'
  df = pd.read_csv(url)
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,NPOPCHG_2010,NPOPCHG_2011,NPOPCHG_2012,NPOPCHG_2013,NPOPCHG_2014,NPOPCHG_2015,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,DEATHS2010,DEATHS2011,DEATHS2012,DEATHS2013,DEATHS2014,DEATHS2015,NATURALINC2010,NATURALINC2011,NATURALINC2012,NATURALINC2013,NATURALINC2014,NATURALINC2015,INTERNATIONALMIG2010,...,RESIDUAL2013,RESIDUAL2014,RESIDUAL2015,GQESTIMATESBASE2010,GQESTIMATES2010,GQESTIMATES2011,GQESTIMATES2012,GQESTIMATES2013,GQESTIMATES2014,GQESTIMATES2015,RBIRTH2011,RBIRTH2012,RBIRTH2013,RBIRTH2014,RBIRTH2015,RDEATH2011,RDEATH2012,RDEATH2013,RDEATH2014,RDEATH2015,RNATURALINC2011,RNATURALINC2012,RNATURALINC2013,RNATURALINC2014,RNATURALINC2015,RINTERNATIONALMIG2011,RINTERNATIONALMIG2012,RINTERNATIONALMIG2013,RINTERNATIONALMIG2014,RINTERNATIONALMIG2015,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,4801108,4816089,4830533,4846411,4858979,5034,15947,14981,14444,15878,12568,14226,59689,59062,57938,58334,58305,11089,48811,48357,50843,50228,50330,3137,10878,10705,7095,8106,7975,1357,...,677,-573,1135,116185,116212,115560,115666,116963,119088,119599,12.45302,12.282581,12.01208,12.056286,12.014973,10.183524,10.05636,10.541099,10.380963,10.371556,2.269496,2.22622,1.470981,1.675322,1.643417,1.02772,1.01984,1.002216,1.142716,1.179963,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,55253,55175,55038,55290,55347,89,593,-78,-137,252,57,151,636,615,574,623,600,152,507,558,583,504,467,-1,129,57,-9,119,133,33,...,22,-10,45,455,455,455,455,455,455,455,11.572789,11.138479,10.416194,11.293597,10.846281,9.225478,10.106133,10.579514,9.136393,8.442022,2.347311,1.032347,-0.16332,2.157204,2.404259,0.363924,0.289782,0.290347,0.3263,0.343466,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,186659,190396,195126,199713,203709,928,3466,3737,4730,4587,3996,517,2187,2092,2160,2186,2240,532,1825,1879,1902,2044,1992,-15,362,213,258,142,248,69,...,91,434,58,2307,2307,2307,2249,2304,2308,2309,11.826352,11.096524,11.205586,11.072868,11.104997,9.868812,9.966716,9.867141,10.353587,9.875515,1.95754,1.129809,1.338445,0.719281,1.229482,1.011215,0.912334,0.881921,1.073855,1.095627,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,27226,27159,26973,26815,26489,-116,-115,-67,-186,-158,-326,70,335,300,283,260,269,128,319,291,294,310,309,-58,16,9,-11,-50,-40,2,...,19,-1,-5,3193,3193,3382,3388,3389,3353,3352,12.278483,11.032454,10.455923,9.667584,10.093051,11.692048,10.70148,10.862337,11.526735,11.593877,0.586435,0.330974,-0.406414,-1.859151,-1.500825,-0.146609,-0.257424,-0.11084,-0.074366,0.0,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,22733,22642,22512,22549,22583,-58,-128,-91,-130,37,34,44,266,245,259,247,253,34,278,237,281,211,223,10,-12,8,-22,36,30,2,...,14,-16,-21,2224,2224,2224,2224,2224,2233,2236,11.668202,10.798898,11.471852,10.962917,11.211557,12.194587,10.446281,12.446295,9.365083,9.882124,-0.526385,0.352617,-0.974443,1.597834,1.329434,0.438654,0.705234,0.797272,0.93207,0.930604,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [0]:
df['SUMLEV'].unique()

array([40, 50])

In [0]:
#筛选标号为50的数据
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,NPOPCHG_2010,NPOPCHG_2011,NPOPCHG_2012,NPOPCHG_2013,NPOPCHG_2014,NPOPCHG_2015,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,DEATHS2010,DEATHS2011,DEATHS2012,DEATHS2013,DEATHS2014,DEATHS2015,NATURALINC2010,NATURALINC2011,NATURALINC2012,NATURALINC2013,NATURALINC2014,NATURALINC2015,INTERNATIONALMIG2010,...,RESIDUAL2013,RESIDUAL2014,RESIDUAL2015,GQESTIMATESBASE2010,GQESTIMATES2010,GQESTIMATES2011,GQESTIMATES2012,GQESTIMATES2013,GQESTIMATES2014,GQESTIMATES2015,RBIRTH2011,RBIRTH2012,RBIRTH2013,RBIRTH2014,RBIRTH2015,RDEATH2011,RDEATH2012,RDEATH2013,RDEATH2014,RDEATH2015,RNATURALINC2011,RNATURALINC2012,RNATURALINC2013,RNATURALINC2014,RNATURALINC2015,RINTERNATIONALMIG2011,RINTERNATIONALMIG2012,RINTERNATIONALMIG2013,RINTERNATIONALMIG2014,RINTERNATIONALMIG2015,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,55253,55175,55038,55290,55347,89,593,-78,-137,252,57,151,636,615,574,623,600,152,507,558,583,504,467,-1,129,57,-9,119,133,33,...,22,-10,45,455,455,455,455,455,455,455,11.572789,11.138479,10.416194,11.293597,10.846281,9.225478,10.106133,10.579514,9.136393,8.442022,2.347311,1.032347,-0.16332,2.157204,2.404259,0.363924,0.289782,0.290347,0.3263,0.343466,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,186659,190396,195126,199713,203709,928,3466,3737,4730,4587,3996,517,2187,2092,2160,2186,2240,532,1825,1879,1902,2044,1992,-15,362,213,258,142,248,69,...,91,434,58,2307,2307,2307,2249,2304,2308,2309,11.826352,11.096524,11.205586,11.072868,11.104997,9.868812,9.966716,9.867141,10.353587,9.875515,1.95754,1.129809,1.338445,0.719281,1.229482,1.011215,0.912334,0.881921,1.073855,1.095627,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,27226,27159,26973,26815,26489,-116,-115,-67,-186,-158,-326,70,335,300,283,260,269,128,319,291,294,310,309,-58,16,9,-11,-50,-40,2,...,19,-1,-5,3193,3193,3382,3388,3389,3353,3352,12.278483,11.032454,10.455923,9.667584,10.093051,11.692048,10.70148,10.862337,11.526735,11.593877,0.586435,0.330974,-0.406414,-1.859151,-1.500825,-0.146609,-0.257424,-0.11084,-0.074366,0.0,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,22733,22642,22512,22549,22583,-58,-128,-91,-130,37,34,44,266,245,259,247,253,34,278,237,281,211,223,10,-12,8,-22,36,30,2,...,14,-16,-21,2224,2224,2224,2224,2224,2233,2236,11.668202,10.798898,11.471852,10.962917,11.211557,12.194587,10.446281,12.446295,9.365083,9.882124,-0.526385,0.352617,-0.974443,1.597834,1.329434,0.438654,0.705234,0.797272,0.93207,0.930604,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,57711,57776,57734,57658,57673,51,338,65,-42,-76,15,183,744,710,646,618,603,133,570,592,585,589,590,50,174,118,61,29,13,5,...,-22,-14,53,489,489,489,489,489,489,489,12.929686,12.295756,11.185179,10.711314,10.456859,9.905808,10.252236,10.128993,10.20868,10.231421,3.023878,2.04352,1.056186,0.502634,0.225438,0.052136,0.329041,0.34629,0.485302,0.485559,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [0]:
columns_to_keep = ['STNAME',
                  'CTYNAME',
                  'BIRTHS2010',
                  'BIRTHS2011',
                  'BIRTHS2012',
                  'BIRTHS2013',
                  'BIRTHS2014',
                  'BIRTHS2015',
                  'POPESTIMATE2010',
                  'POPESTIMATE2011',
                  'POPESTIMATE2012',
                  'POPESTIMATE2013',
                  'POPESTIMATE2014',
                  'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [0]:
#开始创建multi index
df =df.set_index(['STNAME', 'CTYNAME'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [0]:
#multi index的选择
df.loc['Alabama', 'Autauga County']

BIRTHS2010           151
BIRTHS2011           636
BIRTHS2012           615
BIRTHS2013           574
BIRTHS2014           623
BIRTHS2015           600
POPESTIMATE2010    54660
POPESTIMATE2011    55253
POPESTIMATE2012    55175
POPESTIMATE2013    55038
POPESTIMATE2014    55290
POPESTIMATE2015    55347
Name: (Alabama, Autauga County), dtype: int64

In [0]:
df.loc[ [('Alabama', 'Baldwin County'), ('Alabama', 'Bibb County'), ('Michigan', 'Washtenaw Country')] ]

## CSV操作

In [0]:
url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/master/MyNotebooks/olympics.csv'
df = pd.read_csv(url)
print(df.head())
print()

#注意比较上面的column、index
df = pd.read_csv(url, index_col=0, skiprows=1)
df.head()

                   0         1     2     3  ...    12    13    14              15
0                NaN  № Summer  01 !  02 !  ...  01 !  02 !  03 !  Combined total
1  Afghanistan (AFG)        13     0     0  ...     0     0     2               2
2      Algeria (ALG)        12     5     2  ...     5     2     8              15
3    Argentina (ARG)        23    18    24  ...    18    24    28              70
4      Armenia (ARM)         5     1     2  ...     1     2     9              12

[5 rows x 16 columns]



Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


### 从原始文件中选择column、row

In [0]:
#index_col：选哪一列当索引列（对应数据库里的主键）
#skiprows：原始数据，第一行是垃圾数据，跳过
  df = pd.read_csv(url, index_col=0, skiprows=1)
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


### 修改列名

In [0]:
df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

In [0]:
#遍历df.columns，并用df.rename进行修改
for col in df.columns:
    if col[:2] == '01':
    df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
  if col[:2] == '02':
    df.rename(columns={col : 'Silver'+col[4:]}, inplace=True)
  if col[:2] == '03':
    df.rename(columns={col : 'Bronze' + col[4:]}, inplace=True)
  if col[:1] == '№':
    df.rename(columns={col: '#' + col[2:]}, inplace=True)
df.head()

Unnamed: 0,#Summer,Gold,Silver,Bronze,Total,#Winter,Gold.1,Silver.1,Bronze.1,Total.1,#Games,Gold.2,Silver.2,Bronze.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


## Miss Values（处理）
大部分pandas的数学函数、统计函数都会忽略NaN

### 从文件读入时，Miss Values处理

read_csv(na_values)  指定missing value的默认值</br>
read_csv(na_filter)  如果空白对你的数据集是有意义的，可以将它关系，但这种情况很少见

In [0]:
url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/950bb9291107265bb66cbde3584ffe52b82ae254/MyNotebooks/log.txt'
df = pd.read_csv(url)
df.head()

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,


### NaN处理
**三个函数**<br>
bfill、ffill、fillna

In [0]:
#将NaN的值全部替换xxx
df.fillna('xxx')

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10
1,1469974454,cheryl,intro.html,6,xxx,xxx
2,1469974544,cheryl,intro.html,9,xxx,xxx
3,1469974574,cheryl,intro.html,10,xxx,xxx
4,1469977514,bob,intro.html,1,xxx,xxx
5,1469977544,bob,intro.html,1,xxx,xxx
6,1469977574,bob,intro.html,1,xxx,xxx
7,1469977604,bob,intro.html,1,xxx,xxx
8,1469974604,cheryl,intro.html,11,xxx,xxx
9,1469974694,cheryl,intro.html,14,xxx,xxx


## 其他操作

In [0]:
purchase_1 = pd.Series({'Name':'Chris',
                       'Item Purchased':'Dog Food',
                       'Cost':22.50})
purchase_2 = pd.Series({'Name':'kevyn',
                       'Item Purchased':'Kitty Litter',
                       'Cost':2.50})
purchase_3 = pd.Series({'Name':'Vinod',
                       'Item Purchased':'Bird Seed',
                       'Cost':5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index = ['Store 1', 'Store 1', 'Store 2'])

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Chris,Dog Food,22.5
Store 1,kevyn,Kitty Litter,2.5
Store 2,Vinod,Bird Seed,5.0


### unique()
类似于sql中的distinct


In [0]:
print(df.index.unique())
print(df['Name'].unique())

Index(['Store 1', 'Store 2'], dtype='object')
['Chris' 'kevyn' 'Vinod']


# **四、Merge（连接）**

In [0]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
             {'Name': 'Sally', 'Role': 'Course liasion'},
             {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')

student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')

display(staff_df)
print()
student_df

Unnamed: 0_level_0,Role
Name,Unnamed: 1_level_1
Kelly,Director of HR
Sally,Course liasion
James,Grader





Unnamed: 0_level_0,School
Name,Unnamed: 1_level_1
James,Business
Mike,Law
Sally,Engineering


## 外连接(outer join)

In [0]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course liasion,Engineering


## 内连接

In [0]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Business


## 左连接

In [0]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liasion,Engineering
James,Grader,Business


## 右连接

In [0]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liasion,Engineering


## 使用列名连接

In [0]:
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
display(staff_df)
print()
display(student_df)

new_df = pd.merge(staff_df, student_df, how='outer', left_on='Name', right_on='Name')
new_df

Unnamed: 0,Name,Role
0,Kelly,Director of HR
1,Sally,Course liasion
2,James,Grader





Unnamed: 0,Name,School
0,James,Business
1,Mike,Law
2,Sally,Engineering


Unnamed: 0,Name,Role,School
0,Kelly,Director of HR,
1,Sally,Course liasion,Engineering
2,James,Grader,Business
3,Mike,,Law


## 列名冲突处理

In [0]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR', 'Location': 'State Street'},
                         {'Name': 'Sally', 'Role': 'Course liasion', 'Location': 'Washington Avenue'},
                         {'Name': 'James', 'Role': 'Grader', 'Location': 'Washington Avenue'}])
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business', 'Location': '1024 Billiard Avenue'},
                           {'Name': 'Mike', 'School': 'Law', 'Location': 'Fraternity House #22'},
                           {'Name': 'Sally', 'School': 'Engineering', 'Location': '512 Wilson Crescent'}])
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liasion,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


## 多列连接

In [0]:
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
display(staff_df)
print()
display(student_df)
pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

Unnamed: 0,First Name,Last Name,Role
0,Kelly,Desjardins,Director of HR
1,Sally,Brooks,Course liasion
2,James,Wilde,Grader





Unnamed: 0,First Name,Last Name,School
0,James,Hammond,Business
1,Mike,Smith,Law
2,Sally,Brooks,Engineering


Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liasion,Engineering


# **五、Group by**

In [0]:
url = 'https://raw.githubusercontent.com/irJERAD/Intro-to-Data-Science-in-Python/master/MyNotebooks/census.csv'
df = pd.read_csv(url)
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,NPOPCHG_2010,NPOPCHG_2011,NPOPCHG_2012,NPOPCHG_2013,NPOPCHG_2014,NPOPCHG_2015,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,DEATHS2010,DEATHS2011,DEATHS2012,DEATHS2013,DEATHS2014,DEATHS2015,NATURALINC2010,NATURALINC2011,NATURALINC2012,NATURALINC2013,NATURALINC2014,NATURALINC2015,INTERNATIONALMIG2010,...,RESIDUAL2013,RESIDUAL2014,RESIDUAL2015,GQESTIMATESBASE2010,GQESTIMATES2010,GQESTIMATES2011,GQESTIMATES2012,GQESTIMATES2013,GQESTIMATES2014,GQESTIMATES2015,RBIRTH2011,RBIRTH2012,RBIRTH2013,RBIRTH2014,RBIRTH2015,RDEATH2011,RDEATH2012,RDEATH2013,RDEATH2014,RDEATH2015,RNATURALINC2011,RNATURALINC2012,RNATURALINC2013,RNATURALINC2014,RNATURALINC2015,RINTERNATIONALMIG2011,RINTERNATIONALMIG2012,RINTERNATIONALMIG2013,RINTERNATIONALMIG2014,RINTERNATIONALMIG2015,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,55253,55175,55038,55290,55347,89,593,-78,-137,252,57,151,636,615,574,623,600,152,507,558,583,504,467,-1,129,57,-9,119,133,33,...,22,-10,45,455,455,455,455,455,455,455,11.572789,11.138479,10.416194,11.293597,10.846281,9.225478,10.106133,10.579514,9.136393,8.442022,2.347311,1.032347,-0.16332,2.157204,2.404259,0.363924,0.289782,0.290347,0.3263,0.343466,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,186659,190396,195126,199713,203709,928,3466,3737,4730,4587,3996,517,2187,2092,2160,2186,2240,532,1825,1879,1902,2044,1992,-15,362,213,258,142,248,69,...,91,434,58,2307,2307,2307,2249,2304,2308,2309,11.826352,11.096524,11.205586,11.072868,11.104997,9.868812,9.966716,9.867141,10.353587,9.875515,1.95754,1.129809,1.338445,0.719281,1.229482,1.011215,0.912334,0.881921,1.073855,1.095627,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,27226,27159,26973,26815,26489,-116,-115,-67,-186,-158,-326,70,335,300,283,260,269,128,319,291,294,310,309,-58,16,9,-11,-50,-40,2,...,19,-1,-5,3193,3193,3382,3388,3389,3353,3352,12.278483,11.032454,10.455923,9.667584,10.093051,11.692048,10.70148,10.862337,11.526735,11.593877,0.586435,0.330974,-0.406414,-1.859151,-1.500825,-0.146609,-0.257424,-0.11084,-0.074366,0.0,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,22733,22642,22512,22549,22583,-58,-128,-91,-130,37,34,44,266,245,259,247,253,34,278,237,281,211,223,10,-12,8,-22,36,30,2,...,14,-16,-21,2224,2224,2224,2224,2224,2233,2236,11.668202,10.798898,11.471852,10.962917,11.211557,12.194587,10.446281,12.446295,9.365083,9.882124,-0.526385,0.352617,-0.974443,1.597834,1.329434,0.438654,0.705234,0.797272,0.93207,0.930604,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,57711,57776,57734,57658,57673,51,338,65,-42,-76,15,183,744,710,646,618,603,133,570,592,585,589,590,50,174,118,61,29,13,5,...,-22,-14,53,489,489,489,489,489,489,489,12.929686,12.295756,11.185179,10.711314,10.456859,9.905808,10.252236,10.128993,10.20868,10.231421,3.023878,2.04352,1.056186,0.502634,0.225438,0.052136,0.329041,0.34629,0.485302,0.485559,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


## group by column

In [0]:
%%timeit -n 10
for group, frame in df.groupby('STNAME'):
  avg = np.average(frame['CENSUS2010POP'])
  print('Countries in state' + group + 'have an average population of' + str(avg))

## group by function

In [0]:
df = df.set_index('STNAME')

def fun(item):
  if item[0] < 'M':
    return 0
  if item[0] < 'Q':
    return 1
  return 2

for group, frame in df.groupby(fun):
  print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing') 

There are 1177 records in group 0 for processing
There are 1134 records in group 1 for processing
There are 831 records in group 2 for processing


## groupby之后的一系列操作

In [0]:
df = pd.read_csv(url)
df = df[df['SUMLEV'] == 50]
df.head()

In [0]:
#groupby之后求平均值
df.groupby('STNAME').agg({'CENSUS2010POP':np.average})

In [0]:
#groupby之后，对指定某一列
print(type(df.groupby(level=0)['POPESTIMATE2010']))
# df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average, 'sum': np.sum})  过时的语法

#意思是：先根据STNAME分组，取CENSUS2010POP这一列，最后对每个组的CENSUS2010POP求average， 求和
df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(avg=np.average, sum=np.sum, count=np.count_nonzero)

In [0]:
#groupby之后，指定多列
print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']))
#将POPESTIMATE2010， POPESTIMATE2011的avg sum都求出来
display((df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'].agg({'avg': np.average, 'sum': np.sum})).head())
#指定求哪几个
(df.set_index('STNAME').groupby(level=0).agg(POPESTIMATE2010_AVG=('POPESTIMATE2010',np.average), POPESTIMATE2011_AVG=('POPESTIMATE2011', np.sum))).head()

# **六、Scale**


## ordinal、nominal
区别在于ordered=True还是False

In [0]:
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
          index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns={0: 'Grades'}, inplace=True)
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [0]:
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D, D+, C-, C, ..., B+, A-, A, A+]

In [0]:
from pandas.api.types import CategoricalDtype
grades = df['Grades'].astype(CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],ordered=True))
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [0]:
grades > 'C'

excellent     True
excellent     True
excellent     True
good          True
good          True
good          True
ok            True
ok           False
ok           False
poor         False
poor         False
Name: Grades, dtype: bool

# **七、时间、日期**
- Timestamp
- DatetimeIndex
- Period
- PeriodIndex

## Timestamp
从python的Timestamp继承过来

In [0]:
now=pd.Timestamp.now()
display(now)

now_shanghai = now.tz_localize("Asia/Shanghai")
now_shanghai

Timestamp('2019-10-28 09:36:04.745839')

Timestamp('2019-10-28 09:36:04.745839+0800', tz='Asia/Shanghai')

In [0]:
#这种写法比较适合中国人
pd.Timestamp('2018-03-16 21:01:34')

Timestamp('2018-03-16 21:01:34')

## DatatimeIndex

In [0]:
t1 = pd.Series(list('abc'), index=[pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')])
t1

2016-09-01    a
2016-09-02    b
2016-09-03    c
dtype: object

In [0]:
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

## Period
表示时间段<br>
freq参数：Y, M, D, H, M, S

In [0]:
p = pd.Period('2016-1-11', freq='S')
display(p)
display(p.start_time)
display(p.end_time)

Period('2016-01-11 00:00:00', 'S')

Timestamp('2016-01-11 00:00:00')

Timestamp('2016-01-11 00:00:00.999999999')

## PeriodIndex

In [0]:
t2 = pd.Series(list('def'), index=[pd.Period('2016-09'), pd.Period('2016-10'), pd.Period('2016-11')])
t2

2016-09    d
2016-10    e
2016-11    f
Freq: M, dtype: object

In [0]:
type(t2.index)

pandas.core.indexes.period.PeriodIndex

## Converting to Datetime

In [0]:
#demo1
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
t3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1, columns=list('ab'))
t3

Unnamed: 0,a,b
2 June 2013,76,88
"Aug 29, 2014",75,26
2015-06-26,20,60
7/12/16,75,15


In [0]:
t3.index = pd.to_datetime(t3.index)
t3

Unnamed: 0,a,b
2013-06-02,76,88
2014-08-29,75,26
2015-06-26,20,60
2016-07-12,75,15


In [0]:
#demo2
pd.to_datetime('4.7.12', dayfirst=True)

Timestamp('2012-07-04 00:00:00')