### Excel Tasks Demonstrated in Pandas - Part 2
Why - 這篇文章中將重點介紹一些常見的選擇和過濾任務，並說明如何在pandas中做同樣的事情。

What - 

In [24]:
import pandas as pd
import numpy as np

In [25]:
df = pd.read_excel("data/df-sample-sales3.xlsx")
df.head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


你會注意到，我們的日期列顯示為一個通用對象。我們將把它轉換為日期時間對象，以使未來的一些選擇更容易。

In [26]:
df.dtypes

account number      int64
name               object
sku                object
quantity            int64
unit price        float64
ext price         float64
date               object
dtype: object

In [43]:
df['date'] = pd.to_datetime(df['date'])
df.head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58


### Filter 對數據進行過濾
我認為Excel中最方便的功能之一是過濾器。我想，幾乎在任何時候，只要有人拿到任何大小的Excel文件，想要過濾數據，都會使用這個功能。

類似於Excel中的ilter函數，你可以用pandas來過濾和選擇某些數據子集。例如，如果我們只想看到一個特定的賬號，我們可以用Excel或pandas輕鬆做到這一點。

Excel過濾器的例子在pandas中做起來也相對簡單。注意，我將使用head函數來顯示頂部的結果。這純粹是為了保持文章的簡短。

In [44]:
df[df["account number"]==307599].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
13,307599,"Kassulke, Ondricka and Metz",S2-10342,17,12.44,211.48,2014-01-04 07:53:01
34,307599,"Kassulke, Ondricka and Metz",S2-78676,35,33.04,1156.4,2014-01-10 05:26:31


In [45]:
df[df["quantity"] > 22].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22


如果我們想做更複雜的過濾，我們可以使用地圖來過濾各種標準。在這個例子中，讓我們尋找sku's以B1開頭的項目。

In [46]:
df[df["sku"].map(lambda x: x.startswith('B1'))].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23


In [47]:
df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
14,737550,"Fritsch, Russel and Anderson",B1-53102,23,71.56,1645.88,2014-01-04 08:57:48


pandas支持的另一個有用的函數叫做isin。它允許我們定義一個我們想要尋找的值的列表。

在本例中，我們尋找所有包含兩個特定賬號的記錄。

In [48]:
df[df["account number"].isin([714466,218895])].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
5,714466,Trantow-Barrows,S2-77896,17,87.63,1489.71,2014-01-02 10:07:15


Pandas支持另一個叫做query的函數，它允許你有效地選擇數據的子集。它確實需要安裝numexpr，所以在嘗試這個步驟之前，請確保你已經安裝了它。

如果你想按名字獲得一個客戶列表，你可以通過查詢來實現，類似於上面的python語法。

In [49]:
df.query('name == ["Kulas Inc","Barton LLC"]').head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23


### Working with Dates

使用pandas，你可以對日期做複雜的過濾。在對日期做任何處理之前，我鼓勵你按日期列進行排序，以確保結果返回你所期望的內容。

In [37]:
df.dtypes

account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

In [34]:
df = df.sort_values(by=['date'])
df.head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


之前顯示的Python過濾語法適用於日期。

In [40]:
df[df['date'] >='2014-03'].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
242,163416,Purdy-Kunde,S1-30248,19,65.03,1235.57,2014-03-01 16:07:40
243,527099,Sanford and Sons,S2-82423,3,76.21,228.63,2014-03-01 17:18:01
244,527099,Sanford and Sons,B1-50809,8,70.78,566.24,2014-03-01 18:53:09


In [41]:
df[(df['date'] >='20140701') & (df['date'] <= '20140715')].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
778,737550,"Fritsch, Russel and Anderson",S1-65481,35,70.51,2467.85,2014-07-01 00:21:58
779,218895,Kulas Inc,S1-30248,9,16.56,149.04,2014-07-01 00:52:38
780,163416,Purdy-Kunde,S2-82423,44,68.27,3003.88,2014-07-01 08:15:52


因為pandas理解日期列，你可以用多種格式來表達日期值，它將給你帶來你所期望的結果。

In [42]:
df[df['date'] >= 'Oct-2014'].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
1141,307599,"Kassulke, Ondricka and Metz",B1-50809,25,56.63,1415.75,2014-10-01 10:56:32
1142,737550,"Fritsch, Russel and Anderson",S2-82423,38,45.17,1716.46,2014-10-01 16:17:24
1143,737550,"Fritsch, Russel and Anderson",S1-47412,6,68.68,412.08,2014-10-01 22:28:49


In [51]:
df[df['date'] >= '10-10-2014'].head(3)

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
1174,257198,"Cronin, Oberbrunner and Spencer",S2-34077,13,12.24,159.12,2014-10-10 02:59:06
1175,740150,Barton LLC,S1-65481,28,53.0,1484.0,2014-10-10 15:08:53
1176,146832,Kiehn-Spinka,S1-27722,15,64.39,965.85,2014-10-10 18:24:01


在處理時間序列數據時，如果我們將數據轉換為使用日期作為索引，我們可以做一些更多的過濾變化。使用set_index設置新的索引。

In [54]:
df2 = df.set_index(['date'])
df2.head(3)

Unnamed: 0_level_0,account number,name,sku,quantity,unit price,ext price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01 07:21:51,740150,Barton LLC,B1-20000,39,86.69,3380.91
2014-01-01 10:00:47,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16
2014-01-01 13:24:58,218895,Kulas Inc,B1-69924,23,90.7,2086.1


In [56]:
df2["20140101":"20140201"].head(3)

Unnamed: 0_level_0,account number,name,sku,quantity,unit price,ext price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01 07:21:51,740150,Barton LLC,B1-20000,39,86.69,3380.91
2014-01-01 10:00:47,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16
2014-01-01 13:24:58,218895,Kulas Inc,B1-69924,23,90.7,2086.1


再一次，我們可以使用各種日期表示法來消除圍繞日期命名慣例的任何模糊性。

In [57]:
df2["2014-Jan-1":"2014-Feb-1"].head(3)
#df2["2014-Jan-1":"2014-Feb-1"].tail(3)

Unnamed: 0_level_0,account number,name,sku,quantity,unit price,ext price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01 07:21:51,740150,Barton LLC,B1-20000,39,86.69,3380.91
2014-01-01 10:00:47,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16
2014-01-01 13:24:58,218895,Kulas Inc,B1-69924,23,90.7,2086.1


In [62]:
df2["2014"].head(3)

  df2["2014"].head(3)


Unnamed: 0_level_0,account number,name,sku,quantity,unit price,ext price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01 07:21:51,740150,Barton LLC,B1-20000,39,86.69,3380.91
2014-01-01 10:00:47,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16
2014-01-01 13:24:58,218895,Kulas Inc,B1-69924,23,90.7,2086.1


In [61]:
df2["2014-Dec"].head(3)

  df2["2014-Dec"].head(3)


Unnamed: 0_level_0,account number,name,sku,quantity,unit price,ext price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-01 20:15:34,714466,Trantow-Barrows,S1-82801,3,77.97,233.91
2014-12-02 20:00:04,146832,Kiehn-Spinka,S2-23246,37,57.81,2138.97
2014-12-03 04:43:53,218895,Kulas Inc,S2-77896,30,77.44,2323.2


### Additional String Functions額外的字符串函數
Pandas也支持向量的字符串函數。

如果我們想識別所有包含某個值的sku，我們可以使用str.contains 。在這種情況下，我們知道sku總是以相同的方式表示，所以B1只顯示在sku的前面。你需要了解你的數據，以確保你得到的是你預期的結果。

In [63]:
df[df['sku'].str.contains('B1')].head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
6,218895,Kulas Inc,B1-65551,2,31.1,62.2,2014-01-02 10:57:23
14,737550,"Fritsch, Russel and Anderson",B1-53102,23,71.56,1645.88,2014-01-04 08:57:48
17,239344,Stokes LLC,B1-50809,14,16.23,227.22,2014-01-04 22:14:32


我們可以把查詢串起來，並使用排序來控制數據的排序方式。

In [65]:
df[(df['sku'].str.contains('B1-531')) 
   & (df['quantity']>40)].sort_values(by=['quantity','name'],
                                      ascending=[0,1])

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
684,642753,Pollich LLC,B1-53102,46,26.07,1199.22,2014-06-08 19:33:33
792,688981,Keeling LLC,B1-53102,45,41.19,1853.55,2014-07-04 21:42:22
176,383080,Will LLC,B1-53102,45,89.22,4014.9,2014-02-11 04:14:09
1213,604255,"Halvorson, Crona and Champlin",B1-53102,41,55.05,2257.05,2014-10-18 19:27:01
1215,307599,"Kassulke, Ondricka and Metz",B1-53102,41,93.7,3841.7,2014-10-18 23:25:10
1128,714466,Trantow-Barrows,B1-53102,41,55.68,2282.88,2014-09-27 10:42:48
1001,424914,White-Trantow,B1-53102,41,81.25,3331.25,2014-08-26 11:44:30


### Unique Advance
我經常發現自己試圖在Excel中獲得一個長列表中的唯一項目的列表。在Excel中這樣做是一個多步驟的過程，但在pandas中卻相當簡單。下面是使用Excel中的高級過濾器來做這件事的一個方法。

In [66]:
df["name"].unique()

array(['Barton LLC', 'Trantow-Barrows', 'Kulas Inc',
       'Kassulke, Ondricka and Metz', 'Jerde-Hilpert', 'Koepp Ltd',
       'Fritsch, Russel and Anderson', 'Kiehn-Spinka', 'Keeling LLC',
       'Frami, Hills and Schmidt', 'Stokes LLC', 'Kuhn-Gusikowski',
       'Herman LLC', 'White-Trantow', 'Sanford and Sons', 'Pollich LLC',
       'Will LLC', 'Cronin, Oberbrunner and Spencer',
       'Halvorson, Crona and Champlin', 'Purdy-Kunde'], dtype=object)

如果我們想包括帳戶號碼，我們可以使用drop_duplicates。

In [67]:
df.drop_duplicates(subset=["account number","name"]).head()

Unnamed: 0,account number,name,sku,quantity,unit price,ext price,date
0,740150,Barton LLC,B1-20000,39,86.69,3380.91,2014-01-01 07:21:51
1,714466,Trantow-Barrows,S2-77896,-1,63.16,-63.16,2014-01-01 10:00:47
2,218895,Kulas Inc,B1-69924,23,90.7,2086.1,2014-01-01 13:24:58
3,307599,"Kassulke, Ondricka and Metz",S1-65481,41,21.05,863.05,2014-01-01 15:05:22
4,412290,Jerde-Hilpert,S2-34077,6,83.21,499.26,2014-01-01 23:26:55


顯然，我們拉來的數據比我們需要的多，而且得到了一些無用的信息，所以只用iloc選擇第一和第二列。

In [68]:
df.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1]]

Unnamed: 0,account number,name
0,740150,Barton LLC
1,714466,Trantow-Barrows
2,218895,Kulas Inc
3,307599,"Kassulke, Ondricka and Metz"
4,412290,Jerde-Hilpert
7,729833,Koepp Ltd
9,737550,"Fritsch, Russel and Anderson"
10,146832,Kiehn-Spinka
11,688981,Keeling LLC
12,786968,"Frami, Hills and Schmidt"
