### Pivot_Table - Quick Data Analysis with Pandas
Why - 將銷售資料匯入進行樞紐分析，再將樞紐分析結果匯回 Excel 檔案。

What - 這篇文章的目的是讓你對一些交互式的python工具有一個基本的了解，以及你如何使用這些工具以一種非常快速和可重複的方式進行一些複雜的分析。我計劃花更多的時間去看這樣的例子，以顯示這個工具集有多麼有用，並繼續讓人們知道，當涉及到復雜的數據分析時，還有比Excel更好的選擇

Pandas在處理大量數據並將其總結為多種文本和視覺表現形式方面非常出色。不費吹灰之力，Pandas就支持輸出到CSV、Excel、HTML、json等。在閱讀本文之前，我建議你先回顧一下之前關於Pandas數據透視表的文章和關於從這些表格生成Excel報告的後續文章。它們解釋了我所使用的數據集以及如何使用數據透視表。

In [18]:
import pandas as pd
import numpy as np

dt=pd.read_csv("data/df-sample-sales.csv")
dt.head()

Unnamed: 0,Account Number,Account Name,sku,category,quantity,unit price,ext price,date
0,803666,Fritsch-Glover,HX-24728,Belt,1,98.98,98.98,2014-09-28 11:56:02
1,64898,O'Conner Inc,LK-02338,Shirt,9,34.8,313.2,2014-04-24 16:51:22
2,423621,Beatty and Sons,ZC-07383,Shirt,12,60.24,722.88,2014-09-17 17:26:22
3,137865,"Gleason, Bogisich and Franecki",QS-76400,Shirt,5,15.25,76.25,2014-01-30 07:34:02
4,435433,Morissette-Heathcote,RU-25060,Shirt,19,51.83,984.77,2014-08-24 06:18:12


In [19]:
dt.describe()

Unnamed: 0,Account Number,quantity,unit price,ext price
count,1000.0,1000.0,1000.0,1000.0
mean,480941.809,10.565,54.06643,570.17994
std,291330.331287,5.887311,26.068011,443.949007
min,510.0,1.0,10.01,11.13
25%,217002.75,5.0,31.1875,203.765
50%,461305.0,11.0,53.24,456.34
75%,734587.0,16.0,75.1,849.1075
max,998940.0,20.0,100.0,1958.6


實際上，我們可以從 describe 命令中了解到一些相當有用的信息。

- 我們可以知道，客戶平均每筆交易購買了 10.56 件物品
- 交易的平均成本是 570.17 美元
- 最小和最大值，這樣你就能理解數據的範圍。

In [20]:
report = dt.pivot_table(index=['Account Name'],  # Rows
                        columns=['category'],    # Cols
                        values=['quantity'],     # Values
                        fill_value=0,            # fill NaN To 0
                        aggfunc=np.sum)          # Values summarize by SUM
report.head(10)

Unnamed: 0_level_0,quantity,quantity,quantity
category,Belt,Shirt,Shoes
Account Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Abbott PLC,0,0,19
"Abbott, Rogahn and Bednar",0,18,0
Abshire LLC,0,18,2
"Altenwerth, Stokes and Paucek",0,13,0
Ankunding-McCullough,0,2,0
"Armstrong, Champlin and Ratke",7,36,0
"Armstrong, McKenzie and Greenholt",0,0,4
Armstrong-Williamson,19,0,0
Aufderhar and Sons,0,0,2
Aufderhar-O'Hara,0,0,11


In [21]:
report = dt.pivot_table(index=['Account Name'],
                           columns=['category'], 
                           values=['ext price','quantity'],
                           fill_value=0,
                           aggfunc=np.sum)
report.head()

Unnamed: 0_level_0,ext price,ext price,ext price,quantity,quantity,quantity
category,Belt,Shirt,Shoes,Belt,Shirt,Shoes
Account Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Abbott PLC,0.0,0.0,755.44,0,0,19
"Abbott, Rogahn and Bednar",0.0,615.6,0.0,0,18,0
Abshire LLC,0.0,720.18,90.34,0,18,2
"Altenwerth, Stokes and Paucek",0.0,843.31,0.0,0,13,0
Ankunding-McCullough,0.0,132.3,0.0,0,2,0


In [22]:
report.to_excel('data/df-sample-sales-out.xlsx', sheet_name='Sheet1')

### Pandas Pivot Table Explained

Why - 大多數人可能都有在Excel中使用透視表的經驗。 Pandas提供了一個類似的函數，叫做（足夠合適的）pivot_table。雖然它非常有用，但我經常發現自己很難記住如何使用語法來格式化輸出我的需要。本文將重點解釋pandas pivot_table函數以及如何將其用於你的數據分析。

What - 使用 Pandas 的挑戰之一是確保你了解你的數據，以及你試圖用透視表來回答什麼問題。它是一個看似簡單的函數，但可以很快產生非常強大的分析。在這種情況下，我將跟踪一個銷售管道（也叫漏斗）。基本問題是，有些銷售週期非常長（想想 "企業軟件"、資本設備等），管理層希望在一年中更詳細地了解它。

常見的典型問題包含
- 籌備中的收入有多少？
- 哪些產品在管道中？
- 誰有什麼產品處於什麼階段？
- 我們在年底前完成交易的可能性有多大？
- 許多公司都有CRM工具或其他軟件，銷售人員用它們來跟踪這個過程。雖然他們可能有分析數據的有用工具，但不可避免地會有人將數據導出到Excel，並使用PivotTable來總結數據。

使用熊貓的數據透視表可以是一個很好的選擇，因為它
- 更快（一旦設置好了）
- 自我記錄（看一下代碼，你就知道它的作用了）
- 易於使用，可以生成報告或電子郵件
- 更加靈活，因為你可以定義客戶的聚合函數

- https://pbpython.com/pandas-pivot-table-explained.html

In [1]:
import pandas as pd
import numpy as np
df = pd.read_excel("data/funnel.xlsx")
df.head(2)

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented


為了方便起見，讓我們把狀態列定義為一個類別，並設置我們想要查看的順序。
這不是嚴格的要求，但有助於我們在分析數據的過程中保持我們想要的順序。

In [2]:
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

  res = method(*args, **kwargs)


In [3]:
df.dtypes

Account        int64
Name          object
Rep           object
Manager       object
Product       object
Quantity       int64
Price          int64
Status      category
dtype: object

### Pivot the data 數據透視
當我們建立數據透視表時，我認為最簡單的做法是一步步來。添加項目並檢查每一步，以驗證你是否得到了你期望的結果。不要害怕玩弄順序和變量，看看什麼表現方式對你的需求最有意義。

最簡單的透視表必須有一個數據框架和一個索引。在這種情況下，讓我們使用名稱作為我們的索引。

In [4]:
df.head()

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won


In [5]:
pd.pivot_table(df,index=["Name"])  #Prcie & Quantuly = AVG

Unnamed: 0_level_0,Account,Price,Quantity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Barton LLC,740150,35000,1.0
"Fritsch, Russel and Anderson",737550,35000,1.0
Herman LLC,141962,65000,2.0
Jerde-Hilpert,412290,5000,2.0
"Kassulke, Ondricka and Metz",307599,7000,3.0
Keeling LLC,688981,100000,5.0
Kiehn-Spinka,146832,65000,2.0
Koepp Ltd,729833,35000,2.0
Kulas Inc,218895,25000,1.5
Purdy-Kunde,163416,30000,1.0


你可以看到，數據透視表足夠聰明，它開始匯總數據，並通過將代表與他們的經理分組來總結數據。現在我們開始領略到數據透視表可以為我們做什麼。對於這個目的，賬戶和數量列並不真正有用。讓我們通過使用數值字段明確定義我們所關心的列來刪除它。

In [6]:
pd.pivot_table(df,index=["Manager","Rep","Name"],values=["Price"]) #AVG

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Price
Manager,Rep,Name,Unnamed: 3_level_1
Debra Henley,Craig Booker,"Fritsch, Russel and Anderson",35000
Debra Henley,Craig Booker,Trantow-Barrows,15000
Debra Henley,Daniel Hilton,Kiehn-Spinka,65000
Debra Henley,Daniel Hilton,Kulas Inc,25000
Debra Henley,John Smith,Barton LLC,35000
Debra Henley,John Smith,Jerde-Hilpert,5000
Fred Anderson,Cedric Moss,Herman LLC,65000
Fred Anderson,Cedric Moss,Purdy-Kunde,30000
Fred Anderson,Cedric Moss,Stokes LLC,7500
Fred Anderson,Wendy Yule,"Kassulke, Ondricka and Metz",7000


aggfunc 可以接受一個函數的列表。價格欄會自動對數據進行平均，但我們可以做一個計數或求和。使用aggfunc和np.sum就可以簡單地將它們加起來。讓我們試試用numpy的mean函數和len來得到一個計數的平均值。

In [7]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,80000
Debra Henley,Daniel Hilton,115000
Debra Henley,John Smith,40000
Fred Anderson,Cedric Moss,110000
Fred Anderson,Wendy Yule,177000


In [8]:
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2
Debra Henley,Craig Booker,20000.0,4
Debra Henley,Daniel Hilton,38333.333333,3
Debra Henley,John Smith,20000.0,2
Fred Anderson,Cedric Moss,27500.0,4
Fred Anderson,Wendy Yule,44250.0,4


如果我們想看到按產品細分的銷售額，列變量允許我們定義一個或多個列。

列與值
我認為pivot_table的一個令人困惑的地方是列和值的使用。請記住，列是可選的--它們提供了一種額外的方式來分割你所關心的實際數值。聚合函數被應用於你列出的值。

In [9]:
pd.pivot_table(df,index=["Manager","Rep"],
               columns=["Product"],
               values=["Price"],
               aggfunc=[np.sum],fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price,Price,Price
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Debra Henley,Craig Booker,65000,5000,0,10000
Debra Henley,Daniel Hilton,105000,0,0,10000
Debra Henley,John Smith,35000,5000,0,0
Fred Anderson,Cedric Moss,95000,5000,0,10000
Fred Anderson,Wendy Yule,165000,7000,5000,0


我認為把數量也加進去會很有用。在數值列表中添加數量。

In [10]:
pd.pivot_table(df,index=["Manager","Rep"],
               values=["Price","Quantity"],
               columns=["Product"],
               aggfunc=[np.sum],fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Debra Henley,Craig Booker,65000,5000,0,10000,2,2,0,1
Debra Henley,Daniel Hilton,105000,0,0,10000,4,0,0,1
Debra Henley,John Smith,35000,5000,0,0,1,2,0,0
Fred Anderson,Cedric Moss,95000,5000,0,10000,3,1,0,1
Fred Anderson,Wendy Yule,165000,7000,5000,0,7,3,2,0


有趣的是，你可以將項目移到索引中，以獲得不同的視覺表現。將Product從列中移除，並添加到索引中。

In [11]:
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum],fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Price,Quantity
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2
Debra Henley,Craig Booker,CPU,65000,2
Debra Henley,Craig Booker,Maintenance,5000,2
Debra Henley,Craig Booker,Software,10000,1
Debra Henley,Daniel Hilton,CPU,105000,4
Debra Henley,Daniel Hilton,Software,10000,1
Debra Henley,John Smith,CPU,35000,1
Debra Henley,John Smith,Maintenance,5000,2
Fred Anderson,Cedric Moss,CPU,95000,3
Fred Anderson,Cedric Moss,Maintenance,5000,1
Fred Anderson,Cedric Moss,Software,10000,1


對於這個數據集，這種表示方法更有意義。現在，如果我想看一些總數怎麼辦？ margins=True為我們做到了。

In [12]:
pd.pivot_table(df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean],fill_value=0,
               margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,sum,mean,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Price,Quantity,Price,Quantity
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Debra Henley,Craig Booker,CPU,65000,2,32500.0,1.0
Debra Henley,Craig Booker,Maintenance,5000,2,5000.0,2.0
Debra Henley,Craig Booker,Software,10000,1,10000.0,1.0
Debra Henley,Daniel Hilton,CPU,105000,4,52500.0,2.0
Debra Henley,Daniel Hilton,Software,10000,1,10000.0,1.0
Debra Henley,John Smith,CPU,35000,1,35000.0,1.0
Debra Henley,John Smith,Maintenance,5000,2,5000.0,2.0
Fred Anderson,Cedric Moss,CPU,95000,3,47500.0,1.5
Fred Anderson,Cedric Moss,Maintenance,5000,1,5000.0,1.0
Fred Anderson,Cedric Moss,Software,10000,1,10000.0,1.0


一個非常方便的功能是能夠向 aggfunc 傳遞一個字典，所以你可以對你選擇的每個值執行不同的功能。這有一個副作用，就是使標籤變得更乾淨。

In [13]:
pd.pivot_table(df,index=["Manager","Status"],
               columns=["Product"],
               values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Debra Henley,won,65000,0,0,0,1,0,0,0
Debra Henley,pending,40000,10000,0,0,1,2,0,0
Debra Henley,presented,30000,0,0,20000,1,0,0,2
Debra Henley,declined,70000,0,0,0,2,0,0,0
Fred Anderson,won,165000,7000,0,0,2,1,0,0
Fred Anderson,pending,0,5000,0,0,0,1,0,0
Fred Anderson,presented,30000,0,5000,10000,1,0,1,1
Fred Anderson,declined,65000,0,0,0,1,0,0,0


你也可以提供一個適用於每個值的加重功能列表。試圖一次把這些都拉到一起，看起來令人生畏，但只要你開始玩弄數據，慢慢添加項目，你就能感覺到它是如何工作的。我的一般經驗法則是，一旦你使用多個分組，你就應該評估透視表是否是一種有用的方法。

In [14]:
pd.pivot_table(df,index=["Manager","Status"],
               columns=["Product"],
               values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,mean,mean,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,won,65000,0,0,0,65000,0,0,0,1,0,0,0
Debra Henley,pending,40000,5000,0,0,40000,10000,0,0,1,2,0,0
Debra Henley,presented,30000,0,0,10000,30000,0,0,20000,1,0,0,2
Debra Henley,declined,35000,0,0,0,70000,0,0,0,2,0,0,0
Fred Anderson,won,82500,7000,0,0,165000,7000,0,0,2,1,0,0
Fred Anderson,pending,0,5000,0,0,0,5000,0,0,0,1,0,0
Fred Anderson,presented,30000,0,5000,10000,30000,0,5000,10000,1,0,1,1
Fred Anderson,declined,65000,0,0,0,65000,0,0,0,1,0,0,0


### Advanced Pivot Table Filtering 進階數據透視表過濾
一旦你生成了你的數據，它就在一個DataFrame中，所以你可以使用你的標準DataFrame函數對它進行過濾。取出經理。

In [15]:
#同上程式
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)

In [16]:
table.query('Manager == ["Debra Henley"]')

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,mean,mean,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,won,65000,0,0,0,65000,0,0,0,1,0,0,0
Debra Henley,pending,40000,5000,0,0,40000,10000,0,0,1,2,0,0
Debra Henley,presented,30000,0,0,10000,30000,0,0,20000,1,0,0,2
Debra Henley,declined,35000,0,0,0,70000,0,0,0,2,0,0,0


這是pivot_table的一個強大的功能，所以不要忘記，一旦你把你的數據變成你需要的pivot_table格式，你就擁有了pandas的全部功能。

In [17]:
table.query('Status == ["pending","won"]')

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean,mean,mean,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,won,65000,0,0,0,65000,0,0,0,1,0,0,0
Debra Henley,pending,40000,5000,0,0,40000,10000,0,0,1,2,0,0
Fred Anderson,won,82500,7000,0,0,165000,7000,0,0,2,1,0,0
Fred Anderson,pending,0,5000,0,0,0,5000,0,0,0,1,0,0
