### Create an Excel Diff - Simple Graphing with IPython and Pandas
Why - 通過一個真實的例子來說明如何使用pandas來自動化一個在 Excel 中很難做到的過程。

What - 我的業務問題是，我有兩個結構相似但數據不同的Excel文件，我想輕鬆了解這兩個文件之間有什麼變化。基本上，我想要一個Excel差異工具。

In [1]:
import pandas as pd

# Read in the two files but call the data old and new and create columns to track
old = pd.read_excel('data/df-sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('data/df-sample-address-2.xlsx', 'Sheet1', na_values=['NA'])
old['version'] = "old"
new['version'] = "new"

在這個例子中，我有兩個客戶地址列表，我想了解。
- 哪些客戶是新客戶
- 哪些客戶被刪除
- 哪些客戶的信息在這兩個文件之間有變化

你可以設想這在審計系統中的變化時相當有用，或者有可能提供一個變化列表，以便你可以讓你的銷售團隊聯繫新客戶。

In [2]:
old_accts_all = set(old['account number'])
new_accts_all = set(new['account number'])

dropped_accts = old_accts_all - new_accts_all
added_accts = new_accts_all - old_accts_all

接下來，我們將所有的數據連接在一起，得到一個乾淨的唯一數據列表，並通過使用drop_duplicates保留所有改變的行。

In [3]:
all_data = pd.concat([old,new],ignore_index=True)
changes = all_data.drop_duplicates(subset=["account number",
                                           "name", "street",
                                           "city","state",
                                           "postal code"], keep='last')

接下來，我們需要弄清楚哪些賬戶號碼有重複的條目。一個重複的賬號表明他們在某個領域的數值發生了變化，我們需要標記出來。我們可以使用重複函數來獲取所有這些賬號的列表，並過濾掉那些重複的賬號。

In [4]:
dupe_accts = changes[changes['account number'].duplicated() == True]['account number'].tolist()
dupes = changes[changes["account number"].isin(dupe_accts)]

現在我們把新舊數據分開，去掉不必要的版本欄，並把賬戶號碼設置為索引。這些步驟為最終的比較設定了數據。

In [5]:
# Pull out the old and new data into separate dataframes
change_new = dupes[(dupes["version"] == "new")]
change_old = dupes[(dupes["version"] == "old")]

# Drop the temp columns - we don't need them now
change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'], axis=1)

# Index on the account numbers
change_new.set_index('account number', inplace=True)
change_old.set_index('account number', inplace=True)

# Combine all the changes together
df_all_changes = pd.concat([change_old, change_new],
                            axis='columns',
                            keys=['old', 'new'],
                            join='outer')
df_all_changes

Unnamed: 0_level_0,old,old,old,old,old,new,new,new,new,new
Unnamed: 0_level_1,name,street,city,state,postal code,name,street,city,state,postal code
account number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
595932,"Kuhic, Eichmann and West",4059 Tobias Inlet,New Rylanfurt,Illinois,89271,"Kuhic, Eichmann and West",4059 Tobias St,New Rylanfurt,Illinois,89271
558879,Watsica Group,95616 Enos Grove Suite 139,West Atlas,Iowa,47419,Watsica Group,829 Big street,Smithtown,Ohio,47919
880043,Beatty Inc,3641 Schaefer Isle Suite 171,North Gardnertown,Wyoming,64318,Beatty Inc,3641 Schaefer Isle Suite 171,North Gardnertown,Wyoming,64918


創建一個diff函數來顯示變化是什麼。在我們做最後的組合之前，我們需要定義一個函數，它將顯示各列之間的變化。

In [6]:
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

我們現在使用swaplevel函數來獲得新舊兩列的相鄰關係。

In [7]:
df_all_changes= df_all_changes.swaplevel(axis='columns')[change_new.columns[0:]]
df_all_changes

Unnamed: 0_level_0,name,name,street,street,city,city,state,state,postal code,postal code
Unnamed: 0_level_1,old,new,old,new,old,new,old,new,old,new
account number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
595932,"Kuhic, Eichmann and West","Kuhic, Eichmann and West",4059 Tobias Inlet,4059 Tobias St,New Rylanfurt,New Rylanfurt,Illinois,Illinois,89271,89271
558879,Watsica Group,Watsica Group,95616 Enos Grove Suite 139,829 Big street,West Atlas,Smithtown,Iowa,Ohio,47419,47919
880043,Beatty Inc,Beatty Inc,3641 Schaefer Isle Suite 171,3641 Schaefer Isle Suite 171,North Gardnertown,North Gardnertown,Wyoming,Wyoming,64318,64918


在列上使用groupby，然後應用我們自定義的report_diff函數，將兩個相應的列相互比較。

In [8]:
df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
df_changed = df_changed.reset_index()
df_changed

Unnamed: 0,account number,city,name,postal code,state,street
0,595932,New Rylanfurt,"Kuhic, Eichmann and West",89271,Illinois,4059 Tobias Inlet ---> 4059 Tobias St
1,558879,West Atlas ---> Smithtown,Watsica Group,47419 ---> 47919,Iowa ---> Ohio,95616 Enos Grove Suite 139 ---> 829 Big street
2,880043,North Gardnertown,Beatty Inc,64318 ---> 64918,Wyoming,3641 Schaefer Isle Suite 171


最後的分析步驟是弄清被刪除和添加的內容。

In [9]:
df_removed = changes[changes["account number"].isin(dropped_accts)]
df_added = changes[changes["account number"].isin(added_accts)]

我們可以將所有內容輸出到一個Excel文件中，並有一個單獨的選項卡來顯示更改、添加和刪除。

In [10]:
output_columns = ["account number", "name", "street", "city", "state", "postal code"]
writer = pd.ExcelWriter("data/dt-ExcellDiff.xlsx")
df_changed.to_excel(writer,"changed", index=False, columns=output_columns)
df_removed.to_excel(writer,"removed",index=False, columns=output_columns)
df_added.to_excel(writer,"added",index=False, columns=output_columns)
writer.save()