### Excel Tasks Demonstrated
Why - 本文的目的是展示一些常見的Excel任務，以及你如何在pandas中執行類似的任務。

What - 其中一些例子有些微不足道，但我認為展示簡單的和更複雜的函數很重要，你可以在其他地方找到。作為額外的獎勵，我將做一些模糊字符串匹配，以顯示這個過程的一個小轉折，並展示pandas如何利用完整的python系統的模塊，在python中做一些簡單的，在Excel中會很複雜的事情。

In [None]:
import pandas as pd
import numpy as np
df = pd.read_excel("data/df-excel-comp-data.xlsx")
df.head()

In [1]:
import pandas as pd
import numpy as np
df = pd.read_excel("data/df-excel-comp-data.xlsx")
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000


我們想添加一個總列來顯示1月、2月和3月的總銷售額。這在Excel和pandas中都是很簡單的。在Excel中，能在列中添加了公式sum(G2:I2)，我們也可以在Pandas做到新增欄目。

In [2]:
df["total"] = df["Jan"] + df["Feb"] + df["Mar"]
df.head()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
0,211829,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752,10000,62000,35000,107000
1,320563,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365,95000,45000,35000,175000
2,648336,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517,91000,120000,35000,246000
3,109996,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021,45000,120000,10000,175000
4,121213,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681,162000,120000,35000,317000


現在，我們想增加一個按月計算的總數和總計。這就是pandas和Excel的一些分歧之處。在Excel中為每個月的單元格添加總數是非常簡單的。因為pandas需要維護整個DataFrame的完整性，所以還需要幾個步驟。

In [3]:
sum_row = df[["Jan","Feb","Mar","total"]].sum()
df_sum  = pd.DataFrame(data=sum_row).T  #transpose 
df_sum

Unnamed: 0,Jan,Feb,Mar,total
0,1462000,1507000,717000,3686000


或是其他算法

In [4]:
df["Jan"].sum() , df["Jan"].mean() , df["Jan"].min() , df["Jan"].max()

(1462000, 97466.66666666667, 10000, 162000)

在把總數加回去之前，我們需要做的最後一件事是把缺失的列加進去。我們使用reindex為我們做這件事。訣竅是添加我們所有的列，然後讓pandas來填補缺失的值。

In [5]:
df_sum=df_sum.reindex(columns=df.columns)
df_sum

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
0,,,,,,,1462000,1507000,717000,3686000


In [6]:
# complete
df_final=df.append(df_sum,ignore_index=True)
df_final.tail()

Unnamed: 0,account,name,street,city,state,postal-code,Jan,Feb,Mar,total
11,231907.0,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415.0,150000,10000,162000,322000
12,242368.0,"Frami, Anderson and Donnelly",182 Bertie Road,East Davian,Iowa,72686.0,162000,120000,35000,317000
13,268755.0,Walsh-Haley,2624 Beatty Parkways,Goodwinmouth,RhodeIsland,31919.0,55000,120000,35000,210000
14,273274.0,McDermott PLC,8917 Bergstrom Meadow,Kathryneborough,Delaware,27933.0,150000,120000,70000,340000
15,,,,,,,1462000,1507000,717000,3686000


## FuzzyWuzzy - Vlookup

對於另一個例子，讓我們嘗試在數據集中添加一個州的縮寫。從Excel的角度來看，最簡單的方法可能是添加一個新的列，對州名做一個vlookup，然後填上縮寫。比如說 Texas =TX，有一些數值沒有正確顯示出來。這是因為我們把一些狀態拼錯了。在Excel中處理這個問題將是非常具有挑戰性的（在大數據集上）。

In [7]:
pip install fuzzywuzzy

Note: you may need to restart the kernel to use updated packages.


pandas 中的 fuzzy wuzzy 庫有一些相當有用的功能，適用於這種類型的情況。請確保安裝它。我們就有了python生態系統的全部力量供我們使用。在思考如何解決這類混亂的數據問題時，我想過嘗試做一些模糊的文本匹配來確定正確的值。幸運的是，有人在這方面做了大量的工作。 我們需要的另一段代碼是一個州名到縮寫的映射。與其自己打字，不如在谷歌上搜索一下，找到這段代碼。通過導入適當的fuzzywuzzy函數和定義我們的狀態圖字典開始。

In [8]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
state_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA", "Armed Forces Pacific": "AP", "GUAM": "GU",
                 "KANSAS": "KS", "FLORIDA": "FL", "AMERICAN SAMOA": "AS", "NORTH CAROLINA": "NC", "HAWAII": "HI",
                 "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "FEDERATED STATES OF MICRONESIA": "FM",
                 "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",
                 "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "CONNECTICUT": "CT", "MONTANA": "MT", "MASSACHUSETTS": "MA",
                 "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "NEW HAMPSHIRE": "NH", "MARYLAND": "MD", "NEW MEXICO": "NM",
                 "MISSISSIPPI": "MS", "TENNESSEE": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE",
                 "NEW JERSEY": "NJ", "UTAH": "UT", "MICHIGAN": "MI", "WEST VIRGINIA": "WV", "WASHINGTON": "WA",
                 "MINNESOTA": "MN", "OREGON": "OR", "VIRGINIA": "VA", "VIRGIN ISLANDS": "VI", "MARSHALL ISLANDS": "MH",
                 "WYOMING": "WY", "OHIO": "OH", "SOUTH CAROLINA": "SC", "INDIANA": "IN", "NEVADA": "NV", "LOUISIANA": "LA",
                 "NORTHERN MARIANA ISLANDS": "MP", "NEBRASKA": "NE", "ARIZONA": "AZ", "WISCONSIN": "WI", "NORTH DAKOTA": "ND",
                 "Armed Forces Europe": "AE", "PENNSYLVANIA": "PA", "OKLAHOMA": "OK", "KENTUCKY": "KY", "RHODE ISLAND": "RI",
                 "DISTRICT OF COLUMBIA": "DC", "ARKANSAS": "AR", "MISSOURI": "MO", "TEXAS": "TX", "MAINE": "ME"}



scorer計算兩個字串相似度的函式，預設fuzz.WRatio()。 limit是輸出個數。輸出為陣列，元素為元組，元組第一個匹配到的字串，第二個為int型，為score。對輸出按照score排序。

score_cutoff為一個閾值，當score小於該閾值時，不會輸出。返回一個生成器，輸出每個大於 score_cutoff的匹配，按順序輸出不排序。

In [9]:
process.extractOne("Minnesotta",choices=state_to_code.keys())

('MINNESOTA', 95)

In [10]:
process.extractOne("AlaBAMMazzz",choices=state_to_code.keys(),score_cutoff=80)

In [11]:
process.extractOne("AlaBAMMazzz",choices=state_to_code.keys())

('ALABAMA', 78)

現在我們知道了這是如何工作的，我們創建我們的函數，以獲取狀態列並將其轉換為有效的縮略語。我們對這個數據使用80 score_cutoff。你可以玩一玩，看看哪個數字適合你的數據。你會注意到，我們要么返回一個有效的縮寫，要么返回一個np.nan，這樣我們在字段中就有一些有效的值。

In [12]:
def convert_state(row):
    if pd.notnull(row['state']):
        abbrev = process.extractOne(row["state"],choices=state_to_code.keys(),score_cutoff=80)
        if abbrev:
            return state_to_code[abbrev[0]]
    return np.nan

In [13]:
df_final.insert(6, "abbrev", np.nan)
df_final.head()

Unnamed: 0,account,name,street,city,state,postal-code,abbrev,Jan,Feb,Mar,total
0,211829.0,"Kerluke, Koepp and Hilpert",34456 Sean Highway,New Jaycob,Texas,28752.0,,10000,62000,35000,107000
1,320563.0,Walter-Trantow,1311 Alvis Tunnel,Port Khadijah,NorthCarolina,38365.0,,95000,45000,35000,175000
2,648336.0,"Bashirian, Kunde and Price",62184 Schamberger Underpass Apt. 231,New Lilianland,Iowa,76517.0,,91000,120000,35000,246000
3,109996.0,"D'Amore, Gleichner and Bode",155 Fadel Crescent Apt. 144,Hyattburgh,Maine,46021.0,,45000,120000,10000,175000
4,121213.0,Bauch-Goldner,7274 Marissa Common,Shanahanchester,California,49681.0,,162000,120000,35000,317000


In [14]:
# Add the column in the location we want and fill it with NaN values
df_final['abbrev'] = df_final.apply(convert_state, axis=1)
df_final.tail()

Unnamed: 0,account,name,street,city,state,postal-code,abbrev,Jan,Feb,Mar,total
11,231907.0,Hahn-Moore,18115 Olivine Throughway,Norbertomouth,NorthDakota,31415.0,ND,150000,10000,162000,322000
12,242368.0,"Frami, Anderson and Donnelly",182 Bertie Road,East Davian,Iowa,72686.0,IA,162000,120000,35000,317000
13,268755.0,Walsh-Haley,2624 Beatty Parkways,Goodwinmouth,RhodeIsland,31919.0,RI,55000,120000,35000,210000
14,273274.0,McDermott PLC,8917 Bergstrom Meadow,Kathryneborough,Delaware,27933.0,DE,150000,120000,70000,340000
15,,,,,,,,1462000,1507000,717000,3686000


### 填寫符號 Subtotals

我們按州獲得一些小計。在Excel中我們將使用小計工具為我們做這個。

In [15]:
df_sub=df_final[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()
df_sub

Unnamed: 0_level_0,Jan,Feb,Mar,total
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR,150000,120000,35000,305000
CA,162000,120000,35000,317000
DE,150000,120000,70000,340000
IA,253000,240000,70000,563000
ID,70000,120000,35000,225000
ME,45000,120000,10000,175000
MS,62000,120000,70000,252000
NC,95000,45000,35000,175000
ND,150000,10000,162000,322000
PA,70000,95000,35000,200000


In [16]:
def money(x):
    return "${:,.0f}".format(x)

formatted_df = df_sub.applymap(money)
formatted_df

Unnamed: 0_level_0,Jan,Feb,Mar,total
abbrev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AR,"$150,000","$120,000","$35,000","$305,000"
CA,"$162,000","$120,000","$35,000","$317,000"
DE,"$150,000","$120,000","$70,000","$340,000"
IA,"$253,000","$240,000","$70,000","$563,000"
ID,"$70,000","$120,000","$35,000","$225,000"
ME,"$45,000","$120,000","$10,000","$175,000"
MS,"$62,000","$120,000","$70,000","$252,000"
NC,"$95,000","$45,000","$35,000","$175,000"
ND,"$150,000","$10,000","$162,000","$322,000"
PA,"$70,000","$95,000","$35,000","$200,000"


In [17]:
sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64

In [18]:
df_sub_sum=pd.DataFrame(data=sum_row).T
df_sub_sum=df_sub_sum.applymap(money)
df_sub_sum

Unnamed: 0,Jan,Feb,Mar,total
0,"$1,462,000","$1,507,000","$717,000","$3,686,000"


In [20]:
final_table = formatted_df.append(df_sub_sum)
final_table = final_table.rename(index={0:"Total"})
final_table

Unnamed: 0,Jan,Feb,Mar,total
AR,"$150,000","$120,000","$35,000","$305,000"
CA,"$162,000","$120,000","$35,000","$317,000"
DE,"$150,000","$120,000","$70,000","$340,000"
IA,"$253,000","$240,000","$70,000","$563,000"
ID,"$70,000","$120,000","$35,000","$225,000"
ME,"$45,000","$120,000","$10,000","$175,000"
MS,"$62,000","$120,000","$70,000","$252,000"
NC,"$95,000","$45,000","$35,000","$175,000"
ND,"$150,000","$10,000","$162,000","$322,000"
PA,"$70,000","$95,000","$35,000","$200,000"
