# Pandas module
- Pandas是Python語言的一個高效簡易的資料處理和分析工具，類似於Excel的程式版本
- 提供資料結構 (Series, DataFrame) 和運算操作，因此可以用Python來操作試算表內的資料。


![pandas dataframe and series](https://www.altexsoft.com/static/blog-post/2024/2/a2b6d6bd-898e-424f-98a8-50b3bdf775eb.webp)

# Import library

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib.font_manager import fontManager

# 建立Series

- Series 是一個一維的資料結構, 只有一個索引
- pd.Series(資料 [, index = 索引])
  - 資料可用 List, Dict, Tuple, Numpy
  - 索引可選填，預設為整數List

In [None]:
#用串列建立Series物件
foo = ['a', 'c', 'x', 'y']
my_ser = pd.Series(foo)
print(my_ser)           #顯示Series
print(my_ser.values)    #顯示值
print(my_ser.index)     #顯示索引

In [None]:
#用串列建立Series物件並自訂索引
company = ['聯電', '台積電', '聯發科']
stock_price = [42, 510, 694]
stock = pd.Series(stock_price, index=company)
print(stock)
print(stock.values)
print(stock.index)

In [None]:
# 用Dict建立Series
dict1 = {'Taiwan': '台北', 'US': 'New York', 'Japan': 'Tokyo'}
city = pd.Series(dict1)
print(city)

# 讀取Series

In [None]:
print(city['Taiwan']) # get data by key
print(city.iloc[0])        # get data by index
print(city.index[0])  # get key by index

# Lab series

(1) 根據下列表格資料，創建一個 Series (city_revenues), 回答下列問題
city | revenues
-----|---------
Amsterdam|4200
Toronto|8000
Tokyo|6500

- Toronto 的 revenues 是多少
- 第零筆資料的 revenues 是多少
- 最後一筆資料的 revenues 是多少
- 求 revenues 的 sum

(2) 根據下列表格資料，創建一個 Series (city_employee_count),回答下列問題
city | employee_count
-----|----------------
Amsterdam|5
Tokyo|8

Tokyo 是否在 city_employee_count 中

(3) 合併 city_revenues 和 city_employee_count 來產生一個 DataFrame (city_data), 回答下列問題
- 查詢 Amsterdam 的列資料
- 查詢第零橫列的列資料
- 查詢 Amsterdam 到 Tokyo的 revenue 資料



In [None]:
import pandas as pd

city_revenues = pd.Series([4200, 8000, 6500], \
                          index=['Amsterdam', 'Toronto', 'Tokyo'])
print(city_revenues['Toronto'])
print(city_revenues.iloc[0])
print(city_revenues.iloc[-1])
print(city_revenues.sum())

city_employee_count = pd.Series([5, 8], index=['Amsterdam', 'Tokyo'])
print('Tokyo' in city_employee_count.index)

city_data = pd.DataFrame({'revenues': city_revenues,
                           'employee_count':city_employee_count})
city_data.loc['Amsterdam']
city_data.iloc[0]
city_data.loc['Amsterdam':'Tokyo']

# 建立DataFrame

- DataFrame 是一個二維的資料結構, 有橫列索引 (row index) 和直欄標籤 (column label)
- DataFrame 中的每一個直行可以被視為一個 Series
- pd.DataFrame(資料 [, index = 橫列索引, columns = 直欄標籤])
- 資料可用 List, Dict, Numpy, Tuple, Series.
- 橫列索引是列號，可選填，預設為整數List
- 直欄標籤是欄位名稱

### 用List建立DataFrame

In [None]:
foo=  [[65,92,78,83,70],
       [90,72,76,93,56],
       [81,85,91,89,77],
       [79,53,47,94,80],
       ]
df = pd.DataFrame(foo)
print(df)
print(df.values)
print(df.index)
# 缺row index 跟column label

In [None]:
# 設定row index 跟column label
df = pd.DataFrame(foo,
                   index=['王小明','李小美','陳大同','林小玉'],
                   columns=['國文','英文','數學','自然','社會'])
df

### 用Dict建立DataFrame

In [None]:
# 以column為基礎的dict
scores = {'國文':{'王小明':65,'李小美':90,'陳大同':81,'林小玉':79},
          '英文':{'王小明':92,'李小美':72,'陳大同':85,'林小玉':53},
          '數學':{'王小明':78,'李小美':76,'陳大同':91,'林小玉':47},
          '自然':{'王小明':83,'李小美':93,'陳大同':89,'林小玉':94},
          '社會':{'王小明':70,'李小美':56,'陳大同':77,'林小玉':80}}
pd.DataFrame(scores)

## 用Array建立DataFrame

In [None]:
mydata = np.random.randn(4, 3)
print(mydata)
df = pd.DataFrame(mydata, columns=list("ABC"), index=list("甲乙丙丁"))
df

## 了解DataFrame的結構

In [None]:
print(df.shape)
print(df.dtypes)
print(df.info)
print(df.describe())
df

## 合併兩個DataFrame

In [None]:
scores = {'國文':{'王小明':65,'李小美':90,'陳大同':81,'林小玉':79},
          '英文':{'王小明':92,'李小美':72,'陳大同':85,'林小玉':53},
          '數學':{'王小明':78,'李小美':76,'陳大同':91,'林小玉':47},
          '自然':{'王小明':83,'李小美':93,'陳大同':89,'林小玉':94},
          '社會':{'王小明':70,'李小美':56,'陳大同':77,'林小玉':80}}
df1 = pd.DataFrame(scores)
scores_others = {'體育':{'王小明':90,'李小美':93,'陳大同':95,'林小玉':80},
          '家政':{'王小明':70,'李小美':80,'陳大同':75,'林小玉':90},}

df2 = pd.DataFrame(scores_others)
df_all = pd.concat([df1, df2], axis=1)
df_all

In [None]:
mydata = np.random.randn(4, 3)
df1 = pd.DataFrame(mydata, columns=list("ABC"))
df1

In [None]:
df2 = pd.DataFrame(np.random.randn(3, 3), columns=list("ABC"))
df2

In [None]:
#上下合併沒什麼問題
df3 = pd.concat([df1, df2], axis=0)
df3

In [None]:
#把 index 重新整理
df3.index = range(7)
df3

In [None]:
# 左右合併還是可以，不過因為形狀不同，所以會出現NaN。
df4 = pd.concat([df1, df2], axis=1)
df4

# 讀取DataFrame資料

## 以欄位取值 df[column]

In [None]:
scores = {'國文':{'王小明':65,'李小美':90,'陳大同':81,'林小玉':79},
          '英文':{'王小明':92,'李小美':72,'陳大同':85,'林小玉':53},
          '數學':{'王小明':78,'李小美':76,'陳大同':91,'林小玉':47},
          '自然':{'王小明':83,'李小美':93,'陳大同':89,'林小玉':94},
          '社會':{'王小明':70,'李小美':56,'陳大同':77,'林小玉':80}}
df = pd.DataFrame(scores)
df

In [None]:
#讀一個欄位
print(df["自然"])
print(type(df["自然"]))
print(df["自然"].dtype)

In [None]:
#讀多個欄位
# df[['國文','英文','數學']] #dataframe
# df['國文']                 #series
# df[['國文']]            #dataframe
# df['王小明'] #error, 因為沒有王小明這個欄位, 雖然有有王小明這個index, 但df['xxx']，拿的是column，所以會error

## 以index label及column name取值：df.loc()

In [None]:
scores = {'國文':{'王小明':65,'李小美':90,'陳大同':81,'林小玉':79},
          '英文':{'王小明':92,'李小美':72,'陳大同':85,'林小玉':53},
          '數學':{'王小明':78,'李小美':76,'陳大同':91,'林小玉':47},
          '自然':{'王小明':83,'李小美':93,'陳大同':89,'林小玉':94},
          '社會':{'王小明':70,'李小美':56,'陳大同':77,'林小玉':80}}
df = pd.DataFrame(scores)

In [None]:
print(df.loc["林小玉", "社會"]) # int
print(type(df.loc["林小玉", "社會"]))

In [None]:
print(df.loc["王小明", ["國文","社會"]]) # Series
print(type(df.loc["王小明", ["國文","社會"]]))

In [None]:
df.loc[["王小明", "李小美"], ["數學", "自然"]] # DataFrame

In [None]:
df.loc["王小明":"陳大同", "數學":"社會"] # DataFrame

In [None]:
df.loc["陳大同", :] # Series

In [None]:
df.loc[:"李小美", "數學":"社會"] # DataFrame

In [None]:
df.loc["李小美":, "數學":"社會"]

## 以index ID及column ID取值：df.iloc()

In [None]:
df

In [None]:
df.iloc[3, 4]

In [None]:
df.iloc[0, [0, 4]]
#type(df.iloc[0, [0, 4]])

In [None]:
df.iloc[[0, 1], [2, 3]]
#type(df.iloc[[0, 1], [2, 3]])

In [None]:
df.iloc[0:3, 2:5]

In [None]:
df.iloc[2, :]

In [None]:
df.iloc[:2, 2:5]

In [None]:
df.iloc[1:, 2:5]

## 最前或最後幾列資料

In [None]:
df.head(3)

In [None]:
df.tail(2)

## 資料排序

In [None]:
df.sort_values(by="國文", ascending=False, inplace=True)
df

## 條件取值

In [None]:
df

df["數學"]<60 ===> [False, False, False, True] ==> df([False, False, False, True])

In [None]:
#指定欄位以條件式判斷取值
df[df["數學"] < 60]
#boolean indexing: Using boolean array (series) to index select rows
# c1 = pd.Series([False, True, True, False])
# c1 = df[[False, True, True, False]]  #不取row0&3, 取row1&2, dataframe
# df["國文"] >= 80
# c2 = df['國文'] >= 80                 #series
# c3 = df[df['國文'] >= 80]            # dataframe

## 以values取值

In [None]:
d1 = df.values              #Numpy ndarray
d1

# 修改DataFrame資料

In [None]:
scores = {'國文':{'王小明':65,'李小美':90,'陳大同':81,'林小玉':79},
          '英文':{'王小明':92,'李小美':72,'陳大同':85,'林小玉':53},
          '數學':{'王小明':78,'李小美':76,'陳大同':91,'林小玉':47},
          '自然':{'王小明':83,'李小美':93,'陳大同':89,'林小玉':94},
          '社會':{'王小明':70,'李小美':56,'陳大同':77,'林小玉':80}}
df = pd.DataFrame(scores)

In [None]:
dfcopy=df.copy()  # 複製原始資料

In [None]:
df.loc["王小明", "數學"] = 95  # 修改資料
df

In [None]:
df.loc["王小明", :] = 80  # 修改資料
df

In [None]:
df.info()

In [None]:
import numpy as np
df["國文"] = df["國文"].astype(np.int8)
df.info()

In [None]:
df.rename(columns = {"英文":"外語"}, inplace=True)
df

In [None]:
df["總分"] = df["國文"] + df["外語"] + df["數學"] + df["自然"] + df["社會"]
df

In [None]:
df=dfcopy #還原 DataFrame 資料
df

# 新增DataFrame資料

In [None]:
df=dfcopy #還原 DataFrame 資料
print(df)
#資料新增
df.loc['陳彼得'] = [30, 35, 40, 45, 50, 75]    #就地更改
new_row = pd.DataFrame({'國文':{'張阿華':30},
                         '英文':{'張阿華':35},
                         '數學':{'張阿華':40},
                         '自然':{'張阿華':45},
                         '社會':{'張阿華':50},
                         '體育':{'張阿華':70},
                         })
df1 = pd.concat([df,new_row], ignore_index=False) #True代表放棄原來的index, 變成整數的index
print(df1)

# 刪除 DataFrame 資料

In [None]:
dfcopy=df.copy()  # 複製原始資料
df.drop("王小明", axis = 0)          #CRUD Create, Read, Update, Delete

In [None]:
df = dfcopy
df.drop("數學", axis=1)

In [None]:
dfcopy=df.copy()
dfcopy.drop(["數學", "自然"], axis=1)

In [None]:
df = dfcopy
df.drop(df.index[1:4])

In [None]:
dfcopy=df.copy()
dfcopy.drop(dfcopy.columns[1:4], axis=1)

# 讀寫外部檔案

In [None]:
df

In [None]:
# 寫到CSV檔案
from pathlib import Path
target_csv_path = Path.cwd() / '..' / 'files' / 'csv' / 'scores.csv'
df.to_csv(target_csv_path, encoding='utf-8-sig')

In [None]:
# 讀CSV檔案
from pathlib import Path
source_csv_path = Path.cwd() / '..' / 'files' / 'csv' / 'covid19.csv'
df = pd.read_csv(source_csv_path)
df

In [None]:
# 讀JSON檔案
from pathlib import Path
source_csv_path = Path.cwd() / '..' / 'files' / 'json' / 'covid19.json'
df = pd.read_json(source_csv_path)
df

In [None]:
# 讀Excel試算表
from pathlib import Path
source_csv_path = Path.cwd() / '..' / 'files' / 'xls' / 'covid19.xlsx'
df = pd.read_excel(source_csv_path)
df

# 資料清洗 (Customer.csv)

## 查詢空值：isnull()


In [None]:
import pandas as pd
from pathlib import Path

source_csv_path = Path.cwd() / '..' / 'files' / 'csv' / 'customer.csv'
# 讀取資料
df = pd.read_csv(source_csv_path)
df

In [None]:
# 空值的處理
print('各個欄位有空值的狀況:')
print(df.isnull())
print(df.isnull().sum())
print(df.isnull().any(axis=1))
print('有空值的記錄筆數:', df.isnull().any(axis=1).sum())
print(df.isnull().any(axis=0))
print('有空值的欄位數:', df.isnull().any(axis=0).sum())
print('age欄有空值的記錄:')
print(df[df['age'].isnull()])

## 空欄填值：fillna()

In [None]:
df

In [None]:
# 將age的空值填入0
df_sample = df.copy()
df_sample['age'] = df_sample['age'].fillna(value=0)
df_sample.head()

In [None]:
# 將age的空值填入平均值
df_sample = df.copy()
df_sample['age'] = df_sample['age'].fillna(
                    value=df_sample['age'].mean())
df_sample.head()

In [None]:
# 以前一個值往下填ffill或後一個值往上填bfill
df_sample['gender'] = df_sample['gender'].fillna(method='ffill')
df_sample['area'] = df_sample['area'].fillna(method='ffill')
df_sample.head(10)

In [None]:
# 刪除不完整的資料
df_sample = df.copy()
df_no_na= df_sample.dropna()
df_no_na

## 去除重複資料

In [None]:
df_sample = df.copy()
df_sample

In [None]:
# 去除重複的記錄
df_sample.drop_duplicates(subset='id', keep='first', inplace=True)
df_sample

## 資料內容的置換

整數字面值
浮點數字面值

In [None]:
# 去除欄位中的空白
df_sample = df.copy()
df_sample['job'] = df_sample['job'].str.strip()
df_sample['job'] = df_sample['job'].str.replace(' ', '')
df_sample

## 調整資料的格式

In [None]:
# 轉換值的格式
df_sample = df.copy()
print(df_sample.info())
df_sample['age'] = df_sample['age'].fillna(value=0)
df_sample['age'] = df_sample['age'].astype('int8')
print(df_sample.info())

# 資料篩選

In [None]:
# 篩選女性的資料
df_sample = df.copy()
df_female = df_sample[(df_sample['gender'] == 'Female')]
df_female

In [None]:
# 篩選男性且大於50歲的資料
# print(df_sample[(df_sample['gender'] == 'Male') & (df_sample['age'] > 50)])

# 篩選住在新北市三重區或基隆市中正區的資料
df_sample = df.copy()
print(df_sample[(df_sample['area'] == '新北市三重區') | (df_sample['area'] == '基隆市中正區')])

In [None]:
df

In [None]:
df['area'].unique()

In [None]:
df.value_counts('area')

In [None]:
df_sample = df.copy()
def get_last_name(full_name):
    return full_name[0]
df_sample['last_name'] = df_sample['name'].apply(get_last_name)
df_sample

# 資料分組運算: groupby, agg

In [None]:
#客戶中男女生的平均年齡
df_sample = df.copy()
print('mean of all ages:', df_sample['age'].mean())
print('mean of age by gender:\n', df_sample.groupby('gender')['age'].mean())

In [None]:
#客戶中住各區的人數
df_sample.groupby('area')['id'].count()

In [None]:
#彙總統計：agg(), 客戶中男女生的平均年齡、最年長及最年輕的年齡
df_sample.groupby('gender')['age'].agg(['mean', 'max', 'min'])
# df_sample.groupby('gender')['age'].mean()
# df_sample.groupby('gender')['age'].max()
# df_sample.groupby('gender')['age'].min()



# 繪圖應用 (年度銷售)

## 設定Matplotlib的中文顯示

## 繪製長條圖、橫條圖、堆疊圖


In [None]:
df

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
# 設定中文字型 Heiti TC
plt.rcParams['font.sans-serif']=['Heiti TC']
df = pd.DataFrame([[250,320,300,312,280],
							[280,300,280,290,310],
							[220,280,250,305,250]],
							index=['北部','中部','南部'],
							columns=[2015,2016,2017,2018,2019])

g1 = df.plot(kind='bar', title='長條圖', figsize=[10,5])
# g2 = df.plot(kind='barh', title='橫條圖', figsize=[10,5])
# g3 = df.plot(kind='bar', stacked=True, title='堆疊圖', figsize=[10,5])

## 繪製折線圖

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
# 設定中文字型 Heiti TC
plt.rcParams['font.sans-serif']=['Heiti TC']

df = pd.DataFrame([[250,320,300,312,280],
							[280,300,280,290,310],
							[220,280,250,305,250]],
							index=['北部','中部','南部'],
							columns=[2015,2016,2017,2018,2019])

g1 = df.iloc[0].plot(kind='line', legend=True,
							  xticks=range(2015,2020),
							  title='公司分區年度銷售表',
							  figsize=[10,5])
g1 = df.iloc[1].plot(kind='line',
							  legend=True,
							   xticks=range(2015,2020))
g1 = df.iloc[2].plot(kind='line',
									  legend=True,
									  xticks=range(2015,2020))

## 繪製圓餅圖

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
# 設定中文字型 Heiti TC
plt.rcParams['font.sans-serif']=['Heiti TC']

df = pd.DataFrame([[250,320,300,312,280],
                   [280,300,280,290,310],
                   [220,280,250,305,250]],
                  index=['北部','中部','南部'],
                  columns=[2015,2016,2017,2018,2019])
df.plot(kind='pie', subplots=True, figsize=[20,20])