## Study Summary
1. DataFrame 的「對齊」特性
2. DataFrame欄位對不上會產生錯誤的結果
3. DataFrame的廣播特性
4. DataFrame的遮罩特性
5. DataFrame的排序
6. 自訂義的行或列函式應用 apply() 例如開根號後，乘以10-->lambda

In [2]:
import pandas as pd

### 1. DataFrame 的「對齊」特性

In [8]:
x = pd.DataFrame([[1,2,3]], columns=['A','B','C'])
y = pd.DataFrame([[1,1,1]], columns=['A','B','C'])
print(x+y)

   A  B  C
0  2  3  4


### 2. DataFrame欄位對不上會產生錯誤的結果

In [9]:
x = pd.DataFrame([[1,2,3]], columns=['A','B','C'])
y = pd.DataFrame([[1,1,1]], columns=['B','C','E'])
print(x+y) #只有對齊B,C，所以其他欄位是NaN

    A  B  C   E
0 NaN  3  4 NaN


### 3. DataFrame的廣播特性

In [26]:
x = pd.DataFrame([[1,2,3]], columns=['A','B','C'])
print(x+5)

   A  B  C
0  6  7  8


In [27]:
#DataFrame只支援常數的廣播
x = pd.DataFrame([[1,2,3]])
print(x+5) #可

print(x + pd.DataFrame([1])) #各加 1 ，無法

   0  1  2
0  6  7  8
   0   1   2
0  2 NaN NaN


In [21]:
#array 可支援其他廣播
import numpy as np
a = np.array([[1,2,3]])
print(a+1)

print(a+np.array([5])) #各加 5

[[2 3 4]]
[[6 7 8]]
[5]


### 4. DataFrame的遮罩特性

In [32]:
x = pd.DataFrame([[1,2,3],[7,8,9]])
print(x > 1)

print(x [x>5])

       0     1     2
0  False  True  True
1   True  True  True
     0    1    2
0  NaN  NaN  NaN
1  7.0  8.0  9.0


In [42]:
y = pd.DataFrame([[1,2,3],[7,8,9]], columns = ['A','B','C'])
print(y[['A']] > 2)

print(y[y['A'] > 2])

       A
0  False
1   True
   A  B  C
1  7  8  9


### 5. DataFrame的排序

In [45]:
z = pd.DataFrame ({
    'col1':['A','a','B','b'],
    'col2':[2,1,9,8]
})
z

Unnamed: 0,col1,col2
0,A,2
1,a,1
2,B,9
3,b,8


In [46]:
i =z.sort_values(by=['col1'])
i

Unnamed: 0,col1,col2
0,A,2
2,B,9
1,a,1
3,b,8


In [51]:
i =z.sort_values(by=['col2'], ascending = False) #i =z.sort_values(by='col2', ascending = False) 也可以
i

Unnamed: 0,col1,col2
2,B,9
3,b,8
0,A,2
1,a,1


### 6. 自訂義的行或列函式應用 apply() 例如開根號後，乘以10-->lambda

In [52]:
x = pd.DataFrame([[1,2,3],[7,8,9]])
x.apply(lambda x : x**(0.5)*10)

Unnamed: 0,0,1,2
0,10.0,14.142136,17.320508
1,26.457513,28.284271,30.0


### 6-1. apply() 搭配加總sum

In [55]:
p = pd.DataFrame([[1,2,3],[7,8,9]])
p.sum(axis=1)

0     6
1    24
dtype: int64

In [56]:
p.apply(sum, axis = 1)

0     6
1    24
dtype: int64

## Homework
1. 根據題目給的 DataFrame 完成下列操作：
   - 計算每個不同種類 animal 的 age 的平均數
   - 將資料依照 Age 欄位由小到大排序，再依照 visits 欄位由大到小排序
   - 將 priority 欄位中的 yes 和 no 字串，換成是布林值 的 True 和 False

2. 一個包含兩個欄位的DataFrame，將每個數字減去
   - 欄位的平均數
   - 單筆資料平均數

### 1. 根據題目給的 DataFrame 完成下列操作：

In [None]:
import numpy as np
import pandas as pd

In [72]:
data = {
    'animal':['cat','cat','snake','dog','dog','cat','snake','cat','dog','dog'],
    'age':[2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
    'visitors':[1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'priority':['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels) #將series 轉成 DataFrame
df

Unnamed: 0,animal,age,visitors,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [75]:
# 計算每個不同種類 animal 的 age 的平均數

animal_type = df.groupby("animal")
animal_type.mean()

Unnamed: 0_level_0,age,visitors
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,2.5,2.0
dog,5.0,2.0
snake,2.5,1.5


In [77]:
# 將資料依照 Age 欄位由小到大排序，再依照 visitors 欄位由大到小排序

df.sort_values(by = ['age','visitors'])

Unnamed: 0,animal,age,visitors,priority
c,snake,0.5,2,no
f,cat,2.0,3,no
a,cat,2.5,1,yes
j,dog,3.0,1,no
b,cat,3.0,3,yes
g,snake,4.5,1,no
e,dog,5.0,2,no
i,dog,7.0,2,no
h,cat,,1,yes
d,dog,,3,yes


In [79]:
#呈上題，排序方式Age橫向，再visitors直向

df.sort_values(by = ['age','visitors'], ascending = [True, False])

Unnamed: 0,animal,age,visitors,priority
c,snake,0.5,2,no
f,cat,2.0,3,no
a,cat,2.5,1,yes
b,cat,3.0,3,yes
j,dog,3.0,1,no
g,snake,4.5,1,no
e,dog,5.0,2,no
i,dog,7.0,2,no
d,dog,,3,yes
h,cat,,1,yes


In [83]:
# 將 priority 欄位中的 yes 和 no 字串，換成是布林值 的 True 和 False

df['priority'].replace(['yes','no'], [True, False])

a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

### 2. 一個包含兩個欄位的DataFrame，將每個數字減去
       - 欄位的平均數
       - 單筆資料平均數

In [58]:
x = pd.DataFrame(np.random.random(size = (5,3)))
x

Unnamed: 0,0,1,2
0,0.416483,0.377249,0.156512
1,0.652221,0.448398,0.928402
2,0.20063,0.4009,0.384765
3,0.050071,0.076102,0.982886
4,0.774345,0.225799,0.594749


In [62]:
x - x.mean()

Unnamed: 0,0,1,2
0,-0.002267,0.071559,-0.452951
1,0.233471,0.142709,0.318939
2,-0.21812,0.095211,-0.224698
3,-0.368679,-0.229588,0.373423
4,0.355595,-0.07989,-0.014714


In [65]:
x.sub(x.mean(axis=1), axis=0)

Unnamed: 0,0,1,2
0,0.099735,0.060501,-0.160236
1,-0.024119,-0.227942,0.252061
2,-0.128135,0.072135,0.056
3,-0.319615,-0.293585,0.6132
4,0.242714,-0.305832,0.063118


### 承上題，
     1. 哪一筆的資料總和最小
     2. 哪一欄位的資料總和最小

In [66]:
x

Unnamed: 0,0,1,2
0,0.416483,0.377249,0.156512
1,0.652221,0.448398,0.928402
2,0.20063,0.4009,0.384765
3,0.050071,0.076102,0.982886
4,0.774345,0.225799,0.594749


In [67]:
x.sum().argmax()

2

In [70]:
x.sum(axis=1).argmax() #取橫向的欄位最小值

1