## 簡介
主要是使用apply 各項功能, 要能大量透過向量處理, 減少迴圈數量, 盡量在C function的效率運作

In [1]:
import numpy as np
import matplotlib as mp
import pandas as pd
%matplotlib inline
import sklearn
import os
import sys

## catgories in Pandas
就像R裡面的Factor, 可以[參考](https://stackoverflow.com/questions/15124439/closest-equivalent-of-a-factor-variable-in-python-pandas)

In [3]:
s = pd.Series(["a","b","c","a"], dtype="category")
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

## 5.1 將向量分組
完整的類型分類用法請看[參考文件](http://pandas-docs.github.io/pandas-docs-travis/categorical.html)

In [19]:
#５個人，按機率0.5,0.5挑出來train與test
splitter = np.random.choice([0,1], 10, p=[0.5,0.5])
splitter

array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0])

In [20]:
pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))

0     test
1     test
2     test
3     test
4     test
5    train
6     test
7    train
8    train
9    train
dtype: category
Categories (2, object): [train, test]

## 5.2 應用函數到每個元素

In [29]:
df = pd.DataFrame(np.arange(12).reshape(3,4))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [26]:
df.apply(np.sqrt) 

Unnamed: 0,0,1,2,3
0,0.0,1.0,1.414214,1.732051
1,2.0,2.236068,2.44949,2.645751
2,2.828427,3.0,3.162278,3.316625


## 5.3 將函數用到每列資料

In [37]:
def mysum(k):
    return np.sum(k)-1
df.apply(mysum,axis=1)

0     5
1    21
2    37
dtype: int64

## 5.4 將函數用到每行資料

In [36]:
# return Series type
df.apply(mysum,axis=0) 

0    11
1    14
2    17
3    20
dtype: int64

## 5.5 將函數用到分組的資料群組
可以參考[pandas.pivot_table] (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html)

In [41]:
#a = pd.read_clipboard()

Unnamed: 0,A,B,C,D
0,foo,one,small,1
1,foo,one,large,2
2,foo,one,large,2
3,foo,two,small,3
4,foo,two,small,3
5,bar,one,large,4
6,bar,one,small,5
7,bar,two,small,6
8,bar,two,large,7


In [42]:
# 用df 當資料,然後D 當參數, 然後A,B做分類, C做Column分組
pd.pivot_table(a, values='D', index=['A', 'B'],columns=['C'], aggfunc=mysum)

Unnamed: 0_level_0,C,large,small
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,3.0,4.0
bar,two,6.0,5.0
foo,one,3.0,0.0
foo,two,,5.0


In [43]:
pd.pivot_table(a, values='D', columns=['C'], aggfunc=mysum)

C
large    14
small    17
Name: D, dtype: int64

In [46]:
pd.pivot_table(a, values=['D'], index=['A'],columns=['C'], aggfunc=np.sum)

Unnamed: 0_level_0,D,D
C,large,small
A,Unnamed: 1_level_2,Unnamed: 2_level_2
bar,11,11
foo,4,7


## 5.6 應用函數到平行向量
還沒有找到好的案例

## 5.7 延伸比較R vs Python
[參考文件](https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html)