## Pandas 追加

この資料は[The Python Tutorial](https://docs.python.org/3.6/tutorial/index.html#the-python-tutorial)
    ([日本語版](https://docs.python.jp/3/tutorial/)) および [Python for Data Analysis:Wrangling with Pandas, Numpy and IPython](http://shop.oreilly.com/product/0636920050896.do)を参考に作成した。  

## インデックスの階層化
Multiindex は pandas の重要な機能のひとつ。
Multiindex による階層化によって多次元データをより低次元の処理系での取り扱いを実現する。

### `pd.Series` の Multiindex
`pd.Series` クラスの Multiindex は:

In [10]:
import numpy as np
import pandas as pd
ser = pd.Series(np.random.randn(8), 
                 index=[["a", "a", "b", "c", "d", "d", "e", "e"],
                        ["2000","2010","2000","2000","2000","2010","2000","2010"]])
ser

a  2000    0.608605
   2010    0.117970
b  2000    0.165885
c  2000   -0.709062
d  2000   -0.562112
   2010   -0.277552
e  2000    0.171508
   2010   -1.111910
dtype: float64

これをインデックスで指定すると:

In [11]:
ser["a"]

2000    0.608605
2010    0.117970
dtype: float64

In [12]:
ser["b":"c"]

b  2000    0.165885
c  2000   -0.709062
dtype: float64

In [13]:
ser.loc[["a", "d"]]

a  2000    0.608605
   2010    0.117970
d  2000   -0.562112
   2010   -0.277552
dtype: float64

`iloc()` の働きは（もちろん）同じ:

In [14]:
ser.iloc[2:4]

b  2000    0.165885
c  2000   -0.709062
dtype: float64

低い階層のインデックスも指定するには、コンマで区切る:

In [15]:
ser["a","2000"]

0.6086054366042096

インデックスからデータセットを作り直すには、`pd.unstack()` が使える:

In [16]:
ser.unstack()

Unnamed: 0,2000,2010
a,0.608605,0.11797
b,0.165885,
c,-0.709062,
d,-0.562112,-0.277552
e,0.171508,-1.11191


`pd.stack()` は逆をおこなう:

In [17]:
ser.unstack().stack()

a  2000    0.608605
   2010    0.117970
b  2000    0.165885
c  2000   -0.709062
d  2000   -0.562112
   2010   -0.277552
e  2000    0.171508
   2010   -1.111910
dtype: float64

### `pd.DataFrame` の Multiindex
`pd.DataFrame` では行・列いずれにも Multiindex は適用できる:

In [18]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(24).reshape(6,4),
                 index =[["a", "a", "b", "b", "c", "c"],["2000", "2010", "2000", "2010", "2000", "2010"]],
                 columns = [["Tokyo", "Tokyo", "Saitama", "Nagano"],["Mainland", "Isrands", "Mainland", "Mainland"]])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Tokyo,Tokyo,Saitama,Nagano
Unnamed: 0_level_1,Unnamed: 1_level_1,Mainland,Isrands,Mainland,Mainland
a,2000,0.069428,2.546802,1.457338,-0.569397
a,2010,0.460472,0.031973,-0.063079,-0.608645
b,2000,1.841005,-0.667141,-0.110929,-0.327752
b,2010,-0.277205,0.300749,-2.123452,1.26528
c,2000,-0.454824,1.104094,-1.671802,-0.148097
c,2010,0.583424,-1.297794,-0.512486,-0.485872


インデックスの指定は、`pd.Series`と同じ:

In [19]:
df["Tokyo"]

Unnamed: 0,Unnamed: 1,Mainland,Isrands
a,2000,0.069428,2.546802
a,2010,0.460472,0.031973
b,2000,1.841005,-0.667141
b,2010,-0.277205,0.300749
c,2000,-0.454824,1.104094
c,2010,0.583424,-1.297794


In [20]:
df["Tokyo","Mainland"]

a  2000    0.069428
   2010    0.460472
b  2000    1.841005
   2010   -0.277205
c  2000   -0.454824
   2010    0.583424
Name: (Tokyo, Mainland), dtype: float64

In [21]:
df.loc["a"]

Unnamed: 0_level_0,Tokyo,Tokyo,Saitama,Nagano
Unnamed: 0_level_1,Mainland,Isrands,Mainland,Mainland
2000,0.069428,2.546802,1.457338,-0.569397
2010,0.460472,0.031973,-0.063079,-0.608645


In [27]:
df["Tokyo","Mainland"]

a  2000    0.069428
   2010    0.460472
b  2000    1.841005
   2010   -0.277205
c  2000   -0.454824
   2010    0.583424
Name: (Tokyo, Mainland), dtype: float64

ただし、`pd.DataFrame` なのでスライスは行に対して適用される:

In [None]:
df["a":"b"]

## Pandas のデータ読み込み

Pandas では CSV, XLS, JSON 形式といった様々な形式をシステムコールを使うよりも簡単にあつかうことができる。

### CSV 形式
CSV 形式の読み込みには、`pd.read_csv()`を使う。  
以下の例では、[Yahoo! Finance から取得したビットコイン-USD 交換レートデータ](https://finance.yahoo.com/quote/BTC-USD/history/)を読み込む。  
`pd.read_csv()` では Default で１行目、１列目がインデックスとして利用される:

In [35]:
import pandas as pd
df = pd.read_csv("BTC-USD.csv", index_col = 0, parse_dates=True)
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-07-16,0.049510,0.049510,0.049510,0.049510,0.049510,0
2010-07-17,0.049510,0.085850,0.059410,0.085840,0.085840,5
2010-07-18,0.085840,0.093070,0.077230,0.080800,0.080800,49
2010-07-19,0.080800,0.081810,0.074260,0.074740,0.074740,20
2010-07-20,0.074740,0.079210,0.066340,0.079210,0.079210,42
2010-07-21,0.079210,0.081810,0.050500,0.050500,0.050500,129
2010-07-22,0.050500,0.067670,0.050500,0.062620,0.062620,141
2010-07-23,0.062620,0.061610,0.050490,0.054540,0.054540,26
2010-07-24,0.054540,0.059410,0.050500,0.050500,0.050500,85
2010-07-25,0.050500,0.056000,0.050000,0.056000,0.056000,46


### Microsoft Excel XLS 形式
Microsoft Exel の XLS 形式の読み込みには、`pd.read_xls()`を使う。  
以下の例は、総務省が公開している、[住民基本台帳に基づく人口動態データ](http://www.soumu.go.jp/menu_news/s-news/01gyosei02_02000148.html)を読み込んでいる:
- 列インデックスは、2-4 行目を Multiindex としてあつかっている
- 行インデックスは都道府県名で置き換えている


In [38]:
import pandas as pd
import numpy as np

df = pd.read_excel("000494956.xls", sheet_name=0, header=[1,2,3], skiprows=[4])
# Omit "Unnamed" indices and adjust 
for i, col in enumerate(df.columns.levels):
    columns = np.where(col.str.contains("Unnamed"), "", col)
    df.columns.set_levels(columns, level=i, inplace=True)
df.set_index("都道府県名", inplace=True)
df.index.name="都道府県名"
df

団体コード,平成29年,平成29年,平成29年,平成29年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年,平成28年
Unnamed: 0_level_1,人口,人口,人口,世帯数,住民票記載数,住民票記載数,住民票記載数,住民票記載数,住民票記載数,住民票記載数,...,住民票消除数,住民票消除数,住民票消除数,住民票消除数,増減数(Ａ)-(Ｂ),増減率,自然増減数,自然増減率,社会増減数,社会増減率
Unnamed: 0_level_2,男,女,計,Unnamed: 4_level_2,転入者数（国内）,転入者数（国外）,転入者数（計）,出生者数,その他（計）,計（Ａ）,...,転出者数（計）,死亡者数,その他（計）,計（Ｂ）,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
都道府県名,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
北海道,2537340,2833467,5370807,2761826,245573,13047,258620,35452,2519,296591,...,261537,62131,3326,326994,-30403,-0.562892,-26679,-0.493945,-3724,-0.068948
青森県,627006,696855,1323861,589887,34255,1880,36135,8684,519,45338,...,41994,17366,582,59942,-14604,-1.091101,-8682,-0.648653,-5922,-0.442447
岩手県,613838,663433,1277271,523065,36207,1870,38077,8363,325,46765,...,41338,16995,631,58964,-12199,-0.946048,-8632,-0.669422,-3567,-0.276625
宮城県,1131759,1187679,2319438,980808,106795,5049,111844,17569,1158,130571,...,110200,23579,1820,135599,-5028,-0.216308,-6010,-0.258554,982,0.042246
秋田県,485257,543939,1029196,426020,21489,1270,22759,5692,330,28781,...,26717,15263,620,42600,-13819,-1.324909,-9571,-0.917628,-4248,-0.407281
山形県,538338,580130,1118468,411919,28062,1356,29418,7578,240,37236,...,32692,15222,414,48328,-11092,-0.981975,-7644,-0.676724,-3448,-0.305252
福島県,950430,988129,1938559,779244,54421,2781,57202,13810,837,71849,...,61875,24325,789,86989,-15140,-0.77494,-10515,-0.53821,-4625,-0.23673
茨城県,1482072,1478386,2960458,1221978,98859,15798,114657,21383,1325,137365,...,111301,31551,4286,147138,-9773,-0.329032,-10168,-0.34233,395,0.013299
栃木県,993019,998578,1991597,817370,58970,10576,69546,14904,806,85256,...,68582,21500,2441,92523,-7267,-0.363557,-6596,-0.329987,-671,-0.033569
群馬県,988955,1009320,1998275,831970,60784,8721,69505,14201,1038,84744,...,65958,22212,3619,91789,-7045,-0.351316,-8011,-0.399487,966,0.048172
