## Pandasでよく使う操作

In [1]:
import numpy as np
import pandas as pd

### 辞書としてのSeries
pythonの辞書よりも遥かに効率的である。

順番を指定する場合

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

順番を指定しない場合

In [3]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [4]:
population['California']

38332521

In [5]:
#スライス機能も使える
population['California':'Illinois'] #numpyのスライス違ってスライスの最後は含まれる

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

当然python辞書と同じような事もできる

In [6]:
population.keys()

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [7]:
print("New York" in population)
print(38332521 in population)

True
False


In [8]:
list(population.items())

[('California', 38332521),
 ('Texas', 26448193),
 ('New York', 19651127),
 ('Florida', 19552860),
 ('Illinois', 12882135)]

### 順序付き集合としてのIndex
seriesもdataframeもIndexを持つが、順序付き集合としての機能を持つ。いちいちpython setに変換せずに済むので便利

In [9]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
print(indA & indB) #積集合
print(indA | indB) #和集合
print(indA ^ indB) #対称差

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


### DataFrameの作り方

ネットでよくやられてるやつ でも列の順番が入れ替わったりする

In [10]:
pd.DataFrame({
    "first":[5,2],
    "second":[3,4]
})

Unnamed: 0,first,second
0,5,3
1,2,4


辞書のリストから作る　前者よりは賢そう

In [11]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [12]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Numpy配列から作る deepの結果をpandasに変換するときとか便利そう

In [13]:
data = np.random.rand(3, 2)
print(data)
pd.DataFrame(data,
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

[[0.16218446 0.32180006]
 [0.52940368 0.34842539]
 [0.99582134 0.49462118]]


Unnamed: 0,foo,bar
a,0.162184,0.3218
b,0.529404,0.348425
c,0.995821,0.494621


### データフレームに新しい行を作る

すでにあるデータから計算する場合

In [14]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [15]:
data['density'] = data['pop'] / data['area'] #なんと簡単
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### 条件に合うデータに絞る
locはmaskとしても働く

In [16]:
data.density>100

California    False
Texas         False
New York       True
Florida        True
Illinois      False
Name: density, dtype: bool

In [17]:
data.loc[data.density>100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [18]:
data.loc[data.density>100,["area", "pop"]] #行も一緒に指定できる

Unnamed: 0,area,pop
New York,141297,19651127
Florida,170312,19552860


### Pandasのデータに同じ加工をする
実はnumpyをにそのまま噛ませれば良い

In [19]:
np.sqrt(data) #返り値もpandasである

Unnamed: 0,area,pop,density
California,651.127484,6191.326271,9.508624
Texas,834.063547,5142.780668,6.165934
New York,375.894932,4432.95917,11.79308
Florida,412.688745,4421.861599,10.714762
Illinois,387.29188,3589.169124,9.267349


### 欠損値の扱い

null値の検出

In [20]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [21]:
data.isnull(), data.notnull()

(0    False
 1     True
 2    False
 3     True
 dtype: bool, 0     True
 1    False
 2     True
 3    False
 dtype: bool)

欠損値の除外

In [22]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [23]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


オプションとして,howとaxisとthreshがある。`how="all"`ですべてnullの行(列)を削除。デフォでは`how="any"`。axisは、軸の指定。またthreshでnull以外がいくつ以上で残すか決めることもできる

In [24]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [25]:
df.dropna(axis="rows", thresh=3) #非nullが3つ以上なのは1だけ

Unnamed: 0,0,1,2
1,2.0,3.0,5


欠損値の埋め合わせ

In [26]:
df #データはこんな感じ

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


0で埋め合わせる

In [27]:
df.fillna(0) #他の数字で埋め合わせることも可能

Unnamed: 0,0,1,2
0,1.0,0.0,2
1,2.0,3.0,5
2,0.0,4.0,6


前の数字で埋め合わせる

In [28]:
df.fillna(method="ffill") #後ろ向きのbfillもある

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,2.0,4.0,6


In [29]:
df.fillna(method="bfill", axis="columns")

Unnamed: 0,0,1,2
0,1.0,2.0,2.0
1,2.0,3.0,5.0
2,4.0,4.0,6.0


### 階層型インデックス
今までpanelとかで操作してきたけど多分こっちのほうがめっちゃ便利

例えばこんなもの

In [30]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
index = pd.MultiIndex.from_tuples(index)
index #まじでインデックスだけを持っている

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [31]:
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations,index=index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [32]:
pop.loc["New York":"Texas", 2000] #複数のindexで絞ることも可能

New York  2000    18976457
Texas     2000    20851820
dtype: int64

Seriesならばunstackとstackでdfに変換したりできる

In [33]:
pop.unstack()

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [34]:
pop.unstack().stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

dfでの計算操作も今まで通り

In [35]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [36]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


MultiIndexの作り方

DataFrame作成時に暗黙的に作成するやり方。→indexにリストのリストを入れれば良し

In [37]:
np.arange(8).reshape(4,2)

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

In [38]:
df = pd.DataFrame(np.arange(8).reshape(4, 2),
                  index=[['a', 'a', 'b', 'b'], 
                         [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0,1
a,2,2,3
b,1,4,5
b,2,6,7


辞書からも暗黙的に作れる。もうちょっとスマートである。ポイントは、keyにタプルを作ることである。

In [39]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

明示的にMultiIndexを作る
様々な作り方がある

In [42]:
#配列から作る
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], 
                           [1, 2, 1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [43]:
#タプルから作る
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [44]:
#デカルト積から作る
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

複数のindexに名前をふる

In [56]:
pop.index.names = ['state', 'year'] #リストで複数指定可能

In [60]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

データフレームの行列両方にマルチインデクスにする

In [65]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), decimals=1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
# DFを作るときに
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,51.0,36.8,44.0,38.5,21.0,37.3
2013,2,29.0,36.1,49.0,37.1,32.0,39.5
2014,1,46.0,37.0,39.0,35.7,45.0,38.0
2014,2,29.0,36.8,42.0,37.7,36.0,36.4


データの指定

In [87]:
health_data["Bob","HR"]

year  visit
2013  1        51.0
      2        29.0
2014  1        46.0
      2        29.0
Name: (Bob, HR), dtype: float64

In [80]:
health_data.loc[:,"Bob":"Guido"]

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido
Unnamed: 0_level_1,type,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2013,1,51.0,36.8,44.0,38.5
2013,2,29.0,36.1,49.0,37.1
2014,1,46.0,37.0,39.0,35.7
2014,2,29.0,36.8,42.0,37.7


In [89]:
health_data.loc[:,("Bob","Temp"):("Guido","HR")]

Unnamed: 0_level_0,subject,Bob,Guido
Unnamed: 0_level_1,type,Temp,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,36.8,44.0
2013,2,36.1,49.0
2014,1,37.0,39.0
2014,2,36.8,42.0


In [90]:
idx = pd.IndexSlice #idxスライスを使わないとエラーになる(pythonスライスはだめ)
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,51.0,44.0,21.0
2014,1,46.0,39.0,45.0


ソートする必要がある場合もある

In [91]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.452000
      2      0.305514
c     1      0.626629
      2      0.325553
b     1      0.604230
      2      0.555147
dtype: float64

In [103]:
data.loc["a":"b"]

UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

In [104]:
data = data.sort_index()
data

char  int
a     1      0.452000
      2      0.305514
b     1      0.604230
      2      0.555147
c     1      0.626629
      2      0.325553
dtype: float64

In [105]:
data.loc["a":"b"]

char  int
a     1      0.452000
      2      0.305514
b     1      0.604230
      2      0.555147
dtype: float64