## pandasとデータ分析
pandasはデータ分析では必ず利用する重要なツールです。この使い方を知るか知らないか、もしくは、やりたいことをグーグル検索しなくてもすぐに手を動かせるかどうかは、エンジニアとしての力量に直結します。ここでは、具体的なデータを元に私の経験から重要と思われるメソッドや使い方を説明します。他に重要な使い方に遭遇したらどんどん追記していきます。


### github
- jupyter notebook形式のファイルは[こちら](https://github.com/hiroshi0530/wa-src/blob/master/article/library/pandas/pandas_nb.ipynb)

### google colaboratory
- google colaboratory で実行する場合は[こちら](https://colab.research.google.com/github/hiroshi0530/wa-src/blob/master/article/library/pandas/pandas_nb.ipynb)

### 筆者の環境
筆者のOSはmacOSです。LinuxやUnixのコマンドとはオプションが異なります。

In [1]:
!sw_vers

ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G95


In [2]:
!python -V

Python 3.5.5 :: Anaconda, Inc.


基本的なライブラリをインポートしそのバージョンを確認しておきます。

In [3]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib
import matplotlib.pyplot as plt
import scipy
import numpy as np

print('matplotlib version :', matplotlib.__version__)
print('scipy version :', scipy.__version__)
print('numpy version :', np.__version__)

matplotlib version : 2.2.2
scipy version : 1.4.1
numpy version : 1.18.1


### importとバージョン確認

In [4]:
import pandas as pd

print('pandas version :', pd.__version__)

pandas version : 0.24.2


## 基本操作

### データの読み込みと表示

利用させてもらうデータは[danielさんのgithub](https://github.com/chendaniely/pandas_for_everyone)になります。pandasの使い方の本を書いておられる有名な方のリポジトリです。[Pythonデータ分析／機械学習のための基本コーディング！ pandasライブラリ活用入門](https://www.amazon.co.jp/dp/B07NZP6V29/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1)です。僕も持っています。とても勉強になると思います。

データはエボラ出血の発生数(Case)と死者数(Death)だと思います。

`read_csv`を利用して、CSVを読み込み、先頭の5行目を表示してみます。

In [5]:
import pandas as pd

df = pd.read_csv('./country_timeseries.csv', sep=',')
df.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


末尾の5データを表示します。

In [6]:
df.tail()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
117,3/27/2014,5,103.0,8.0,6.0,,,,,,66.0,6.0,5.0,,,,,
118,3/26/2014,4,86.0,,,,,,,,62.0,,,,,,,
119,3/25/2014,3,86.0,,,,,,,,60.0,,,,,,,
120,3/24/2014,2,86.0,,,,,,,,59.0,,,,,,,
121,3/22/2014,0,49.0,,,,,,,,29.0,,,,,,,


### データの確認

#### データの型などの情報を取得

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122 entries, 0 to 121
Data columns (total 18 columns):
Date                   122 non-null object
Day                    122 non-null int64
Cases_Guinea           93 non-null float64
Cases_Liberia          83 non-null float64
Cases_SierraLeone      87 non-null float64
Cases_Nigeria          38 non-null float64
Cases_Senegal          25 non-null float64
Cases_UnitedStates     18 non-null float64
Cases_Spain            16 non-null float64
Cases_Mali             12 non-null float64
Deaths_Guinea          92 non-null float64
Deaths_Liberia         81 non-null float64
Deaths_SierraLeone     87 non-null float64
Deaths_Nigeria         38 non-null float64
Deaths_Senegal         22 non-null float64
Deaths_UnitedStates    18 non-null float64
Deaths_Spain           16 non-null float64
Deaths_Mali            12 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 17.2+ KB


#### 大きさ（行数と列数）の確認

In [8]:
df.shape

(122, 18)

#### インデックスの確認

In [9]:
df.index

RangeIndex(start=0, stop=122, step=1)

#### カラム名の確認

In [10]:
df.columns

Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
       'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
      dtype='object')

#### 任意の列名のデータの取得

カラム名を指定して、任意のカラムだけ表示させます。

In [11]:
df[['Cases_UnitedStates','Deaths_UnitedStates']].head()

Unnamed: 0,Cases_UnitedStates,Deaths_UnitedStates
0,,
1,,
2,,
3,,
4,,


#### 行数や列数を指定してデータを取得

In [12]:
df.iloc[[6,7],[0,3]]

Unnamed: 0,Date,Cases_Liberia
6,12/27/2014,
7,12/24/2014,7977.0


#### ある条件を満たしたデータを取得

In [13]:
df[df['Deaths_Liberia'] > 3000][['Deaths_Liberia']]

Unnamed: 0,Deaths_Liberia
2,3496.0
3,3496.0
4,3471.0
5,3423.0
7,3413.0
9,3384.0
10,3376.0
12,3290.0
14,3177.0
16,3145.0


#### 統計量の取得

describe()を利用して、列ごとの統計量を取得することが出来ます。ぱっと見、概要を得たいときに有力です。

In [14]:
df.describe()

Unnamed: 0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
count,122.0,93.0,83.0,87.0,38.0,25.0,18.0,16.0,12.0,92.0,81.0,87.0,38.0,22.0,18.0,16.0,12.0
mean,144.778689,911.064516,2335.337349,2427.367816,16.736842,1.08,3.277778,1.0,3.5,563.23913,1101.209877,693.701149,6.131579,0.0,0.833333,0.1875,3.166667
std,89.31646,849.108801,2987.966721,3184.803996,5.998577,0.4,1.178511,0.0,2.746899,508.511345,1297.208568,869.947073,2.781901,0.0,0.383482,0.403113,2.405801
min,0.0,49.0,3.0,0.0,0.0,1.0,1.0,1.0,1.0,29.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,66.25,236.0,25.5,64.5,15.0,1.0,3.0,1.0,1.0,157.75,12.0,6.0,4.0,0.0,1.0,0.0,1.0
50%,150.0,495.0,516.0,783.0,20.0,1.0,4.0,1.0,2.5,360.5,294.0,334.0,8.0,0.0,1.0,0.0,2.0
75%,219.5,1519.0,4162.5,3801.0,20.0,1.0,4.0,1.0,6.25,847.75,2413.0,1176.0,8.0,0.0,1.0,0.0,6.0
max,289.0,2776.0,8166.0,10030.0,22.0,3.0,4.0,1.0,7.0,1786.0,3496.0,2977.0,8.0,0.0,1.0,1.0,6.0


## インデックスをdatetime型に変更

インデックスをDateに変更し、上書きします。

In [15]:
df.set_index('Date', inplace=True)
df.index

Index(['1/5/2015', '1/4/2015', '1/3/2015', '1/2/2015', '12/31/2014',
       '12/28/2014', '12/27/2014', '12/24/2014', '12/21/2014', '12/20/2014',
       ...
       '4/4/2014', '4/1/2014', '3/31/2014', '3/29/2014', '3/28/2014',
       '3/27/2014', '3/26/2014', '3/25/2014', '3/24/2014', '3/22/2014'],
      dtype='object', name='Date', length=122)

In [16]:
df.head()

Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


ついでにDateというインデックス名も変更します。rename関数を利用します。

In [26]:
df.rename(index={'Date':'YYYYMMDD'}, inplace=True)

In [27]:
df.columns
# df.sort_values(by="YYYYMMDD", ascending=True).head()

Index(['Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
       'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
      dtype='object')

インデックスでソートします。ただ、日付が文字列のオブジェクトになっているので、目論見通りのソートになっていません。

In [30]:
df.sort_index(ascending=True).head()

Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
9/9/2014,171,,2407.0,,,,,,,,,,,,,,
9/7/2014,169,861.0,2081.0,1424.0,21.0,3.0,,,,557.0,1137.0,524.0,8.0,0.0,,,
9/5/2014,167,812.0,1871.0,1261.0,22.0,1.0,,,,517.0,1089.0,491.0,8.0,,,,
9/28/2014,190,1157.0,3696.0,2304.0,20.0,1.0,,,,710.0,1998.0,622.0,8.0,0.0,,,
9/23/2014,185,1074.0,3458.0,2021.0,20.0,1.0,,,,648.0,1830.0,605.0,8.0,0.0,,,


インデックスをdatetime型に変更します。

In [31]:
df.index

Index(['1/5/2015', '1/4/2015', '1/3/2015', '1/2/2015', '12/31/2014',
       '12/28/2014', '12/27/2014', '12/24/2014', '12/21/2014', '12/20/2014',
       ...
       '4/4/2014', '4/1/2014', '3/31/2014', '3/29/2014', '3/28/2014',
       '3/27/2014', '3/26/2014', '3/25/2014', '3/24/2014', '3/22/2014'],
      dtype='object', name='Date', length=122)

In [33]:
df.index = pd.to_datetime(df.index, format='%m/%d/%Y')
df.index

DatetimeIndex(['2015-01-05', '2015-01-04', '2015-01-03', '2015-01-02',
               '2014-12-31', '2014-12-28', '2014-12-27', '2014-12-24',
               '2014-12-21', '2014-12-20',
               ...
               '2014-04-04', '2014-04-01', '2014-03-31', '2014-03-29',
               '2014-03-28', '2014-03-27', '2014-03-26', '2014-03-25',
               '2014-03-24', '2014-03-22'],
              dtype='datetime64[ns]', name='Date', length=122, freq=None)

となり、dtype='object'からobject='datetime64'とdatetime型に変更されていることが分かります。そこでソートしてみます。

In [35]:
df.sort_index(ascending=True).head(10)

Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-03-22,0,49.0,,,,,,,,29.0,,,,,,,
2014-03-24,2,86.0,,,,,,,,59.0,,,,,,,
2014-03-25,3,86.0,,,,,,,,60.0,,,,,,,
2014-03-26,4,86.0,,,,,,,,62.0,,,,,,,
2014-03-27,5,103.0,8.0,6.0,,,,,,66.0,6.0,5.0,,,,,
2014-03-28,6,112.0,3.0,2.0,,,,,,70.0,3.0,2.0,,,,,
2014-03-29,7,112.0,7.0,,,,,,,70.0,2.0,,,,,,
2014-03-31,9,122.0,8.0,2.0,,,,,,80.0,4.0,2.0,,,,,
2014-04-01,10,127.0,8.0,2.0,,,,,,83.0,5.0,2.0,,,,,
2014-04-04,13,143.0,18.0,2.0,,,,,,86.0,7.0,2.0,,,,,


In [36]:
df.sort_index(ascending=True).tail(10)

Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-12-20,272,2571.0,7862.0,8939.0,,,,,,1586.0,3384.0,2556.0,,,,,
2014-12-21,273,2597.0,,9004.0,,,,,,1607.0,,2582.0,,,,,
2014-12-24,277,2630.0,7977.0,9203.0,,,,,,,3413.0,2655.0,,,,,
2014-12-27,280,2695.0,,9409.0,,,,,,1697.0,,2732.0,,,,,
2014-12-28,281,2706.0,8018.0,9446.0,,,,,,1708.0,3423.0,2758.0,,,,,
2014-12-31,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,
2015-01-02,286,,8157.0,,,,,,,,3496.0,,,,,,
2015-01-03,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
2015-01-04,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2015-01-05,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,


となり、想定通りのソートになっている事が分かります。

また、datetime型がインデックスに設定されたので、日付を扱いのが容易になっています。
例えば、2015年のデータを取得するのに、

In [41]:
df['2015']

Unnamed: 0_level_0,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-01-05,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
2015-01-04,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2015-01-03,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
2015-01-02,286,,8157.0,,,,,,,,3496.0,,,,,,


In [None]:
となります。

In [38]:
df.index.year

Int64Index([2015, 2015, 2015, 2015, 2014, 2014, 2014, 2014, 2014, 2014,
            ...
            2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014],
           dtype='int64', name='Date', length=122)

In [25]:
df.index

Index(['1/5/2015', '1/4/2015', '1/3/2015', '1/2/2015', '12/31/2014',
       '12/28/2014', '12/27/2014', '12/24/2014', '12/21/2014', '12/20/2014',
       ...
       '4/4/2014', '4/1/2014', '3/31/2014', '3/29/2014', '3/28/2014',
       '3/27/2014', '3/26/2014', '3/25/2014', '3/24/2014', '3/22/2014'],
      dtype='object', name='Date', length=122)

## queryとwhereの使い方 (ソートも)

## 列名やインデックス名の変更

## nullの使い方

## get_dummiesの使い方

## 頻出のコマンド一覧
概要として、よく利用するコマンドを以下に載せます。

#### 
```python
df.query()
```

#### 
```python
df.unique()
```

#### 
```python
df.drop_duplicates()
```

#### 
```python
df.describe()
```

#### 
```python
df.set_index()
```

#### 
```python
df.rename()
```

#### 
```python
df.apply()
```

#### 
```python
pd.cut()
```

#### 
```python
df.isnull()
```

#### 
```python
df.any()
```

#### 
```python
df.fillna()
```

#### 
```python
df.dropna()
```

#### 
```python
df.replace()
```

#### 
```python
df.mask()
```

#### 
```python
df.drop()
```

#### 
```python
df.value_counts()
```

#### 
```python
df.groupby()
```

#### 
```python
df.diff()
```

#### 
```python
df.rolling()
```

#### 
```python
df.pct_change()
```

#### 
```python
df.plot()
```

#### 
```python
df.pivot()
```

#### 
```python
pd.get_dummies()
```

#### 
```python
df.to_csv()
```

#### 
```python
pd.options.display.max_columns = None
```


## よく使う関数

最後のまとめとして、良く使う関数をまとめておきます。

#### インデックスの変更(既存のカラム名に変更)

```python
df.set_index('xxxx')
```

#### カラム名の変更

```python
df.rename(columns={'before': 'after'}, inplace=True)
```

#### あるカラムでソートする

```python
df.sort_values(by='xxx', ascending=True)
```

#### インデックスでソートする

```python
df.sort_index()
```

#### datetime型の型変換
```python
df.to_datetime()
```

#### NaNのカラムごとの個数
```python
df.isnull().sum()
```




## 参考文献
- [チートシート](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [read_csvの全引数について解説してくれてます](https://own-search-and-study.xyz/2015/09/03/pandas%E3%81%AEread_csv%E3%81%AE%E5%85%A8%E5%BC%95%E6%95%B0%E3%82%92%E4%BD%BF%E3%81%84%E3%81%93%E3%81%AA%E3%81%99/)