## 2-4 サンプリング

In [3]:
import pandas as pd
import os


def load_hotel_reserve():
  customer_tb = pd.read_csv('../../../data/customer.csv')
  hotel_tb = pd.read_csv('../../../data/hotel.csv')
  reserve_tb = pd.read_csv('../../../data/reserve.csv')
  return customer_tb, hotel_tb, reserve_tb

customer_tb, hotel_tb, reserve_tb = load_hotel_reserve()


### Series
https://pandas.pydata.org/pandas-docs/stable/dsintro.html

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

`>>> s = pd.Series(data, index=index)`

Here, data can be many different things:

- a Python dict
- an ndarray
- a scalar value (like 5)
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

#### From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

```
In [3]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [4]: s
Out[4]: 
a    0.4691
b   -0.2829
c   -1.5091
d   -1.1356
e    1.2121
dtype: float64
```

In [8]:
unique_ids = reserve_tb['customer_id'].unique()
print ('size:', pd.Series(unique_ids).shape[0])
pd.Series(unique_ids)[list(range(10))]

size: 888


0     c_1
1     c_2
2     c_3
3     c_4
4     c_5
5     c_6
6     c_7
7     c_8
8     c_9
9    c_10
dtype: object

In [9]:
# 下の行から本書スタート
# reserve_tb['customer_id'].unique()は、重複を排除したcustomer_idを返す
# sample関数を利用するためにpandas.Series(pandasのリストオブジェクト)に変換
# sample関数によって、顧客IDをサンプリング
target = pd.Series(reserve_tb['customer_id'].unique()).sample(frac=0.2)
target

580    c_660
343    c_390
862    c_972
573    c_652
872    c_983
772    c_869
416    c_472
26      c_30
19      c_22
438    c_502
377    c_430
362    c_413
462    c_528
865    c_976
349    c_397
637    c_723
168    c_196
795    c_897
331    c_377
361    c_411
572    c_651
290    c_333
813    c_916
209    c_241
55      c_64
567    c_646
780    c_878
208    c_240
524    c_597
851    c_961
       ...  
520    c_593
102    c_114
550    c_625
263    c_304
75      c_86
49      c_58
231    c_265
23      c_27
127    c_147
870    c_981
715    c_810
554    c_629
575    c_654
247    c_284
414    c_470
771    c_868
551    c_626
74      c_85
364    c_415
93     c_105
367    c_419
330    c_376
291    c_334
687    c_779
192    c_224
110    c_126
784    c_882
139    c_163
191    c_222
696    c_791
Length: 178, dtype: object

### pandas.DataFrame.isin
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html

`DataFrame.isin(values)`

    Return boolean DataFrame showing whether each element in the DataFrame is contained in values.


#### Examples

When values is a list:

```
>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
>>> df.isin([1, 3, 12, 'a'])
       A      B
0   True   True
1  False  False
2   True  False

```

When values is a dict:

```
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
       A      B
0   True  False  # Note that B didn't match the 1 here.
1  False   True
2   True   True
```

When values is a Series or DataFrame:

```
>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
>>> other = DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})
>>> df.isin(other)
       A      B
0   True  False
1  False  False  # Column A in `other` has a 3, but not at index 1.
2   True   True
```

In [10]:
# isin関数によって、customer_idがサンプリングした顧客IDのいずれかに一致した行を抽出
# (target: Series type)
reserve_tb[reserve_tb['customer_id'].isin(target)]

Unnamed: 0,reserve_id,hotel_id,customer_id,reserve_datetime,checkin_date,checkin_time,checkout_date,people_num,total_price
38,r39,h_232,c_7,2016-03-21 16:14:52,2016-04-10,10:00:00,2016-04-11,2,57000
39,r40,h_102,c_7,2016-09-01 01:19:57,2016-09-17,11:30:00,2016-09-18,3,18000
40,r41,h_23,c_7,2016-10-14 02:41:13,2016-10-15,12:00:00,2016-10-16,2,130200
41,r42,h_63,c_7,2016-11-11 12:42:32,2016-11-26,09:00:00,2016-11-29,1,44700
42,r43,h_224,c_7,2017-04-22 20:03:46,2017-05-22,11:00:00,2017-05-25,3,36000
43,r44,h_145,c_7,2017-06-23 23:56:10,2017-07-18,09:30:00,2017-07-21,2,112800
44,r45,h_104,c_7,2017-09-28 11:44:02,2017-10-05,09:00:00,2017-10-07,1,84400
102,r103,h_118,c_18,2016-05-24 10:42:22,2016-05-31,09:00:00,2016-06-03,4,129600
104,r105,h_233,c_21,2016-02-28 09:18:25,2016-03-24,09:30:00,2016-03-26,3,53400
105,r106,h_266,c_21,2016-07-21 11:10:16,2016-07-26,10:30:00,2016-07-28,4,92800


In [59]:
#example code
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
print ('DataFrame df :\n', df)
print ()
other = pd.Series([1, 3])
print ('Series other :\n', other)
#df.isin(other)
print ()
print ('A column: \n', df['A'])
print ()
print ('find values in column A that matches in `other`: \n', df['A'].isin(other))
df[df['A'].isin(other)]

DataFrame df :
    A  B
0  1  a
1  2  b
2  3  f

Series other :
 0    1
1    3
dtype: int64

A column: 
 0    1
1    2
2    3
Name: A, dtype: int64

find values in column A that matches in `other`: 
 0     True
1    False
2     True
Name: A, dtype: bool


Unnamed: 0,A,B
0,1,a
2,3,f


In [32]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
print ('DataFrame df :\n', df)
other = pd.DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})
print ('DataFrame other :\n', other)
df.isin(other)

DataFrame df :
    A  B
0  1  a
1  2  b
2  3  f
DataFrame other :
    A  B
0  1  e
1  3  f
2  3  f
3  2  e


Unnamed: 0,A,B
0,True,False
1,False,False
2,True,True
