# README
- [DataFrameで特徴量作るのめんどくさ過ぎる。。featuretoolsを使って自動生成したろ - Qiita](https://qiita.com/Hyperion13fleet/items/4eaca365f28049fe11c7)
- 上記はコピペで試したので、今度は自分でやって理解を深めるノート
- 特に名前の省略が気に食わない〜〜〜！！！


# データの準備

In [1]:
import featuretools
import pandas as pd
import copy


data = featuretools.demo.load_mock_customer()

In [2]:
# いちいちKey打つのだるい&型がわかりにくいので変数化
df_customers = data['customers']
df_sessions = data['sessions']

# データの確認
- Customers と Sessions の関係性をよくみてくれ

In [3]:
df_customers

Unnamed: 0,customer_id,zip_code,join_date,date_of_birth
0,1,60091,2011-04-17 10:48:33,1994-07-18
1,2,13244,2012-04-15 23:31:04,1986-08-18
2,3,13244,2011-08-13 15:42:34,2003-11-21
3,4,60091,2011-04-08 20:08:14,2006-08-15
4,5,60091,2010-07-17 05:27:50,1984-07-28


In [4]:
df_sessions.head()

Unnamed: 0,session_id,customer_id,device,session_start
0,1,2,desktop,2014-01-01 00:00:00
1,2,5,mobile,2014-01-01 00:17:20
2,3,4,mobile,2014-01-01 00:28:10
3,4,1,mobile,2014-01-01 00:44:25
4,5,4,mobile,2014-01-01 01:11:30


# featuretoolsでつくる

## EntitySetの作成とEntityの登録

In [5]:
entity_set = featuretools.EntitySet(id='data')

entity_set.entity_from_dataframe(entity_id='customer',
                                 dataframe=df_customers,
                                 index='customer_id')

entity_set.entity_from_dataframe(entity_id='session',
                                 dataframe=df_sessions,
                                 index='session_id')

entity_set

Entityset: data
  Entities:
    customer [Rows: 5, Columns: 4]
    session [Rows: 35, Columns: 4]
  Relationships:
    No relationships

## Relationshipの作成とRelationの登録

In [6]:
relation_customer_and_session = featuretools.Relationship(parent_variable=entity_set['customer']['customer_id'],
                                                          child_variable=entity_set['session']['customer_id'])

In [7]:
entity_set.add_relationship(relationship=relation_customer_and_session)

Entityset: data
  Entities:
    customer [Rows: 5, Columns: 4]
    session [Rows: 35, Columns: 4]
  Relationships:
    session.customer_id -> customer.customer_id

### Relation間違えに気をつけないといけない話

In [8]:
# 注意！ RelationがおかしくてもErrorにはならない
dummy_entity_set = copy.deepcopy(entity_set)

bad_relation = featuretools.Relationship(parent_variable=entity_set['customer']['customer_id'],
                                         child_variable=entity_set['session']['session_id'])

dummy_entity_set.add_relationship(relationship=bad_relation)

Entityset: data
  Entities:
    customer [Rows: 5, Columns: 4]
    session [Rows: 35, Columns: 4]
  Relationships:
    session.customer_id -> customer.customer_id
    session.session_id -> customer.customer_id

### 存在しないカラム指定はKeyErrorなので間違えっても発見できる

In [9]:
# KeyError: 'Variable: HOGE not found in entity'
# featuretools.Relationship(parent_variable=entity_set['customer']['HOGE'],
#                                          child_variable=entity_set['session']['HOGE'])

# DFSを実行する

## 集約関数を適用しないでやってみる

In [10]:
df_feature_0, feature_defs_0 = featuretools.dfs(entityset=entity_set,
                                                target_entity='session',
                                                agg_primitives=None,
                                                trans_primitives=None, # まずはなしでやる
                                                max_depth=1)

### DataFrameの比較
1. DAY(session_start)	
    1. 'session_start'(datetime型)は勝手に分解されるようだ → デフォルトで集約関数が機能している？？
    2. カラム名は大文字になる
    3. もともとのカラム名がカッコの中に入る
2. JOINされている
    1. Indexにsession_idが入っているね
3. Customersの'zip_code'はDFに含まれた
4. Customersの'join_date'と'date_of_birth'は含まれていない

In [11]:
df_feature_0.head(2)

Unnamed: 0_level_0,customer_id,device,DAY(session_start),YEAR(session_start),MONTH(session_start),WEEKDAY(session_start),customer.zip_code
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,desktop,1,2014,1,2,13244
2,5,mobile,1,2014,1,2,60091


In [12]:
df_sessions.head(2)

Unnamed: 0,session_id,customer_id,device,session_start
0,1,2,desktop,2014-01-01 00:00:00
1,2,5,mobile,2014-01-01 00:17:20


In [13]:
df_customers.head(2)

Unnamed: 0,customer_id,zip_code,join_date,date_of_birth
0,1,60091,2011-04-17 10:48:33,1994-07-18
1,2,13244,2012-04-15 23:31:04,1986-08-18


## datetimeに対して集約関数を使ってみる
1. `trans_primitives` のデフォルトが指定されていたことを発見！
    - `Default: [“day”, “year”, “month”, “weekday”, “haversine”, “num_words”, “num_characters”]`
2. 1と同様で、`agg_primitives` もデフォルト値があるけど、CustomersとSessionsには適用対象がなかった    

In [14]:
agg_trans = ['year'] # いろいろ変えてみるといいよ

df_feature_1, feature_defs_1 = featuretools.dfs(entityset=entity_set,
                                                target_entity='session',
                                                agg_primitives=None,
                                                trans_primitives=agg_trans, 
                                                max_depth=1)

In [15]:
df_feature_1.head(3)

Unnamed: 0_level_0,customer_id,device,YEAR(session_start),customer.zip_code
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2,desktop,2014,13244
2,5,mobile,2014,60091
3,4,mobile,2014,60091


In [16]:
df_feature_0.head(3)

Unnamed: 0_level_0,customer_id,device,DAY(session_start),YEAR(session_start),MONTH(session_start),WEEKDAY(session_start),customer.zip_code
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,desktop,1,2014,1,2,13244
2,5,mobile,1,2014,1,2,60091
3,4,mobile,1,2014,1,2,60091


## おわりに
ここで一旦切る。これくらいの粒度がいいのだ