## About

ここでは[`tsfresh`](https://tsfresh.readthedocs.io/en/latest/index.html)を用いて`spectrum`から特徴を大量作成する方法を紹介します。

## 準備

`tsfresh`を利用するにはデータが特定のフォーマットにしたがっている必要があるため、`hogehoge.dat`のように`spectrum_filename`ごとにファイル分けされた状態から、一つのファイルになった状態に変換をします。

### Libraries

In [1]:
import pandas as pd

from pathlib import Path

from fastprogress import progress_bar, master_bar

### Put every spectrum data in one DataFrame

In [2]:
data_dir = Path("../input/atma5/")
spectrum_dir = data_dir / "spectrum"

In [3]:
mb = master_bar(["train", "test"])
spectrum_dfs = {}

for phase in mb:
    df = pd.read_csv(data_dir / (phase + ".csv"))
    dfs = []
    
    for filename in progress_bar(df.spectrum_filename, parent=mb):
        spec = pd.read_csv(
            spectrum_dir / filename,
            sep="\t",
            header=None)
        spec.columns = ["wl", "intensity"]
        spec["spectrum_filename"] = filename
        dfs.append(spec)

    spectrums = pd.concat(dfs, axis=0).reset_index(drop=True)
    spectrum_dfs[phase] = spectrums

In [4]:
spectrum_dfs["train"].head()

Unnamed: 0,wl,intensity,spectrum_filename
0,1032.836,1751.0,b2e223339f4abce9b400.dat
1,1033.886,1493.0,b2e223339f4abce9b400.dat
2,1034.936,1299.0,b2e223339f4abce9b400.dat
3,1035.986,1120.0,b2e223339f4abce9b400.dat
4,1037.036,900.0,b2e223339f4abce9b400.dat


In [5]:
spectrum_dfs["test"].head()

Unnamed: 0,wl,intensity,spectrum_filename
0,1032.836,30.0,fe0fb0a5d966d574c98b.dat
1,1033.886,-91.0,fe0fb0a5d966d574c98b.dat
2,1034.936,-148.0,fe0fb0a5d966d574c98b.dat
3,1035.986,71.0,fe0fb0a5d966d574c98b.dat
4,1037.036,36.0,fe0fb0a5d966d574c98b.dat


## `tsfresh`による特徴抽出

`tsfresh`は時系列データから特徴を抽出することができるライブラリです。
今回のデータは時系列データではないですが各スペクトラムが`wl`(波長)と`intensity`(強度)からなる時系列のようにみなして`tsfresh`を使って大量に特徴を生成することを考えます。

`tsfresh`では上のように時系列データの`id`を表すカラム(上のデータでは`spectrum_filename`)と時間を表すカラム(上の例では波長を表す`wl`を時間を表すカラムとみなす)、そして特徴抽出するための値が入ったカラム(上の例では`intensity`)があるデータから`id`ごとに大量の特徴を生成できます。

### Libraries

In [6]:
from tsfresh import extract_features, extract_relevant_features
from tsfresh.feature_extraction import settings

### Basic Approach

一番シンプルなやり方で特徴抽出を行っています。なお、時間が少しかかるのでデータ量を削減して例を示します。

In [7]:
TEST = True

In [8]:
if TEST:
    spec_train = spectrum_dfs["train"]
    uniq_filenames = spec_train.spectrum_filename.unique()
    df = spec_train[spec_train.spectrum_filename.isin(uniq_filenames[:500])]
else:
    df = spectrum_dfs["train"]

In [9]:
df

Unnamed: 0,wl,intensity,spectrum_filename
0,1032.836,1751.0,b2e223339f4abce9b400.dat
1,1033.886,1493.0,b2e223339f4abce9b400.dat
2,1034.936,1299.0,b2e223339f4abce9b400.dat
3,1035.986,1120.0,b2e223339f4abce9b400.dat
4,1037.036,900.0,b2e223339f4abce9b400.dat
...,...,...,...
255995,1560.748,-27.0,de627b44a98fafd3ac9d.dat
255996,1561.781,-64.0,de627b44a98fafd3ac9d.dat
255997,1562.813,149.0,de627b44a98fafd3ac9d.dat
255998,1563.845,107.0,de627b44a98fafd3ac9d.dat


In [10]:
X = extract_features(df, column_id="spectrum_filename", column_sort="wl", n_jobs=8)
X.head()

Feature Extraction: 100%|██████████| 39/39 [00:29<00:00,  1.30it/s]


variable,intensity__abs_energy,intensity__absolute_sum_of_changes,"intensity__agg_autocorrelation__f_agg_""mean""__maxlag_40","intensity__agg_autocorrelation__f_agg_""median""__maxlag_40","intensity__agg_autocorrelation__f_agg_""var""__maxlag_40","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,intensity__symmetry_looking__r_0.9500000000000001,intensity__time_reversal_asymmetry_statistic__lag_1,intensity__time_reversal_asymmetry_statistic__lag_2,intensity__time_reversal_asymmetry_statistic__lag_3,intensity__value_count__value_-1,intensity__value_count__value_0,intensity__value_count__value_1,intensity__variance,intensity__variance_larger_than_standard_deviation,intensity__variation_coefficient
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0048906881fc43eae1e2.dat,133032900.0,63954.888926,0.40111,0.327185,0.074556,301.696662,123.873682,-59.791727,18392.02335,506.227273,...,1.0,7494.391719,-112439.2,-360579.6,1.0,0.0,0.0,223398.816127,1.0,2.476308
0088433f301fbf03d626.dat,28620880.0,56772.0,0.280247,0.199171,0.056259,336.605225,162.427705,-17.444122,13420.349631,596.818182,...,1.0,104748.455295,136550.9,506529.2,2.0,0.0,1.0,42578.204033,1.0,1.787762
0155974b49445f2528c6.dat,24188530.0,55213.5556,0.168302,0.144878,0.026952,384.144494,203.034164,43.061684,13366.317087,679.772727,...,1.0,19935.733286,-61548.49,-246015.6,0.0,1.0,1.0,21346.874141,1.0,0.907921
016b8a8e5b23b4c05d72.dat,230881200.0,61523.5556,0.219614,0.057999,0.093169,810.412192,436.086091,131.764151,200343.669469,1727.090909,...,1.0,711812.658264,3656261.0,11687820.0,3.0,0.0,0.0,403375.159374,1.0,2.912143
018abb6ef8f19ab8fbb9.dat,33991630.0,57867.7778,0.180222,0.058144,0.06124,362.72948,154.647452,-42.560958,18525.124487,686.5,...,1.0,-158960.289622,53662.02,318659.6,1.0,4.0,0.0,59404.75383,1.0,2.916235


自動で763個の特徴が生成されました。これはどのようにして生成されているかは、[公式ドキュメント](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#module-tsfresh.feature_extraction.feature_calculators)に詳細に書かれていますが、簡単にいうと予め定められた特徴計算関数が多数用意されており、それを自動で適用することで特徴生成を行っています。特徴計算関数の中にはパラメータを取るものもあり、その組み合わせを考えると膨大な数になります。

自動で生成された特徴は`<元のデータの値のカラム名>__<特徴計算関数名>__<パラメータ名>_<そのパラメータの値>`のような命名規則になっています。

### 生成する特徴量セットを変更する

上では関数の呼び出しだけで一気に763個の特徴が生成されましたが、どのような特徴を作るかは(予め定義された特徴計算関数の範囲内で)カスタマイズできます。

ここでは例として特徴計算関数として用意されている`abs_energy`と`agg_autocorrelation`を使って特徴を作成してみることにします。どのような関数があるかは[ドキュメント](https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#module-tsfresh.feature_extraction.feature_calculators)を参照してください。

In [11]:
fc_parameters = {
    "abs_energy": None,
    "agg_autocorrelation": [
        {"f_agg": "median", "maxlag": 40},
        {"f_agg": "var", "maxlag": 40}
    ]
}

特徴計算関数の指定には上のように`特徴計算関数名`: `パラメータ`の形式になった辞書を用意してやります。

`abs_energy`は特にパラメータを取らない関数なので値は`None`にします。

一方`agg_autocorrelation`はパラメータを取るのでリストにそのパラメータを指定した辞書を入れて渡してやります。今回はパラメータを2通りの組み合わせで試してみたいのでリストには2つの辞書を入れていますが、このように試したいパラメータの組み合わせ分だけそのパラメータの組み合わせの辞書を用意してリストに入れる必要があります。

どのようなパラメータを渡す必要があるかはドキュメントに記載されています。

さてこのパラメータで特徴を生成してみましょう。

In [12]:
X = extract_features(df, 
                     column_id="spectrum_filename", 
                     column_sort="wl",
                     n_jobs=8,
                     default_fc_parameters=fc_parameters)
X.head()

Feature Extraction: 100%|██████████| 39/39 [00:00<00:00, 434.75it/s]


variable,intensity__abs_energy,"intensity__agg_autocorrelation__f_agg_""median""__maxlag_40","intensity__agg_autocorrelation__f_agg_""var""__maxlag_40"
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0048906881fc43eae1e2.dat,133032900.0,0.327185,0.074556
0088433f301fbf03d626.dat,28620880.0,0.199171,0.056259
0155974b49445f2528c6.dat,24188530.0,0.144878,0.026952
016b8a8e5b23b4c05d72.dat,230881200.0,0.057999,0.093169
018abb6ef8f19ab8fbb9.dat,33991630.0,0.058144,0.06124


自分が指定した通りの特徴が生成されていることがわかります。

### ある程度まとまった単位で特徴生成の規則を指定する

自分で`default_fc_parameters`に与える辞書を用意するのはカスタマイズ性が高くていい面もありますが、一気に特徴を生成するには少し大変です。幸い`tsfresh`にはある程度まとまった単位で特徴を生成するための辞書がいくつか用意されています。

これらは`tsfresh.feature_extraction.settings`の中にあります。

In [13]:
dir(settings)

['ComprehensiveFCParameters',
 'EfficientFCParameters',
 'IndexBasedFCParameters',
 'MinimalFCParameters',
 'TimeBasedFCParameters',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'feature_calculators',
 'from_columns',
 'get_config_from_string',
 'getfullargspec',
 'pd',
 'product',
 'range']

この最初の幾つかの`hogehogeFCParameters`というのが特徴生成規則がまとめられた辞書です。実際に使ってみましょう。

In [14]:
X = extract_features(df,
                     column_id="spectrum_filename",
                     column_sort="wl",
                     n_jobs=8,
                     default_fc_parameters=settings.MinimalFCParameters())
X.head()

Feature Extraction: 100%|██████████| 39/39 [00:00<00:00, 484.36it/s]


variable,intensity__length,intensity__maximum,intensity__mean,intensity__median,intensity__minimum,intensity__standard_deviation,intensity__sum_values,intensity__variance
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0048906881fc43eae1e2.dat,512.0,3400.0,190.869141,76.0,-361.0,472.65084,97724.999997,223398.816127
0088433f301fbf03d626.dat,512.0,1399.0,115.42079,85.0,-209.0,206.344867,59095.44441,42578.204033
0155974b49445f2528c6.dat,512.0,1271.0,160.923394,149.5,-172.0,146.105695,82392.7778,21346.874141
016b8a8e5b23b4c05d72.dat,512.0,5208.0,218.093099,102.5,-229.0,635.118225,111663.666689,403375.159374
018abb6ef8f19ab8fbb9.dat,512.0,2130.0,83.577257,47.0,-299.0,243.730905,42791.55557,59404.75383


In [15]:
X = extract_features(df,
                     column_id="spectrum_filename",
                     column_sort="wl",
                     n_jobs=8,
                     default_fc_parameters=settings.EfficientFCParameters())
X.head()

Feature Extraction: 100%|██████████| 39/39 [00:13<00:00,  2.81it/s]


variable,intensity__abs_energy,intensity__absolute_sum_of_changes,"intensity__agg_autocorrelation__f_agg_""mean""__maxlag_40","intensity__agg_autocorrelation__f_agg_""median""__maxlag_40","intensity__agg_autocorrelation__f_agg_""var""__maxlag_40","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""max""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""mean""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""min""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_10__f_agg_""var""","intensity__agg_linear_trend__attr_""intercept""__chunk_len_50__f_agg_""max""",...,intensity__symmetry_looking__r_0.9500000000000001,intensity__time_reversal_asymmetry_statistic__lag_1,intensity__time_reversal_asymmetry_statistic__lag_2,intensity__time_reversal_asymmetry_statistic__lag_3,intensity__value_count__value_-1,intensity__value_count__value_0,intensity__value_count__value_1,intensity__variance,intensity__variance_larger_than_standard_deviation,intensity__variation_coefficient
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0048906881fc43eae1e2.dat,133032900.0,63954.888926,0.40111,0.327185,0.074556,301.696662,123.873682,-59.791727,18392.02335,506.227273,...,1.0,7494.391719,-112439.2,-360579.6,1.0,0.0,0.0,223398.816127,1.0,2.476308
0088433f301fbf03d626.dat,28620880.0,56772.0,0.280247,0.199171,0.056259,336.605225,162.427705,-17.444122,13420.349631,596.818182,...,1.0,104748.455295,136550.9,506529.2,2.0,0.0,1.0,42578.204033,1.0,1.787762
0155974b49445f2528c6.dat,24188530.0,55213.5556,0.168302,0.144878,0.026952,384.144494,203.034164,43.061684,13366.317087,679.772727,...,1.0,19935.733286,-61548.49,-246015.6,0.0,1.0,1.0,21346.874141,1.0,0.907921
016b8a8e5b23b4c05d72.dat,230881200.0,61523.5556,0.219614,0.057999,0.093169,810.412192,436.086091,131.764151,200343.669469,1727.090909,...,1.0,711812.658264,3656261.0,11687820.0,3.0,0.0,0.0,403375.159374,1.0,2.912143
018abb6ef8f19ab8fbb9.dat,33991630.0,57867.7778,0.180222,0.058144,0.06124,362.72948,154.647452,-42.560958,18525.124487,686.5,...,1.0,-158960.289622,53662.02,318659.6,1.0,4.0,0.0,59404.75383,1.0,2.916235


先ほどの763特徴よりは減っているのがわかると思います。

### 学習に役立ちそうな特徴を生成したい

ここまでで自動で特徴を抽出する方法は紹介しましたが、実際に欲しいのは自分たちの取り組むタスクに役立つ特徴です。`tsfresh`には自動で特徴を作成したあと、学習に役立ちそうかどうかを統計的な検定で判定して役立ちそうな特徴だけを残してくれる手段も存在します。

最後に、この学習に役立ちそうな特徴だけを残す`extract_relvant_features`を紹介しようと思います。
この関数は統計的な検定を行うときにタスクのラベルを使用するため、今回はターゲットも与えてあげる必要があります。

In [16]:
train = pd.read_csv(data_dir / "train.csv")
train.head()

Unnamed: 0,spectrum_id,spectrum_filename,chip_id,exc_wl,layout_a,layout_x,layout_y,pos_x,target
0,000da4633378740f1ee8,b2e223339f4abce9b400.dat,79ad4647da6de6425abf,850,2,36,140,1313.081,0
1,000ed1a5a9fe0ad2b7dd,e2f150a503244145e7ce.dat,79ad4647da6de6425abf,780,3,0,168,159.415,0
2,0016e3322c4ce0700f9a,3d58b7ccaee157979cf0.dat,c695a1e61e002b34e556,780,1,34,29,-610.7688,0
3,00256bd0f8c6cf5f59c8,ed3641184d3b7c0ae703.dat,c695a1e61e002b34e556,780,2,32,139,1214.618,0
4,003483ee5ae313d37590,4c63418d39f86dfab9bb.dat,c695a1e61e002b34e556,780,0,45,85,-257.6161,0


In [17]:
y = df.merge(
    train, 
    how="left", 
    on="spectrum_filename").set_index("spectrum_filename").target
y

spectrum_filename
b2e223339f4abce9b400.dat    0
b2e223339f4abce9b400.dat    0
b2e223339f4abce9b400.dat    0
b2e223339f4abce9b400.dat    0
b2e223339f4abce9b400.dat    0
                           ..
de627b44a98fafd3ac9d.dat    0
de627b44a98fafd3ac9d.dat    0
de627b44a98fafd3ac9d.dat    0
de627b44a98fafd3ac9d.dat    0
de627b44a98fafd3ac9d.dat    0
Name: target, Length: 256000, dtype: int64

In [18]:
y = y.groupby("spectrum_filename").mean()
y

spectrum_filename
0048906881fc43eae1e2.dat    0
0088433f301fbf03d626.dat    0
0155974b49445f2528c6.dat    0
016b8a8e5b23b4c05d72.dat    0
018abb6ef8f19ab8fbb9.dat    0
                           ..
fe2b5f1ac937379dfc30.dat    0
fe9018a12792fd8073ca.dat    0
fee08b41362960ac7716.dat    0
ff534a07ee95f77f8347.dat    0
ff78d191f2e597a24c92.dat    0
Name: target, Length: 500, dtype: int64

In [19]:
X = extract_relevant_features(df, y, column_id="spectrum_filename", column_sort="wl")
X.head()

Feature Extraction: 100%|██████████| 20/20 [00:42<00:00,  2.14s/it]


variable,intensity__ratio_beyond_r_sigma__r_10,intensity__large_standard_deviation__r_0.1,intensity__ar_coefficient__coeff_2__k_10,"intensity__change_quantiles__f_agg_""var""__isabs_True__qh_1.0__ql_0.8","intensity__change_quantiles__f_agg_""var""__isabs_True__qh_1.0__ql_0.6","intensity__change_quantiles__f_agg_""var""__isabs_True__qh_1.0__ql_0.4","intensity__change_quantiles__f_agg_""var""__isabs_True__qh_1.0__ql_0.0","intensity__fft_coefficient__attr_""abs""__coeff_71","intensity__fft_coefficient__attr_""abs""__coeff_65",intensity__ar_coefficient__coeff_3__k_10,...,"intensity__change_quantiles__f_agg_""var""__isabs_False__qh_1.0__ql_0.6","intensity__change_quantiles__f_agg_""var""__isabs_False__qh_1.0__ql_0.4",intensity__cid_ce__normalize_False,"intensity__change_quantiles__f_agg_""var""__isabs_False__qh_1.0__ql_0.0","intensity__fft_coefficient__attr_""abs""__coeff_66","intensity__fft_coefficient__attr_""abs""__coeff_69","intensity__fft_coefficient__attr_""abs""__coeff_62","intensity__change_quantiles__f_agg_""var""__isabs_False__qh_1.0__ql_0.2","intensity__agg_linear_trend__attr_""stderr""__chunk_len_5__f_agg_""var""","intensity__fft_coefficient__attr_""abs""__coeff_84"
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0048906881fc43eae1e2.dat,0.0,1.0,0.347879,14235.523995,12277.364869,9458.323561,10168.320914,2592.688388,1078.907887,0.228412,...,25970.911131,19040.217668,3633.231579,25832.308619,2359.171688,2846.221785,308.987554,17391.21567,83.23386,1608.024037
0088433f301fbf03d626.dat,0.0,1.0,0.217381,7890.735802,6515.647189,5709.793194,6450.462576,2383.147242,726.314094,0.212065,...,13536.938447,13189.71326,3098.95861,18793.579322,1486.146588,5060.673141,4077.910466,14231.367854,24.433602,2742.067413
0155974b49445f2528c6.dat,0.0,1.0,0.220527,11169.405432,6544.395861,5696.296542,7443.017747,4073.50831,5402.81704,0.030992,...,14534.422198,13164.149862,3125.573124,19117.701277,1495.751339,2569.862761,3257.044224,13587.930214,41.231494,617.120819
016b8a8e5b23b4c05d72.dat,0.0,1.0,0.403621,56050.758347,33263.829576,21229.481169,13185.526208,3176.617065,3657.233739,0.110019,...,51039.317092,33394.470692,3761.002303,27681.28706,2803.717945,1887.559495,1053.157653,25369.521874,395.746642,2184.878369
018abb6ef8f19ab8fbb9.dat,0.0,1.0,0.284026,12394.693827,7893.055032,6068.34948,7772.445969,1953.374134,1051.395185,0.150964,...,18214.251835,14056.164434,3244.211636,20596.652783,775.08389,2233.583044,1616.237658,13736.697299,72.239506,1343.663353


特に`default_fc_parameters`を指定していないので最初は763特徴生成されているのですが、その後の検定により23特徴まで削減されているのがわかります。

**FIN**