<a href="https://colab.research.google.com/github/maskot1977/PythonCourse2019/blob/master/Feature_Selection_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Selection Datasets

機械学習の勉強や手法のベンチマーク用に集められたと思われるデータセットとして、 Feature Selection Datasets があります。

http://featureselection.asu.edu/datasets.php

非常に多くのデータがあるので、その中身を一覧して、ちょうどいいデータを見つけたいと思って軽く解析してみました。

データを取得してデータ構造を見るだけでなく、 scikit-learn の RandomForestClassifier を使って、分類問題の難易度も見てみました。

In [1]:
import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

取得データの一覧はこちらです。２つほど、URLが間違っているのを直しました。

In [2]:
dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]

In [3]:
result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

# 結果

* dataset : データセットの名前
* byte : データセットのサイズ（byte）
* X.shape : 説明変数の形
* X_type : 説明変数に入っている数値のバリエーション。2なら２種類の値しか入っていない離散値と考えることができる。十分に多ければ実質連続値と考えることができる。
* y.shape : 目的変数の形
* n_class : 目的変数の数値のバリエーション、すなわちクラスの数。
* RF_max, RF_mean, RF_min : ランダムフォレストで分類問題を解いた時の正解率の最大値、平均値、最小値
* sec : 分類問題を解くのに要した時間 (sec) の平均値

In [4]:
pd.DataFrame(result).sort_values("RF_max")

Unnamed: 0,dataset,byte,X.shape,X_type,y.shape,n_class,RF_max,RF_mean,RF_min,sec
21,nci9.mat,169288,"(60, 9712)",3,"(60,)",9,0.666667,0.433333,0.266667,0.183649
23,SMK_CAN_187.mat,11861244,"(187, 19993)",171,"(187,)",2,0.723404,0.655319,0.574468,0.670948
28,madelon.mat,1496573,"(2600, 500)",40,"(2600,)",2,0.733846,0.707385,0.68,2.456003
13,CLL_SUB_111.mat,5875157,"(111, 11340)",111,"(111,)",3,0.75,0.657143,0.464286,0.326307
24,TOX_171.mat,3470586,"(171, 5748)",169,"(171,)",4,0.813953,0.772093,0.697674,0.405085
16,GLIOMA.mat,1462087,"(50, 4434)",50,"(50,)",4,0.846154,0.669231,0.538462,0.154852
9,Yale.mat,161021,"(165, 1024)",77,"(165,)",15,0.857143,0.769048,0.595238,0.306511
25,arcene.mat,1900005,"(200, 10000)",82,"(200,)",2,0.9,0.788,0.68,0.417719
20,lymphoma.mat,110185,"(96, 4026)",3,"(96,)",9,0.916667,0.829167,0.708333,0.169875
2,RELATHE.mat,226918,"(1427, 4322)",2,"(1427,)",2,0.921569,0.89888,0.876751,1.218853


簡単すぎる問題を解いてもつまらないと思ったので、 RF_max の降順に並べてみました。

データセット選びの参考になればと。