## 課題:Kickstarter Projects　目的変数をstateとする分類問題

氏名：中田　敦也

Kaggle:https://www.kaggle.com/kemical/kickstarter-projects

csv:ks-projects-201801.csv

In [1]:
from sklearn.model_selection import train_test_split
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from time import time
from datetime import datetime,timedelta
from sklearn.feature_selection import RFECV
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


|項目|説明|訳|
|:-|:-|:-|
|id|internal kickstarter id|内部id|
|name|name of project|プロジェクト名|
|category|category|カテゴリー|
|main_category|category of campaign|キャンペーンの種類|
|currency|currency used to support|支援通貨|
|deadline|deadline for crowdfunding|締切|
|goal|fundraising goal|目標額|
|launched|date launched|開始日時|
|pledged|amount pledged by "crowd"|支援金額|
|state|Current condition the project is inv|プロジェクトの状態|
|backers|number of backers|支持者の数|
|country|country pledged from|開始国|
|usd pledged|Pledged amount in USD (conversion made by KS)|支援金額(USD by KS)|
|usd_pledged_real|Pledged amount in USD (conversion made by fixer.io api)|支援金額(USD by fixer.io api)|
|usd_goal_real|Goal amount in USD|目標額(USD)|    

In [2]:
#データのロード
df = pd.read_csv("ks-projects-201801.csv")
#データの一部を見てみる
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [3]:
#日付などはそのままでは文字列なので数値に変換する。
#おそらく時間までは関係ないと思われるので削除
#deadlineは日付からの経過日数とする
df["launched"]=df["launched"].map(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df["year"] = df["launched"].map(lambda x: x.year)
df["month"] = df["launched"].map(lambda x: x.month)
df["day"] = df["launched"].map(lambda x: x.day)
df["weekday"] = df["launched"].map(lambda x: x.weekday())
df["deadline"]=df["deadline"].map(lambda x: datetime.strptime(x, '%Y-%m-%d'))
df["deadline"]=(df["deadline"]-df["launched"]).map(lambda x: x/timedelta(days=1))
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,year,month,day,weekday
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,58.491343,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,2015,8,11,1
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,59.802813,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0,2017,9,2,5
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,44.985532,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0,2013,1,12,5
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,29.858206,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0,2012,3,17,5
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,55.642326,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0,2015,7,4,5


In [4]:
#欠損値の確認
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 19 columns):
ID                  378661 non-null int64
name                378657 non-null object
category            378661 non-null object
main_category       378661 non-null object
currency            378661 non-null object
deadline            378661 non-null float64
goal                378661 non-null float64
launched            378661 non-null datetime64[ns]
pledged             378661 non-null float64
state               378661 non-null object
backers             378661 non-null int64
country             378661 non-null object
usd pledged         374864 non-null float64
usd_pledged_real    378661 non-null float64
usd_goal_real       378661 non-null float64
year                378661 non-null int64
month               378661 non-null int64
day                 378661 non-null int64
weekday             378661 non-null int64
dtypes: datetime64[ns](1), float64(6), int64(6), object(6)
memor

In [5]:
#category一覧
df.groupby(["main_category","category"])["ID"].count()

main_category  category         
Art            Art                  8253
               Ceramics              305
               Conceptual Art       1030
               Digital Art          1346
               Illustration         3175
               Installations         482
               Mixed Media          2757
               Painting             3294
               Performance Art      2154
               Public Art           3077
               Sculpture            1810
               Textiles              276
               Video Art             194
Comics         Anthologies           405
               Comic Books          2743
               Comics               4996
               Events                163
               Graphic Novels       1864
               Webcomics             648
Crafts         Candles               429
               Crafts               4664
               Crochet               162
               DIY                  1173
               Embroider

In [6]:
# 国一覧
df.groupby("country")["ID"].count()

country
AT         597
AU        7839
BE         617
CA       14756
CH         761
DE        4171
DK        1113
ES        2276
FR        2939
GB       33672
HK         618
IE         811
IT        2878
JP          40
LU          62
MX        1752
N,0"      3797
NL        2868
NO         708
NZ        1447
SE        1757
SG         555
US      292627
Name: ID, dtype: int64

In [7]:
#　状態一覧
state_sum = df.groupby("state")["ID"].count()
state_sum

state
canceled       38779
failed        197719
live            2799
successful    133956
suspended       1846
undefined       3562
Name: ID, dtype: int64

pd.plotting.scatter_matrix(df[["backers","usd_pledged_real","usd_goal_real"]], figsize=(10,10))
plt.show()

In [8]:
# main categoryとcountryをonehotエンコーディング
# 今回使わないデータを削除
category_ohe = pd.get_dummies(df["main_category"])
df.drop("main_category",axis=1,inplace=True)
# categoryは細かすぎるので過学習を起こしやすいので削除
df.drop("category",axis=1,inplace=True)
country_ohe = pd.get_dummies(df["country"])
df = pd.concat([df,category_ohe,country_ohe],axis=1)
df.drop("country",axis=1,inplace=True)
# 大体が国と一緒なので削除
df.drop("currency",axis=1,inplace=True)
# 今回名前とIDは使わないので削除
df.drop("ID",axis=1,inplace=True)
df.drop("name",axis=1,inplace=True)
# usd_predged_realを用いるので削除
df.drop("pledged",axis=1,inplace=True)
df.drop("usd pledged",axis=1,inplace=True)
# usd goalを用いるので削除
df.drop("goal",axis=1,inplace=True)
# すでに数値に変換したので削除
df.drop("launched",axis=1,inplace=True)
df.head()

Unnamed: 0,deadline,state,backers,usd_pledged_real,usd_goal_real,year,month,day,weekday,Art,...,JP,LU,MX,"N,0""",NL,NO,NZ,SE,SG,US
0,58.491343,failed,0,0.0,1533.95,2015,8,11,1,0,...,0,0,0,0,0,0,0,0,0,0
1,59.802813,failed,15,2421.0,30000.0,2017,9,2,5,0,...,0,0,0,0,0,0,0,0,0,1
2,44.985532,failed,3,220.0,45000.0,2013,1,12,5,0,...,0,0,0,0,0,0,0,0,0,1
3,29.858206,failed,1,1.0,5000.0,2012,3,17,5,0,...,0,0,0,0,0,0,0,0,0,1
4,55.642326,canceled,14,1283.0,19500.0,2015,7,4,5,0,...,0,0,0,0,0,0,0,0,0,1


In [9]:
#stateを数値に変換
y = df["state"]
df.drop("state",axis=1,inplace=True)
y.head()

index = list(y.value_counts().index)
dic = {index[i]:i for i in range(len(index))}
y = y.map(lambda x:dic[x])

In [10]:
#比率が違うのでweightをかける
weight = compute_class_weight('balanced' , np.unique(y), y)
weight

array([ 0.31919121,  0.47112609,  1.62743151, 17.71762119, 22.54739788,
       34.18752257])

In [11]:
# 標準化
X = df.values
std_scaler = StandardScaler() 
X = std_scaler.fit_transform(X)

In [12]:
# estimatorにモデルをセット
# 今回はRandomForestを使用
estimator = RandomForestClassifier(n_estimators=50, max_depth=3, criterion="gini",
                                                 min_samples_leaf=2, min_samples_split=2, random_state=1234)

# RFECVは交差検証によってステップワイズ法による特徴選択を行う
# cvにはFold（=グループ）の数，scoringには評価指標を指定する
# 今回は分類なのでaccuracy
rfecv = RFECV(estimator, cv=10, scoring="accuracy")

In [13]:
# fitで特徴選択を実行
# 今回は一部のデータで行う
index = np.random.randint(0,X.shape[0],10000)

rfecv.fit(X[index], y[index])

RFECV(cv=10,
   estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=1234, verbose=0,
            warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
   verbose=0)

In [14]:
# 削除すべき特徴の名前を取得 
remove_idx = ~rfecv.support_
remove_feature = df.columns[remove_idx]
df = df.drop(remove_feature, axis=1)
remove_feature

Index(['deadline', 'year', 'month', 'day', 'weekday', 'Art', 'Comics',
       'Crafts', 'Dance', 'Design', 'Fashion', 'Film & Video', 'Food', 'Games',
       'Journalism', 'Music', 'Photography', 'Publishing', 'Technology',
       'Theater', 'AT', 'AU', 'BE', 'CA', 'CH', 'DE', 'DK', 'ES', 'FR', 'GB',
       'HK', 'IE', 'IT', 'JP', 'LU', 'MX', 'NL', 'NO', 'NZ', 'SE', 'SG', 'US'],
      dtype='object')

In [15]:
X = df.values

In [16]:
def eval_model(clf,iskeras = False):
    
    n_split = 5 # グループ数を設定（今回は5分割）
    split_num = 1
    cross_valid_acc = 0
    start = time()
    # テスト役を交代させながら学習と評価を繰り返す
    for train_idx, test_idx in KFold(n_splits=n_split, shuffle=True, random_state=1234).split(X, y):
        X_train, y_train = X[train_idx], y[train_idx] #学習用データ
        X_test, y_test = X[test_idx], y[test_idx]     #テスト用データ
        
        #標準化
        std_scaler = StandardScaler() 
        X_train = std_scaler.fit_transform(X_train)
        X_test = std_scaler.transform(X_test)

        #kerasは前処理とfitの引数が違うので別処理
        if iskeras:
            from tensorflow.keras.utils import to_categorical
            y_train = to_categorical(y_train)
            y_test = to_categorical(y_test)
            fit = clf.fit(X_train, y_train,
                            epochs=5,
                            batch_size=100,validation_data=(X_test, y_test))
            y_test = np.argmax(y_test,axis=1)
            y_est = np.argmax(clf.predict(X_test),axis=1)
        else:
            clf.fit(X_train, y_train)
            # テストデータに対する予測を実行
            y_est = clf.predict(X_test)


        # テストデータに対するMAEを計算
        acc =  accuracy_score(y_test, y_est)
        print("Fold %s"%split_num)
        print("ACC = %s"%round(acc, 3))
        print()

        cross_valid_acc += acc #後で平均を取るためにMAEを加算
        split_num+=1
    # MAEの平均値を最終的な汎化誤差値とする
    final_acc = cross_valid_acc / n_split
    print("Cross Validation ACC = %s"%round(final_acc, 3))
    
    end = time()
    print(f"learning time:{int(end-start)}s")

In [17]:
# SGDClassifierでの評価
clf = SGDClassifier(loss='log', penalty='none', max_iter=10000,
                fit_intercept=True, random_state=1234, tol=1e-3)
eval_model(clf)

Fold 1
ACC = 0.743

Fold 2
ACC = 0.74

Fold 3
ACC = 0.736

Fold 4
ACC = 0.742

Fold 5
ACC = 0.744

Cross Validation ACC = 0.741
learning time:9s


In [18]:
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5,
                                                 min_samples_leaf=2,
                                                 min_samples_split=2, 
                                                 random_state=1234,
                                                 criterion="gini"),
                                                 n_estimators=10, random_state=1234)
eval_model(clf)

Fold 1
ACC = 0.862

Fold 2
ACC = 0.844

Fold 3
ACC = 0.842

Fold 4
ACC = 0.608

Fold 5
ACC = 0.789

Cross Validation ACC = 0.789
learning time:19s


In [19]:
clf = RandomForestClassifier(n_estimators=100, max_depth=5, criterion="gini",
                                                 min_samples_leaf=2, min_samples_split=2, random_state=1234)
eval_model(clf)

Fold 1
ACC = 0.86

Fold 2
ACC = 0.859

Fold 3
ACC = 0.86

Fold 4
ACC = 0.86

Fold 5
ACC = 0.858

Cross Validation ACC = 0.859
learning time:65s


In [21]:
from tensorflow import keras  
import tensorflow as tf
from tensorflow.compat.v1.keras import Sequential
from tensorflow.compat.v1.keras.layers import Dense, Dropout, Activation
from tensorflow.compat.v1.keras.optimizers import SGD,RMSprop, Adagrad, Adadelta, Adam

model = Sequential()
model.add(Dense(10, activation='relu', input_dim=X.shape[1]))
model.add(Dense(5, activation='relu'))
model.add(Dense(np.max(y)+1, activation='softmax'))#最終層のactivationは変更しないこと

# ------ 最適化手法 ------
#sgd = SGD(lr=0.01, momentum=0.9, nesterov=False)
# rms = RMSprop(lr=0.01)
# adag = Adagrad(lr=0.01)
# adad = Adadelta(lr=0.01)
adam = Adam(lr=0.01)
# -----------------------------

model.compile(loss='categorical_crossentropy',
              optimizer=adam,
              metrics=['accuracy'])
eval_model(model,iskeras=True)

Train on 302928 samples, validate on 75733 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fold 1
ACC = 0.873

Train on 302929 samples, validate on 75732 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fold 2
ACC = 0.877

Train on 302929 samples, validate on 75732 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fold 3
ACC = 0.864

Train on 302929 samples, validate on 75732 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fold 4
ACC = 0.85

Train on 302929 samples, validate on 75732 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Fold 5
ACC = 0.855

Cross Validation ACC = 0.864
learning time:167s
