<a href="https://colab.research.google.com/github/jeffrey82221/cc_fraud_delection/blob/main/FraudDetectionTrainModulized.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strategies:
## I.  Use the computers of all team members to speed up the experiments: 
- Ask all members to upload train.csv to Google Drive
- Ask all members to adopt my colab notebook. 

## II. 大家持續想出新的特徵因子加到下面的表中
- 需要有一個人當BA
  - 負責激發大家發想特徵因子(可以參考下面III的「前N次交易與當前交易行為的差異」)。
  - 蒐集大家的特徵因子。
  - 對特徵因子做排序提供給Engineer。
- 需要有一個ppt的leader
  - 負責構思投影片架構
  - 分配投影片工作給大家
  - 和Engineer詢問建模流程
  - 和Engineer要求圖表的生成。

## III. Implement following methods:
- 前處理的調整:
  - For the importance features, try to use other more reasonable value to replace NULL value rather than simply using -1 because some features might encounter negative values. 
  - Apply Logscale to the positive, currency-related features.
  - 有無綁定行動支付(合併成一大項)

- 特徵的增減 
  - Remove MCC as we do not use MCHNO too 
  - Add AGE 
  - Now we use 100 features, why not try 100+N and change N instead? 
  - For each feature with NULL, create a new feature describe wether the original feature is NULL or not.
  - ID類特徵的加入 (Add ID-like feature as embeddings)
  - 前N次交易與當前交易行為的差異:
    - 同一個消費者或同一個卡，前1次2次3次...N次刷卡的金額與當前刷卡金額的差異。
    - 同一個消費者或同一個卡，前1次2次3次...N次刷卡的金額距離當前刷卡的時間。
    - 同一個消費者或同一個卡，前1次2次3次...N次刷卡的特店是否和當前刷卡的特店是一樣的? (或是類似的--利用EMBEDDING來做計算)。
    - 同一個消費者或同一個卡，前1次2次3次...N次刷卡的特店是否和當前刷卡的特店是一樣的國家、城市或一樣的MCC? (或是類似的--利用EMBEDDING來做計算)。
    - 同一個消費者或同一個卡，前1次2次3次...N次刷卡的和當前刷卡的交易型態、交易類別、網路交易註記、行動支付註記等等是否相同? (或是類似的--利用EMBEDDING來做計算)。
  - 訓練方法的調整:
    - Upsample rather than downsample 
    - Tune the sampling rate 
    - Tune the hyperparameters 
- Q: 
  - LabelEncoder是甚麼原理??



# Import Packages 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
from sklearn.metrics import recall_score, precision_score, precision_recall_curve
from sklearn.model_selection import train_test_split

# Data Preprocessing

In [24]:
import copy
############################ Preprocessing ###################################
def extend_with_detailed_time(data, weekday = True, hour = True):
  '''
  Add WEEKDAY and HOUR and convert DATETIME into strptime format. 
  '''
  c_data = copy.copy(data)
  c_data["DATETIME"] = c_data["DATETIME"].apply(lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
  if weekday:
    c_data["WEEKDAY"] = c_data["DATETIME"].apply(lambda x: x.weekday() + 1)
  if hour:
    c_data["HOUR"] = c_data["DATETIME"].apply(lambda x: x.hour + 1)
  return c_data 

def extend_with_time_difference_features(data, max_time_shift = 5, pivot_feature = 'CHID'):
  # CHID: 卡人ID
  # CANO: 交易卡號
  c_data = copy.copy(data)
  assert max_time_shift > 2
  def date_diff(data, time_shift, pivot_feature):
    df = copy.copy(data)
    df["shift"] = df.groupby([pivot_feature])["DATETIME"].shift(time_shift)
    name = pivot_feature + '_DIF' + str(time_shift)
    df[name] = (df["DATETIME"] - df['shift']).dt.total_seconds().fillna(0)
    # 
    df = df.drop("shift", 1)
    return df
  for time_shift in range(1, max_time_shift + 1):
    print("add time difference between current and " + str(time_shift) + "th-last transaction")
    c_data = date_diff(c_data, time_shift, pivot_feature)
  return c_data

def preprocess_null_values(data):
  # 將空值填補
  c_data = copy.copy(data)
  c_data[
        c_data.select_dtypes(include=['object']).columns
      ] = c_data[
        c_data.select_dtypes(include=['object']).columns
      ].fillna("NULL")
  c_data[
      c_data.select_dtypes(include=['float64', 'int64']).columns
    ] = c_data[
      c_data.select_dtypes(include=['float64', 'int64']).columns
    ].fillna(-1)
  return c_data


def encode_labels(data):
  #將object欄位使用Label Encoder
  c_data = copy.copy(data)
  labelencoder = LabelEncoder()
  obj_col = c_data.select_dtypes(include=['object']).columns.to_list()
  for col in obj_col:
      c_data[col] = labelencoder.fit_transform(c_data[col])
  return c_data
def preprocessing(data):
  r_data = preprocess_null_values(data)
  return encode_labels(r_data)
############################ Training Preprocess ############################
def resample(data, sampling_rate=0.7, sample_type='downsample'):
  # note that testing data should not be re-sampled. 
  assert sample_type == 'downsample' or sample_type == 'upsample'
  c_data = copy.copy(data) 
  #將資料切分為train&test
  if sample_type == 'downsample': 
    df_fraud = c_data[c_data["FRAUD_IND"] == 1]
    df_not_fraud = c_data[c_data["FRAUD_IND"] != 1].sample(frac=sampling_rate, random_state=42)
  elif sample_type == 'upsample':
    df_fraud = c_data[c_data["FRAUD_IND"] == 1].sample(frac=1./sampling_rate, replace = True, random_state=42)
    df_not_fraud = c_data[c_data["FRAUD_IND"] != 1]
  df_train = pd.concat([df_fraud, df_not_fraud], 0)
  return df_train
def create_X_y(data, drop_list = ['FRAUD_IND']):
  X = data.drop(drop_list, 1)
  y = data["FRAUD_IND"]
  return X,y

############################ Model Build ####################################
def train_lgb(x_train, x_test, y_train, y_test, max_depth = 8, learning_rate = 0.05, n_estimators = 1000):
  # n_estimators: number of trees 
  lgb_train = lgb.Dataset(x_train, y_train)
  lgb_test = lgb.Dataset(x_test, y_test)
  params = {
      "boosting_type": "gbdt",
      "objective": "binary",
      "metric": "binary_logloss",
      "max_depth": max_depth,
      "learning_rate": learning_rate,
      "n_estimators": n_estimators,
  }
  trained_model = lgb.train(
      params,
      lgb_train,
      num_boost_round=5000,
      valid_sets=[lgb_train, lgb_test],
      early_stopping_rounds=30,
      verbose_eval=50
  )
  return trained_model
##### Get Result Generated from Model #####################################
def evaluate(clf, x_test, y_test):
  y_pred = clf.predict(x_test)
  precision, recall, threshold = precision_recall_curve(y_test, y_pred)
  performance = {"precision": precision[0:-1],
                "recall": recall[0:-1],
                "threshold": threshold
                }
  performance["f1"] = 2 * (performance["precision"] * performance["recall"]) / (performance["precision"] + performance["recall"])
  performance = pd.DataFrame(performance)
  thr = performance[performance["f1"] == max(performance["f1"])]["threshold"].values[0]
  recall = performance[performance["f1"] == max(performance["f1"])]["recall"].values[0]
  precision = performance[performance["f1"] == max(performance["f1"])]["precision"].values[0]
  print("Recall Score:", recall)
  print("Precision Score:", precision)
  print("F1 Score:", 2 * (precision * recall) / (precision + recall))
  print("Threshold: ", thr)
def get_important_feature_table(clf, x_train):
  importance = {
  "col": np.array(x_train.columns),
  "imp": lgb.Booster.feature_importance(clf)
  }
  df_imp = pd.DataFrame(importance).sort_values(by='imp', ascending=False)
  return df_imp

In [3]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)


In [55]:
train_data['FRAUD_IND'].mean()

0.14388730724941018

# Reduced Run and Generate Important Feature Table 

In [27]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
# add AGE 
# remove weekday and hour 
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = False, hour = False)
preprocessed_train_data = preprocessing(tmp_train_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
X, y = create_X_y(resampled_train_data, 
  drop_list = ["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"])
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=val_percentage, 
  shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
important_feature_table = get_important_feature_table(clf, x_train)
important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.162201	valid_1's binary_logloss: 0.162768
[100]	training's binary_logloss: 0.13457	valid_1's binary_logloss: 0.13604
[150]	training's binary_logloss: 0.12292	valid_1's binary_logloss: 0.125427
[200]	training's binary_logloss: 0.115362	valid_1's binary_logloss: 0.118842
[250]	training's binary_logloss: 0.109307	valid_1's binary_logloss: 0.113742
[300]	training's binary_logloss: 0.103638	valid_1's binary_logloss: 0.108953
[350]	training's binary_logloss: 0.0989313	valid_1's binary_logloss: 0.105065
[400]	training's binary_logloss: 0.0948469	valid_1's binary_logloss: 0.101827
[450]	training's binary_logloss: 0.0910325	valid_1's binary_logloss: 0.0986804
[500]	training's binary_logloss: 0.0874282	valid_1's binary_logloss: 0.0957317
[550]	training's binary_logloss: 0.0840682	valid_1's binary_logloss: 0.0930062
[600]	training's binary_logloss: 0.081057	valid_1's binary_logloss: 0.0906978
[650]	tra

Unnamed: 0,col,imp
27,CC_VINTAGE,2721
0,MCC,2437
10,SCITY,1885
8,FLAM1,1810
37,BONUS_POINTS,1545
9,STOCN,1483
43,CURRENT_FEE,1266
14,AGNO,1214
46,CURRENT_PURCH_AMT,1190
35,ACCT_VINTAGE,1190


# First Run 

In [6]:
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
preprocessed_train_data = preprocessing(tmp_train_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
X, y = create_X_y(resampled_train_data, 
  drop_list = ["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"])
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=val_percentage, 
  shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table



Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.162279	valid_1's binary_logloss: 0.162807
[100]	training's binary_logloss: 0.134768	valid_1's binary_logloss: 0.136324
[150]	training's binary_logloss: 0.122707	valid_1's binary_logloss: 0.125311
[200]	training's binary_logloss: 0.114838	valid_1's binary_logloss: 0.118302
[250]	training's binary_logloss: 0.108433	valid_1's binary_logloss: 0.112826
[300]	training's binary_logloss: 0.103261	valid_1's binary_logloss: 0.108748
[350]	training's binary_logloss: 0.0987153	valid_1's binary_logloss: 0.104977
[400]	training's binary_logloss: 0.0947501	valid_1's binary_logloss: 0.101738
[450]	training's binary_logloss: 0.0907433	valid_1's binary_logloss: 0.0984731
[500]	training's binary_logloss: 0.0871949	valid_1's binary_logloss: 0.0957686
[550]	training's binary_logloss: 0.0840042	valid_1's binary_logloss: 0.0932831
[600]	training's binary_logloss: 0.0807209	valid_1's binary_logloss: 0.0907092
[650]

# Second Run 

In [7]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)


In [8]:
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 5, pivot_feature = 'CHID')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
X, y = create_X_y(resampled_train_data, 
  drop_list = ["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"])
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.103183	valid_1's binary_logloss: 0.102447
[100]	training's binary_logloss: 0.0733892	valid_1's binary_logloss: 0.0742938
[150]	training's binary_logloss: 0.0644384	valid_1's binary_logloss: 0.0664961
[200]	training's binary_logloss: 0.0586591	valid_1's binary_logloss: 0.0616757
[250]	training's binary_logloss: 0.0537821	valid_1's binary_logloss: 0.0575237
[300]	training's binary_logloss: 0.0496841	valid_1's binary_logloss: 0.0543032
[350]	training's binary_logloss: 0.0459843	valid_1's binary_logloss: 0.0514022
[400]	training's binary_logloss: 0.042668	valid_1's binary_logloss: 0.0487851
[450]	training's binary_logloss: 0.0398339	valid_1's binary_logloss: 0.0465592
[500]	training's binary_logloss: 0.0376665	valid_1's binary_logloss: 0.0449057
[550]	training's binary_logloss: 0.0352642	valid_1's binary_logloss: 0.043089
[600]	training's binary_logloss: 0.0332634	valid_1's binary_logloss: 0.041

# Extending Features 

In [11]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)


In [None]:
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 5, pivot_feature = 'CHID')
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
X, y = create_X_y(resampled_train_data, 
  drop_list = ["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"])
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

# Increase Transaction Count 

## Increase Transaction count for time difference from 5 to 10 

In [25]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 2
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0892547	valid_1's binary_logloss: 0.0893711
[100]	training's binary_logloss: 0.0582204	valid_1's binary_logloss: 0.0598786
[150]	training's binary_logloss: 0.0492312	valid_1's binary_logloss: 0.0519083
[200]	training's binary_logloss: 0.0444188	valid_1's binary_logloss: 0.0478067
[250]	training's binary_logloss: 0.0402722	valid_1's binary_logloss: 0.0444012
[300]	training's binary_logloss: 0.0367237	valid_1's binary_logloss: 0.0414856
[350]	training's binary_logloss: 0.0333854	valid_1's binary_logloss: 0.0388164
[400]	training's binary_logloss: 0.0306773	valid_1's binary_logloss: 0.0366652
[450]	training's binary_logloss: 0.0281007	valid_1's binary_logloss: 0.0347014
[500]	training's binary_logloss: 0.0258136	valid_1's binary_logloss: 0.0329116
[550]	training's binary_logloss: 0.0238282	valid_1's binary_logloss: 0.0314154
[600]	training's binary_logloss: 0.0219955	valid_1's binary_logloss: 0

Unnamed: 0,col,imp
25,CC_VINTAGE,2041
0,MCC,1614
8,SCITY,1464
6,FLAM1,1352
33,ACCT_VINTAGE,1276
...,...,...
4,INSFG,50
9,OVRLT,35
11,FALLBACK_IND,33
5,ITERM,13


## Increase Transaction count for time difference from 5 to 20 

In [37]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 20, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 5
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction
add time difference between current and 11th-last transaction
add time difference between current and 12th-last transaction
add time difference between current and 13th-last transaction
add time difference between current and 



Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.083666	valid_1's binary_logloss: 0.0839237
[100]	training's binary_logloss: 0.0522189	valid_1's binary_logloss: 0.0540179
[150]	training's binary_logloss: 0.0434371	valid_1's binary_logloss: 0.0461507
[200]	training's binary_logloss: 0.0390603	valid_1's binary_logloss: 0.0424071
[250]	training's binary_logloss: 0.0351947	valid_1's binary_logloss: 0.0391872
[300]	training's binary_logloss: 0.0323212	valid_1's binary_logloss: 0.036856
[350]	training's binary_logloss: 0.0296248	valid_1's binary_logloss: 0.0347479
[400]	training's binary_logloss: 0.027224	valid_1's binary_logloss: 0.0327181
[450]	training's binary_logloss: 0.0251559	valid_1's binary_logloss: 0.0310547
[500]	training's binary_logloss: 0.0229929	valid_1's binary_logloss: 0.0294841
[550]	training's binary_logloss: 0.0211482	valid_1's binary_logloss: 0.0281075
[600]	training's binary_logloss: 0.0193928	valid_1's binary_logloss: 0.02

In [26]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 10
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
important_feature_table = get_important_feature_table(clf, x_train)
important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0892277	valid_1's binary_logloss: 0.0892831
[100]	training's binary_logloss: 0.0584096	valid_1's binary_logloss: 0.0598766
[150]	training's binary_logloss: 0.0491892	valid_1's binary_logloss: 0.0515599
[200]	training's binary_logloss: 0.0438654	valid_1's binary_logloss: 0.0471535
[250]	training's binary_logloss: 0.0398491	valid_1's binary_logloss: 0.0438675
[300]	training's binary_logloss: 0.0364154	valid_1's binary_logloss: 0.0410122
[350]	training's binary_logloss: 0.0331525	valid_1's binary_logloss: 0.0383715
[400]	training's binary_logloss: 0.0301294	valid_1's binary_logloss: 0.0359607
[450]	training's binary_logloss: 0.0276933	valid_1's binary_logloss: 0.0340566
[500]	training's binary_logloss: 0.0254671	valid_1's binary_logloss: 0.0323919
[550]	training's binary_logloss: 0.0237251	valid_1's binary_logloss: 0.0310851
[600]	training's binary_logloss: 0.0218714	valid_1's binary_logloss: 0

Unnamed: 0,col,imp
20,CC_VINTAGE,2089
0,MCC,1668
8,SCITY,1495
6,FLAM1,1407
37,CURRENT_PURCH_AMT,1280
27,ACCT_VINTAGE,1244
7,STOCN,1227
29,BONUS_POINTS,1199
28,AVAILABLE_LIMIT_AMT,1085
33,CREDIT_USE_RATE,1044


# Testing time difference based on Card ID

## Using Customer ID only (with 30 unimportance feature removed) 

In [28]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 30
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0911569	valid_1's binary_logloss: 0.091662
[100]	training's binary_logloss: 0.0612414	valid_1's binary_logloss: 0.0630541
[150]	training's binary_logloss: 0.0519155	valid_1's binary_logloss: 0.0548417
[200]	training's binary_logloss: 0.0466882	valid_1's binary_logloss: 0.0504925
[250]	training's binary_logloss: 0.0422205	valid_1's binary_logloss: 0.0467794
[300]	training's binary_logloss: 0.0383521	valid_1's binary_logloss: 0.0435877
[350]	training's binary_logloss: 0.0353123	valid_1's binary_logloss: 0.0410903
[400]	training's binary_logloss: 0.0324583	valid_1's binary_logloss: 0.0387583
[450]	training's binary_logloss: 0.0299694	valid_1's binary_logloss: 0.0368855
[500]	training's binary_logloss: 0.0279558	valid_1's binary_logloss: 0.0353859
[550]	training's binary_logloss: 0.0260104	valid_1's binary_logloss: 0.0339417
[600]	training's binary_logloss: 0.0243362	valid_1's binary_logloss: 0.

## Using Card ID too (with 30 unimportance feature removed) 

In [29]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 30
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th



Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.106206	valid_1's binary_logloss: 0.106895
[100]	training's binary_logloss: 0.0771259	valid_1's binary_logloss: 0.0794538
[150]	training's binary_logloss: 0.0675126	valid_1's binary_logloss: 0.0709598
[200]	training's binary_logloss: 0.0617478	valid_1's binary_logloss: 0.0660632
[250]	training's binary_logloss: 0.0566549	valid_1's binary_logloss: 0.0619732
[300]	training's binary_logloss: 0.0524137	valid_1's binary_logloss: 0.0586813
[350]	training's binary_logloss: 0.0487457	valid_1's binary_logloss: 0.0559055
[400]	training's binary_logloss: 0.0455101	valid_1's binary_logloss: 0.0534152
[450]	training's binary_logloss: 0.0426535	valid_1's binary_logloss: 0.0512436
[500]	training's binary_logloss: 0.0404259	valid_1's binary_logloss: 0.0495829
[550]	training's binary_logloss: 0.0382912	valid_1's binary_logloss: 0.047974
[600]	training's binary_logloss: 0.0358686	valid_1's binary_logloss: 0.04

# Reducing Feature Count

## remove 5 unimportant features  

In [30]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 5
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0892547	valid_1's binary_logloss: 0.0893711
[100]	training's binary_logloss: 0.0583426	valid_1's binary_logloss: 0.0599961
[150]	training's binary_logloss: 0.0491506	valid_1's binary_logloss: 0.0518282
[200]	training's binary_logloss: 0.0445937	valid_1's binary_logloss: 0.0480088
[250]	training's binary_logloss: 0.0403131	valid_1's binary_logloss: 0.0444603
[300]	training's binary_logloss: 0.0367402	valid_1's binary_logloss: 0.0415238
[350]	training's binary_logloss: 0.0336079	valid_1's binary_logloss: 0.0389672
[400]	training's binary_logloss: 0.0310796	valid_1's binary_logloss: 0.0369998
[450]	training's binary_logloss: 0.0281554	valid_1's binary_logloss: 0.0346962
[500]	training's binary_logloss: 0.0261544	valid_1's binary_logloss: 0.0331458
[550]	training's binary_logloss: 0.0244262	valid_1's binary_logloss: 0.0318157
[600]	training's binary_logloss: 0.0226625	valid_1's binary_logloss: 0

## remove 15 unimportant features  

In [36]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 15
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0895316	valid_1's binary_logloss: 0.0896586
[100]	training's binary_logloss: 0.0586199	valid_1's binary_logloss: 0.0603031
[150]	training's binary_logloss: 0.0492577	valid_1's binary_logloss: 0.051909
[200]	training's binary_logloss: 0.0440942	valid_1's binary_logloss: 0.0475568
[250]	training's binary_logloss: 0.0401819	valid_1's binary_logloss: 0.044337
[300]	training's binary_logloss: 0.0364765	valid_1's binary_logloss: 0.0414256
[350]	training's binary_logloss: 0.0330621	valid_1's binary_logloss: 0.0386661
[400]	training's binary_logloss: 0.0302222	valid_1's binary_logloss: 0.0363878
[450]	training's binary_logloss: 0.0278304	valid_1's binary_logloss: 0.0345247
[500]	training's binary_logloss: 0.0254633	valid_1's binary_logloss: 0.0326841
[550]	training's binary_logloss: 0.0237583	valid_1's binary_logloss: 0.0314527
[600]	training's binary_logloss: 0.0219331	valid_1's binary_logloss: 0.0

## remove 30 unimportant features  

In [31]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 10, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 30
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
shape of train data: (533202, 59)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction




Training until validation scores don't improve for 30 rounds.
[50]	training's binary_logloss: 0.0911569	valid_1's binary_logloss: 0.091662
[100]	training's binary_logloss: 0.0612414	valid_1's binary_logloss: 0.0630541
[150]	training's binary_logloss: 0.0519155	valid_1's binary_logloss: 0.0548417
[200]	training's binary_logloss: 0.0466882	valid_1's binary_logloss: 0.0504925
[250]	training's binary_logloss: 0.0422205	valid_1's binary_logloss: 0.0467794
[300]	training's binary_logloss: 0.0383521	valid_1's binary_logloss: 0.0435877
[350]	training's binary_logloss: 0.0353123	valid_1's binary_logloss: 0.0410903
[400]	training's binary_logloss: 0.0324583	valid_1's binary_logloss: 0.0387583
[450]	training's binary_logloss: 0.0299694	valid_1's binary_logloss: 0.0368855
[500]	training's binary_logloss: 0.0279558	valid_1's binary_logloss: 0.0353859
[550]	training's binary_logloss: 0.0260104	valid_1's binary_logloss: 0.0339417
[600]	training's binary_logloss: 0.0243362	valid_1's binary_logloss: 0.

# Generate Testing Result 

# Copy the training pipeline of the best performing model here  

In [None]:
from google.colab import drive
drive.mount('/content/drive')
#匯入資料
train_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/train.csv')
test_data = "先不給你們"
#查看資料筆數
print("shape of train data:" , train_data.shape)
#print("shape of test data:" , test_data.shape)
tmp_train_data = extend_with_detailed_time(train_data, 
  weekday = True, hour = True)
train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
  max_time_shift = 20, pivot_feature = 'CHID')
#train_tmp_data = extend_with_time_difference_features(tmp_train_data, 
#  max_time_shift = 5, pivot_feature = 'CANO')
preprocessed_train_data = preprocessing(train_tmp_data)
resampled_train_data = resample(preprocessed_train_data, 
  sampling_rate=0.7, sample_type='downsample')
removed_unimportant_feature_count = 5
X, y = create_X_y(resampled_train_data, 
  drop_list = list(set(["FRAUD_IND", "TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)
val_percentage = 0.33
x_train, x_test, y_train, y_test = train_test_split(X, y, 
  test_size=val_percentage, shuffle=True, random_state=42)
clf = train_lgb(x_train, x_test, y_train, y_test, 
  max_depth = 8, learning_rate = 0.05, n_estimators = 1000)
evaluate(clf, x_test, y_test)
#important_feature_table = get_important_feature_table(clf, x_train)
#important_feature_table

## Generate testing result 

In [40]:
test_data = pd.read_csv('/content/drive/MyDrive/智金輪習Kaggle/test.csv')
print("shape of test data:" , test_data.shape)
tmp_test_data = extend_with_detailed_time(test_data, 
  weekday = True, hour = True)
tmp_test_data = extend_with_time_difference_features(tmp_test_data, 
  max_time_shift = 20, pivot_feature = 'CHID')

preprocessed_test_data = preprocessing(tmp_test_data)

removed_unimportant_feature_count = 5

def create_X(data, drop_list = []):
  if drop_list:
    return data.drop(drop_list, 1)
  else:
    return data

X = create_X(preprocessed_test_data, 
  drop_list = list(set(["TXKEY", "DATETIME", "CANO", "CHID", "ACQIC", "MCHNO", "AGE"] + \
  important_feature_table.set_index('col').index[-(removed_unimportant_feature_count):].tolist()))
)

y_pred = clf.predict(X)
threshold = 0.4981942164802951
y_result = (y_pred > threshold).astype(int) 

shape of test data: (472335, 58)
add time difference between current and 1th-last transaction
add time difference between current and 2th-last transaction
add time difference between current and 3th-last transaction
add time difference between current and 4th-last transaction
add time difference between current and 5th-last transaction
add time difference between current and 6th-last transaction
add time difference between current and 7th-last transaction
add time difference between current and 8th-last transaction
add time difference between current and 9th-last transaction
add time difference between current and 10th-last transaction
add time difference between current and 11th-last transaction
add time difference between current and 12th-last transaction
add time difference between current and 13th-last transaction
add time difference between current and 14th-last transaction
add time difference between current and 15th-last transaction
add time difference between current and 16th-l

In [54]:
y_pred = clf.predict(X)
threshold = 0.4981942164802951
y_result = (y_pred > threshold).astype(int).T
result_table = pd.DataFrame([test_data['TXKEY'], y_result]).T
result_table.columns = ['TXKEY', 'FRAUD_IND']
#result_table = result_table.set_index('TXKEY')
result_table.to_csv('submission.csv')

In [56]:
print("imbalance rate of train data:", train_data['FRAUD_IND'].mean())

imbalance_rate of train data: 0.14388730724941018


In [59]:
for threshold in [0.9, 0.95, 1.0]:
  print('threshold:', threshold)
  y_result = (y_pred > threshold).astype(int).T
  result_table = pd.DataFrame([test_data['TXKEY'], y_result]).T
  result_table.columns = ['TXKEY', 'FRAUD_IND']
  print("imbalance rate of test data:", result_table['FRAUD_IND'].mean())

threshold: 0.9
imbalance rate of test data: 0.2434373908348947
threshold: 0.95
imbalance rate of test data: 0.17955264801465062
threshold: 1.0
imbalance rate of test data: 0.0


imbalance rate of test data: 0.4797103750516053


In [53]:
result_table.to_csv('submission.csv')

Unnamed: 0,TXKEY,FRAUD_IND
0,ES0100920180318AABPO,0
1,ES0100A20180318AABQO,0
2,VS0I00120180302AABVX,1
3,VS0I00120180302AACO5,1
4,VS0I00120180319AABMI,1
...,...,...
472330,NC0100F20180320AAHOX,0
472331,NC0101020180321AAG1V,0
472332,NC0100620180330AAIX1,0
472333,NC0100720180331AADJU,0
