In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
# default_exp eda

# IEEE-CIS Fraud Detection
https://www.kaggle.com/c/ieee-fraud-detection/

Imagine standing at the check-out counter at the grocery store with a long line behind you and the cashier not-so-quietly announces that your card has been declined. In this moment, you probably aren’t thinking about the data science that determined your fate.

Embarrassed, and certain you have the funds to cover everything needed for an epic nacho party for 50 of your closest friends, you try your card again. Same result. As you step aside and allow the cashier to tend to the next customer, you receive a text message from your bank. “Press 1 if you really tried to spend $500 on cheddar cheese.”

While perhaps cumbersome (and often embarrassing) in the moment, this fraud prevention system is actually saving consumers millions of dollars per year. Researchers from the IEEE Computational Intelligence Society (IEEE-CIS) want to improve this figure, while also improving the customer experience. With higher accuracy fraud detection, you can get on with your chips without the hassle.

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, you’ll improve the efficacy of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. And of course, you will save party people just like you the hassle of false positives.

Acknowledgements:

Vesta Corporation provided the dataset for this competition. Vesta Corporation is the forerunner in guaranteed e-commerce payment solutions. Founded in 1995, Vesta pioneered the process of fully guaranteed card-not-present (CNP) payment transactions for the telecommunications industry. Since then, Vesta has firmly expanded data science and machine learning capabilities across the globe and solidified its position as the leader in guaranteed ecommerce payments. Today, Vesta guarantees more than $18B in transactions annually.

Header Photo by Tim Evans on Unsplash

## Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
Submission File

For each TransactionID in the test set, you must predict a probability for the isFraud variable. The file should contain a header and have the following format:

    TransactionID,isFraud
    3663549,0.5
    3663550,0.5
    3663551,0.5
    etc.

## Prizes

    1st Prize: $10,000
    2nd Prize: $7,000
    3rd Prize: $3,000

Winners will be required to submit a write-up for the IEEE CIS Conference, to which they are invited and highly encouraged to attend and present their work.


## Timeline
UPDATE: The below timeline has been updated according to this post. Please see that post and the competition rules for more details.

    September 24, 2019 - Entry deadline. You must accept the competition rules before this date in order to compete.

    September 24, 2019 - Team Merger deadline. This is the last day participants may join or merge teams.

    September 24, 2019 - External Data Disclosure deadline. This is the last day to disclose any used external data to the competition forums

    October 3, 2019 - Final submission deadline. After this date, we will not be taking any more submissions. Remember to select your two best submissions for final scoring.




## Data Description

In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.

The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.

### Categorical Features - Transaction

    ProductCD
    card1 - card6
    addr1, addr2
    P_emaildomain
    R_emaildomain
    M1 - M9

### Categorical Features - Identity

    DeviceType
    DeviceInfo
    id_12 - id_38

The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp).

You can read more about the data from this post by the competition host.
### Files

    train_{transaction, identity}.csv - the training set
    test_{transaction, identity}.csv - the test set (you must predict the isFraud value for these observations)
    sample_submission.csv - a sample submission file in the correct format

# lib导入

In [2]:
# export
import os
from code.config import * 
from loguru import logger
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)  # 设置显示数据的最大列数，防止出现省略号…，导致数据显示不全
pd.set_option('expand_frame_repr', False)  # 当列太多时不自动换行



In [3]:
import sys
sys.path.append('..')
import seaborn as sns
sns.set(font='Arial Unicode MS')  # 解决Seaborn中文显示问题
from mylib.utils.pickle import PickleWrapper

In [4]:
args.DATA_DIR

'../../data/contest/ieee-fraud-detection/'

In [5]:
!ls ../../data/contest/ieee-fraud-detection/

[31msample_submission.csv[m[m [31mtest_transaction.csv[m[m  [31mtrain_transaction.csv[m[m
[31mtest_identity.csv[m[m     [31mtrain_identity.csv[m[m


# 1st Place Solution - Part 1
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284

## Magic Feature
我们不会预测欺诈交易。 根据比赛的主持人林恩在这里。 客户（信用卡）欺诈后，其整个帐户将转换为isFraud = 1。 因此，我们正在预测欺诈性客户（信用卡）。

我们标记的逻辑是将卡上报告的费用分摊定义为欺诈交易（isFraud = 1），并将其后面与用户帐户，电子邮件地址或帐单地址直接关联到这些属性的交易也称为欺诈。 如果没有报告并发现超过120天以上，则我们将其定义为合法交易（isFraud = 0）。

您可能认为120天后，卡片变成isFraud = 0。 我们很少在训练数据中看到这一点。 （也许欺诈性信用卡会被终止而不是重新使用）。 下面的饼图显示了这一点。 训练数据集有73838个客户（信用卡），交易量为2个或更多。 其中，71575（96.9％）始终为isFraud = 0，2134（2.9％）始终为isFraud = 1。 只有129（0.2％）具有isFraud = 0和isFraud = 1的混合。

## Fraudulent Clients
以下是2134个示例中的一个示例，我可以向您展示isFraud = 1的客户端。 

这是client = 2988694标识客户端的关键是card1，addr1和D1这三列。 

D1列是“自客户（信用卡）开始以来的天数”。

因此，如果我们创建D1n = TransactionDay减去D1，则得出卡开始的日期（其中TransactionDay = TransactionDT /（24 * 60 * 60））。 

在下面的示例中，此信用卡的开始日期为-81（大约为2017年9月10日，因为据认为数据集将于2017年12月1日开始）。 

我们还看到D3n = TransactionDay减去D3是指向每个客户的最后一笔交易的日期的指针。 （还有更多列可帮助我们识别客户）。

## Preventing Overfitting
为了避免过度拟合训练和过度拟合公共测试数据集，我们一定不能直接使用客户端UID，也不能使用数据集中的数十列来帮助识别客户端的列，例如某些D，V和ID列。 （您知道哪些列在此处说明的训练/测试对抗性验证期间通过回顾LGBM功能重要性来标识客户）。

我们无法将UID添加为新列，因为私人测试数据集中68.2％的客户端不在训练数据集中。 相反，我们必须创建聚合组特征。 例如，我们可以获取所有C，M列，并执行new_features = df.groupby（'uid'）[CM_columns] .agg（['mean']）。 然后我们删除列uid。 现在，我们的模型具有对客户进行分类的能力，这是以前从未见过的。

Model Details

There is so much more to say. Namely, I know you would like to know the specifics of our EDA, our models, validation methodology, and stacking/ensembling methodology. Konstantin has provided a brief summary here. I will begin working on "1st Place Solution - Part 2" that describes the technical aspects in more detail. Also we will begin sharing our code as we clean it up.

I look forward to reading everyone's solutions. Please share. 

# 1st Place Solution - Part 2
https://www.kaggle.com/c/ieee-fraud-detection/discussion/111308

In "1st Place Solution - Part 1" posted here, we discussed the benefits of classifying clients (credit cards) instead of transactions in Kaggle's Fraud competition. Here we will discuss the technical details. 

## Final Model

Our final model was a combination of 3 high scoring single models. 
* CatBoost (Public/Private LB of 0.9639/0.9408), 
* LGBM (0.9617/0.9384), and 
* XGB (0.9602/0.9324). 

These models were diversified because Konstantin built the CAT and LGB while I built the XGB and NN. And we engineered features independently. (In the end we didn't use the NN which had LB 0.9432). XGB notebook posted here.

One final submission was a stack where LGBM was trained on top of the predictions of CAT and XGB and the other final submission was an ensemble with equal weights. Both submissions were post processed by taking all predictions from a single client (credit card) and replacing them with that client's average prediction. This PP increased LB by 0.001.

## How to Find UIDs

We found UIDs in two different ways. (Specific details here).

    Wrote a script that finds UIDs here
    Train our models to find UIDs here and here

If you remember, Konstantin's original public FE kernel here without UIDs achieves local validation AUC = 0.9245 and public LB 0.9485. His new FE kernel here achieves local validation AUC = 0.9377 and public LB 0.9617 by finding and using UIDs. Soon I will post my XGB kernel which finds UIDs with even less human assistance and proves to beat all other methods of finding UIDs. (XGB posted here). The purpose of producing UIDs by a script was for EDA, special validation tests, and post process. We did not add the script's UIDs to our models. Machine learning did better finding them on its own.
EDA

EDA was daunting in this competition. There were so many columns to analyze and their meanings were obscured. For the first 150 columns, we used Alijs's great EDA here. For the remaining 300 columns, we used my V and ID EDA here. We reduced the number of V columns with 3 tricks. First groups of V columns were found that shared similar NAN structure, next we used 1 of 3 methods:

    We applied PCA on each group individually
    We selected a maximum sized subset of uncorrelated columns from each group
    We replaced the entire group with all columns averaged.

Afterward, these reduced groups were further evaluated using feature selection techniques below. For example, the block V322-V339 failed "time consistency" and was removed from our models.
Feature Selection

Feature selection was important because we had many columns and preferred to keep our models efficient. My XGB had 250 features and would train 6 folds in 10 minutes. Konstantin will need to say what his models had. We used every trick we knew to select our features:

    forward feature selection (using single or groups of features)
    recursive feature elimination (using single or groups of features)
    permutation importance
    adversarial validation
    correlation analysis
    time consistency
    client consistency
    train/test distribution analysis

One interesting trick called "time consistency" is to train a single model using a single feature (or small group of features) on the first month of train dataset and predict isFraud for the last month of train dataset. This evaluates whether a feature by itself is consistent over time. 95% were but we found 5% of columns hurt our models. They had training AUC around 0.60 and validation AUC 0.40. In other words some features found patterns in the present that did not exist in the future. Of course the possible of interactions complicates things but we double checked every test with other tests.
Validation Strategy

We never trusted a single validation strategy so we used lots of validation strategies. Train on first 4 months of train, skip a month, predict last month. We also did train 2, skip 2, predict 2. We did train 1 skip 4 predict 1. We reviewed LB scores (which is just train 6, skip 1, predict 1 and no less valid than other holdouts). We did a CV GroupKFold using month as the group. We also analyzed models by how well they classified known versus unknown clients using our script's UIDs.

For example when training on the first 5 months and predicting the last month, we found that our

    XGB model did best predicting known UIDs with AUC = 0.99723
    LGBM model did best predicting unknown UIDs with AUC = 0.92117
    CAT model did best predicting questionable UIDs with AUC = 0.98834

Questionable UIDs are transactions that our script could not confidently link to other transactions. When we ensembled and/or stacked our models we found that the resultant model excelled in all three categories. It could predict known, unknown, and questionable UIDs forward in time with great accuracy !!

# code-xgb Fraud with Magic scores LB 0.96
https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

## load data

### train_identity

In [4]:
train_identity = pd.read_csv(os.path.join(args.DATA_DIR, 'train_identity.csv'))

train_identity.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,id_10,id_11,id_12,id_13,id_14,id_15,id_16,id_17,id_18,id_19,id_20,id_21,id_22,id_23,id_24,id_25,id_26,id_27,id_28,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,,100.0,NotFound,,-480.0,New,NotFound,166.0,,542.0,144.0,,,,,,,,New,NotFound,Android 7.0,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,,100.0,NotFound,49.0,-300.0,New,NotFound,166.0,,621.0,500.0,,,,,,,,New,NotFound,iOS 11.1.2,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,0.0,100.0,NotFound,52.0,,Found,Found,121.0,,410.0,142.0,,,,,,,,Found,Found,,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,,100.0,NotFound,52.0,,New,NotFound,225.0,,176.0,507.0,,,,,,,,New,NotFound,,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,0.0,100.0,NotFound,,-300.0,Found,Found,166.0,15.0,529.0,575.0,,,,,,,,Found,Found,Mac OS X 10_11_6,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In [5]:
train_identity.shape

(144233, 41)

### train_transaction

In [9]:
train_transaction = pd.read_csv(os.path.join(args.DATA_DIR, 'train_transaction.csv'))

train_transaction.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,...,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,T,T,T,M2,...,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,M0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,287.0,,outlook.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,315.0,,,,315.0,T,T,T,M0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,,,yahoo.com,,2.0,5.0,0.0,0.0,0.0,4.0,0.0,0.0,1.0,0.0,1.0,0.0,25.0,1.0,112.0,112.0,0.0,94.0,0.0,,,,,84.0,,,,,111.0,,,,M0,...,1.0,1.0,1.0,1.0,38.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,50.0,1758.0,925.0,0.0,354.0,0.0,135.0,0.0,0.0,0.0,50.0,1404.0,790.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,,,,,,,,,,,,,,,,,,,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
train_transaction.isFraud.value_counts()

0    569877
1     20663
Name: isFraud, dtype: int64

In [6]:
BUILD95 = True
BUILD96 = True

import numpy as np, pandas as pd, os, gc
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# COLUMNS WITH STRINGS
str_type = ['ProductCD', 'card4', 'card6', 'P_emaildomain', 'R_emaildomain','M1', 'M2', 'M3', 'M4','M5',
            'M6', 'M7', 'M8', 'M9', 'id_12', 'id_15', 'id_16', 'id_23', 'id_27', 'id_28', 'id_29', 'id_30', 
            'id_31', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo']
str_type += ['id-12', 'id-15', 'id-16', 'id-23', 'id-27', 'id-28', 'id-29', 'id-30', 
            'id-31', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38']

# FIRST 53 COLUMNS
cols = ['TransactionID', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
       'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain',
       'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
       'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8',
       'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4',
       'M5', 'M6', 'M7', 'M8', 'M9']

# V COLUMNS TO LOAD DECIDED BY CORRELATION EDA
# https://www.kaggle.com/cdeotte/eda-for-columns-v-and-id
v =  [1, 3, 4, 6, 8, 11]
v += [13, 14, 17, 20, 23, 26, 27, 30]
v += [36, 37, 40, 41, 44, 47, 48]
v += [54, 56, 59, 62, 65, 67, 68, 70]
v += [76, 78, 80, 82, 86, 88, 89, 91]

#v += [96, 98, 99, 104] #relates to groups, no NAN 
v += [107, 108, 111, 115, 117, 120, 121, 123] # maybe group, no NAN
v += [124, 127, 129, 130, 136] # relates to groups, no NAN

# LOTS OF NAN BELOW
v += [138, 139, 142, 147, 156, 162] #b1
v += [165, 160, 166] #b1
v += [178, 176, 173, 182] #b2
v += [187, 203, 205, 207, 215] #b2
v += [169, 171, 175, 180, 185, 188, 198, 210, 209] #b2
v += [218, 223, 224, 226, 228, 229, 235] #b3
v += [240, 258, 257, 253, 252, 260, 261] #b3
v += [264, 266, 267, 274, 277] #b3
v += [220, 221, 234, 238, 250, 271] #b3

v += [294, 284, 285, 286, 291, 297] # relates to grous, no NAN
v += [303, 305, 307, 309, 310, 320] # relates to groups, no NAN
v += [281, 283, 289, 296, 301, 314] # relates to groups, no NAN
#v += [332, 325, 335, 338] # b4 lots NAN

cols += ['V'+str(x) for x in v]
dtypes = {}
for c in cols+['id_0'+str(x) for x in range(1,10)]+['id_'+str(x) for x in range(10,34)]+\
    ['id-0'+str(x) for x in range(1,10)]+['id-'+str(x) for x in range(10,34)]:
        dtypes[c] = 'float32'
for c in str_type: dtypes[c] = 'category'

In [8]:
%%time
# LOAD TRAIN
X_train = pd.read_csv(os.path.join(args.DATA_DIR, 'train_transaction.csv'),index_col='TransactionID', dtype=dtypes, usecols=cols+['isFraud'])
train_id = pd.read_csv(os.path.join(args.DATA_DIR, 'train_identity.csv'),index_col='TransactionID', dtype=dtypes)
X_train = X_train.merge(train_id, how='left', left_index=True, right_index=True)
# LOAD TEST
X_test = pd.read_csv(os.path.join(args.DATA_DIR, 'train_transaction.csv'),index_col='TransactionID', dtype=dtypes, usecols=cols)
test_id = pd.read_csv(os.path.join(args.DATA_DIR, 'train_identity.csv'),index_col='TransactionID', dtype=dtypes)
fix = {o:n for o, n in zip(test_id.columns, train_id.columns)}
test_id.rename(columns=fix, inplace=True)
X_test = X_test.merge(test_id, how='left', left_index=True, right_index=True)
# TARGET
y_train = X_train['isFraud'].copy()
del train_id, test_id, X_train['isFraud']; x = gc.collect()
# PRINT STATUS
print('Train shape',X_train.shape,'test shape',X_test.shape)

Train shape (590540, 213) test shape (590540, 213)
CPU times: user 22.7 s, sys: 2.01 s, total: 24.7 s
Wall time: 25.5 s
