
# <p style="padding:15px;font-family:newtimeroman;color:#9900cc;text-align:center;font-size:110%;border-radius:40px 5px;"> (|| American Express Default Prediction - EDA ||)</p>
<img src="https://blog.bankbazaar.com/wp-content/uploads/2016/03/Surviving-a-Credit-Card-Default.png" style="border-radius:5px;width:100%;height:500px">

# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;">1 | Competition overview</p>
Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.

If successful, you'll help create a better customer experience for cardholders by making it easier to be approved for a credit card. Top solutions could challenge the credit default prediction model used by the world's largest payment card issuer—earning you cash prizes, the opportunity to interview with American Express, and potentially a rewarding new career.

## Data overview
The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

D_* = Delinquency variables
S_* = Spend variables
P_* = Payment variables
B_* = Balance variables
R_* = Risk variables
with the following features being categorical:

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

Your task is to predict, for each customer_ID, the probability of a future payment default (target = 1).

Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.

## objective
The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

## Evaluation
The evaluation metric, M , for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient,G, and default rate captured at 4%,D.

M = 0.5(G + D)

The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.

For both of the sub-metrics  and , the negative labels are given a weight of 20 to adjust for downsampling.

This metric has a maximum value of 1.0.

Python code for calculating this metric can be found in this Notebook.

# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;">2 | Basic Stuffs -- import,settings,reading</p>

In [None]:
import pandas as pd
import numpy as np

# visualization tools
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# kaggle utils
import kaggle_utils_py as kaggle_utils

# garbage collector
import gc

# modeling
import optuna
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import roc_auc_score, roc_curve, auc
from lightgbm import LGBMClassifier, early_stopping

In [None]:
# some basic settings for me
pd.set_option('display.max_columns', None)

# set the warning off
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%time
train = pd.read_feather('../input/amex-default-prediction-feather/train.feather')
test = pd.read_feather('../input/amex-default-prediction-feather/test.feather')
train_labels = pd.read_csv("../input/amex-default-prediction/train_labels.csv")
sub = pd.read_csv('../input/amex-default-prediction/sample_submission.csv')

In [None]:
print("shape of the data --->", train.shape)
print("shape of the test data --->", test.shape)

In [None]:
train.head()

In [None]:
d_feats = [c for c in train.columns if c.startswith('D_')]
s_feats = [c for c in train.columns if c.startswith('S_')]
p_feats = [c for c in train.columns if c.startswith('P_')]
b_feats = [c for c in train.columns if c.startswith('B_')]
r_feats = [c for c in train.columns if c.startswith('R_')]
print(f'Number of Delinquency variables: {len(d_feats)}')
print(f'Number of Spend variables: {len(s_feats)}')
print(f'Number of Payment variables: {len(p_feats)}')
print(f'Number of Balance variables: {len(b_feats)}')
print(f'Number of Risk variables: {len(r_feats)}')


# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 3 | Understanding Customer data</p>

In [None]:
unique_customer_count = len(train.groupby("customer_ID")['customer_ID'].count())
print("unique customer data in training data -->", unique_customer_count)
unique_customer_count_test = len(test.groupby("customer_ID")['customer_ID'].count())
print("unique customer data in test data -->", unique_customer_count_test)

In [None]:
# checking single customer data
train.groupby("customer_ID").size()

In [None]:
# checking one customer data
train[train["customer_ID"] == "0000099d6bd597052cdcda90ffabf56573fe9d7c79be5fbac11a8ed792feb62a"]

In [None]:
y = train.groupby("customer_ID")['customer_ID'].count().values
y_test = test.groupby("customer_ID")['customer_ID'].count().values

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(
    y = y,
    ybins = dict(size = 0.5),
    marker_color= '#9900cc'))
fig.update_layout(
    template = "plotly_dark",
    title = "Customer profile count -- training data",
    yaxis_title = "Number of months",
    bargap = 0.2
)
fig.show()

fig = go.Figure()
fig.add_trace(go.Histogram(
    y = y_test,
    ybins = dict(size = 0.5),
    marker_color= '#9900cc'))
fig.update_layout(
    template = "plotly_dark",
    title = "Customer profile count -- test data",
    yaxis_title = "Number of months"
)
fig.show()

- From here we can see the dsitribution of profile length is common between train and test data.


In [None]:
del y
del y_test
gc.collect()

In [None]:
# connection between the profile length and target output
count = train.groupby("customer_ID")['customer_ID'].count()
con_check_df = pd.DataFrame({"customer_ID":count.index, "count": count.values})
# merge the data with the label data frame
con_check_df = con_check_df.merge(train_labels, on='customer_ID', how='left')

In [None]:
con_check_df.head(3)

In [None]:

sns.countplot(data = con_check_df,y='count',hue='target', orient='h')


- nearly 30 - 50 % of all  profile length has target 1 (default) 
- Can't get a great correlation between profile length and target -- but thinking like keeping this information may help.

In [None]:
del con_check_df
gc.collect()


# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 4 | Common data Analysis</p>

In [None]:
# merge the two dataset
train = train.groupby('customer_ID').tail(1).set_index('customer_ID')
data = train.merge(train_labels, on='customer_ID', how='left')

In [None]:
columns, categorical_col, numerical_col,missing_value_df = kaggle_utils.Common_data_analysis(data, missing_value_highlight_threshold=5.0, display_df = False,
                                                                                              only_show_missing=False)


In [None]:
# by dataset defenistion descrete columns are 
descrete_cols=['B_30', 'B_38', 'D_63', 'D_64', 'D_66', 'D_68',
          'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'target']

# so numerical columns we need to check
numerical_col = [c for c in numerical_col if c not in descrete_cols]

target_col = 'target'

In [None]:
# null value analysis
print("shape of missing value df", missing_value_df.shape)
missing_value_df.head()

In [None]:
print("Features having missing values -->",missing_value_df[missing_value_df['% of Missing value(NA)'] > 0.00].shape[0])

In [None]:

missing_value_df = missing_value_df[missing_value_df["% of Missing value(NA)"] > 0.00]
missing_value_df = missing_value_df.sort_values(ascending=True, by = '% of Missing value(NA)')
fig = go.Figure()

#fig.add_trace()

# add line in the chart
for i in range(100): # gonna have 100 missing values lines
    #print(missing_value_df.iloc[i,3])
    fig.add_shape(dict(type = 'line', 
                  x0= 0 , y0 = i,
                  x1 = missing_value_df.iloc[i,3], y1 = i,
                  line = {'color': '#9900cc', 'width' : 3}))
fig.add_trace(go.Scatter(x = missing_value_df["% of Missing value(NA)"], y = missing_value_df.index, 
                         mode='markers', 
                         marker_color='#ffff80', marker_size=8))
fig.update_layout(template='plotly_dark',
                  title = "Feature with missing values :(",
                  xaxis = dict(title = "Missing value percentage", zeroline=False),
                  yaxis_showgrid=False,
                  width = 1000,
                  height = 1500,)

In [None]:
del missing_value_df
gc.collect()


# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 5 | Distribution Analysis</p>

In [None]:
def plot_hist(data, columns, nrow, ncol, figsize, hue_value=None):
    # find the distubution of the data. ( visualization would be so good)
    fig, ax = plt.subplots(nrow,ncol, figsize=figsize)
    col, row = ncol,nrow
    col_count = 0
    sns.set_style('dark')
    for r in range(row):
        for c in range(col):
            if col_count >= len(columns):
                ax[r,c].text(0.5, 0.5, "no data")
            else:
                sns.kdeplot(data=data, x=columns[col_count], hue=hue_value, ax=ax[r, c], palette=['#9900cc','#99ff99'],
                                fill = True, hue_order=[1,0], legend = True)
                ax[r,c].set(xlabel = columns[col_count], ylabel=("Density" if c==0 else ''))
                col_count +=1
        # print("col count ", col_count)
            

## Delinquency variable

In [None]:
# Find the distribution of Delinquency variables
d_feats = [c for c in d_feats if c not in descrete_cols]
plot_hist(data, d_feats, 15, 6, (50,100),hue_value=target_col)

## Spend variables

In [None]:
# Find the distribution of Spend variables
s_feats = [c for c in s_feats if c not in descrete_cols]
s_feats.remove('S_2')
plot_hist(data, s_feats, 7, 3, (50,50),hue_value=target_col)

## Payment and Balance variables

In [None]:
# Find the distribution of Payment variables
p_feats = [c for c in p_feats if c not in descrete_cols]
b_feats = [c for c in b_feats if c not in descrete_cols]
plot_hist(data, p_feats + b_feats, 11, 4, (50,50),hue_value=target_col)

## Risk variables


In [None]:
# Find the distribution of Payment variables
r_feats = [c for c in r_feats if c not in descrete_cols]
plot_hist(data, r_feats, 8, 4, (50,50),hue_value=target_col)

- Can't see any feature following normal distribution 
- And all features having different distributions
- We can't use parameterised models -- best go for some non-parameterised models

# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 6 | Correlation Analysis
</p>

In [None]:
# correlation with target
#col = [c for c in data.columns if data[c].dtypes != 'object']

corr = data.corrwith(data[target_col], axis=0)
val = [str(round(v ,2) *100) + '%' for v in corr.values]

fig = go.Figure()
fig.add_trace(go.Bar(y=corr.index, x= corr.values,
                     orientation='h',
                     marker_color = '#9900cc',
                     text = val,
                     textposition = 'outside',
                     textfont_color = '#ffff80'))
fig.update_layout(template = 'plotly_dark',
                  title = "Correlation with Target",
                  width = 800,
                  height = 3000)
fig.update_xaxes(range=[-2,2])

In [None]:
del val,corr
gc.collect()

there are lots of values correlated with target, B_2 is the most negatively correlated feature with -0.56 correlation. At the same time most positively correlated feature is D_68 with 0.61 correlation


# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 7 | Target value distibution</p>

In [None]:
# plot the target
count = data[target_col].value_counts()
print(count)
print("percentage of first class --- >",count[0]/data.shape[0])
print("percentage of second class --->", count[1]/data.shape[0])

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x= ['Paid', "Default"],y=count.values,
                     marker_color = ['#9900cc','#ffff80'],
                     text = [str(round(count[0]/data.shape[0],2) * 100) + '%' , str(round(count[1]/data.shape[0], 2) * 100) + '%']))
fig.update_layout(template = 'plotly_dark',
                  title = "target value distribution",
                  width = 500,
                  height = 500)

In [None]:
# delete some unwanted variables
del d_feats,s_feats,p_feats,b_feats,r_feats
gc.collect()

# <p style="background-color:#9900cc;padding:15px;font-family:newtimeroman;color:#ffff80;font-size:110%;border-radius:40px 5px;"> 8 | Prediction</p>

In [None]:
# del categorical data
needed_col = [c for c in data.columns if c not in ['customer_ID','S_2']]
data = data[needed_col]
test.drop('S_2', inplace = True, axis = 1)
test.drop('customer_ID', inplace = True, axis = 1)

In [None]:
X=data.drop(['target'],axis=1)
y=data['target']

del data
gc.collect()

In [None]:
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

In [None]:
# # parameter tuning
# def objective(trial, X, y):
#     param = {
#         "n_estimators": trial.suggest_int("n_estimators", 5000,20000,step=10),
#         'learning_rate' : trial.suggest_uniform('learning_rate',0.01, 0.1),
#         "lambda_l1": trial.suggest_loguniform("reg_lambda", 1.0, 50.0), # L1 regularization parameter
#         "lambda_l2": trial.suggest_loguniform("lambda_l2", 1.0, 50.0),
#         "max_depth": trial.suggest_int("max_depth", 5, 20), # max depth of the tree
#         "num_leaves": trial.suggest_int("max_depth", 30, 2000, step=10),
#         "subsample": trial.suggest_loguniform("subsample", 0.1, 1.0), #Denotes the fraction of observations to be randomly samples for each tree.
#         "bagging_fraction": trial.suggest_loguniform("subsample", 0.1, 1.0),
#         "bagging_freq": trial.suggest_int("bagging_freq", 0,10),
#         "feature_fraction": trial.suggest_float("feature_fraction", 0.2, 0.95, step=0.1),
#         "min_data_in_leaf": trial.suggest_int("bagging_freq", 20,200),
#         "boosting_type":'gbdt',
        
#         "colsample_bytree": trial.suggest_float("feature_fraction", 0.2, 0.95, step=0.1),
#         "max_bins": trial.suggest_int("max_depth", 30, 2000, step=10),
#         "objective": "binary",
#         "random_state": 23,
#         "max_bin": 500
#     }

#     # cross - validation
#     cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=32)
#     cross_val_score = []
#     for fold_index, (train_id, val_id) in enumerate(cv.split(X,y)):
#         # get the train and val set for this cross validation
#         print("="*20, end=" ")
#         print("Fold ", fold_index, end = " ")
#         print("="*20, )
#         X_train, X_val = X.iloc[train_id], X.iloc[val_id]
#         y_train, y_val = y[train_id], y[val_id]

#         # define the model
#         model = LGBMClassifier(**param)
#         # fit the model
#         model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=100, verbose=200, eval_metric=["auc"])

#         # predict 
#         y_pred = model.predict_proba(X_val)[:,1]
        
#         y_pred=pd.DataFrame(data={'prediction':y_pred})
#         y_true=pd.DataFrame(data={'target':y_val.reset_index(drop=True)})
#         gini_score=amex_metric(y_true = y_true, y_pred = y_pred)
    
#         cross_val_score.append(roc_auc_score(y_val, y_pred))
        
#         del X_train, X_val,y_train, y_val
#         gc.collect()
        
#     return np.mean(np.array(cross_val_score))


# # strat the study
# study = optuna.create_study(study_name="LGBM classifier", direction="maximize")
# fun = lambda trial: objective(trial, X, y)
# study.optimize(fun, n_trials=50)

In [None]:
# best param
# Trial 26 finished with value: 0.9596085950724972 and parameters: {'n_estimators': 13470, 'learning_rate': 0.03334140860049416, 'reg_lambda': 3.7629073371138517, 'lambda_l2': 24.60526923347014, 'max_depth': 16, 'subsample': 0.3297124328760659, 'bagging_freq': 3, 'feature_fraction': 0.2}. Best is trial 17 with value: 0.9601344000765331.
best_param= {"n_estimators":1500,
            "learning_rate":0.04,
            #"lambda_l2":24.60526923347014,
            "max_depth":16,
            "subsample":0.32,
             "bagging_freq": 3,
             #"feature_fraction":0.2,
             "random_state": 37,
             "boosting_type":'gbdt',
             "min_child_samples": 2000,
             'objective': 'binary'
            }

In [None]:
# prediction
gbm_test_preds, gini=[],[]
ft_importance=pd.DataFrame(index=X.columns)
# cross - validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=32)
cross_val_score = []
for fold_index, (train_id, val_id) in enumerate(cv.split(X,y)):
    # get the train and val set for this cross validation
    print("="*20, end=" ")
    print("Fold ", fold_index, end = " ")
    print("="*20, )
    X_train, X_val = X.iloc[train_id], X.iloc[val_id]
    y_train, y_val = y[train_id], y[val_id]

    # define the model
    model = LGBMClassifier(**best_param)
    # fit the model
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=100, verbose=200, eval_metric=["auc"])

    # predict 
    y_pred = model.predict_proba(X_val)[:,1]

    y_pred=pd.DataFrame(data={'prediction':y_pred})
    y_true=pd.DataFrame(data={'target':y_val.reset_index(drop=True)})
    gini_score=amex_metric(y_true = y_true, y_pred = y_pred)

    cross_val_score.append(roc_auc_score(y_val, y_pred))
    print("Gini score {} --- cross validation score {}".format(gini_score,cross_val_score))
    
    # gbm_test_preds.append(model.predict_proba(test)[:,1])
    

    del X_train, X_val,y_train, y_val
    gc.collect()
del X, y
gc.collect()