## Data Dictionary

There are multiple variables in the dataset which can be cleanly divided in 3 categories:

### Demographic information about customers

<b>customer_id</b> - Customer id

<b>vintage</b> - Vintage of the customer with the bank in number of days

<b>age</b> - Age of customer

<b>gender</b> - Gender of customer

<b>dependents</b> - Number of dependents

<b>occupation</b> - Occupation of the customer 

<b>city</b> - City of customer (anonymised)


### Customer Bank Relationship


<b>customer_nw_category</b> - Net worth of customer (3:Low 2:Medium 1:High)

<b>branch_code</b> - Branch Code for customer account

<b>days_since_last_transaction</b> - No of Days Since Last Credit in Last 1 year


### Transactional Information

<b>current_balance</b> - Balance as of today

<b>previous_month_end_balance</b> - End of Month Balance of previous month


<b>average_monthly_balance_prevQ</b> - Average monthly balances (AMB) in Previous Quarter

<b>average_monthly_balance_prevQ2</b> - Average monthly balances (AMB) in previous to previous quarter

<b>current_month_credit</b> - Total Credit Amount current month

<b>previous_month_credit</b> - Total Credit Amount previous month

<b>current_month_debit</b> - Total Debit Amount current month

<b>previous_month_debit</b> - Total Debit Amount previous month

<b>current_month_balance</b> - Average Balance of current month

<b>previous_month_balance</b> - Average Balance of previous month

<b>churn</b> - Average balance of customer falls below minimum balance in the next quarter (1/0)

In [None]:
import pandas as pd
train=pd.read_csv('../input/churn-prediction/churn_prediction.csv')
train.head()

In [None]:
!pip install flaml
from sklearn.metrics import mean_absolute_percentage_error

In [None]:
train.nunique()

In [None]:
train.info()

In [None]:
train=train.drop(['customer_id'],axis=1)
train.head()

In [None]:
train.city.value_counts()

In [None]:
train.branch_code.value_counts()

In [None]:
train=train.drop(['branch_code','city'],axis=1) 
train.head()

In [None]:
train.info()

In [None]:
train.occupation.value_counts()

In [None]:
train.customer_nw_category=train.customer_nw_category.replace({3:'Low', 2:'Medium', 1:'High'})
train.customer_nw_category.value_counts()

In [None]:
train.customer_nw_category=train.customer_nw_category.replace({'Low':0,'Medium':1,'High':2})
train.customer_nw_category.value_counts()

In [None]:
round(train.isnull().sum()*100/len(train),2)

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split( train, test_size=0.2, random_state=42,shuffle=True, stratify=train.churn)
train.shape, test.shape

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
test.isnull().sum().sort_values(ascending=False)

In [None]:
train[train.isna().any(axis=1)]

In [None]:
test[test.isna().any(axis=1)]

In [None]:
train=pd.get_dummies(train,prefix_sep='__')
train.head()

In [None]:
test=pd.get_dummies(test,prefix_sep='__')
test.head()

In [None]:
# !rm -r kuma_utils
!git clone https://github.com/analokmaus/kuma_utils.git

In [None]:
import sys
sys.path.append("kuma_utils/")
from kuma_utils.preprocessing.imputer import LGBMImputer

In [None]:
col=train.columns.tolist()
col.remove('churn')
col[:5]

In [None]:
%%time
lgbm_imtr = LGBMImputer(n_iter=500)

train_iterimp = lgbm_imtr.fit_transform(train[col])
test_iterimp = lgbm_imtr.transform(test[col])

# Create train test imputed dataframe
train_ = pd.DataFrame(train_iterimp, columns=col)
test_ = pd.DataFrame(test_iterimp, columns=col)

In [None]:
train_['churn'] = train['churn']
train_.head()

In [None]:
test_['churn'] = test['churn']
test_.head()

In [None]:
def undummify(df, prefix_sep="__"):
    cols2collapse = {
        item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
    }
    series_list = []
    for col, needs_to_collapse in cols2collapse.items():
        if needs_to_collapse:
            undummified = (
                df.filter(like=col)
                .idxmax(axis=1)
                .apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
                .rename(col)
            )
            series_list.append(undummified)
        else:
            series_list.append(df[col])
    undummified_df = pd.concat(series_list, axis=1)
    return undummified_df

In [None]:
train=undummify(train_)
train.head()

In [None]:
test=undummify(test_)
test.head()

In [None]:
y = train.pop('churn')
X = train

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42,shuffle=True, stratify=y)

In [None]:
from flaml import AutoML
automl = AutoML()

In [None]:
automl.fit(X_train, y_train, task="classification",metric='roc_auc',time_budget=900)

In [None]:
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best roc_auc on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, automl.predict(X_train)))

In [None]:
print(classification_report(y_test, automl.predict(X_test)))

In [None]:
test_=test.drop('churn',axis=1)
test_.head()

In [None]:
y_pred = automl.predict(test_)
y_pred[:5]

In [None]:
df = pd.DataFrame(y_pred,columns=['churn'])
df.head()

In [None]:
print(classification_report(test.churn, df.churn))