# 1) Using the DataProcessor class to preprocess the data

Preprocessing can basically be done with 2 function calls: .remove_columns() and .preprocess_data().

However the class also includes more fine-grained functionality (see code).

In [1]:
import os


In [2]:
# Read the sample file
data_dir = r"/Users/xingyuehuang/Downloads/Tradeteq-adversarial-attacks-on-credit-score-main 2"
co_file = os.path.join(data_dir, "client_start_folder","Co_600K_Jul2019_6M.pkl")


Create a DataProcessor object which loads the data from the right file.

In [3]:
from processing.DataProcessor import DataProcessor
data_proc = DataProcessor(co_file)

Define columns to be removed

In [4]:
zero_info_features = ["CompanyId", "CompanyNumber","CompanyName","imd"]
only_one_value_features = ["Filled", "LimitedPartnershipsNumGenPartners", "LimitedPartnershipsNumLimPartners",\
                          "Status20190701","CompanyStatus"]
complicated_features = ["RegAddressAddressLine1", "RegAddressAddressLine2", "RegAddressCareOf", "RegAddressCounty", \
                        "RegAddressPOBox", "RegAddressPostCode", "RegAddressPostTown","oa11", "PreviousName_1CompanyName"]
to_remove_cols = zero_info_features+only_one_value_features+complicated_features

In [5]:
data_proc.remove_columns(to_remove_cols)

Define columns to be converted to numerical/string type

In [6]:
to_num_cols = ["AccountsAccountRefDay", "AccountsAccountRefMonth", "oac1"]
to_str_cols = ["ru11ind"]

Define the date-Data processing: [["NewDuration Name", "Post Name", "Prev Name"],...]

In [7]:
date_convert=[["dAccountsTimeGap","dAccountsNextDueDate","dAccountsLastMadeUpDate"],\
              ["dConfStmtTimeGap","dConfStmtNextDueDate","dConfStmtLastMadeUpDate"],\
              ["dReturnsTimeGap","dReturnsNextDueDate","dReturnsLastMadeUpDate"]]

In [8]:
X_train, X_test, y_train, y_test = data_proc.preprocess_data(to_num_cols, to_str_cols, date_convert)

# 2) Evaluating models with the ModelEvaluator class

The ModelEvaluator class is abstract, so that it is general enough to work for different types of models. You might have to build your child class if the syntax differs (see SKEvaluator example)

In [9]:
import xgboost as xgb
from ModelEvaluationTools.SKEvaluator import SKEvaluator

In [10]:
xg = xgb.XGBClassifier(learning_rate=0.3, max_depth=10, subsample=0.5, objective='binary:logistic', verbosity=3)

Create instance of RFEvaluator, a child of the ModelEvaluator class

In [11]:
xg_model = SKEvaluator(xg, 'xg')

In [12]:
xg_model.fit(X_train, y_train)



[17:26:42] DEBUG: /Users/travis/build/dmlc/xgboost/src/gbm/gbtree.cc:154: Using tree method: 2
[17:26:43] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 62 extra nodes, 0 pruned nodes, max_depth=10
[17:26:44] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 62 extra nodes, 0 pruned nodes, max_depth=10
[17:26:45] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 124 extra nodes, 0 pruned nodes, max_depth=10
[17:26:45] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 120 extra nodes, 0 pruned nodes, max_depth=10
[17:26:46] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 188 extra nodes, 0 pruned nodes, max_depth=10
[17:26:47] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 158 extra nodes, 0 pruned nodes, max_depth=10
[17:26:48] INFO: /Users/travis/build/

[17:27:47] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 92 extra nodes, 0 pruned nodes, max_depth=10
[17:27:48] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 110 extra nodes, 0 pruned nodes, max_depth=10
[17:27:49] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 116 extra nodes, 0 pruned nodes, max_depth=10
[17:27:50] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 102 extra nodes, 0 pruned nodes, max_depth=10
[17:27:52] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 242 extra nodes, 0 pruned nodes, max_depth=10
[17:27:53] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 214 extra nodes, 0 pruned nodes, max_depth=10
[17:27:54] INFO: /Users/travis/build/dmlc/xgboost/src/tree/updater_prune.cc:101: tree pruning end, 212 extra nodes, 0 pruned nodes,

In [13]:
test_auc = xg_model.evaluate(X_test, y_test)

Accuracy - Test: 0.9982166666666666
AUC - Test: 0.9197329276626238


In [14]:
# saves model to fitted_models folder
xg_model.save_model()