# 1) Using the DataProcessor class to preprocess the data

Preprocessing can basically be done with 2 function calls: .remove_columns() and .preprocess_data().

However the class also includes more fine-grained functionality (see code).

In [1]:
import os

In [2]:
# Read the sample file
data_dir = r"C:\Users\oscar\Documents\Group Design Practical\Tradeteq-adversarial-attacks-on-credit-score"
co_file = os.path.join(data_dir, "client_start_folder","Co_600K_Jul2019_6M.pkl")

Create a DataProcessor object which loads the data from the right file.

In [3]:
from processing.DataProcessor import DataProcessor
data_proc = DataProcessor(co_file)

Define columns to be removed

In [4]:
zero_info_features = ["CompanyId", "CompanyNumber"]
only_one_value_features = ["Filled", "LimitedPartnershipsNumGenPartners", "LimitedPartnershipsNumLimPartners",\
                          "Status20190701"]
complicated_features = ["RegAddressAddressLine1", "RegAddressAddressLine2", "RegAddressCareOf", "RegAddressCounty", \
                        "RegAddressPOBox", "RegAddressPostCode", "RegAddressPostTown", "pcd", "oa11", "PreviousName_1CompanyName"]
to_remove_cols = zero_info_features+only_one_value_features+complicated_features

In [5]:
data_proc.remove_columns(to_remove_cols)

Define columns to be converted to numerical/string type

In [6]:
to_num_cols = ["AccountsAccountRefDay", "AccountsAccountRefMonth", "oac1"]
to_str_cols = ["ru11ind"]

In [7]:
X_train, X_test, y_train, y_test = data_proc.preprocess_data(to_num_cols, to_str_cols)

# 2) Evaluating models with the ModelEvaluator class

The ModelEvaluator class is abstract, so that it is general enough to work for different types of models. You might have to build your child class if the syntax differs (see SKEvaluator example)

In [33]:
import xgboost as xgb
from ModelEvaluationTools.SKEvaluator import SKEvaluator

In [34]:
xg = xgb.XGBClassifier(learning_rate=0.3, max_depth=10, subsample=0.5, objective='binary:logistic', verbosity=3)

Create instance of RFEvaluator, a child of the ModelEvaluator class

In [35]:
xg_model = SKEvaluator(xg, 'xg')

In [36]:
xg_model.fit(X_train, y_train)



[17:30:27] DEBUG: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/gbm/gbtree.cc:154: Using tree method: 2
[17:30:28] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/tree/updater_prune.cc:101: tree pruning end, 66 extra nodes, 0 pruned nodes, max_depth=10
[17:30:28] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/tree/updater_prune.cc:101: tree pruning end, 58 extra nodes, 0 pruned nodes, max_depth=10
[17:30:29] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/tree/updater_prune.cc:101: tree pruning end, 120 extra nodes, 0 pruned nodes, max_depth=10
[17:30:29] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/tree/updater_prune.cc:101: tree pruning end, 98 extra nodes, 0 pruned nodes, max_depth=10
[17:30:30] INFO: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/tree/updater_prune.cc:101: tree pruning end, 158 extra nodes, 0 pruned nodes, max_depth=10
[17:30:30] INFO

In [37]:
test_auc = xg_model.evaluate(X_test, y_test)

Accuracy - Test: 0.99825
AUC - Test: 0.9204472796834738


In [38]:
# saves model to fitted_models folder
xg_model.save_model()