**Data Preprocessing**

I conducted the exploratory analysis (EDA) in R (see eda.Rmd), and this part of the notebook deals with preparing the data using what I found from the EDA.

In [1]:
from src.data_main import Data

data = Data();

Creating an instance of the Data class reads the data csv.

In [2]:
data._Data__convert_response();

  self.data["response"] = self.data["y"].replace({"no": 0, "yes": 1});


I convert the response variable to boolean.

In [3]:
data._Data__create_day_ids();

The EDA revealed this data has a significant temporal component. Given the data is ordered and I have days of week, I can create a helper column called "new_day" to mark changes in the day of week, and cumulatively sum over it to get a "day_id" or the number of days elapsed since data collection started.

In [4]:
data._Data__convert_to_categorical();

I convert all string columns to categorical values. The EDA identified that some numerical variables would be inappropriate to represent as continuous (i.e previous), so I convert them to categorical as well.

In [5]:
data._Data__merge("loan", "housing");
data._Data__merge("poutcome", "previous");

EDA identified that these columns need to be merged for linear models because some levels of this category are perfectly multcolinear with each other.

In [6]:
data._Data__bin_continuous();
data._Data__bin_categorical();

  self.data["default_group"] = self.data["default"].replace({"unknown": "unknown_or_yes", "yes": "unknown_or_yes"});


EDA showed densities of certain non-linear continuous variables are neatly seperated by class at some thresholds (age, pdays, campaign) so linear models can use them to seperate between classes. Similarly, certain categoricals (default) have some unecessary levels that can be removed to reduce their complexity and standard error.

In [7]:
data._Data__encode_categorical();

I perform one-hot encoding of all categorical variables to make it easier for models to interpret.

In [8]:
data._Data__remove_unfair_predictors();

Some predictors (duration) are only known after a recording is complete, so need to be removed.

In [9]:
data._Data__split_data();

The EDA identified a significant temporal component to the data. To produce genuine predictions in the future, a successful model needs to be able to infer temporal patterns in the data using day_id. I choose to split the data based on day_id. The idea here is to train the model on a lot of historical data with different temporal patterns (e.g the strategy shifts identified in EDA) so the model learns these different patterns, then test them on the most recent data. Given the most recent days contain the fewest records, a balance needs to be struck between showing the model enough of the most recent temporal pattern and restricting the size of the training data to reduce overfitting.

In [10]:
data.split_day

np.float64(260.0)

In [11]:
data.test_prop

0.06958337379819365

In [12]:
data.train_prop

0.9304166262018063

This proportion of training is higher than normal (0.8), so models trained with this split have a tendancy to overfit to the training data, but is necessary for the model to learn about the most recent data.

In [13]:
data.data["response"].value_counts()

response
False    36548
True      4640
Name: count, dtype: int64

The data as a whole has a class imbalance, but EDA identified that the class balance becomes more equal for more recent observations. Data split using the above splitting schema will have different balances for training and testing data, so I oversample the True class in the training data so a) it resembles the split of the testing data and b) it learns characteristics of True at the same level as False.

Linear models (e.g GAM fitted during EDA) sometimes work best with non-linear continuous by binning and need some adjustments to avoid perfect multicolinearity, whereas machine learning models (e.g DecisionTree, RandomForest, SGD) work best by inferring the bins themselves and need no such adjustments. I create two sets of data: "sensitive" for linear models and "insensitive" for ML.

In [18]:
data.insensitive_train_X.columns

Index(['age', 'campaign', 'pdays', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'day_id', 'job_admin.',
       'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep

In [17]:
data.sensitive_train_X.columns

Index(['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m',
       'nr.employed', 'day_id', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid',
       ...
       'campaign_group_2', 'campaign_group_3', 'campaign_group_4',
       'campaign_group_5', 'campaign_group_6', 'campaign_group_7',
       'campaign_group_8', 'campaign_group_9+', 'default_group_no',
       'default_group_unknown_or_yes'],
      dtype='object', length=127)

Finally, I create validation sets from a random sample of 20% of the training data to tune hyperparameters and decision thresholds with.

**Modelling and Evaluation**

In [19]:
from src.models_main import Models

models = Models(data);

ModuleNotFoundError: No module named 'models'