# Pre-Processing

With the pre-processing of the data, i make use of the `sklearn` pipeline approach and build custom transformers for all feature engineering.

This will ensure that the predictions made on the eventual model have the same transformations applied to the new data as with the training data.

The big benefit to using transformers is the OOP approach which allows us to store state directly in the transformer for custom mappings and feature engineering.

In [3]:
%load_ext autoreload
%autoreload 2
import warnings
from trainer_lib import DataManager
from trainer_lib.utils.notebook_config import DATA_DIR, REPORT_DIR
print(f"Data directory: {DATA_DIR}")
print(f"Report directory: {REPORT_DIR}")
warnings.filterwarnings('ignore')

# The instantiation will fetch the data and documentation if not already fetched
mngr = DataManager(save_path=DATA_DIR, report_path=REPORT_DIR)
X,y = mngr.train

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Data directory: /app/data/raw/
Report directory: /app/reports/


## Custom transformers

The following transformers are part of the `trainer_lib` package in this project under the `transformers` namespace.


They are generally dealing with the feature engineering / quality checks outlined in the EDA notebook

In [18]:
from trainer_lib.modelling.model_config import ALL_COLUMNS, ONE_HOT_CATEGORICAL_COLUMNS, SCALABLE_NUMERIC_COLUMNS
from sklearn.pipeline import Pipeline
from trainer_lib.transformers import SelectFeaturesTransfomer
from trainer_lib.transformers import CallDurationTransformer
from trainer_lib.transformers import TimeOfDayTransformer
from trainer_lib.transformers import MonthNameTransformer
from trainer_lib.transformers import EducationTransformer
from trainer_lib.transformers import OutcomeTransformer
from trainer_lib.transformers import JobTransformer
from trainer_lib.transformers import DaysPassedTransformer
from trainer_lib.transformers import DatasetCleanerPipeline

## Create a pipeline

In [19]:
PRE_PROCESSING_STEPS = [
    ("add_time_duration", CallDurationTransformer()),
    ("add_time_of_day", TimeOfDayTransformer()),
    ("convert_job", JobTransformer()),
    ("convert_month", MonthNameTransformer()),
    ("convert_education", EducationTransformer()),
    ("convert_outcome", OutcomeTransformer()),
    ('replace_negative_days_passed', DaysPassedTransformer()),
    ("impute_missing", DatasetCleanerPipeline()),
    ("column_selection", SelectFeaturesTransfomer(features=ALL_COLUMNS))
]
feature_engineering = Pipeline(PRE_PROCESSING_STEPS)

## Raw data

In [20]:
X.info()
X.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Id                4000 non-null   int64 
 1   Age               4000 non-null   int64 
 2   Job               3981 non-null   object
 3   Marital           4000 non-null   object
 4   Education         3831 non-null   object
 5   Default           4000 non-null   int64 
 6   Balance           4000 non-null   int64 
 7   HHInsurance       4000 non-null   int64 
 8   CarLoan           4000 non-null   int64 
 9   Communication     3098 non-null   object
 10  LastContactDay    4000 non-null   int64 
 11  LastContactMonth  4000 non-null   object
 12  NoOfContacts      4000 non-null   int64 
 13  DaysPassed        4000 non-null   int64 
 14  PrevAttempts      4000 non-null   int64 
 15  Outcome           958 non-null    object
 16  CallStart         4000 non-null   object
 17  CallEnd       

Unnamed: 0,Id,Age,Job,Marital,Education,Default,Balance,HHInsurance,CarLoan,Communication,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,Outcome,CallStart,CallEnd
0,1,32,management,single,tertiary,0,1218,1,0,telephone,28,jan,2,-1,0,,13:45:20,13:46:30
1,2,32,blue-collar,married,primary,0,1156,1,0,,26,may,5,-1,0,,14:49:03,14:52:08
2,3,29,management,single,tertiary,0,637,1,0,cellular,3,jun,1,119,1,failure,16:30:24,16:36:04
3,4,25,student,single,primary,0,373,1,0,cellular,11,may,2,-1,0,,12:06:43,12:20:22
4,5,30,management,married,tertiary,0,2694,0,0,cellular,3,jun,1,-1,0,,14:35:44,14:38:56


In [21]:
processed = feature_engineering.fit_transform(X)
processed.info()
processed.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Job               4000 non-null   object 
 1   Marital           4000 non-null   object 
 2   Communication     4000 non-null   object 
 3   CallTimeOfDay     4000 non-null   object 
 4   Age               4000 non-null   int64  
 5   Balance           4000 non-null   int64  
 6   LastContactDay    4000 non-null   int64  
 7   LastContactMonth  4000 non-null   int64  
 8   NoOfContacts      4000 non-null   int64  
 9   DaysPassed        4000 non-null   int64  
 10  PrevAttempts      4000 non-null   int64  
 11  CallDurationMins  4000 non-null   int64  
 12  Education         4000 non-null   float64
 13  Default           4000 non-null   int64  
 14  HHInsurance       4000 non-null   int64  
 15  CarLoan           4000 non-null   int64  
 16  Outcome           4000 non-null   float64


Unnamed: 0,Job,Marital,Communication,CallTimeOfDay,Age,Balance,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,CallDurationMins,Education,Default,HHInsurance,CarLoan,Outcome
0,professional,single,telephone,afternoon,32,1218,28,1,2,182,0,1,3.0,0,1,0,0.0
1,skilled,married,other,afternoon,32,1156,26,5,5,182,0,3,1.0,0,1,0,0.0
2,professional,single,cellular,afternoon,29,637,3,6,1,119,1,5,3.0,0,1,0,0.0
3,other,single,cellular,afternoon,25,373,11,5,2,182,0,13,1.0,0,1,0,0.0
4,professional,married,cellular,afternoon,30,2694,3,6,1,182,0,3,3.0,0,0,0,0.0


Now that the pipeline is fitted to our training data, it can be used to transform the test data quite easily (which will make predictions nice and straight forward)

In [26]:
X_test, _ = mngr.test
X_test_processed = feature_engineering.transform(X_test)
X_test_processed.info()
X_test_processed.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Job               1000 non-null   object 
 1   Marital           1000 non-null   object 
 2   Communication     1000 non-null   object 
 3   CallTimeOfDay     1000 non-null   object 
 4   Age               1000 non-null   int64  
 5   Balance           1000 non-null   int64  
 6   LastContactDay    1000 non-null   int64  
 7   LastContactMonth  1000 non-null   int64  
 8   NoOfContacts      1000 non-null   int64  
 9   DaysPassed        1000 non-null   int64  
 10  PrevAttempts      1000 non-null   int64  
 11  CallDurationMins  1000 non-null   int64  
 12  Education         1000 non-null   float64
 13  Default           1000 non-null   int64  
 14  HHInsurance       1000 non-null   int64  
 15  CarLoan           1000 non-null   int64  
 16  Outcome           1000 non-null   float64
d

Unnamed: 0,Job,Marital,Communication,CallTimeOfDay,Age,Balance,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,CallDurationMins,Education,Default,HHInsurance,CarLoan,Outcome
0,workforce,single,other,afternoon,25,1,12,5,12,182,0,0,2.0,0,1,1,0.0
1,professional,married,cellular,morning,40,0,24,7,1,182,0,0,3.0,0,1,1,0.0
2,professional,single,cellular,afternoon,44,-1313,15,5,10,182,0,1,3.0,0,1,1,0.0
3,skilled,single,cellular,morning,27,6279,9,11,1,182,0,4,2.0,0,1,0,0.0
4,skilled,married,cellular,afternoon,53,7984,2,2,1,182,0,2,2.0,0,1,0,0.0


Nice!