## Preprocessing
In this Notebook, we'll do some feature engineering, which should include
1. Filtering features
2. Encoding categorical features for use in common ML algorithms

We want to be able to change things on the fly during the model build process. For that, we are creating a preprocessing tool instead of a script or static, to speed up iterations and get a preprocessed data file.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Import our helper functions
import sys
sys.path.insert(0, './helpers')
from preprocessor import PreProcessor

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load the training data set
df = pd.read_csv('./derived_data/train_data_after_EDA.csv')
df.head()

Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
0,JOB1362684407687,COMP37,CFO,MASTERS,MATH,HEALTH,10,83,130
1,JOB1362684407688,COMP19,CEO,HIGH_SCHOOL,NONE,WEB,3,73,101
2,JOB1362684407689,COMP52,VICE_PRESIDENT,DOCTORAL,PHYSICS,HEALTH,10,38,137
3,JOB1362684407690,COMP38,MANAGER,DOCTORAL,CHEMISTRY,AUTO,8,17,142
4,JOB1362684407691,COMP7,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,8,16,163


### Call our Preprocessor class to transform columns we want to label encode

In [4]:
# List out all the columns we want to be label encoded
cols_to_label = ['companyId','jobType', 'degree', 'major', 'industry']

In [5]:
# A class has been made called PreProcessor to transform given columns in the data frame to label encoded columns
# Lets check that class just to get an idea of what is happening in the module

PreProcessor??

[0;31mInit signature:[0m [0mPreProcessor[0m[0;34m([0m[0mcolumns[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      <no docstring>
[0;31mSource:[0m        
[0;32mclass[0m [0mPreProcessor[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m[0mcolumns[0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mcolumns[0m [0;34m=[0m [0mcolumns[0m [0;31m# array of column names to encode[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mfit[0m[0;34m([0m[0mself[0m[0;34m,[0m[0mX[0m[0;34m,[0m[0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mreturn[0m [0mself[0m [0;31m# not relevant here[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mtransform[0m[0;34m([0m[0mself[0m[0;34m,[0m[0mX[0m[0;34m)[0m[0;34m:[0m

In [6]:
# Now using this PreProcessor class we will label encode the columns in cols_to_label
label_enc = PreProcessor(cols_to_label)
df = label_enc.fit_transform(df)
df.head()

Encoding complete


Unnamed: 0,jobId,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis,salary
0,JOB1362684407687,31,1,3,6,3,10,83,130
1,JOB1362684407688,11,0,2,7,6,3,73,101
2,JOB1362684407689,48,7,1,8,3,10,38,137
3,JOB1362684407690,32,5,1,2,0,8,17,142
4,JOB1362684407691,60,7,0,8,2,8,16,163


This makes our training data ready for modeling and based on the feedback from modeling we might come back to do some feature engineering with this data set. 