In [1]:
import pandas as pd

In [8]:
train_data = pd.read_csv('data_assignment2/train_fact.csv')
ext_df = pd.read_csv('data_assignment2/external_data.csv')
prev_df = pd.read_csv('data_assignment2/prev_filtered.csv')

# Feature engineering

Feature engineering is all about creating more information from the information you have. Not all raw data can be used as is.

For example categorical variables cannot be used "as is" therefore we have to one hot encode these variables to use them. Here is one example for gender.


We usually have three types of variables, continuous, ordinal and categorical

Continuous variables are basically numeric features, it can take any numerical value. In this dataset age is an example of a continuous feature. These features are probably a good starting point since we can use them directly in a machine learning model.

Categorical variables are different as in that a variable is categorized in two or more categories. In this dataset sex is a categorical variable. It has just two categories: male, female. Often categorical variables are often strings and have to be transformed in some way as they cannot be used directly in a machine learning model. A machine learning model works with numbers, not with strings.

Ordinal variables are similar to categorical features, however there is some order in it. Eg. small, medium, tall is a categorical feature but there is still order in the features. In this dataset the Pclass is an ordinal variable, there are just 3 classes however there is order in the classes.

# The desired state of the data before we can apply machine learning

Before we can train a machine learning we have to transform the dataset into a format that a machine learning model can use.

As mentioned earlier, a machine learning model expects numbers, there should be no string columns.

Q1: how are we going to transform the string columns to a numerical column(s)?

We want to extract as much data from our dataset as possible.

Q2: how can we extract more information from the dataset then the data already provides?

Eventually we want to get to a situation where each column is a feature and the target variable in the end.

Eg:

feat_1, feat_2, feat_3, 'TARGET'


In [23]:
train_data = pd.get_dummies(train_data, columns=['CODE_GENDER'])

Here is an example of how we can add more information such as "income per child" 

In [6]:
train_data['income_per_child'] = train_data['AMT_INCOME_TOTAL'] / (train_data['CNT_CHILDREN'] + 1)

The historical data is more complex, since we only have current records in our prediction table, we have to transform the history to a single row of variables to add to our prediction table.

In [9]:
prev_agg = prev_df.groupby('SK_ID_CURR').agg({
    'AMT_CREDIT': ['min', 'max', 'mean', 'sum']
})
prev_agg.columns = pd.Index(['PREV_' + e[0] + "_" + e[1].upper() for e in prev_agg.columns.tolist()])

In [15]:
ext_df = pd.read_csv('data_assignment2/external_data.csv')
ext_agg = ext_df.groupby('SK_ID_CURR').agg({
    'AMT_CREDIT_SUM': ['min', 'max', 'mean', 'sum']
})
ext_agg.columns = pd.Index(
    ['EXT_' + e[0] + "_" + e[1].upper() for e in ext_agg.columns.tolist()])

We can now merge the aggregates with our prediction table.

In [16]:
train_data = pd.merge(train_data, prev_agg, how='left', on='SK_ID_CURR')
train_data = pd.merge(train_data, ext_df, how='left', on='SK_ID_CURR')

In [17]:
train_data

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY_x,...,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY_y
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-1038.0,,0.0,40761.000,,,0.0,Credit card,-1038.0,0.000
1,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-48.0,,0.0,0.000,0.00,,0.0,Credit card,-47.0,
2,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-1185.0,0.000,0.0,135000.000,0.00,0.000,0.0,Consumer credit,-1185.0,0.000
3,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-911.0,3321.000,0.0,19071.000,,,0.0,Consumer credit,-906.0,0.000
4,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-36.0,5043.645,0.0,120735.000,0.00,0.000,0.0,Consumer credit,-34.0,0.000
5,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,,40.500,0.0,31988.565,0.00,31988.565,0.0,Credit card,-24.0,0.000
6,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,,,0.0,450000.000,245781.00,0.000,0.0,Consumer credit,-7.0,0.000
7,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-967.0,0.000,0.0,67500.000,,,0.0,Credit card,-758.0,0.000
8,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,-2131.0,0.000,0.0,22248.000,0.00,0.000,0.0,Consumer credit,-2131.0,
9,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,-540.0,0.000,0.0,112500.000,0.00,0.000,0.0,Credit card,-540.0,


# Your turn!

What other features can you come up with?

- What features have to be transformed, eg categorical variables, transformations over the income etc.
- How can we extract more value from our history?
- What other creative features can you come up with?

Some ideas:

- What's the term of the loan?
- How many times did someone pay late in the past?
- 