# CrossColumnMappingTransformer
This notebook shows the functionality in the CrossColumnMappingTransformer class. This transformer changes the values of one  column based on the values in other columns. <br>

In [1]:
import pandas as pd
import numpy as np
from collections import OrderedDict

In [2]:
import tubular
from tubular.mapping import CrossColumnMappingTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising CrossColumnMappingTransformer

The user must pass in a dict of mappings, each item within must be a dict of mappings for a specific column. <br>
The column to be adjusted is also specified by the user. <br>
As shown below, if not all values of a column are required to define mappings, then these can be excluded from the dictionary. <br>

In [7]:
mappings = {
    'RAD': {
        '1.0': 1.1,
        '2.0': 0.5,
        '3.0': 4,        
    }
}

adjust_column = "target"

In [8]:
map_1 = CrossColumnMappingTransformer(adjust_column = adjust_column, mappings = mappings, copy = True, verbose = True)


BaseTransformer.__init__() called


### CrossColumnMappingTransformer fit
There is not fit method for the CrossColumnMappingTransformer as the user sets the mappings dictionary when initialising the object.

### CrossColumnMappingTransformer transform
Only one column mappings was specified when creating map_1 so only this column will be all be used to map the value of the adjust_column when the transform method is run.

In [9]:
boston_df[['RAD','target']].head(10)

Unnamed: 0,RAD,target
0,,24.0
1,2.0,21.6
2,2.0,34.7
3,3.0,33.4
4,3.0,36.2
5,3.0,28.7
6,5.0,22.9
7,5.0,27.1
8,5.0,16.5
9,5.0,18.9


In [10]:
boston_df[boston_df['RAD'].isin(['1.0', '2.0','3.0'])]['target'].groupby(boston_df['RAD']).mean()

RAD
1.0    24.122222
2.0    27.125000
3.0    27.931429
Name: target, dtype: float64

In [11]:
boston_df_2 = map_1.transform(boston_df)

BaseTransformer.transform() called


In [12]:
boston_df_2[['RAD','target']].head(10)

Unnamed: 0,RAD,target
0,,24.0
1,2.0,0.5
2,2.0,0.5
3,3.0,4.0
4,3.0,4.0
5,3.0,4.0
6,5.0,22.9
7,5.0,27.1
8,5.0,16.5
9,5.0,18.9


In [13]:
boston_df_2[boston_df_2['RAD'].isin(['1.0', '2.0','3.0'])]['target'].groupby(boston_df_2['RAD']).mean()

RAD
1.0    1.1
2.0    0.5
3.0    4.0
Name: target, dtype: float64

## Column dtype conversion
If all levels of a column are included in a mapping, and the mapping converts between data types, the pandas dtype will be converted. 

In [14]:
mappings_2 = {
    'CHAS': {
        '0.0': 'a',
        '1.0': 'b'
    }
}

adjust_column_2 = "target"

In [15]:
map_2 = CrossColumnMappingTransformer(adjust_column = adjust_column_2, mappings=mappings_2, copy = True, verbose = True)

BaseTransformer.__init__() called


In [16]:
boston_df['target'].dtype

dtype('float64')

In [17]:
boston_df_3 = map_2.transform(boston_df)

BaseTransformer.transform() called


In [18]:
boston_df_3['target'].dtype

dtype('O')

In [19]:
boston_df_3['target'].value_counts(dropna = False)

a    471
b     35
Name: target, dtype: int64

## Unexpected dtype conversions
Special care should be taken if specifying only a subset of levels in a 'replace' adjustment type - that the replacement does not introduce data type conversion. Any conversions that do happen follow the pandas dtype conversions. <br>
The example below shows how the dtype of the column 'target' was changed by mapping a particular value to a str - following pandas dtype conversions.

In [20]:
mappings_3 = {
    'RM': {
        6.405: 'zzz'
    }
}

adjust_column_3 = "target"


In [21]:
map_3 = CrossColumnMappingTransformer(adjust_column = adjust_column_3, mappings = mappings_3, copy = True, verbose = True)

BaseTransformer.__init__() called


In [22]:
boston_df['target'].dtype

dtype('float64')

In [23]:
boston_df_4 = map_3.transform(boston_df)

BaseTransformer.transform() called


In [24]:
boston_df_4['target'].dtype

dtype('O')

# Using ordered dicts

If more than one column is used to define the mappings, then an ordered dict must be used to specify these, and any mappings will be made in the order specified to ensure reproducability

In [25]:
mappings_4 = OrderedDict()

mappings_4['RAD'] = {
        '2.0': 5,
        '3.0': 3,        
    }
mappings_4['ZN'] = {
        '0.0': -2,
    }


adjust_column_4 = "target"

print(mappings_4)

OrderedDict([('RAD', {'2.0': 5, '3.0': 3}), ('ZN', {'0.0': -2})])


In [26]:
map_4 = CrossColumnMappingTransformer(adjust_column = adjust_column_4, mappings = mappings_4, copy = True, verbose = True)

BaseTransformer.__init__() called


In [27]:
boston_df[['ZN','RAD','target']].head()

Unnamed: 0,ZN,RAD,target
0,18.0,,24.0
1,,2.0,21.6
2,0.0,2.0,34.7
3,,3.0,33.4
4,0.0,3.0,36.2


In the above example, as 'ZN' follows 'RAD' in the ordered dict, any replacements made based off the value of 'RAD' may be  overridden by replacements made based off the value of 'ZN'

In [28]:
boston_df_5 = map_4.transform(boston_df)

BaseTransformer.transform() called


In [29]:
boston_df_5[['ZN','RAD','target']].head()

Unnamed: 0,ZN,RAD,target
0,18.0,,24.0
1,,2.0,5.0
2,0.0,2.0,-2.0
3,,3.0,3.0
4,0.0,3.0,-2.0
