# MappingTransformer
This notebook shows the functionality in the MappingTransformer class. This transformer maps column values to other values, using the pandas.DataFrame.replace function. <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.mapping import MappingTransformer

In [3]:
tubular.__version__

'0.2.8'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising MappingTransformer

The user must pass in a dict of mappings, each item within must be a dict of mappings for a specific column. <br>
In the mapping transformer the user does not specify columns, as with the most other transformers, instead this is picked up from the keys of mappings. <br>
In the case of RAD, there are null values if the user wishes to treat these they should use the imputation transformers in the package.

In [7]:
column_mappings = {
    'RAD': {
        '1.0': 'a',
        '2.0': 'b',
        '3.0': 'c',
        '4.0': 'd',
        '5.0': 'e',
        '6.0': 'f',
        '7.0': 'g',
        '8.0': 'h',
        '24.0': 'i'
    },
    'CHAS': {
        '0.0': 'aa',
        '1.0': 'bb'
    }
}

In [8]:
map_1 = MappingTransformer(mappings = column_mappings, copy = True, verbose = True)

BaseTransformer.__init__() called


In [9]:
map_1.mappings

{'RAD': {'1.0': 'a',
  '2.0': 'b',
  '3.0': 'c',
  '4.0': 'd',
  '5.0': 'e',
  '6.0': 'f',
  '7.0': 'g',
  '8.0': 'h',
  '24.0': 'i'},
 'CHAS': {'0.0': 'aa', '1.0': 'bb'}}

### MappingTransformer fit
There is not fit method for the MappingTransformer as the user sets the mappings when initialising the object.

### MappingTransformer transform
Multiple column mappings were specified when creating map_1 so these columns will be mapped when the transform method is run.

In [10]:
boston_df['RAD'].dtype

dtype('O')

In [11]:
boston_df['RAD'].value_counts(dropna = False)

24.0    124
5.0     103
4.0      88
NaN      62
3.0      35
6.0      22
8.0      21
2.0      20
1.0      18
7.0      13
Name: RAD, dtype: int64

In [12]:
boston_df['CHAS'].dtype

dtype('O')

In [13]:
boston_df['CHAS'].value_counts(dropna = False)

0.0    471
1.0     35
Name: CHAS, dtype: int64

In [14]:
boston_df_2 = map_1.transform(boston_df)

BaseTransformer.transform() called


In [15]:
boston_df_2['RAD'].dtype

dtype('O')

In [16]:
boston_df_2['RAD'].value_counts(dropna = False)

i      124
e      103
d       88
NaN     62
c       35
f       22
h       21
b       20
a       18
g       13
Name: RAD, dtype: int64

In [17]:
boston_df_2['CHAS'].dtype

dtype('O')

In [18]:
boston_df_2['CHAS'].value_counts(dropna = False)

aa    471
bb     35
Name: CHAS, dtype: int64

## Transforming only certain levels
If only certain levels of a column are to be mapped then just these levels can be supplied in the mapping dict. 

In [19]:
column_mappings_2 = {
    'ZN': {
        '0.0': 'zzz',
        '20.0': '10.0'
    }
}

In [20]:
map_2 = MappingTransformer(mappings = column_mappings_2, copy = True, verbose = False)

In [21]:
boston_df['ZN'].dtype

dtype('O')

In [22]:
boston_df['ZN'].value_counts(dropna = False).head()

0.0     330
NaN      62
20.0     16
80.0     13
25.0     10
Name: ZN, dtype: int64

In [23]:
boston_df_3 = map_2.transform(boston_df)

In [24]:
boston_df_3['ZN'].value_counts(dropna = False).head()

zzz     330
NaN      62
10.0     16
80.0     13
25.0     10
Name: ZN, dtype: int64

## Column dtype conversion
If all levels of a column are included in a mapping, and the mapping converts between data types, the pandas dtype will be converted. 

In [25]:
column_mappings_3 = {
    'CHAS': {
        '0.0': 1000,
        '1.0': 2000
    }
}

In [26]:
map_3 = MappingTransformer(mappings = column_mappings_3, copy = True, verbose = False)

In [27]:
boston_df['CHAS'].dtype

dtype('O')

In [28]:
boston_df['CHAS'].value_counts(dropna = False).head()

0.0    471
1.0     35
Name: CHAS, dtype: int64

In [29]:
boston_df_4 = map_3.transform(boston_df)

In [30]:
boston_df_4['CHAS'].dtype

dtype('int64')

In [31]:
boston_df_4['CHAS'].value_counts(dropna = False)

1000    471
2000     35
Name: CHAS, dtype: int64

## Unexpected dtype conversions
Special care should be taken if specifying only a subset of levels in a mapping - that the mapping does not introduce data type conversion. Any conversions that do happen follow the pandas dtype conversions as this transformer uses `pandas.DataFrame.map`. <br>
The example below shows how the dtype of the column 'RM' was changed by mapping a particular value to a str - following pandas dtype conversions.

In [32]:
column_mappings_4 = {
    'RM': {
        6.405: 'zzz'
    }
}

In [33]:
map_4 = MappingTransformer(mappings = column_mappings_4, copy = True, verbose = False)

In [34]:
boston_df['RM'].dtype

dtype('float64')

In [35]:
(boston_df['RM'] == 6.405).sum()

3

In [36]:
boston_df_5 = map_4.transform(boston_df)

In [37]:
boston_df_5['RM'].dtype

dtype('O')

In [38]:
(boston_df_5['RM'] == 'zzz').sum()

3