# MappingTransformer
This notebook shows the functionality in the MappingTransformer class. This transformer maps column values to other values, using the pandas.DataFrame.replace function. <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tubular
from tubular.mapping import MappingTransformer

In [3]:
tubular.__version__

'0.3.0'

## Create dummy dataset

In [4]:
df = pd.DataFrame(
    {
        "factor1": [
            np.nan,
            "1.0",
            "2.0",
            "1.0",
            "3.0",
            "3.0",
            "2.0",
            "2.0",
            "1.0",
            "3.0",
        ],
        "factor2": ["z", "z", "x", "y", "x", "x", "z", "y", "x", "y"],
        "target": [18.5, 21.2, 33.2, 53.3, 24.7, 19.2, 31.7, 42.0, 25.7, 33.9],
        "target_int": [2, 1, 3, 4, 5, 6, 5, 8, 9, 8],
        "target_binary": [0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
    }
)

In [5]:
df.head()

Unnamed: 0,factor1,factor2,target,target_int,target_binary
0,,z,18.5,2,0
1,1.0,z,21.2,1,0
2,2.0,x,33.2,3,1
3,1.0,y,53.3,4,0
4,3.0,x,24.7,5,1


In [6]:
df.dtypes

factor1           object
factor2           object
target           float64
target_int         int64
target_binary      int64
dtype: object

## Simple usage

### Initialising MappingTransformer

The user must pass in a dict of mappings, each item within must be a dict of mappings for a specific column. <br>
In the mapping transformer the user does not specify columns, as with the most other transformers, instead this is picked up from the keys of mappings. <br>
In the case of factor1, there are null values if the user wishes to treat these they should use the imputation transformers in the package.

In [7]:
column_mappings = {
    "factor1": {
        "1.0": "a",
        "2.0": "b",
        "3.0": "c",
    },
    "factor2": {"x": "aa", "y": "bb", "z": "cc"},
}

In [8]:
map_1 = MappingTransformer(mappings=column_mappings, copy=True, verbose=True)

BaseTransformer.__init__() called


In [9]:
map_1.mappings

{'factor1': {'1.0': 'a', '2.0': 'b', '3.0': 'c'},
 'factor2': {'x': 'aa', 'y': 'bb', 'z': 'cc'}}

### MappingTransformer fit
There is not fit method for the MappingTransformer as the user sets the mappings when initialising the object.

### MappingTransformer transform
Multiple column mappings were specified when creating map_1 so these columns will be mapped when the transform method is run.

In [10]:
df["factor1"].dtype

dtype('O')

In [11]:
df["factor1"].value_counts(dropna=False)

1.0    3
2.0    3
3.0    3
NaN    1
Name: factor1, dtype: int64

In [12]:
df["factor2"].dtype

dtype('O')

In [13]:
df["factor2"].value_counts(dropna=False)

x    4
z    3
y    3
Name: factor2, dtype: int64

In [14]:
df_2 = map_1.transform(df)

BaseTransformer.transform() called


In [15]:
df_2["factor1"].dtype

dtype('O')

In [16]:
df_2["factor1"].value_counts(dropna=False)

a      3
b      3
c      3
NaN    1
Name: factor1, dtype: int64

In [17]:
df_2["factor2"].dtype

dtype('O')

In [18]:
df_2["factor2"].value_counts(dropna=False)

aa    4
cc    3
bb    3
Name: factor2, dtype: int64

## Transforming only certain levels
If only certain levels of a column are to be mapped then just these levels can be supplied in the mapping dict. 

In [19]:
column_mappings_2 = {"factor1": {"1.0": "0.0", "3.0": "10.0"}}

In [20]:
map_2 = MappingTransformer(mappings=column_mappings_2, copy=True, verbose=False)

In [21]:
df["factor1"].dtype

dtype('O')

In [22]:
df["factor1"].value_counts(dropna=False).head()

1.0    3
2.0    3
3.0    3
NaN    1
Name: factor1, dtype: int64

In [23]:
df_3 = map_2.transform(df)

In [24]:
df_3["factor1"].value_counts(dropna=False).head()

0.0     3
2.0     3
10.0    3
NaN     1
Name: factor1, dtype: int64

## Column dtype conversion
If all levels of a column are included in a mapping, and the mapping converts between data types, the pandas dtype will be converted. 

In [25]:
column_mappings_3 = {"target_binary": {0: False, 1: True}}

In [26]:
map_3 = MappingTransformer(mappings=column_mappings_3, copy=True, verbose=False)

In [27]:
df["target_binary"].dtype

dtype('int64')

In [28]:
df["target_binary"].value_counts(dropna=False).head()

0    7
1    3
Name: target_binary, dtype: int64

In [29]:
df_4 = map_3.transform(df)

In [30]:
df_4["target_binary"].dtype

dtype('bool')

In [31]:
df_4["target_binary"].value_counts(dropna=False)

False    7
True     3
Name: target_binary, dtype: int64

## Unexpected dtype conversions
Special care should be taken if specifying only a subset of levels in a mapping - that the mapping does not introduce data type conversion. Any conversions that do happen follow the pandas dtype conversions as this transformer uses `pandas.DataFrame.map`. <br>
The example below shows how the dtype of the column 'RM' was changed by mapping a particular value to a str - following pandas dtype conversions.

In [32]:
column_mappings_4 = {"target_binary": {1: True}}

In [33]:
map_4 = MappingTransformer(mappings=column_mappings_4, copy=True, verbose=False)

In [34]:
df["target_binary"].dtype

dtype('int64')

In [35]:
(df["target_binary"] == 1).sum()

3

In [36]:
df_5 = map_4.transform(df)

In [37]:
df_5["target_binary"].dtype

dtype('O')

In [38]:
df_5["target_binary"].value_counts()

0       7
True    3
Name: target_binary, dtype: int64