# CrossColumnMappingTransformer
This notebook shows the functionality in the CrossColumnMappingTransformer class. This transformer changes the values of one  column based on the values in other columns. <br>

In [1]:
import pandas as pd
import numpy as np
from collections import OrderedDict

In [2]:
import tubular
from tubular.mapping import CrossColumnMappingTransformer

In [3]:
tubular.__version__

'0.3.0'

## Create dummy dataset

In [4]:
df = pd.DataFrame(
    {
        "factor1": [
            np.nan,
            "1.0",
            "2.0",
            "1.0",
            "3.0",
            "3.0",
            "2.0",
            "2.0",
            "1.0",
            "3.0",
        ],
        "factor2": ["z", "z", "x", "y", "x", "x", "z", "y", "x", "y"],
        "target": [18.5, 21.2, 33.2, 53.3, 24.7, 19.2, 31.7, 42.0, 25.7, 33.9],
        "target_int": [2, 1, 3, 4, 5, 6, 5, 8, 9, 8],
    }
)

In [5]:
df.head()

Unnamed: 0,factor1,factor2,target,target_int
0,,z,18.5,2
1,1.0,z,21.2,1
2,2.0,x,33.2,3
3,1.0,y,53.3,4
4,3.0,x,24.7,5


In [6]:
df.dtypes

factor1        object
factor2        object
target        float64
target_int      int64
dtype: object

## Simple usage

### Initialising CrossColumnMappingTransformer

The user must pass in a dict of mappings, each item within must be a dict of mappings for a specific column. <br>
The column to be adjusted is also specified by the user. <br>
As shown below, if not all values of a column are required to define mappings, then these can be excluded from the dictionary. <br>

In [7]:
mappings = {
    "factor1": {
        "1.0": 1.1,
        "2.0": 0.5,
        "3.0": 4,
    }
}

adjust_column = "target"

In [8]:
map_1 = CrossColumnMappingTransformer(
    adjust_column=adjust_column, mappings=mappings, copy=True, verbose=True
)

BaseTransformer.__init__() called


### CrossColumnMappingTransformer fit
There is not fit method for the CrossColumnMappingTransformer as the user sets the mappings dictionary when initialising the object.

### CrossColumnMappingTransformer transform
Only one column mappings was specified when creating map_1 so only this column will be all be used to map the value of the adjust_column when the transform method is run.

In [9]:
df[["factor1", "target"]].head(10)

Unnamed: 0,factor1,target
0,,18.5
1,1.0,21.2
2,2.0,33.2
3,1.0,53.3
4,3.0,24.7
5,3.0,19.2
6,2.0,31.7
7,2.0,42.0
8,1.0,25.7
9,3.0,33.9


In [10]:
df[df["factor1"].isin(["1.0", "2.0", "3.0"])]["target"].groupby(df["factor1"]).mean()

factor1
1.0    33.400000
2.0    35.633333
3.0    25.933333
Name: target, dtype: float64

In [11]:
df_2 = map_1.transform(df)

BaseTransformer.transform() called


In [12]:
df_2[["factor1", "target"]].head(10)

Unnamed: 0,factor1,target
0,,18.5
1,1.0,1.1
2,2.0,0.5
3,1.0,1.1
4,3.0,4.0
5,3.0,4.0
6,2.0,0.5
7,2.0,0.5
8,1.0,1.1
9,3.0,4.0


In [13]:
df_2[df_2["factor1"].isin(["1.0", "2.0", "3.0"])]["target"].groupby(
    df_2["factor1"]
).mean()

factor1
1.0    1.1
2.0    0.5
3.0    4.0
Name: target, dtype: float64

## Column dtype conversion
If all levels of a column are included in a mapping, and the mapping converts between data types, the pandas dtype will be converted. 

In [14]:
mappings_2 = {"factor2": {"x": "a", "y": "b", "z": "c"}}

adjust_column_2 = "target"

In [15]:
map_2 = CrossColumnMappingTransformer(
    adjust_column=adjust_column_2, mappings=mappings_2, copy=True, verbose=True
)

BaseTransformer.__init__() called


In [16]:
df["target"].dtype

dtype('float64')

In [17]:
df_3 = map_2.transform(df)

BaseTransformer.transform() called


In [18]:
df_3["target"].dtype

dtype('O')

In [19]:
df_3["target"].value_counts(dropna=False)

a    4
c    3
b    3
Name: target, dtype: int64

## Unexpected dtype conversions
Special care should be taken if specifying only a subset of levels in a 'replace' adjustment type - that the replacement does not introduce data type conversion. Any conversions that do happen follow the pandas dtype conversions. <br>
The example below shows how the dtype of the column 'target' was changed by mapping a particular value to a str - following pandas dtype conversions.

In [20]:
mappings_3 = {"factor1": {"1.0": "zzz"}}

adjust_column_3 = "target"

In [21]:
map_3 = CrossColumnMappingTransformer(
    adjust_column=adjust_column_3, mappings=mappings_3, copy=True, verbose=True
)

BaseTransformer.__init__() called


In [22]:
df["target"].dtype

dtype('float64')

In [23]:
df_4 = map_3.transform(df)

BaseTransformer.transform() called


In [24]:
df_4["target"].dtype

dtype('O')

In [25]:
df_4["target"]

0    18.5
1     zzz
2    33.2
3     zzz
4    24.7
5    19.2
6    31.7
7    42.0
8     zzz
9    33.9
Name: target, dtype: object

# Using ordered dicts

If more than one column is used to define the mappings, then an ordered dict must be used to specify these, and any mappings will be made in the order specified to ensure reproducability

In [26]:
mappings_4 = OrderedDict()

mappings_4["factor1"] = {
    "1.0": 2,
    "2.0": 5,
    "3.0": 3,
}
mappings_4["factor2"] = {
    "x": -2,
}


adjust_column_4 = "target"

print(mappings_4)

OrderedDict([('factor1', {'1.0': 2, '2.0': 5, '3.0': 3}), ('factor2', {'x': -2})])


In [27]:
map_4 = CrossColumnMappingTransformer(
    adjust_column=adjust_column_4, mappings=mappings_4, copy=True, verbose=True
)

BaseTransformer.__init__() called


In [28]:
df[["factor1", "factor2", "target"]].head()

Unnamed: 0,factor1,factor2,target
0,,z,18.5
1,1.0,z,21.2
2,2.0,x,33.2
3,1.0,y,53.3
4,3.0,x,24.7


In the above example, as 'factor2' follows 'factor1' in the ordered dict, any replacements made based off the value of 'factor1' may be  overridden by replacements made based off the value of 'factor2'

In [29]:
df_5 = map_4.transform(df)

BaseTransformer.transform() called


In [30]:
df_5[["factor1", "factor2", "target"]].head()

Unnamed: 0,factor1,factor2,target
0,,z,18.5
1,1.0,z,2.0
2,2.0,x,-2.0
3,1.0,y,2.0
4,3.0,x,-2.0
