## Data Type Mapper Demonstration

## Problem

Machine Learning packages for the learning of a ML model often assume a certain encoding / data type of the data.

For instance, PyMC3 (Prob. Prog. Language) requires that categorical data is represented by integer numbers. This is due to the internal way of how such a variable is modelled. However, that often contradicts the intuitive semantic of a variable/ data attribute.

For instance 1 doesn't mean anything, but "Male" is easy to understand. However, PyMC3 forces us to provide it with data encoded as 0 and 1.

In Lumen the values of variables/attributes are directly used in the visualizations. That is, if a variable is encoded with 1 and 0 then these will show up as categories in the frontend. Now, obviously, this is not what we want. The user (in our scenario) does not care much about technical details of encoding and modelling - he wants to see the model!

Hence, we need an abstraction layer that transparantely converts between the actual representation of data in *data space* and the required representation in *model space*.

This is what the _data type mapper_ is for.

## simple demonstration

In [1]:
import pandas as pd
import numpy as np
import mb.modelbase
import mb.data

In [2]:
iris_data = pd.read_csv('example_models/iris.csv', index_col=None)

In [3]:
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
dtm = mb.modelbase.DataTypeMapper()
dtm.set_map('species', forward={'setosa': 1, 'virginica': 2, 'versicolor': 3}, backward='auto')

In [10]:
iris_data.species.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [6]:
dtm.forward(iris_data.species, inplace=False)

0      1
1      1
2      1
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     1
11     1
12     1
13     1
14     1
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     1
25     1
26     1
27     1
28     1
29     1
      ..
120    2
121    2
122    2
123    2
124    2
125    2
126    2
127    2
128    2
129    2
130    2
131    2
132    2
133    2
134    2
135    2
136    2
137    2
138    2
139    2
140    2
141    2
142    2
143    2
144    2
145    2
146    2
147    2
148    2
149    2
Name: species, Length: 150, dtype: int64

In [7]:
dtm.backward(dtm.forward(iris_data.species, inplace=False), inplace=False)

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
5         setosa
6         setosa
7         setosa
8         setosa
9         setosa
10        setosa
11        setosa
12        setosa
13        setosa
14        setosa
15        setosa
16        setosa
17        setosa
18        setosa
19        setosa
20        setosa
21        setosa
22        setosa
23        setosa
24        setosa
25        setosa
26        setosa
27        setosa
28        setosa
29        setosa
         ...    
120    virginica
121    virginica
122    virginica
123    virginica
124    virginica
125    virginica
126    virginica
127    virginica
128    virginica
129    virginica
130    virginica
131    virginica
132    virginica
133    virginica
134    virginica
135    virginica
136    virginica
137    virginica
138    virginica
139    virginica
140    virginica
141    virginica
142    virginica
143    virginica
144    virginica
145    virginica
146    virginica
147    virgini

In [8]:
np.all(iris_data.species == dtm.backward(dtm.forward(iris_data.species)))

True

## Complex mappings

yeah, but what if it's more than just a string <-> int mapping, what if it's more than a lookup?

In [9]:
dtm.set_map('sepal_length', forward=lambda x: str(x), backward=lambda x: float(x))

NotImplementedError: callables as mappings not yet implemented