# SetDtypeTransformer
This notebook shows the functionality in the SetDtypeTransformer class. This transformer changes the column type to the new set type. <br>

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline

In [2]:
import tubular
from tubular.mapping import MappingTransformer
from tubular.misc import SetColumnDtype

In [3]:
tubular.__version__

'0.3.3'

## Example 1

### Create sample data

In [4]:
sample_data = pd.DataFrame(
    {
        "col1": ["a", "b", "c"],
        "col2": [1, 2, 3],
        "col3": [True, False, True],
        "col4": [0.1, 0.2, 0.3],
        "col5": ["a", "b", "c"],
    }
)
sample_data.shape

(3, 5)

In [5]:
sample_data

Unnamed: 0,col1,col2,col3,col4,col5
0,a,1,True,0.1,a
1,b,2,False,0.2,b
2,c,3,True,0.3,c


In [6]:
sample_data.dtypes

col1     object
col2      int64
col3       bool
col4    float64
col5     object
dtype: object

## Set dtypes using pipeline

### Initialising SetColumnDtype

Creating two transformers in a pipeline.  First one changes dtype  of col1 and col5 into a string and second one changes col2 into a float.  note that either dtype objects or strings interpretable as such by pandas.api.types.pandas_dtype will work

In [7]:
set_dtypes_pipeline = Pipeline(
    [
        ("dtype_string", SetColumnDtype(["col1", "col5"], dtype="string")),
        ("dtype_float", SetColumnDtype("col2", dtype=float)),
    ]
)

### SetColumnDtype transform

In [8]:
sdp = set_dtypes_pipeline.transform(sample_data)

In [9]:
sdp.dtypes

col1     string
col2    float64
col3       bool
col4    float64
col5     string
dtype: object

## Example 2
This shows handling of 'O' type which occurswhen the mapping has missing values (there are some values which are not in the dictionary).

In [10]:
data = pd.DataFrame([[1, "a"], [2, "b"]], columns=["numbers", "letters"])

In [11]:
column_mappings_5 = {
    "numbers": {1: "zzz", 2: "yyy", 3: "www"},
    "letters": {"a": "albatross"},
}

In [12]:
map_data = MappingTransformer(mappings=column_mappings_5, copy=True, verbose=False)

In [13]:
data["numbers"].dtype

dtype('int64')

In [14]:
data_transformed = map_data.transform(data)

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   numbers  2 non-null      int64 
 1   letters  2 non-null      object
dtypes: int64(1), object(1)
memory usage: 160.0+ bytes


column letters has chnaged to 'object' type.  We can use the SetColumnDtype transformer to change this back to a string.

In [17]:
set_dtypes = SetColumnDtype(["letters", "numbers"], "string")

In [18]:
data_types_changed = set_dtypes.transform(data_transformed)

In [19]:
data_types_changed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   numbers  2 non-null      string
 1   letters  2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
