# DataFrameMethodTransformer
This notebook shows the functionality in the `DataFrameMethodTransformer` class. This transformer applys a `pd.DataFrame` method to the input `X`. <br>
This generic transformer means that many `pd.DataFrame` methods are available for use within the package without having to directly implement a transformer for that specific function.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

In [2]:
import tubular
from tubular.base import DataFrameMethodTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load California housing dataset from sklearn

In [4]:
cali = fetch_california_housing()
cali_df = pd.DataFrame(cali["data"], columns=cali["feature_names"])

In [5]:
cali_df.shape

(20640, 8)

In [6]:
cali_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


## Simple usage

### Initialising DataFrameMethodTransformer

The user must specify the following; <br>
- `new_column_name` the name or names of columns to assign the outputs of the `pd.DataFrame` method to <br> 
- `pd_method_name` the name of the `pd.DataFrame` method to be called <br>
- `columns` the columns in the `DataFrame` passed to the `transform` method to be transformed <br>
- `pd_method_kwargs` a dictionary of keyword arguments that are passed to the `pd.DataFrame` method when called <br>

Note, for `DataFrameMethodTransformer` the `columns` argument is mandatory. This is different to most of the other transformers in the package, which will pick up all columns in the data to use if it is not supplied. The reason for this is that is it very unlikely a user will want to run this transformer on all columns.

In [7]:
sum_transformer = DataFrameMethodTransformer(
    columns=["AveRooms", "AveBedrms"],
    pd_method_name="sum",
    new_column_name="AveRooms_AveBedrms_sum",
    pd_method_kwargs={"axis": 1},
)

### DataFrameMethodTransformer fit
There is no fit method for the DataFrameMethodTransformer as the methods that it can run do not 'learn' anything from the data.

### DataFrameMethodTransformer transform
When running transform with this configuration a new column `CRIM_INDUS_sum` is added to the input `X` which is the sum of `CRIM` and `INDUS`.

In [8]:
cali_df_2 = sum_transformer.transform(cali_df)

In [9]:
cali_df_2[["AveRooms", "AveBedrms", "AveRooms_AveBedrms_sum"]].head()

Unnamed: 0,AveRooms,AveBedrms,AveRooms_AveBedrms_sum
0,6.984127,1.02381,8.007937
1,6.238137,0.97188,7.210018
2,8.288136,1.073446,9.361582
3,5.817352,1.073059,6.890411
4,6.281853,1.081081,7.362934


## Multiple column assignment

It is possible to assign the output of the `pd.DataFrame` method to multiple columns by passing a list of column names to `new_column_name`.

In [10]:
div_transformer = DataFrameMethodTransformer(
    columns=["AveRooms", "AveBedrms"],
    pd_method_name="div",
    new_column_name=["AveRooms_half", "AveBedrms_half"],
    pd_method_kwargs={"other": 2},
)

In [11]:
cali_df_3 = div_transformer.transform(cali_df)

In [12]:
cali_df_3[["AveRooms", "AveBedrms", "AveRooms_half", "AveBedrms_half"]].head()

Unnamed: 0,AveRooms,AveBedrms,AveRooms_half,AveBedrms_half
0,6.984127,1.02381,3.492063,0.511905
1,6.238137,0.97188,3.119069,0.48594
2,8.288136,1.073446,4.144068,0.536723
3,5.817352,1.073059,2.908676,0.53653
4,6.281853,1.081081,3.140927,0.540541


## Other examples 

Below are other examples of using the `DataFrameMethodTransformer` transformer with exisitng `pd.DataFrame` numerical methods. <br> 
It is possible to use any [pd.DataFrame method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.abs.html), although some may not work correctly. The transformer only checks that the supplied method is available from the `pd.DataFrame` class. There are many combinations of # output columns, input columns, method and method keyword args that it cannot check.

### Cumulative sum

In [13]:
cumsum_transformer = DataFrameMethodTransformer(
    columns=["AveRooms", "AveBedrms", "AveOccup"],
    pd_method_name="cumsum",
    new_column_name=[
        "AveRooms_duplicate",
        "AveRooms_AveBedrms",
        "AveRooms_AveBedrms_AveOccup",
    ],
    pd_method_kwargs={"axis": 1},
)

In [14]:
cali_df_4 = cumsum_transformer.transform(cali_df)

In [15]:
cali_df_4[
    [
        "AveRooms",
        "AveBedrms",
        "AveOccup",
        "AveRooms_duplicate",
        "AveRooms_AveBedrms",
        "AveRooms_AveBedrms_AveOccup",
    ]
].head()

Unnamed: 0,AveRooms,AveBedrms,AveOccup,AveRooms_duplicate,AveRooms_AveBedrms,AveRooms_AveBedrms_AveOccup
0,6.984127,1.02381,2.555556,6.984127,8.007937,10.563492
1,6.238137,0.97188,2.109842,6.238137,7.210018,9.319859
2,8.288136,1.073446,2.80226,8.288136,9.361582,12.163842
3,5.817352,1.073059,2.547945,5.817352,6.890411,9.438356
4,6.281853,1.081081,2.181467,6.281853,7.362934,9.544402


### Modulo

In [16]:
mod_transformer = DataFrameMethodTransformer(
    columns=["AveOccup"],
    pd_method_name="mod",
    new_column_name=["AveOccup_2"],
    pd_method_kwargs={"other": 2},
)

In [17]:
cali_df_5 = mod_transformer.transform(cali_df)

In [18]:
cali_df_5[["AveOccup", "AveOccup_2"]].head()

Unnamed: 0,AveOccup,AveOccup_2
0,2.555556,0.555556
1,2.109842,0.109842
2,2.80226,0.80226
3,2.547945,0.547945
4,2.181467,0.181467


### Less than

In [19]:
lt_transformer = DataFrameMethodTransformer(
    columns=["AveOccup"],
    pd_method_name="lt",
    new_column_name=["AveOccup_lt_3"],
    pd_method_kwargs={"other": 3},
)

In [20]:
cali_df_6 = lt_transformer.transform(cali_df)

In [21]:
cali_df_6[["AveOccup", "AveOccup_lt_3"]].head()

Unnamed: 0,AveOccup,AveOccup_lt_3
0,2.555556,True
1,2.109842,True
2,2.80226,True
3,2.547945,True
4,2.181467,True


### Absolute value

In [22]:
abs_transformer = DataFrameMethodTransformer(
    columns=["Longitude"], pd_method_name="abs", new_column_name=["Longitude_abs"]
)

In [23]:
cali_df_7 = abs_transformer.transform(cali_df)

In [24]:
cali_df_7[["Longitude", "Longitude_abs"]].head()

Unnamed: 0,Longitude,Longitude_abs
0,-122.23,122.23
1,-122.22,122.22
2,-122.24,122.24
3,-122.25,122.25
4,-122.25,122.25


### Power

In [25]:
power_transformer = DataFrameMethodTransformer(
    columns=["AveOccup"],
    pd_method_name="pow",
    new_column_name=["AveOccup_cubed"],
    pd_method_kwargs={"other": 3},
)

In [26]:
cali_df_8 = power_transformer.transform(cali_df)

In [27]:
cali_df_8[["AveOccup", "AveOccup_cubed"]].head()

Unnamed: 0,AveOccup,AveOccup_cubed
0,2.555556,16.689986
1,2.109842,9.391819
2,2.80226,22.005195
3,2.547945,16.541323
4,2.181467,10.381164


### Type setting

In [28]:
type_transformer = DataFrameMethodTransformer(
    columns=["AveRooms", "AveBedrms"],
    pd_method_name="astype",
    new_column_name=["AveRooms_str", "AveBedrms_str"],
    pd_method_kwargs={"dtype": "str"},
)

In [29]:
cali_df_9 = type_transformer.transform(cali_df)

In [30]:
cali_df_9[["AveRooms", "AveBedrms", "AveRooms_str", "AveBedrms_str"]].dtypes

AveRooms         float64
AveBedrms        float64
AveRooms_str      object
AveBedrms_str     object
dtype: object

# Dropping the original columns

The columns specified to be transformed can be dropped by using the drop_original argument when initialising the DataFrameMethodTransformer object. The argument drop_original is False by default

In [32]:
cali_df_10 = sum_transformer_and_drop.transform(cali_df)

In [33]:
cali_df_10.columns

Index(['MedInc', 'HouseAge', 'Population', 'AveOccup', 'Latitude', 'Longitude',
       'AveRooms_AveBedrms_sum'],
      dtype='object')