<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/25Apr20_6_custom_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Custom Transformers

### Introduction

We've not seen how to use `sklearn_pandas`, transformers, and pipelines to perform feature engineering on our dataset.  Sometimes, however, we we'll need to perform more transformations than those provided to us out of the box.  In this lesson, we'll see how we can.

### Our transformations

Let's load data that describes different airbnb listings in Germany.

In [0]:
import pandas as pd 
listings_url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/listings_summary.csv"
# listings_shorter_url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/listings_five_k.csv"
listings_df = pd.read_csv(listings_url)

Let's take a look at some of the columns.

In [0]:
prices_df = listings_df[['extra_people', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee']]
prices_df[:3]

Unnamed: 0,extra_people,price,weekly_price,monthly_price,security_deposit,cleaning_fee
0,$28.00,$60.00,,,$200.00,$30.00
1,$0.00,$17.00,,,$0.00,$0.00
2,$20.00,$90.00,$520.00,"$1,900.00",$200.00,$50.00


We can see that each column has information that we would likely want to include in our model, but it is currently not numeric.

In [0]:
prices_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 6 columns):
extra_people        22552 non-null object
price               22552 non-null object
weekly_price        3681 non-null object
monthly_price       2659 non-null object
security_deposit    13191 non-null object
cleaning_fee        15406 non-null object
dtypes: object(6)
memory usage: 1.0+ MB


Let's get started coercing the data.

### Coercing the data

Let's start with the extra people column.  And let's just convert it without even worrying about pipelines or transformers.

In [0]:
extra_people_prices = prices_df['extra_people']

Let's select the first price from the column.

In [0]:
first_price = extra_people_prices[0]
first_price

'$28.00'

And convert it to a number.

In [0]:
first_price[1:]
# '28.00'

pd.to_numeric(first_price[1:])

28.0

### Custom Transformers

Now, to use this in a pipeline, we first wrap the procedure in a function.

In [0]:
def price_to_number(price):
    return pd.to_numeric(price[1:])

And we can use this function by passing it into a FunctionTransformer, which we use in our DataFrameMapper.  Let's see how it works.

In [0]:
from sklearn_pandas import DataFrameMapper, FunctionTransformer
mapper = DataFrameMapper([
    ('extra_people', FunctionTransformer(price_to_number))
], df_out = True)

In [0]:
tranformed_prices = mapper.fit_transform(prices_df)
tranformed_prices[:3]

Unnamed: 0,extra_people
0,28.0
1,0.0
2,20.0


### Common Errors with Function Transformers

One thing to pay attention to with function transformers is that our function takes in a single value in the column.

In [0]:
def price_to_number(price):
    return pd.to_numeric(price[1:])

So notice that it is not the *entire column* that is passed through, but rather a single value in that column.  Let's try referencing the entire column with our function, and see what happens.

> If we instead try writing a function that takes in the entire column (like so), we will get an error.

In [0]:
def price_col_to_num(price_col):
    return pd.to_numeric(prices_df['extra_people'].str[1:])

### Working with NA Columns

Another tricky component is writing transformers that handle missing values.  For example, let's take a look at the weekly price column.

In [0]:
weekly_price = prices_df['weekly_price']

In [0]:
weekly_price[:3]

0        NaN
1        NaN
2    $520.00
Name: weekly_price, dtype: object

Let's try to use our price to number function.

In [0]:
num = "1,137.00"

In [0]:
def price_to_number(price):
    return pd.to_numeric(price[1:].replace(',', ''))

In [0]:
# price_to_number(num)
# num.replace(',', '')

In [0]:
mapper = DataFrameMapper([
    (['weekly_price'], FunctionTransformer(price_to_number))
], df_out = True)

In [0]:
# mapper.fit_transform(prices_df)

A problem we are running into is with our NAN values.  One fix is to add an SimpleImputer to replace the nans with an empty string or equivalent.  Another is to handle nans directly in the function.  

Let's first replace our NANs with empty strings.

In [0]:
from sklearn.impute import SimpleImputer
mapper = DataFrameMapper([
    (['weekly_price'], SimpleImputer(strategy = 'constant', fill_value = ''))
], df_out = True)

In [0]:
transformed_prices = mapper.fit_transform(prices_df)
transformed_prices[:3]

Unnamed: 0,weekly_price
0,
1,
2,$520.00


And from here, we can add in our function for converting our price.

In [0]:
from sklearn.impute import SimpleImputer
mapper = DataFrameMapper([
    (['weekly_price'], [SimpleImputer(strategy = 'constant', fill_value = ''), 
                        FunctionTransformer(price_to_number)])
], df_out = True)

In [0]:
transformed_to_num = mapper.fit_transform(prices_df)
transformed_to_num[:3]

Unnamed: 0,weekly_price
0,
1,
2,520.0


In [0]:
transformed_to_num.dtypes

weekly_price    float64
dtype: object

And now we have our column properly converted into a number.

### Summary

In this lesson we saw how to use custom transformers to use with our DataFrameMapper.  The key to using the transformer, is understanding that the function handles a single value in the column.  We can first test out the functionality of our function, and then pass it into our `FunctionTransformer` with `FunctionTransformer(function_name)`.