# sklearn identity transformer
This is for my blogpost, which you can find [here](), about writing an identity transformer for sklearn's Pipeline and FeatureUnion. Basically, this allows for concatenating a vector to (a transformed version of) itself. That way, the same transformation doesn't have to be carried out twice.

## Data loading
Just the usual data-loading that I got from the blogpost at [Ultraviolet Analytics](http://www.ultravioletanalytics.com/2014/10/30/kaggle-titanic-competition-part-i-intro/). Basically loads the dataset into the `df` dataframe.

In [2]:
import pandas as pd
import numpy as np
 
# Read in the training and testing data into Pandas.DataFrame objects
# This assumes files are in the current directory
input_df = pd.read_csv('train.csv', header=0)
submit_df  = pd.read_csv('test.csv',  header=0)
 
# Merge the two DataFrames into one
df = pd.concat([input_df, submit_df])

# Re-number the combined data set so there aren't duplicate indexes
df.reset_index(inplace=True)
 
# Reset_index() generates a new column that we don't want
df.drop('index', axis=1, inplace=True)
 
# The remaining columns need to be reindexed so we can access the first column at '0' instead of '1'
df = df.reindex_axis(input_df.columns, axis=1)
 
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Transformer classes
Here are the three classes that we will be using as transformers for the `Cabin` column:
* __Imputer transformer:__ This imputes the missing values in the Cabin column, replacing them with "U0"
* __Factorizer transformer:__ This factorizes the different levels into n-1 classes (n-1 because the factorising starts at 0)
* __Identity transformer:__ The identity transformation that will allow us to concatenate a vector to (a transformation of) itself.

In [8]:
from sklearn.base import BaseEstimator, TransformerMixin

class CabinImputer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, df, y=None):
        # No need to fit, so simply return dataframe
        return self
    
    def transform(self, input_array, y=None):
        result = pd.DataFrame(input_array).fillna('U0')
        return result
    
class CabinLetterFactorized(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, input_array, y=None):
        return self
    
    def transform(self, input_array, y=None):
        single_cabin_letter = input_array[0].map( lambda c : c[0] )
        result = pd.factorize(single_cabin_letter)[0]
        return one_to_two(result)
    
class IdentityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, input_array, y=None):
        return self
    
    def transform(self, input_array, y=None):
        return input_array*1
    
# Helper function, not really relevant to what I'm trying to show though
# Converts a 1D numpy array to a 2D one that can be concatenated properly
def one_to_two(data):
    if isinstance(data, pd.Series):
        return data.values.reshape(data.values.shape[0], -1)
    elif isinstance(data, np.ndarray):
        return data.reshape(data.shape[0], -1)

## Using `Pipeline` and `FeatureUnion`
Now that we have our transformers defined above, it's time to put the pipeline to work.

In [7]:
from sklearn.pipeline import Pipeline, FeatureUnion

pipe = Pipeline([
    ('imputer', CabinImputer()),
    ('unioniser', FeatureUnion([
        ('identity', IdentityTransformer()),
        ('factorizer', CabinLetterFactorized()),
    ]))
])

cabin_vector = one_to_two(df['Cabin'])
pipe.fit_transform(cabin_vector)

array([['U0', 0],
       ['C85', 1],
       ['U0', 0],
       ..., 
       ['U0', 0],
       ['U0', 0],
       ['U0', 0]], dtype=object)

That's it! You can play around a bit to see what the effects are on configuration. Don't mind my pandas DataFrame shenanigans too much, I'm still getting used to switching between DataFrames and numpy arrays, and how to properly get that to work.