# SM08: Custom Transformers

### Classes

When transforming data, often the predefined functions provided in libraries such as sklearn or category_encoders are only part of the transformation that needs to happen. This means creating custom transformers that can be incorporated into a sklearn pipeline (more on the sklearn pipeline later).

For information on how to create a custom transformer, see the following tutorials:

- [ML Data Pipelines with Custom Transformers in Python](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65)
- [Creating custom scikit-learn Transformers](https://www.andrewvillazon.com/custom-scikit-learn-transformers/)
- [Pipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)

In [None]:
class OneHotTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self._feature_names = feature_names
    
    def fit(self, ori_df, y=None):
        return self
    
    def transform(self, ori_df, y=None):
        print('Running OneHotTransformer')
        df = ori_df[self._feature_names]
        col_names = df.dropna(axis=1, how='all').columns
        encoder = ce.OneHotEncoder(cols=col_names, use_cat_names=True, handle_missing='return_nan')
        ce_one_hot = pd.DataFrame(encoder.fit_transform(df[col_names]), index=df.index)
        ce_one_hot = ce_one_hot.astype(int)
        df = ori_df.drop(self._feature_names, axis=1).merge(ce_one_hot, left_index=True, right_index=True, how='outer')
        return df

## Create sklearn pipeline

It might seem silly to use an sklearn pipeline when we're already creating a SageMaker pipeline. However, these pipelines do different things. 

The SageMaker pipeline controls how the data moves through the workflow, from data pull to transformation to training and evaluation to deployment.

The sklearn pipeline strings together specific transformers and estimators to allow easy replication of data transformation for training purposes. The sklearn pipeline is particularly useful when incorporated into a `preprocessing.py` script because it can be exported as a joblib. This allows the same transformations to be done in both training and prediction, which makes it much easier to ensure the same code is applied in both places.

For more information on sklearn pipelines see the following:

- [6.1. Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html)
- [sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [None]:
preprocessor = Pipeline([
    ('onehot', OneHotTransformer(cat_cols.keys()))
    ])

## Run script

The final step is to run the code. When using the script as a transformer, estimator, etc, use `if __name__ == '__main__':` to initialize the code. Anything after this line will execute on the EC2.

### Input/Output

The `input_path` and `output_path` variables are unique to working with a SageMaker pipeline. By default SageMaker uses `input_path = '/opt/ml/processing/input'` and `output_path = '/opt/ml/processing/output'`. All this code really does is specify where data and files are located.

In the pipeline step (covered next), the input and output locations are defined in the step. It looks something like this:

```
inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(
            output_name="clean",
            source="/opt/ml/processing/output",
            destination=Join(
                on="/",
                values=[
                    "s3://{}".format(bucket),
                    prefix,
                    'processed',
                    "clean"
                ]
```

This makes it easy to change the files/folders loaded to the EC2 that actually executes the python script. As well as where the output is saved. Think of this process as an automated version of having to add code that saves things to S3.

When specifying input, if all the files in a single folder are needed, the entire folder can be referenced, which will load the folder and all files in it to the `input_path`. Just make sure the references for file location include the `input_path` + `folder_name` instead of just the `input_path`.

*Note*, when saving things to subdirectories, those directories need to be created first.

To use the same code between training and inference/prediction, simply save the pre-processed data (for use in training) and dump the sklearn pipeline as a joblib. In order to reference these later, they need to be saved to separate locations because they are used in different steps.

In [None]:
if __name__ == '__main__':
    input_path = '/opt/ml/processing/input'
    output_path = '/opt/ml/processing/output'
    
    try:
        os.makedirs(os.path.join(output_path, 'data'))
        os.makedirs(os.path.join(output_path, 'encoder'))
    except:
        pass
    
    print('Reading data')
    df = pd.read_table(input_path, header=None)
    print('Preprocessing data')
    processed_df = pd.DataFrame(preprocessor.fit_transform(train_data))
    print('Saving dataframe')
    df.to_json(os.path.join(output_path, 'data', 'train_data.json'))
    print('Saving joblib')
    joblib.dump(preprocessor, os.path.join(output_path, 'encoder', 'preprocess.joblib'))    

### References

- [Pipelines & Custom Transformers in Scikit-learn](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-ef792bbb3260)
- [Pipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)
- [How to transform items using sklearn Pipeline?](https://stackoverflow.com/questions/33469633/how-to-transform-items-using-sklearn-pipeline)
- [6. Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)
- [6.1. Pipelines and composite estimators](https://scikit-learn.org/stable/modules/compose.html#pipelines-and-composite-estimators)
- [6.3. Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-data)
- [6.3.8. Custom transformers](https://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers)
- [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn-preprocessing-onehotencoder)
- [Entry 20: Scikit-Learn Pipeline](https://julielinx.github.io/blog/20_sklearn_pipeline/)
- [Entry 20 notebook - SciKit Learn Pipeline](https://github.com/julielinx/datascience_diaries/blob/master/02_model_eval/20a_nb_sklearn_pipeline.ipynb)