# [Module 2.0] 데이타 전처리 하여 Feature 생성



Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import sagemaker
from sagemaker.predictor import csv_serializer

# Preprocessing data <a class="anchor" id="Pre-processing"></a>

We need to do typical preprocessing tasks, including cleaning, feature transformation, feature selection on input data before train the prediction model. For example:  
- `Phone` takes on too many unique values to be of any practical use. It's possible parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.
- `Area Code` showing up as a feature we should convert to non-numeric.
- If we dig into features and run correlaiton analysis, we see several features that essentially have high correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. We should remove one feature from each of the highly correlated pairs.

We will use Amazon SageMaker built-in Scikit-learn library for preprocessing (and also postprocessing), and then use the Amazon SageMaker built-in XGboost algorithm for predictions. We’ll deploy both the library and the algorithm on the same endpoint using the Amazon SageMaker Inference Pipelines feature so you can pass raw input data directly to Amazon SageMaker. We’ll also reuse the preprocessing code between training and inference to reduce development overhead and errors.

To run Scikit-learn on Sagemaker `SKLearn` Estimator with a script as an entry point. The training script is very similar to a training script you might run outside of SageMaker. Also, as this data set is pretty small in term of size, we use the 'local' mode for preprocessing and upload the transformer and transformed data into S3.

In [28]:
%store -r

## 전처리 학습 모델 (Featurizer) 생성
아래는 다음과 같은 작업을 합니다.
- SKLearn 이라는 Estimator를 생성 합니다. 
    - s3_input_train의 학습 데이타를 SKLearn 입력으로 제공 합니다.
    - "전처리 학습 모델 (Featurizer)" 을 생성할 수 있는 소스 코드를 preprocessing.py 지정 합니다. 
    - 사용할 리소스로 instance_type = 'local' 를 지정 합니다. (이미 노트북 인스턴스에 설치된 Docker-compose를 이용 합니다.)
        - Local 이 아니라 SageMaker Cloud Instance도 사용 가능 합니다. (예: ml.m4.xlarge)
- SKLearn의 "전처리 학습 모델"이 완료가 되면 결과인 모델 아티펙트인 (model.tar.gz) 파일이  s3://{bucket_name}/{job_name}/output.tar.gz 에 저장 됩니다. (예: s3://sagemaker-us-east-2-057716757052/sagemaker-scikit-learn-2020-07-15-08-39-41-035/model.tar.gz)

In [29]:
from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
from sagemaker import get_execution_role

role = get_execution_role()

script_path = 'preprocessing.py'

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    train_instance_type="local")
sklearn_preprocessor.fit({'train': s3_input_train})

Creating tmpq6vbeyyd_algo-1-bogcm_1 ... 
[1BAttaching to tmpq6vbeyyd_algo-1-bogcm_12mdone[0m
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:47,930 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:47,932 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:47,941 sagemaker_sklearn_container.training INFO     Invoking user training script.
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:48,042 sagemaker-containers INFO     Module preprocessing does not provide a setup.py. 
[36malgo-1-bogcm_1  |[0m Generating setup.py
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:48,042 sagemaker-containers INFO     Generating setup.cfg
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:48,042 sagemaker-containers INFO     Generating MANIFEST.in
[36malgo-1-bogcm_1  |[0m 2020-07-15 13:22:48,042 sagemaker-containers INFO     Installing module with the following command:


### Preparing the training and validation dataset <a class="anchor" id="preprocess_train_data"></a>
Now that our proprocessor is properly fitted, let's go ahead and preprocess our training and validation data. Let's use batch transform to directly preprocess the raw data and store right back into s3.

### Preprocessed training data (Feature) 만들기
![Transformer_Train](img/Fig2.1.transformer_train.png)

In [30]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')
instance_type = 'local'

scikit_learn_inferencee_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'feature-transform'})
transformer_train = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_train_output_path,
    accept = 'text/csv')


# Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'], 
                            content_type='text/csv')
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
print(preprocessed_train_path)



Attaching to tmp_8felqip_algo-1-v2kbw_1
[36malgo-1-v2kbw_1  |[0m Processing /opt/ml/code
[36malgo-1-v2kbw_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-v2kbw_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-v2kbw_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=9702 sha256=b97df0e391dac5647dc21c6e8756f0273274f8e2daab0877b6e8b7ed76249e38
[36malgo-1-v2kbw_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-6ww0fgki/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-v2kbw_1  |[0m Successfully built preprocessing
[36malgo-1-v2kbw_1  |[0m Installing collected packages: preprocessing
[36malgo-1-v2kbw_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-v2kbw_1  |[0m   import imp
[36malgo-1-v2kbw_1  |[0m [2020-07-15 13:22:53 +0000] [72] [INFO] Starting gunicorn 19.9.0
[36malgo-1-v2kbw_1  |[0m [2020-07-15 13:22:53 +0000

### 전처리된 학습 파일 확인

In [31]:
print(preprocessed_train_path)

s3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-50-436


In [32]:
! aws s3 ls {preprocessed_train_path} --recursive

2020-07-15 13:22:57        291 sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-50-436/train.csv.out


In [33]:
preprocessed_train_path_file = os.path.join (preprocessed_train_path, 'train.csv.out')
df_pre_train = pd.read_csv(preprocessed_train_path_file)
df_pre_train.head()


Unnamed: 0,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 3.2 Final//EN"">"
0,<title>500 Internal Server Error</title>
1,<h1>Internal Server Error</h1>
2,<p>The server encountered an internal error an...


### Preprocessed Validation data (Feature) 만들기

In [34]:
# Define a SKLearn Transformer from the trained SKLearn Estimator

transform_validation_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-validation-output')
transformer_validation = scikit_learn_inferencee_model.transformer(
    instance_count=1, 
    instance_type= instance_type,
    assemble_with = 'Line',
    output_path = transform_validation_output_path,
    accept = 'text/csv')
# Preprocess validation input
transformer_validation.transform(s3_input_validation.config['DataSource']['S3DataSource']['S3Uri'], content_type='text/csv')
print('Waiting for transform job: ' + transformer_validation.latest_transform_job.job_name)
transformer_validation.wait()
preprocessed_validation_path = transformer_validation.output_path+transformer_validation.latest_transform_job.job_name
print(preprocessed_validation_path)




Attaching to tmpkrnmnmwq_algo-1-lb0xf_1
[36malgo-1-lb0xf_1  |[0m Processing /opt/ml/code
[36malgo-1-lb0xf_1  |[0m Building wheels for collected packages: preprocessing
[36malgo-1-lb0xf_1  |[0m   Building wheel for preprocessing (setup.py) ... [?25ldone
[36malgo-1-lb0xf_1  |[0m [?25h  Created wheel for preprocessing: filename=preprocessing-1.0.0-py2.py3-none-any.whl size=9700 sha256=20da9f13c0d073dca43bf16b8bdf0fc470f98dfa972bfe073964ac26e9d370e1
[36malgo-1-lb0xf_1  |[0m   Stored in directory: /tmp/pip-ephem-wheel-cache-_tjmvab8/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
[36malgo-1-lb0xf_1  |[0m Successfully built preprocessing
[36malgo-1-lb0xf_1  |[0m Installing collected packages: preprocessing
[36malgo-1-lb0xf_1  |[0m Successfully installed preprocessing-1.0.0
[36malgo-1-lb0xf_1  |[0m   import imp
[36malgo-1-lb0xf_1  |[0m [2020-07-15 13:23:01 +0000] [72] [INFO] Starting gunicorn 19.9.0
[36malgo-1-lb0xf_1  |[0m [2020-07-15 13:23:01 +0000

In [35]:
preprocessed_validation_path

's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-57-979'

In [36]:
! aws s3 ls {preprocessed_validation_path} --recursive

2020-07-15 13:23:04        291 sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-57-979/validation.csv.out


---
## Train

Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [37]:
import boto3

from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

	get_image_uri(region, 'xgboost', '0.90-2').


Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [38]:
s3_input_train_processed = sagemaker.session.s3_input(
    preprocessed_train_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_train_processed.config)
s3_input_validation_processed = sagemaker.session.s3_input(
    preprocessed_validation_path, 
    distribution='FullyReplicated',
    content_type='text/csv', 
    s3_data_type='S3Prefix')
print(s3_input_validation_processed.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-train-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-50-436', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-2-057716757052/sagemaker/customer-churn/transformtrain-validation-output/sagemaker-scikit-learn-2020-07-15-13-22-2020-07-15-13-22-57-979', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'text/csv'}


Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [39]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train_processed, 'validation': s3_input_validation_processed}) 

2020-07-15 13:23:05 Starting - Starting the training job...
2020-07-15 13:23:07 Starting - Launching requested ML instances......
2020-07-15 13:24:10 Starting - Preparing the instances for training...
2020-07-15 13:25:01 Downloading - Downloading input data...
2020-07-15 13:25:33 Training - Downloading the training image...
2020-07-15 13:26:06 Uploading - Uploading generated training model
2020-07-15 13:26:06 Failed - Training job failed
[34mArguments: train[0m
[34m[2020-07-15:13:25:54:INFO] Running standalone xgboost training.[0m
[34m[2020-07-15:13:25:54:INFO] File size need to be processed in the node: 0.0mb. Available memory size in the node: 8496.47mb[0m
[34m[2020-07-15:13:25:54:ERROR] Customer Error: Non-numeric value 'D' found in the header line '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final/...' of file 'train.csv.out'. CSV format require no header line in it. If header line is already removed, XGBoost does not accept non-numeric value in the data.[0m
[34mTraceback 

UnexpectedStatusException: Error for Training job xgboost-2020-07-15-13-23-05-144: Failed. Reason: ClientError: Non-numeric value 'D' found in the header line '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final/...' of file 'train.csv.out'. CSV format require no header line in it. If header line is already removed, XGBoost does not accept non-numeric value in the data.

## Post-processing

In [None]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transform_postprocessor_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-postprocessing-output')
scikit_learn_post_process_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'inverse-label-transform'})
transformer_post_processing = scikit_learn_post_process_model.transformer(
    instance_count=1, 
    instance_type='local',
    assemble_with = 'Line',
    output_path = transform_postprocessor_path,
    accept = 'text/csv')

In [None]:
sklearn_preprocessor.model_data

In [None]:
scikit_learn_post_process_model.model_data

## Inference Pipeline <a class="anchor" id="pipeline_setup"></a>
Setting up a Machine Learning pipeline can be done with the create_model(). In this example, we configure our pipeline model with the fitted Scikit-learn inference model, the fitted Xgboost model and the psotprocessing model.

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_name = 'churn-inference-pipeline-' + timestamp_prefix
client = boto3.client('sagemaker')
response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            'Image': sklearn_preprocessor.image_name,
            'ModelDataUrl': sklearn_preprocessor.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "feature-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
        {
            'Image': xgb.image_name,
            'ModelDataUrl': xgb.model_data,
            "Environment": {}
        },
        {
            'Image': scikit_learn_post_process_model.image,
            'ModelDataUrl': scikit_learn_post_process_model.model_data,
            'Environment': {
                    "SAGEMAKER_ENABLE_CLOUDWATCH_METRICS": str(sklearn_preprocessor.enable_cloudwatch_metrics),
                    "SAGEMAKER_SUBMIT_DIRECTORY": sklearn_preprocessor.uploaded_code.s3_prefix,
                    "TRANSFORM_MODE": "inverse-label-transform",
                    "SAGEMAKER_CONTAINER_LOG_LEVEL": str(sklearn_preprocessor.container_log_level),
                    "SAGEMAKER_REGION": sklearn_preprocessor.sagemaker_session.boto_region_name,
                    "SAGEMAKER_PROGRAM": sklearn_preprocessor.uploaded_code.script_name
                }
        },
    ],
    ExecutionRoleArn = role,
)
model_name

In [None]:
%store model_name

## preprocessing.py code<a class="anchor" id="create_sklearn_script"></a>

```python
from __future__ import print_function

import time
import sys
from io import StringIO
import os
import shutil

import argparse
import csv
import json
import numpy as np
import pandas as pd
import logging

from sklearn.compose import ColumnTransformer
from sklearn.externals import joblib
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer, StandardScaler, OneHotEncoder

from sagemaker_containers.beta.framework import (
    content_types, encoders, env, modules, transformer, worker)

# Since we get a headerless CSV file we specify the column names here.
feature_columns_names = [
    'State',
    'Account Length',
    'Area Code',
    'Phone',
    "Int'l Plan",
    'VMail Plan',
    'VMail Message',
    'Day Mins',
    'Day Calls',
    'Day Charge',
    'Eve Mins',
    'Eve Calls',
    'Eve Charge',
    'Night Mins',
    'Night Calls',
    'Night Charge',
    'Intl Mins',
    'Intl Calls',
    'Intl Charge',
    'CustServ Calls'] 

label_column = 'Churn?'

feature_columns_dtype = {
    'State' :  str,
    'Account Length' :  np.int64,
    'Area Code' :  str,
    'Phone' :  str,
    "Int'l Plan" :  str,
    'VMail Plan' :  str,
    'VMail Message' :  np.int64,
    'Day Mins' :  np.float64,
    'Day Calls' :  np.int64,
    'Day Charge' :  np.float64,
    'Eve Mins' :  np.float64,
    'Eve Calls' :  np.int64,
    'Eve Charge' :  np.float64,
    'Night Mins' :  np.float64,
    'Night Calls' :  np.int64,
    'Night Charge' :  np.float64,
    'Intl Mins' :  np.float64,
    'Intl Calls' :  np.int64,
    'Intl Charge' :  np.float64,
    'CustServ Calls' :  np.int64}

label_column_dtype = {'Churn?': str}  

def merge_two_dicts(x, y):
    z = x.copy()   # start with x's keys and values
    z.update(y)    # modifies z with y's keys and values & returns None
    return z

def _is_inverse_label_transform():
    """Returns True if if it's running in inverse label transform."""
    return os.getenv('TRANSFORM_MODE') == 'inverse-label-transform'

def _is_feature_transform():
    """Returns True if it's running in feature transform mode."""
    return os.getenv('TRANSFORM_MODE') == 'feature-transform'


if __name__ == '__main__':

    parser = argparse.ArgumentParser()

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])


    args = parser.parse_args()

    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))

    raw_data = [ pd.read_csv(
        file, 
        header=None, 
        names=feature_columns_names + [label_column],
        dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype)) for file in input_files ]
    concat_data = pd.concat(raw_data)

    numeric_features = list([
    'Account Length',
    'VMail Message',
    'Day Mins',
    'Day Calls',
    'Eve Mins',
    'Eve Calls',
    'Night Mins',
    'Night Calls',
    'Intl Mins',
    'Intl Calls',
    'CustServ Calls'])


    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_features = ['State','Area Code',"Int'l Plan",'VMail Plan']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)],
        remainder="drop")

    preprocessor.fit(concat_data)

    joblib.dump(preprocessor, os.path.join(args.model_dir, "model.joblib"))

    print("saved model!")
    
    
def input_fn(input_data, request_content_type):
    """Parse input data payload
    
    We currently only take csv input. Since we need to process both labelled
    and unlabelled data we first determine whether the label column is present
    by looking at how many columns were provided.
    """
    
    
    content_type = request_content_type.lower(
    ) if request_content_type else "text/csv"
    content_type = content_type.split(";")[0].strip()
    
    
    if isinstance(input_data, str):
        str_buffer = input_data
    else:
        str_buffer = str(input_data,'utf-8')
    

    if _is_feature_transform():
        if content_type == 'text/csv':
            # Read the raw input data as CSV.
            df = pd.read_csv(StringIO(input_data),  header=None)
            if len(df.columns) == len(feature_columns_names) + 1:
                # This is a labelled example, includes the  label
                df.columns = feature_columns_names + [label_column]
            elif len(df.columns) == len(feature_columns_names):
                # This is an unlabelled example.
                df.columns = feature_columns_names
            return df
        else:
            raise ValueError("{} not supported by script!".format(content_type))
    
    
    if _is_inverse_label_transform():
        if (content_type == 'text/csv' or content_type == 'text/csv; charset=utf-8'):
            # Read the raw input data as CSV.
            df = pd.read_csv(StringIO(str_buffer),  header=None)
            logging.info(f"Shape of the requested data: '{df.shape}'")
            return df
        else:
            raise ValueError("{} not supported by script!".format(content_type))
            
            
def output_fn(prediction, accept):
    """Format prediction output
    
    The default accept/content-type between containers for serial inference is JSON.
    We also want to set the ContentType or mimetype as the same value as accept so the next
    container can read the response payload correctly.
    """
    
    accept = 'text/csv'
    if type(prediction) is not np.ndarray:
        prediction=prediction.toarray()
    
   
    if accept == "application/json":
        instances = []
        for row in prediction.tolist():
            instances.append({"features": row})

        json_output = {"instances": instances}

        return worker.Response(json.dumps(json_output), mimetype=accept)
    elif accept == 'text/csv':
        return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
    else:
        raise RuntimeException("{} accept type is not supported by this script.".format(accept))


def predict_fn(input_data, model):
    """Preprocess input data
    
    We implement this because the default predict_fn uses .predict(), but our model is a preprocessor
    so we want to use .transform().

    The output is returned in the following order:
    
        rest of features either one hot encoded or standardized
    """

    
    if _is_feature_transform():
        features = model.transform(input_data)


        if label_column in input_data:
            # Return the label (as the first column) and the set of features.
            return np.insert(features.toarray(), 0, pd.get_dummies(input_data[label_column])['True.'], axis=1)
        else:
            # Return only the set of features
            return features
    
    if _is_inverse_label_transform():
        features = input_data.iloc[:,0]>0.5
        features = features.values
        return features
    

def model_fn(model_dir):
    """Deserialize fitted model
    """
    if _is_feature_transform():
        preprocessor = joblib.load(os.path.join(model_dir, "model.joblib"))
        return preprocessor
```