# SM07: Data Preprocessing

Data preprocessing entails preparing data to be used to derive insight. I've seen the use of this output range from analytic reports to deep learning algorithms. For the purposes of this SageMaker series, I'm processing the data for use with machine learning algorithms.

When it comes down to it, preparing data for machine learning just means turning all features into numbers so I can do math on them. This is an oversimplification, especially considering all the different kinds of machine learning that can be applied, but it's a [mental model](https://jamesclear.com/mental-models#:~:text=A%20mental%20model%20is%20an,models%20help%20you%20understand%20life.) that works well for me.

## Prerequisites

Per usual, I'll assume that SageMaker Studio and an IAM role with the appropriate permissions have been set up. For more information on these two topics, see the Prerequisites section of [SM01]().

I'll also assume that the clean data is stored in the appropriate location in S3. To get this data, run the Python script and SageMaker Pipeline in [SM05]().

In [4]:
import pandas as pd
import numpy as np
import sagemaker.session

session = sagemaker.session.Session()
region = session.boto_region_name
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = '1_ins_dataset'

df = pd.read_csv(f's3://{bucket}/{prefix}/clean/full_data.csv')

## Transformation types

When dealing with production data, I find myself doing the following kinds of transformations:

- True/False: ensure all features are represented as 0/1
    - Sometimes true/false values are stored as text values 'true' and 'false'
- [Category encoding](https://julielinx.github.io/blog/13_cat_prelims/): Turning [categories into numbers](https://julielinx.github.io/blog/14_encoding_cats/)
    - I [tend to favor](https://github.com/julielinx/datascience_diaries/blob/master/01_ml_process/14_nb_encoding_cats.ipynb) One Hot Encoding: turning each categorical variable into it's own True/False (binary) column
    - Only applicable if there is only one value per observation for that feature
    - Only practical when there are a limited number of categorical variables (if there are thousands, this will explode the number of features without necessarily adding any value)
- Dates: creating features to represent information about the date
- Floats: a kind of catchall for floating point numbers to ensure consistent representation
- Multivalue handling: turn lists of values into numeric features
    - Maximum value of a numeric list (mainly used to track the presence of absence of something)
    - Descriptive statistics of a numeric list (min, max, mean, etc - generally used when the numeric values represent something that can be counted)
    - Count unique values in the list
    - Multilabel binarization: basically one hot encoding when there are multiple variables for each observation (a list of values)
- [Standardization/normalization](https://julielinx.github.io/blog/08_center_scale_and_latex/): bring numeric values into a similar range
    - Only necessary for algorithms that are sensitive to outliers
    - Generally only applicable when there is a difference between the feature values that is a magnitude of 10x or greater
- Delete: remove columns that aren't actual features or would cause data leakage
    - I often include multiple identifier columns in order to bring in and join the appropriate data from its original locations
    - Sometimes values that wouldn't be available at the time of prediction will sneak into the data set

The only transformation from the list above that really apply to the [Insurance Company Benchmark (COIL 2000) dataset](https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29) is one hot encoding.

I was going to throw in standardization/normalization just to have a second transformation to apply, but from the EDA compleated in [SM06]() I know that the range of values across all features is 0-12. There's really no reason to apply standardization/normalization in this case.

In [16]:
print('Smallest minimum:', df.describe().transpose()['min'].min())
print('Largest maximum:', df.describe().transpose()['max'].max())

Smallest minimum: 0.0
Largest maximum: 12.0


## One Hot Encode

I prefer the `category_encoders` library orver those available in `sklearn` or `pandas`. Benefits of `category_encoders` include:

- Easily retain column names
- Easily switch between encoders
- Still fully integratable with sklearn pipelines

In [20]:
import sys
!{sys.executable} -q -m pip install category_encoders

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [21]:
import category_encoders as ce

cat_cols = ['zip_agg Customer Subtype', 'zip_agg Customer main type']

encoder = ce.OneHotEncoder(cols=cat_cols, use_cat_names=True, handle_missing='return_nan')

In [24]:
trnsfrmd_df = encoder.fit_transform(df)

In [26]:
trnsfrmd_df

Unnamed: 0,zip_agg Customer Subtype_Lower class large families,zip_agg Customer Subtype_Mixed small town dwellers,"zip_agg Customer Subtype_Modern, complete families",zip_agg Customer Subtype_Large family farms,zip_agg Customer Subtype_Young and rising,zip_agg Customer Subtype_Large religous families,zip_agg Customer Subtype_Family starters,zip_agg Customer Subtype_Stable family,zip_agg Customer Subtype_Mixed rurals,zip_agg Customer Subtype_Traditional families,...,Nbr private accident ins policies,Nbr family accidents ins policies,Nbr disability ins policies,Nbr fire policies,Nbr surfboard policies,Nbr boat policies,Nbr bicycle policies,Nbr property ins policies,Nbr ss ins policies,Nbr mobile home policies
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9817,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
9818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,1
9819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,1,0,0
9820,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
from sklearn.preprocessing import OneHotEncoder

sk_encoder = OneHotEncoder()
sk_one_hot = sk_encoder.fit_transform(df[cat_cols]).toarray()
sk_one_hot = pd.DataFrame(sk_one_hot, columns=sk_encoder.get_feature_names())
sk_one_hot.columns = [x.replace('x0', cat_cols[0]).replace('x1', cat_cols[1]) for x in sk_one_hot.columns]
sk_transformed = sk_one_hot.merge(df.drop(columns=cat_cols), left_index=True, right_index=True)
sk_transformed.head()

Unnamed: 0,zip_agg Customer Subtype_Affluent senior apartments,zip_agg Customer Subtype_Affluent young families,zip_agg Customer Subtype_Career and childcare,zip_agg Customer Subtype_Couples with teens 'Married with children',zip_agg Customer Subtype_Dinki's (double income no kids),zip_agg Customer Subtype_Etnically diverse,zip_agg Customer Subtype_Family starters,zip_agg Customer Subtype_Fresh masters in the city,"zip_agg Customer Subtype_High Income, expensive child",zip_agg Customer Subtype_High status seniors,...,Nbr private accident ins policies,Nbr family accidents ins policies,Nbr disability ins policies,Nbr fire policies,Nbr surfboard policies,Nbr boat policies,Nbr bicycle policies,Nbr property ins policies,Nbr ss ins policies,Nbr mobile home policies
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0


## Next Steps

Applying one hot encoding as a function works great when applying it to the entire data set at once in a notebook. However, the point of this series is to emulate a production level pipeline.

# Notes

- Use functions for the preprocessing in this one. This is the code to develop the processing.
    - OneHotEncoding
- Classes/transformers and sklearn pipe go in the next one.
- Info re: updating packages and `os.makedir` go in 09. Write the preprocess.py file in 09.

# Code Notes

## Import libraries

The next piece of code loads the libraries.

The `warnings.simplefilter("once")` code helps reduce clutter from warning messages, only returning each message type once. This makes finding actual trouble areas easier.

In [4]:
import numpy as np
import pandas as pd
import sys
import os

from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce
from sklearn.pipeline import Pipeline

## Paramaters and Functions

The next sections holds the parameters (listed first as they are the most likely to need updating), functions (listed second as the functions are often used in the classes), and classes (defined last to take advantage of the parameters and functions already defined).

In [8]:
%%writefile preprocess.py
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
def upgrade(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package, '--upgrade'])
    
upgrade('pandas==1.3.5')
upgrade('numpy')
upgrade('pyarrow')
install('category_encoders')

import numpy as np
import pandas as pd
import sys
import os

from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders as ce
from sklearn.pipeline import Pipeline

bucket = session.default_bucket()
prefix = '1_ins_dataset'
    
class OneHotTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self._feature_names = feature_names
    
    def fit(self, ori_df, y=None):
        return self
    
    def transform(self, ori_df, y=None):
        print('Running OneHotTransformer')
        df = ori_df[self._feature_names]
        col_names = df.dropna(axis=1, how='all').columns
        encoder = ce.OneHotEncoder(cols=col_names, use_cat_names=True, handle_missing='return_nan')
        ce_one_hot = pd.DataFrame(encoder.fit_transform(df[col_names]), index=df.index)
        ce_one_hot = ce_one_hot.astype(int)
        df = ori_df.drop(self._feature_names, axis=1).merge(ce_one_hot, left_index=True, right_index=True, how='outer')
        return df
    
    def get_feature_names_out(self):
        return df.columns.tolist()
    
preprocessor = Pipeline([
    ('onehot', OneHotTransformer(cat_cols.keys()))
    ])


if __name__ == '__main__':
    input_path = '/opt/ml/processing/input'
    output_path = '/opt/ml/processing/output'
    
    try:
        os.makedirs(os.path.join(output_path, 'data'))
        os.makedirs(os.path.join(output_path, 'encoder'))
    except:
        pass
    
    print('Reading data')
    df = pd.read_table(input_path + '/ticdata2000.txt', header=None)
    print('Preprocessing data')
    processed_df = pd.DataFrame(preprocessor.fit_transform(df))
    print('Saving dataframe')
    df.to_json(os.path.join(output_path, 'data', 'train_data.json'))
#     print('Saving joblib')
#     joblib.dump(preprocessor, os.path.join(output_path, 'encoder', 'preprocess.joblib'))    

Overwriting preprocess.py
