# Data Preprocessing

---

## Contents

1. [Preparation](#Preparation)
    1. [Define S3 Parameters](#DefineS3Parameters)
    1. [Import Python Libraries](#ImportPythonLibraries)
    1. [Define Code to Export Data (Optional)](#ExportDataCode)
1. [Read Data](#ReadData)
1. [Transformation](#Transformation)
    1. [Initial Feature Selection](#FeatureSelection)
    1. [Handling missing values](#MissignValues)
    1. [Formatting Dataset](#FormattingDataset)
    1. [Feature Engineering](#FeatureEngineering)
    1. [Filter and Sample Data](#FilterSampleData)
    1. [Outlier Treatment](#OutlierTreatment)
1. [Data Agumentation](#DataAugmentation)
    1. [Oddly distributed data](#OddlyDistributedData)
1. [Data Conversion](#DataConversion)      
    1. [Converting categorical to numeric](#Categorical2Numeric) 
    1. [One Hot Encoding (Dummy variables)](#OneHotEncoding)
1. [Data Normalization](#DataNormalization) 
1. [preprocessing.py File](#preprocessing.pyFile)

---
<a id='Preparation'></a>
## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

<a id='DefineS3Parameters'></a>
### Define S3 Parameters
From the initial dataset, select the features to be worked with.

In [None]:
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()
s3_client = boto3.client('s3')                     # 's3_client' is a key word. create connection to S3 using default config and all buckets within S3

# Define File Paths and S3 Buckets
bucket_name = '<SiemensProjectBucket>'
prefix = 'preprocessing'

file_key = "<folder>/<InputFile>.csv"

s3_output_location = 's3://{}/{}'.format(bucket_name, prefix)
print('training artifacts will be uploaded to: {}'.format(s3_output_location))

<a id='ImportPythonLibraries'></a>
### Import Python Libraries

Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
from scipy import stats

import numpy as np                                        # For matrix operations and numerical processing
import pandas as pd                                       # For munging tabular data
import pickle
import matplotlib.pyplot as plt                           # For charts and visualizations
from IPython.display import Image                         # For displaying images in the notebook
from IPython.display import display                       # For displaying outputs in the notebook
from time import gmtime, strftime                         # For labeling SageMaker models, endpoints, etc.
#from imblearn.under_sampling import RandomUnderSampler   # For using undersampling in dataframes ---------> Library imblearn not Installed. Include in requirements.txt
#from imblearn.over_sampling import SMOTE                 # For using oversampling in dataframes  ---------> Library imblearn not Installed. Include in requirements.txt
import sys                                                # For writing outputs to notebook
import math                                               # For ceiling function
import json                                               # For parsing hosting outputs
import os                                                 # For manipulating filepath names

class AwsUtility():
    @staticmethod
    def read_csv_df(file_key)
        df = pd.read_csv('s3://{}/{}'.format(bucket_name, file_key))
        print ('{} rows have been read in from AWS S3.'.format(len(df)))
        return df

    @staticmethod
    def read_pickle(file_key):
        obj = client.get_object(Bucket=bucket_name, Key=file_key)
        data = pickle.loads(obj['Body'].read())
        print ('{} rows have been read in from AWS S3.'.format(len(data)))
        return data

    @staticmethod
    def upload_csv_to_s3(df, file_key):
        csv_buffer = StringIO(),
        df.to_csv(csv_buffer, sep=','),
        obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_key),
        obj.put(
            Body=csv_buffer.getvalue(),
            ServerSideEncryption='aws:kms')

---
<a id='ReadData'></a>
## Read Data
Let's start by reading the data previously stored in your project S3 and convert it into a Pandas data frame.

In [None]:
# Load files (key) from S3 Buckets and write them into df format
initial_df_1 = read_csv_df('<prefix/source_file_name>')
initial_df_2 = read_csv_df('<prefix/source_file_name>') 

# Load files (key) from S3 Buckets and write them into df format
data = read_pickle('<prefix/source_file_name>')

# Combine Data Frames - if applicable
frames = [initial_df_1, initial_df_2]
df = pd.concat(frames)
df.head()

In [None]:
print("Columns in dataset are the Following:")
df.columns

Check for missing values in columns

In [None]:
print("Null values in Received_Date variable are the following:")
df.<ColumnName>.isna().value_counts()

---
<a id='Transformation'></a>
## Transformation

The goal in this section is to apply all transformations necessary to clean data, transform it, and filter if necessary.
Cleaning up data is part of nearly every advanced analytics project. It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  

A key rationale in this section is to apply transformations while keeping changes in data visible from a business sense point of view. That is 
- On every step we should be able to apply changes so that we can export and visualize data.
- Interpretability of data should be assured. Thus we will be able to forsee business sence being applied in data and afterwards fed into the modeling. 



<a id='FeatureSelection'></a>
### Initial Feature Selection
From the initial dataset, select the features to be worked with.

In [None]:
def select_data_model(data):
    model_data = data.copy().rename(columns={    #rename columns
        'col_name_old_1':'col_name_new_1',
        'col_name_old_2':'col_name_new_2'
        }).reindex(                          #drop any unwanted columns
        columns=['col_name_new_1', 'col_name_new_2']))
    return model_data

In [None]:
#Basic Model with features preprocessed
model = select_data_model(data)
print("Data Point Count in this step is: {}".format(len(model)))

model.head() 

<a id='MissignValues'></a>
### Handling missing values
Some algorithms are capable of handling missing values, but most would rather not.  Options include:
 * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.

In [None]:
def missing_treatment(data):
    model_data = pd.DataFrame(data)
    model_data = model_data.replace('-', np.nan)
    model_data = model_data.dropna()

    return model_data

In [None]:
model = missing_treatment(model)

print("Data Point Count in this step is: {}".format(len(model)))
model.head()

 * Imputing missing values: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.

In [None]:
df['<column_name>'].fillna(df['<column_name>'].mode()) # To replace NaN values by the column mode

df['<column_name>'].fillna(df['<column_name>'].mean()) # To replace NaN values by the column mean

<a id='FormattingDataset'></a>
### Formatting Dataset
Some data needs to be formatted in the correct way for processing. This is especially true for Time variables. In some cases, we even need to order our datapoints, like in time series analysis.

In [None]:
def format_model(data):
    model_data = pd.DataFrame(data)
    model_data = model_data[1:]
    model_data['ReceivedDate'] = pd.to_datetime(model_data['ReceivedDate'], dayfirst=True, errors='coerce')
    model_data['Date'] = pd.to_datetime(model_data['Date'], dayfirst=True, errors='coerce')
    model_data['Col1'] = model_data['Col1'].astype(np.int64)
    model_data['DateMonth'] = model_data['Date'].dt.month
    
    return model_data

In [None]:
model = format_model(model)

print("Data Point Count in this step is: {}".format(len(model)))
model.head()

---
<a id='FeatureEngineering'></a>
## Feature Engineering
Another question to ask yourself before building a model is whether certain features will add value in your final use case.  For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction? If not, can you somehow forecast them accurately?

In [None]:
######################## Pseudo Code Example ########################
def FeatureEngineeringFunction(model_data):
    ####### Combine variables
    ####### reduce categories in some variables
    ####### etc
    ####### etc
    
    ####### delete previous variables that will be obselote after feature engineering
    ####### ...
    pass
######################## Pseudo Code Example ########################

In [None]:
df = FeatureEngineeringFunction(model)

df.columns
df.head()

<a id='FilterSampleData'></a>
### Filter and Sample Data
Commonly there are 2 reasons for filtering data:
* For development purposes feed the model with only partial data, for test cases
* For structural reasons, if we want the mode

In [None]:
######################## Pseudo Code Example ########################
def DataFileterFunction(model_data):
    ### Filter some of your data, whether it will serve just for testing
    ### or if you just want to to look at one region or any other dimension
    ###
    
    ###
    pass
######################## Pseudo Code Example ########################

In [None]:
data_updated = DataFileterFunction(df)

print("Data Filtered:")
data_updated.head()

<a id='OutlierTreatment'></a>
### Outlier Treatment
This step is focused on the treatment of outliers. That is, to detect and treat datapoints that deviate too far from the rest of the dataset.

 * Removing features with missing values: This works well if there are a small number of features which have a large number of missing values.

In [None]:
def remove_outliers(data):
    outliers = stats.zscore(data['col3'])               
    squarer = lambda x: np.abs(x) < 2.5
    vfunc = np.vectorize(squarer)
    updated_data = data[vfunc(outliers)]
    
    return updated_data[vfunc(outliers_lt)]

In [None]:
model_data = remove_outliers(data_updated)
model_data.head()

<a id='DataAugmentation'></a>
## **3. Data Augmentation**

Data Augmentation can be used in several ways, and thus, it can appear either in Data Preprocessing or Modelling Notebooks. The rationale is the following:

   **1 - Preprocessing:** When augmentation of data can explicitly be used to enhance the Dataset. Example: Dealing with imbalanced datasets (Under or Oversampling).
   
   **2 - Modelling:** When augmentation of data is done implicitly during the model training. Example: When data loaders are used as input for a CNN (in Tensorflow or PyTorch). These data loaders usually have the option of agumenting image data, for instance mirroring the input images.


Naturally, as we are using the **Pre-processing Notebook**, this step is used to establish ways of augmenting data explicitly. Several Techniques can be applied in this regard.


<a id='OddlyDistributedData'></a>
### Oddly distributed data
Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data. Some of the options are:

- For the most simple cases, taking the natural log of the features is sufficient to produce more normally distributed data.

In [None]:
df['<new_column_name>'] = np.log(df['column_name'])

- In more complex scenarios, bucketing values into discrete ranges is helpful. These buckets can then be treated as categorical variables and included in the model when one hot encoded as explained on the [Converting categorical to numeric](#Categorical2Numerical) section.

In [None]:
bin_labels = ['first_label', 'second_label', 'third_label', 'fourth_label']

df['quantile_qcut'] = pd.qcut(df['column_name'], q=<number of quantils>, labels=bin_labels) # Quantile-based discretization function: to divide up the underlying data into equal sized bins

cut_bins = [<first_boundary>, <second_boundary>, <third_boundary>, <fourth_boundary>]
df['quantile_cut'] = pd.cut(df['column_name'], q=cut_bins, labels=bin_labels) # Standard discretization function: to divide up the underlying data between manually specified bins

- If the previous are not enough, we might need to do some downsampling or upsampling (depending on the data volume we have at our disposal) to obtain a balanced enough trainning set.

In [None]:
#Downsampling
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(<features_dataset>, <output_vector>)
print('Removed indexes: ', id_rus)

#Upsampling (via SMOTE)
smote = SMOTE(random_state=42)
X_sm, y_sm = smote.fit_sample(<features_dataset>, <output_vector>)

<a id='DataConversion'></a>
## Data Conversion
Depending on the model our data is going to feed, some data variables need to be converted, namely from categorical to numerical.
This is necessary for Neural Network or XGBoost, for instance. Some other models (like several decision trees based models) might not nee this step.


<a id='Categorical2Numeric'></a>
### Converting categorical to numeric
Sometimes, data can be converted into numerical, if the categories reflect some ordinal notion (In this case, one categorical column is transformed into another with ordinal numbers reflecting the categories rank order).

Example:
Terrible - 1
Poor     - 2
Medium   - 3
Good     - 4
Excelent - 5

In [None]:
######################## Pseudo Code Example ########################
def DataAugmentationFunction:
    ### Define ways to augment
    ### Example: use the Smote algorithm for imbalanced datasets
    ###

    ###
    pass

######################## Pseudo Code Example ########################

<a id='OneHotEncoding'></a>
### One Hot Encoding (Dummy variable Creation)
Most common method to feed Data modelling, especially necessary for XGBoost and NN implementations.

For each feature maps (categorical variables), all distinct categorical values will have their own (binary) variable - with value 1 if that category applies, and 0 otherwise.

In [None]:
# One Hot Encode Variables (Dummy Variable Creation)
def OneHotEncode_AllCategVar(DataFrame):
    List_AllColumns = DataFrame.columns
    List_NumColumns = DataFrame._get_numeric_data().columns
    List_CategColumns = list(set(List_AllColumns) - set(List_NumColumns))
    NewDataFrame = pd.get_dummies(DataFrame, columns = List_CategColumns, drop_first = True)
    
    return NewDataFrame


model_data = OneHotEncode_AllCategVar(model_data)

model_data.head()

---
<a id='DataNormalization'></a>
## Data Normalization
Now that we have all the features we want to include, we need to normalize data, so our model can use it correctly:
* Especially in numerical data, normalization is essential so that some variables won't be dominant in relation to others.

In [None]:
def normalize(dataframe):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

In [None]:
model_data.column3 = normalize(gross_order_value_model)
model_data.head()

## Export Pre-processed Data to S3

Making it available for the modelling step.

(Same function defined at the beggining for data export)

In [None]:
#### Export Pre-processed data to S3, for further action on Modelling step ####
upload_csv_to_s3(df, '<prefix/source_file_name>')

Remove local files at the end of export step

In [None]:
remove_command = "rm -rf ./{}".format(filename)
!$remove_command

---
<a id='preprocessing.pyFile'></a>
## _preprocessing.py_ File

The preprocessing code used in previous cells should be condensed and runned in the magic cell bellow. After that it should be commited and pushed to the project repo for further use in our data science pipeline.

In [None]:
!jupyter nbconvert --output-dir='../04-deployment/' --to script preprocessing.ipynb