# Data Preprocessing

## Importing Data
- Use the parameter 'usecols' to select all columns from the raw data that are needed
- Use the parameter 'parse_dates' to have Pandas automatically parse date info as it is brought in
- Use the paremeter 'index_col' to set the index to the datetime column if this is time series data
- Use the .query() function to import data that's conditional upon another columns values

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

raw_data = pd.read_csv('./data/TrainingSet.csv')
if isinstance(raw_data, pd.DataFrame):
    print("Data successfully imported.")
else:
    print("Data failed to import.")


test_set = pd.read_csv('./data/TestSet.csv')
if isinstance(test_set, pd.DataFrame):
    print("Test data successfully imported.")
else:
    print("Test data failed to import.")
    
# Remove unneeded columns
del raw_data['timestamp']
del test_set['timestamp']

# Time series example
# hourly_weather_data = pd.read_csv('./data/raw_weather_data.csv', usecols=['DATE','REPORT_TYPE','HourlyDryBulbTemperature', 'HourlyPrecipitation'] , parse_dates=["DATE"], index_col="DATE").query("REPORT_TYPE == 'FM-15'")

print("Data Shape:",raw_data.shape) 
print("Test Shape:",test_set.shape)

Data successfully imported.
Data Shape: (20000, 379)


## Separate Data into Training, Validation, Test, and Target Sets
##### It is very important that this step is done prior to data imputation, normalization, one hot encoding or other preprocessing steps.

You should NEVER do anything which leaks information about your testing data BEFORE a split.  If you normalize before the split, then you will use the testing data to calculate the range or distribution of this data which leaks this information also into the training data and vice versa which "contaminates" your data and will lead to over-optimistic performance estimations on your testing data. This is true for all data preprocessing steps which change data based on all data points including also feature selection.

What you SHOULD do instead is to create the normalization only on the training data and use the preprocessing model coming out of the normalization operator.  This preprocessing model can then be applied like any other model on the testing data as well and will change the testing data based on the training data (which is ok) but not the other way around.

In [4]:
from sklearn.model_selection import train_test_split

# Separate data into training, validation, and test sets
train_set, validation_set, train_targets, validation_targets = train_test_split(raw_data, raw_data['job_performance'], test_size=0.2)

# Set target and drop from training/test set data
del train_set['job_performance']
del validation_set['job_performance']
del test_set['job_performance']

print("Training Set Shape:",train_set.shape)
print("Validation Set Shape:",validation_set.shape)
print("Test Set Shape:",test_set.shape)
print(train_targets.head(5))
print(validation_targets.head(5))

Training Set Shape: (16000, 378)
Test Set Shape: (4000, 378)
7063     2839.728205
4309     2422.636497
8102     2276.712858
16617    2498.442852
18398    2487.473308
Name: job_performance, dtype: float64
6173     2540.483752
13687    2751.641890
16952    3156.473677
8445     2727.079721
14974    3783.485098
Name: job_performance, dtype: float64


# Cleaning Data
---
- <strong>Missing Value Ratio Filter:</strong> Used to drop features that have more than a certain percentage of their rows empty
- <strong>Imputation:</strong> The process of deciding how to fill the empty rows. This can be done by using the mean, median, or mode for numerical data or a constant for categorical data. There are also algorithms and machine learning libraries built solely for imputation that can be used.
- <strong>Normalization / Standarization:</strong> Used to scale and center data. Which one to use depends on the dimensionality reduction techniques to be used.  
- <strong>Low Variance Filter:</strong> This can be used on numerical data to remove features that are constants or others with very low variance.
- <strong>One Hot Encoding</strong>

## Preprocessing Pipeline

In [None]:
import pipeline_functions
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import VarianceThreshold
from pipeline_functions import Print, MissingValueRatioFilter, StartTimer, ForceToNumerical, \
ConvertToDataFrame, HighCorrelationFilter, OutputRunTime, ChangeDType

X = train_set.copy(deep=True)
V = validation_set.copy(deep=True)
T = test_set.copy(deep=True)

# Numerical transformations
numerical_missing_ratio = 0.5
variance_threshold = 0.01
numerical_colums = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']  
numerical_features = list(X.select_dtypes(include=numerical_colums).columns)  
numerical_transformer = Pipeline(steps=[
    ('print1', Print(message="  Preprocessing:")),
    ('print2', Print(message="    Numerical: Missing Value Ratio Filter (>"+str(numerical_missing_ratio)+")")),
    ('print3', Print(message="      Starting Numerical Features: ",columns=True)),
    ('missing_value_ratio_filter', MissingValueRatioFilter(ratio_missing=numerical_missing_ratio)),
    ('print4', Print(message="      Remaining Numerical Features:",columns=True)),
    ('print5', Print(message="    Numerical: Imputation")),
    ('imputer', SimpleImputer(strategy='mean')),
    ('print6', Print(message="    Numerical: Normalization")),
    ('scaler', MinMaxScaler()),
    ('print7', Print(message="    Numerical: Low Variance Filter (>"+str(variance_threshold)+")")),
    ('print8', Print(message="      Starting Numerical Features: ",columns=True)),
    ('variance_threshold', VarianceThreshold(threshold=variance_threshold)),
    ('print9', Print(message="      Remaining Numerical Features:",columns=True))
    ])

# Categorical transformations
categorical_missing_ratio = 0.5
categorical_variance_threshold = 0.01
categorical_features = X.select_dtypes(['object']).columns
categorical_transformer = Pipeline(steps=[
    ('print0', Print(message="    Categorical: Missing Value Ratio Filter (>"+str(categorical_missing_ratio)+")")),
    ('print1', Print(message="      Starting Categorical Features: ",columns=True)),
#     ('missing_value_ratio_filter', MissingValueRatioFilter(ratio_missing=categorical_missing_ratio)),
    ('print2', Print(message="      Remaining Categorical Features:",columns=True)),
    ('print3', Print(message="    Categorical: Imputation")),
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('print4', Print(message="    Categorical: Conversion of Ints to Strings")),
    ('change_dtype', ChangeDType()),
    ('print5', Print(message="    Categorical: One Hot Encoding")),
    ('print6', Print(message="      Starting Categorical Features:",columns=True)),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
    ('print7', Print(message="      Remaining Categorical Features:",columns=True)),
    ('print8', Print(message="    Categorical: Low Variance Filter (>"+str(categorical_variance_threshold)+")")),
    ('print9', Print(message="      Starting Categorical Features: ",columns=True)),
#     ('variance_threshold', VarianceThreshold(threshold=categorical_variance_threshold)),
    ('print10', Print(message="      Remaining Categorical Features:",columns=True))
    ])

# Combine numerical and categorical data back together
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Master pipeline
high_correlation_filter_decimal = 0.9
master_pipeline = Pipeline([
    ('start_timer', StartTimer()),
    ('print1', Print(message="\nStarting Shape: " + str(X.shape))),
    ('print2', Print(message="  Forcing column 'v71' to numerical data.")),
    ('force_to_numerical', ForceToNumerical()),
    ('preprocessor', preprocessor),
    ('print3', Print(message="  Recombined Numerical & Categorical Shape: ",return_shape=True)),
    ('print4', Print(message="  Dimensionality Reduction: ")),
    ('convert_to_dataframe', ConvertToDataFrame()),
    ('print5', Print(message="    High Correlation Filter (> " + str(high_correlation_filter_decimal) + ")")),
    ('high_correlation_filter', HighCorrelationFilter(correlation_decimal=high_correlation_filter_decimal)),
    ('print6', Print(message="Final Shape:",return_shape=True)),
#     ('output_run_time', OutputRunTime(start_time=master_pipeline.named_steps['start_timer'].start_time))
])

# Run numerical data only
  # X = pd.DataFrame(numerical_transformer.fit_transform(cleaned_train_set[numerical_features]))
  # X.head(10)


##### Run on train set
##### Last runtime = 7,094 seconds
train_set_processed = pd.DataFrame(master_pipeline.fit_transform(X))
validation_set_processed = pd.DataFrame(master_pipeline.transform(V))
test_set_processed = pd.DataFrame(master_pipeline.transform(T))


## Save Pipeline Fit Values to File

In [None]:
from sklearn.externals import joblib

joblib.dump(master_pipeline, 'master_pipeline.joblib')

## Load Pipeline File & Test

In [None]:
pipeline = joblib.load('master_pipeline.joblib') 
test_set_processed = pipeline.transform(T)

## Export Preprocessed Data

In [114]:
X.to_csv(r'./data/1111_Preprocessed_TrainingSet.csv', index=False)
V.to_csv(r'./data/1111_Preprocessed_ValidationSet.csv', index=False)
y.to_csv(r'./data/1111_Preprocessed_TestSet.csv', index=False)
# train_targets.to_csv(r'./data/Preprocessed_TrainingTargets.csv', header=['job_performance'], index=False)
# test_targets.to_csv(r'./data/Preprocessed_TestingTargets.csv', header=['job_performance'], index=False)