# Sagemaker Feature Engineering using Python code

---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.


---

## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model df.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your df. See the documentation for how to create these.  


In [None]:
import sagemaker
bucket=sagemaker.Session().default_bucket()
prefix = 'mlops/activity-1'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
import sagemaker_dfwrangler           # For interactive dfprep widget
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular df
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 

In [None]:
pd.__version__

---

## Data
Let's start by downloading the dfset from github 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A df-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [None]:
!wget https://raw.githubusercontent.com/manifoldailearning/mlops-with-aws-dfscientists/main/Section-13-Feature-Engineering/dfset/bank-additional-full.csv

Now lets read this into a Pandas dfframe and take a look. Because we imported the `sagemaker_dfwrangler` library we will automatically be able to view distributions, issues with the df, and other helpful recomendations and built in transformations. 

In [None]:
df= pd.read_csv('./bank-additional-full.csv')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
df

In [None]:
for column in df.select_dtypes(include=['object']).columns:
    if column != 'y':
        print(pd.crosstab(index=df[column], columns=df['y'], normalize='columns'))

for column in df.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = df[[column, 'y']].hist(by='y', bins=30)
    plt.show()

In [None]:
print(df.corr())
pd.plotting.scatter_matrix(df, figsize=(12, 12))
plt.show()

In [None]:
# Transformation
df['no_previous_contact'] = np.where(df['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
df['not_working'] = np.where(np.in1d(df['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed

In [None]:
model_df= pd.get_dummies(df)                                                                  # Convert categorical variables to sets of indicators

In [None]:
model_df= model_df.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

In [None]:
train_df, validation_df, test_df= np.split(model_df.sample(frac=1, random_state=1729), [int(0.7 * len(model_df)), int(0.9 * len(model_df))])   # Randomly sort the dfthen split out first 70%, second 20%, and last 10%

Amazon SageMaker's XGBoost container expects dfin the libSVM or CSV dfformat.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
pd.concat([train_df['y_yes'], train_df.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_df['y_yes'], validation_df.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')