# Assignment 1: local development in a notebook
In this assignment you write code to process data, do feature engineering, train a model, and evaluate the model on the test dataset. You do all processing in the local notebook, trading scalability and reproducibility for the speed of deployment and fast interations to experiment with different feature engineering approaches and model types.

Optionally you can run this notebook headlessly as a SageMaker on-demand or scheduled notebook job. 

Refer to the notebook [`01-idea-development.ipynb`](../01-idea-development.ipynb) for code snippets and a general guidance for the exercises in this assignment.

Feel free to implement your own specific use case with your own dataset and a model.

## Install and import packages
You can use `%` commands and `pip install` to install any packages in the notebook kernel.

In [2]:
%pip install --upgrade pip
%pip install -q  xgboost sagemaker-experiments

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np 
import json
import joblib
import xgboost as xgb
import sagemaker
import boto3
import os
from time import gmtime, strftime, sleep
from sklearn.metrics import roc_auc_score
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

sagemaker.__version__

'2.165.0'

In [4]:
session = sagemaker.Session()
sm = session.sagemaker_client

## Load data
- Create variables to keep literal constants, like file names and paths
- Load data from a local file to a Pandas dataframe
- Explore the data

In [5]:
# Write data load code
df_data = pd.read_csv('data/bank-additional/bank-additional-full.csv',sep = ";")
df_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


## [Optional] create an experiment
Use [Amazon SageMaker Experiments Python SDK](https://sagemaker-experiments.readthedocs.io/en/latest/) to create and manage your experiments.

In [33]:
# Wirte code to create an experiment
experiment_name = "mohammadexperiment"

## Exercise 1: EDA and feature engineering
- Implement data processing
- Implement EDA
- Implement feature engineering

In [7]:
# Exercise 1 - write code here
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)
df_data["not_working"] = np.where(np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0)
df_model_data = df_data.drop(["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],axis=1,)

In [8]:
df_model_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,y,no_previous_contact,not_working
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,no,1,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,no,1,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,no,1,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,no,1,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,no,1,0


In [9]:
df_model_data = pd.get_dummies(df_model_data)

In [10]:
df_model_data.head()

Unnamed: 0,age,campaign,pdays,previous,no_previous_contact,not_working,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,...,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y_no,y_yes
0,56,1,999,0,1,0,0,0,0,1,...,0,1,0,0,0,0,1,0,1,0
1,57,1,999,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
2,37,1,999,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
3,40,1,999,0,1,0,1,0,0,0,...,0,1,0,0,0,0,1,0,1,0
4,56,1,999,0,1,0,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0


In [11]:
target_col = "y"

In [12]:
df_model_data = pd.concat(
    [
        df_model_data["y_yes"].rename(target_col),
        df_model_data.drop(["y_no", "y_yes"], axis=1),
    ],
    axis=1,
)

In [13]:
df_model_data.head()

Unnamed: 0,y,age,campaign,pdays,previous,no_previous_contact,not_working,job_admin.,job_blue-collar,job_entrepreneur,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,0,56,1,999,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,57,1,999,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,37,1,999,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
3,0,40,1,999,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
4,0,56,1,999,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Exercise 2: Split data
- Prepare data for training, split the dataset

In [15]:
print(train_data.shape,validation_data.shape,test_data.shape)

(28831, 60) (8238, 60) (4119, 60)


## Exercise 3: Model training
- Train the model
- Optional: track your model training runs as trials and trials components

In [16]:
# Exercise 3 - write code here
# y_train = train_data['target_col']
# X_train = train_data.drop('target_col', axis=1, inplace=True)
X_train = train_data.drop(target_col, axis=1)
y_train = pd.DataFrame(train_data[target_col])

In [17]:
dtrain = xgb.DMatrix(X_train, label=y_train)

In [27]:
X_test = test_data.drop(target_col, axis=1)
y_test = pd.DataFrame(test_data[target_col])
dtest = xgb.DMatrix(X_test, label=y_test)

In [28]:
hyperparams = {
                "max_depth": 5,
                "eta": 0.5,
                "alpha": 2.5,
                "objective": "binary:logistic",
                "subsample" : 0.8,
                "colsample_bytree" : 0.8,
                "min_child_weight" : 3
              }

num_boost_round = 150
nfold = 3
early_stopping_rounds = 10

In [29]:
model = (
            xgb.train(
                params=hyperparams, 
                dtrain=dtrain, 
                evals = [(dtrain,'train'), (dtest,'eval')], 
                num_boost_round=num_boost_round, 
                early_stopping_rounds=early_stopping_rounds, 
                verbose_eval = 0
            )
        )


In [30]:
train_pred = model.predict(dtrain)

In [31]:
test_pred = model.predict(dtest)

## Exercise 4: Validate model
- Validate the model on the test dataset

In [32]:
# Exercise 4 - write code here

test_auc = roc_auc_score(y_test, test_pred)
train_auc = roc_auc_score(y_train, train_pred)
print(f"Train-auc:{train_auc:.2f}, Test-auc:{test_auc:.2f}")

Train-auc:0.81, Test-auc:0.76


## Exercise 5: [Optional] explore your experiment in Studio
Refer to [View and Compare Amazon SageMaker Experiments, Trials, and Trial Components](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-view-compare.html) developer guide to understand how to work with experiments and trials.

## Exercise 6: [Optional] run the notebook as a SageMaker job
Adapt your notebook code and follow the instructions in [Notebook-based Workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html) to run this notebook on-demand headlessly a SageMaker job.

## Continue with the assignment 2
Navigate to the [assignment 2](02-assignment-sagemaker-containers.ipynb) notebook.