# AutoGluon-Tabular in AWS Marketplace

본 노트북은 다음 aws 예제에 대한 한글 번역입니다.
- https://github.com/aws/amazon-sagemaker-examples/tree/master/aws_marketplace/using_algorithms/autogluon

[AutoGluon](https://github.com/awslabs/autogluon) 은 머신리닝을 자동화하여 여러분의 응용프로그램에 쉽게 강력한 예측성능을 제공할 수 있도록 합니다. 몇줄의 코드만으로 여러분은 테이블, 이미지, 텍스트데이터에 대하여 고성능의 딥러닝 모델을 학습하고 배포할 수 있습니다. 본 노트북은 테이블형식의 데이터에 대하여 AWS마켓플레이스에 있는 AutoGluon-Tabluar 를 어떻게 적용하는지 보여드립니다.


### Contents:
* [Step 1: Subscribe to AutoML algorithm from AWS Marketplace](#Step-1:-Subscribe-to-AutoML-algorithm-from-AWS-Marketplace)
* [Step 2: Set up environment](#Step-2-:-Set-up-environment)
* [Step 3: Prepare and upload data](#Step-3:-Prepare-and-upload-data)
* [Step 4: Train a model](#Step-4:-Train-a-model)
* [Step 5: Deploy the model and perform a real-time inference](#Step-5:-Deploy-the-model-and-perform-a-real-time-inference)
* [Step 6: Use Batch Transform](#Step-6:-Use-Batch-Transform)
* [Step 7: Clean-up](#Step-7:-Clean-up)

### Step 1: Subscribe to AutoML algorithm from AWS Marketplace

1. 마켓플레이스에 접속하고 [AutoGluon-Tabular](https://aws.amazon.com/marketplace/pp/prodview-n4zf5pmjt7ism) 페이지를 오픈합니다.
2. **Highlights** 부분과 **product overview** 부분을 읽어봅니다. (알고리즘의 개요와 동작특성, 특장점 등을 설명하고 있습니다.)
3. **usage information** 부분과 **additional resources** 부분을 살펴봅니다. (알고리즘의 사용방법이 설명됩니다.)
4. 지원되는 인스턴스 타입을 살펴봅니다. 본 노트북의 이후 셀에서 해당 타입을 설정할 것입니다. 
5. **Continue to subscribe** 버튼을 클릭합니다.
6. **End user license agreement**, **support terms**, **pricing information**을 읽어봅니다.
7. 여러분의 조직에서 해당 알고리즘의 라이센스, 가격, 지원정책에 동의하는 경우 **Accept offer** 버튼을 클릭합니다. 

**Notes**: 
1. **Continue to configuration** 버튼이 활성회되면 여러분의 어카운트가 subscription 된 상태입니다. 
2. **Continue to configuration** 버튼을 클릭하고 리전을 선택하면 `Product Arn`을 확인할 수 있습니다. 이 값이 여러분의 학습작업에서 사용할 알고리즘 ARN입니다. (단, 본 노트북에서는 이미 리전별 ARN 값들을 **src/algorithm_arns.py** 파일에 저장해 두었기 때문에 특별히 설정할 필요는 없습니다.) 


### Step 2 : Set up environment

In [10]:
# Import the latest sagemaker and boto3 SDKs
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install -qU awscli boto3 "sagemaker>=2.0.0" tqdm
!{sys.executable} -m pip show sagemaker

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.2.4)
Name: sagemaker
Version: 2.16.3
Summary: Open source library for training and deploying models on Amazon SageMaker.
Home-page: https://github.com/aws/sagemaker-python-sdk/
Author: Amazon Web Services
Author-email: None
License: Apache License 2.0
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: boto3, smdebug-rulesconfig, protobuf, google-pasta, protobuf3-to-dict, importlib-metadata, numpy, packaging
Required-by: 


In [1]:
#Import necessary libraries.
import os
import boto3
import sagemaker
from time import sleep
from collections import Counter
import numpy as np
import pandas as pd
from sagemaker import get_execution_role, local, Model, utils, fw_utils, s3
from sagemaker import AlgorithmEstimator
from sagemaker.predictor import RealTimePredictor, csv_serializer, StringDeserializer
from sklearn.metrics import accuracy_score, classification_report
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell

# Print settings
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10)

# Account/s3 setup
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'sagemaker/autogluon-tabular'
region = session.boto_region_name
role = get_execution_role()


In [2]:
compatible_training_instance_type='ml.m5.4xlarge' 
compatible_inference_instance_type='ml.m5.4xlarge' 

In [3]:
#Specify algorithm ARN for AutoGluon-Tabular from AWS Marketplace.  However, for this notebook, the algorithm ARN 
#has been specified in src/algorithm_arns.py file and you do not need to specify the same explicitly.

from src.algorithm_arns import AlgorithmArnProvider

algorithm_arn = AlgorithmArnProvider.get_algorithm_arn(region)

### Step 3: Get the data

본 샘플에서는 다이렉트 마케팅 데이터셋을 사용하여 고객이 마케팅 제안을 수용할지 거절할 지 예측하는 이진 분류 모델을 만들 것입니다.
우선 데이터를 다운로드하고 학습(train), 테스트용(test) 셋으로 나눕니다. AutoGluon 이용시 별도의 검증(validation)셋의 생성은 필요하지 않습니다. (내부적으로 k-fold cross-validation을 이용합니다.)


In [4]:
# Download and unzip the data
!aws s3 cp --region {region} s3://sagemaker-sample-data-{region}/autopilot/direct_marketing/bank-additional.zip .
!unzip -qq -o bank-additional.zip
!rm bank-additional.zip

local_data_path = './bank-additional/bank-additional-full.csv'
data = pd.read_csv(local_data_path)

# Split train/test data
train = data.sample(frac=0.7, random_state=42)
test = data.drop(train.index)

# Split test X/y
label = 'y'
y_test = test[label]
X_test = test.drop(columns=[label])

download: s3://sagemaker-sample-data-us-east-1/autopilot/direct_marketing/bank-additional.zip to ./bank-additional.zip


#### 데이터 확인

- train, test 데이터는 레이블 컬럼 `y`를 포함하고 있습니다.
- X_test 데이터는 레이블 컬럼 `y`를 포함하고 있지 않습니다.


In [5]:
train.head(3)
train.shape

test.head(3)
test.shape

X_test.head(3)
X_test.shape

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
32884,57,technician,married,high.school,no,no,yes,cellular,may,mon,371,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
3169,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,285,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
32206,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,52,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no


(28832, 21)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
9,25,services,single,high.school,no,yes,no,telephone,may,mon,50,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
10,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,55,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


(12356, 21)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
9,25,services,single,high.school,no,yes,no,telephone,may,mon,50,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
10,41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,55,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0


(12356, 20)

데이터를 S3로 업로드합니다. 

In [6]:
train_file = 'train.csv'
train.to_csv(train_file,index=False)
train_s3_path = session.upload_data(train_file, key_prefix='{}/data'.format(prefix))

test_file = 'test.csv'
test.to_csv(test_file,index=False)
test_s3_path = session.upload_data(test_file, key_prefix='{}/data'.format(prefix))

X_test_file = 'X_test.csv'
X_test.to_csv(X_test_file,index=False)
X_test_s3_path = session.upload_data(X_test_file, key_prefix='{}/data'.format(prefix))

### Step 4: Train a model

이제 모델을 학습하겠습니다. 

**주의:** 적절한 디스크 사이즈 할당을 위해 `train_volume_size`값을 조정해야 할 수 있습니다.


In [7]:
# Define required label and optional additional parameters
fit_args = {
  'label': 'y',
  # Adding 'best_quality' to presets list will result in better performance (but longer runtime)
  'presets': ['optimize_for_deployment'],
}

# Pass fit_args to SageMaker estimator hyperparameters
hyperparameters = {
  'fit_args': fit_args,
  'feature_importance': True
}

In [11]:
algo = AlgorithmEstimator(algorithm_arn=algorithm_arn, 
                          role=role, 
                          instance_count=1, 
                          instance_type=compatible_training_instance_type, 
                          sagemaker_session=session, 
                          base_job_name='autogluon',
                          hyperparameters=hyperparameters,
                          train_volume_size=100) 

inputs = {'training': train_s3_path}

algo.fit(inputs)

2020-11-16 11:34:31 Starting - Starting the training job...
2020-11-16 11:34:32 Starting - Launching requested ML instances......
2020-11-16 11:35:47 Starting - Preparing the instances for training...
2020-11-16 11:36:18 Downloading - Downloading input data...
2020-11-16 11:36:35 Training - Downloading the training image......
2020-11-16 11:37:52 Training - Training image download completed. Training in progress.[34m2020-11-16 11:37:52,078 sagemaker-training-toolkit INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-11-16 11:37:52,080 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-11-16 11:37:52,080 sagemaker-training-toolkit INFO     Failed to parse hyperparameter fit_args value {'label': 'y', 'presets': ['optimize_for_deployment']} to Json.[0m
[34mReturning the value itself[0m
[34m2020-11-16 11:37:52,080 sagemaker-training-toolkit INFO     Failed to parse hyperparameter feature_importance value True to J

### Step 5: Deploy the model and perform a real-time inference

#### 추론을 위한 엔드포인트 배포

In [14]:
%%time

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import StringDeserializer

predictor = algo.deploy(1, 
                        compatible_inference_instance_type, 
                        serializer=CSVSerializer(), 
                        deserializer=StringDeserializer())

..........
-------------!CPU times: user 228 ms, sys: 12.1 ms, total: 240 ms
Wall time: 7min 18s


#### 레이블이 없는 테스트 데이터셋을 이용하여 예측 실행 

Endpoint의 생성 및 호출결과는 CloudFront에서도 모니터링 가능합니다. 
- SageMaker 콘솔의 Endpoint 메뉴에서 Ednpoint를 클릭한 후 `View Logs`를 클릭하면 CloudWatch로 연결됩니다. 

In [19]:
results = predictor.predict(X_test.to_csv(index=False)).splitlines()

# Check output
print(Counter(results))

Counter({'no': 11384, 'yes': 972})


#### 레이블을 포함한 데이터를 이용하여 예측 실행 

엔드포인트 로그에 성능 매트릭이 함께 표시됩니다.  
- CloudWatch Logs에 다음과 유사한 로그가 생성되는지 확인합니다. 

```
...
2020-11-16 14:36:39,356 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "precision": 0.9108522195098212,
2020-11-16 14:36:39,356 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "recall": 0.9189057947555843,
2020-11-16 14:36:39,356 [INFO ] W-9000-model ACCESS_LOG - /127.0.0.1:50338 "POST /invocations HTTP/1.1" 200 2346
2020-11-16 14:36:39,356 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "f1-score": 0.9124150939348481,
2020-11-16 14:36:39,356 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - "support": 12356
2020-11-16 14:36:39,356 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -         }
```

In [18]:
results = predictor.predict(test.to_csv(index=False)).splitlines()

# Check output
print(Counter(results))

Counter({'no': 11384, 'yes': 972})


#### 엔드포인트 성능 매트릭 직접 확인

실제값과 예측결과를 비교하여 실제 분류 성능을 확인해 봅니다. 


In [17]:
y_results = np.array(results)

print("accuracy: {}".format(accuracy_score(y_true=y_test, y_pred=y_results)))
print(classification_report(y_true=y_test, y_pred=y_results, digits=6))

accuracy: 0.9189057947555843
              precision    recall  f1-score   support

          no   0.937368  0.973631  0.955156     10960
         yes   0.702675  0.489255  0.576858      1396

    accuracy                       0.918906     12356
   macro avg   0.820022  0.731443  0.766007     12356
weighted avg   0.910852  0.918906  0.912415     12356



### Step 6: Use Batch Transform

이번에는 배치로 추론을 실행해 봅니다. 테스트 데이터셋에 레이블 컬럼을 추가함으로써, 예측 성능을 평가할 수 있습니다. (파라미터로 `X_test_s3_path`가 아닌 `test_s3_path` 를 전달하였습니다.)

In [20]:
output_path = f's3://{bucket}/{prefix}/output/'

transformer = algo.transformer(instance_count=1, 
                               instance_type=compatible_inference_instance_type,
                               strategy='MultiRecord',
                               max_payload=6,
                               max_concurrent_transforms=1,                              
                               output_path=output_path)

transformer.transform(test_s3_path, content_type='text/csv', split_type='Line')
transformer.wait()

..........
............................[32m2020-11-16T14:53:27.608:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m2020-11-16 14:53:27,207 [INFO ] main com.amazonaws.ml.mms.ModelServer - [0m
[34mMMS Home: /usr/local/lib/python3.6/site-packages[0m
[34mCurrent directory: /[0m
[34mTemp directory: /home/model-server/tmp[0m
[34mNumber of GPUs: 0[0m
[34mNumber of CPUs: 16[0m
[34mMax heap size: 13346 M[0m
[34mPython executable: /usr/local/bin/python3.6[0m
[34mConfig file: /etc/sagemaker-mms.properties[0m
[34mInference address: http://0.0.0.0:8080[0m
[34mManagement address: http://0.0.0.0:8080[0m
[34mModel Store: /.sagemaker/mms/models[0m
[34mInitial Models: ALL[0m
[34mLog dir: /logs[0m
[34mMetrics dir: /logs[0m
[34mNetty threads: 0[0m
[34mNetty client threads: 0[0m
[34mDefault workers per model: 16[0m
[34mBlacklist Regex: N/A[0m
[34mMaximum Response Size: 6553500[0m
[34mMaximum Request Size: 6553500[0

In [25]:
!aws s3 ls {output_path}

2020-11-16 14:53:35      38040 test.csv.out


In [28]:
!aws s3 cp {output_path}test.csv.out test.csv.out

download: s3://sagemaker-us-east-1-308961792850/sagemaker/autogluon-tabular/output/test.csv.out to ./test.csv.out


In [32]:
!head -n 5 test.csv.out

no
no
no
no
no


### Step 7: Clean-up

예측작업이 끝나면 추가 과금을 피하기 위해 엔드포인트를 삭제합니다. 


In [39]:
predictor.delete_endpoint()

In [40]:
#Finally, delete the model you created.
predictor.delete_model()

마지막으로, 테스트만을 목적으로 AWS 마켓플레이스에 subscribe한 경우 테스트 이후 unsubscribe를 할 수 있습니다.  
subscription을 취소하기 전에 해당 알고리즘이나 모델 패키지로부터 배포된 [모델](https://console.aws.amazon.com/sagemaker/home#/models)에 있지 않은지 확인합니다. - 모델과 연관된 컨테이너를 통해 이를 확인할 수 있습니다.

AWS 마켓플레이스 unsubscribe 하기
1. [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=lbr_tab_ml)의 __Machine Learning__ 탭으로 이동합니다.
1. subscription을 취소하고자 하는 리스트로 이동한 후 __Cancel Subscription__을 클릭합니다.
