## SageMaker用にECRに独自のアルゴリズムのコンテナを用意する

## 参考
- [Amazon SageMaker コンテナ: Docker コンテナを作成するためのライブラリ](https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/amazon-sagemaker-containers.html)
- [エントリポイントを定義するために Amazon SageMaker コンテナによって使用される環境変数](https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/docker-container-environmental-variables-entrypoint.html)
- [ノートブックの例: 独自のアルゴリズムまたはモデルの使用](https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/adv-bring-own-examples.html)
- [ai/books-rec/item-vectors/train-model](https://git.dmm.com/ai/books-rec/tree/master/item-vectors/train-model)

### コンテナイメージのビルドとプッシュ
`build-and-push.sh logistic-regression` と打つのと同様の処理が実行される。

In [1]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-logistic-regression-top3

cd ../container

chmod +x scripts/train
chmod +x scripts/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

## 学習データのアップロード

In [None]:
import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import s3fs

In [10]:
def s3_load(s3_path):
    # s3からデータを読み込む
    fs = s3fs.S3FileSystem()
    dataset = pq.ParquetDataset(s3_path, filesystem=fs)
    table = dataset.read()
    df = table.to_pandas()
    return df

def s3_load_days(s3_dir_path, start_dt, end_dt):
    # s3から日毎にデータを読み込む
    dfs = []
    dt = start_dt
    while dt <= end_dt:
        dt_str = dt.strftime('%Y-%m-%d')
        print(dt_str)
        s3_dt_path = s3_dir_path + f'dt={dt_str}'
        df = s3_load(s3_dt_path)
        dfs.append(df)
        dt += relativedelta(days=1)
    dfs = pd.concat(dfs)
    return dfs

In [2]:
s3_dir_path = 's3://{prefix}/'
start_dt = datetime.datetime(2019, 12, 1)
end_dt = datetime.datetime(2019, 12, 31)

df = s3_load_days(s3_dir_path, start_dt, end_dt)
df.to_csv('../data/train.csv', header=None, index=None)
df.head()

## プッシュしたコンテナイメージを用いて学習

In [3]:
# S3 prefix
prefix = 'hogehoge'

# Define IAM role
import boto3
import re
import os
from sagemaker import get_execution_role
import sagemaker as sage
from time import gmtime, strftime
role = get_execution_role()
sess = sage.Session()
sess.default_bucket()

In [4]:
# upload data
WORK_DIRECTORY = '../data'
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)
data_location

In [5]:
data_location = 's3://sagemaker-us-west-2-{account_id}/{model-name}'


# training
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-{container_name}:latest'.format(account, region)

model = sage.estimator.Estimator(image,
                                 role, 
                                 1, 
                                 'ml.c4.2xlarge',
                                 output_path="s3://{}/output".format(sess.default_bucket()),
                                 #base_job_name=,
                                 sagemaker_session=sess)

model.fit(data_location)

In [7]:
model.latest_training_job.name

In [None]:
# そのままエンドポイントを建てたい場合
#from sagemaker.predictor import csv_serializer
#predictor = model.deploy(1, 'ml.t2.medium', serializer=csv_serializer)