## Amazon SageMaker Processing jobs

Amazon SageMaker 프로세싱 작업을 사용하면 간소화된 관리형 환경을 활용하여 Amazon SageMaker 플랫폼에서 데이터 전처리 또는 사후 처리를 실행하고 평가 워크로드를 모델링할 수 있습니다.

처리 작업은 Amazon S3 (단순 스토리지 서비스) 에서 입력을 다운로드한 다음 처리 작업 도중이나 후에 출력을 Amazon S3에 업로드합니다.

<img src="Processing-1.jpg">

이 노트북은 다음을 수행합니다

1.처리 작업을 실행하여 정리, 사전 처리, feature 엔지니어링을 수행하고 입력 데이터를 학습 및 테스트 세트로 분할하는 scikit-learn 스크립트를 실행합니다.
2.사전 처리된 교육 데이터에 대한 교육 작업을 실행하여 모델 학습
3.사전 처리된 테스트 데이터에서 처리 작업을 실행하여 학습된 모델의 성능을 평가합니다.
4.사용자 정의 컨테이너를 사용하여 자신의 Python 라이브러리 및 종속성을 사용하여 처리 작업을 실행할 수 있습니다.

데이터 셋은 다음 데이터셋을 사용합니다. [Census-Income KDD Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29). 이 데이터셋에서 피처를 선택하고, 데이터를 정리하고, 학습 알고리즘이 이진 분류 모델을 학습하는 데 사용할 수 있는 피처로 데이터를 변환하고, 데이터를 기차 및 테스트 세트로 분할합니다. 이 과제는 인구 조사 응답자를 나타내는 레코드의 수입이 '$50,000'보다 크거나 '$50,000'보다 작은지 예측하는 것입니다.데이터 집합은 불균형이 많으며 대부분의 레코드는 '$50,000' 미만인 것으로 표시됩니다. Logistic Regression 모델을 학습한 후에는 홀드 아웃 테스트 데이터 집합에 대해 모델을 평가하고 각 레이블에 대한 정밀도, 리콜 및 F1 점수, 모델의 정확도 및 ROC AUC를 포함한 분류 평가 메트릭을 저장합니다.

## Data pre-processing and feature engineering

scikit-learn 전처리 스크립트를 처리 작업으로 실행하려면 제공된 scikit-learn 이미지를 사용하여 처리 작업 내에서 스크립트를 실행할 수 있는 'sklearnProcessor'를 만듭니다.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0", role=role, instance_type="ml.m5.xlarge", instance_count=1
)

데이터 정리, 전처리 및 feature 엔지니어링에 사용하는 스크립트를 도입하기 전에 데이터 집합의 처음 20개 행을 검사합니다. 대상은 'income' 범주를 예측하고 있습니다. 선택한 데이터 세트의 기능은 `age`, `education`, `major industry code`, `class of worker`, `num persons worked for employer`, `capital gains`, `capital losses`, and `dividends from stocks` 입니다.

In [2]:
import pandas as pd

input_data = "s3://sagemaker-sample-data-{}/processing/census/census-income.csv".format(region)
df = pd.read_csv(input_data, nrows=10)
df.head(n=10)

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,income
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
5,48,Private,40,10,Some college but no degree,1200,Not in universe,Married-civilian spouse present,Entertainment,Professional specialty,...,Philippines,United-States,United-States,Native- Born in the United States,2,Not in universe,2,52,95,- 50000.
6,42,Private,34,3,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
7,28,Private,4,40,High school graduate,0,Not in universe,Never married,Construction,Handlers equip cleaners etc,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,30,95,- 50000.
8,47,Local government,43,26,Some college but no degree,876,Not in universe,Married-civilian spouse present,Education,Adm support including clerical,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,- 50000.
9,34,Private,4,37,Some college but no degree,0,Not in universe,Married-civilian spouse present,Construction,Machine operators assmblrs & inspctrs,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.


이 노트북 셀은 전처리 스크립트를 포함하는 `preprocessing.py` 파일을 작성합니다. 스크립트를 업데이트하고이 셀을 다시 실행하여 `preprocessing.py`를 덮어 쓸 수 있습니다. 이 작업은 다음 셀에서 처리 작업으로 실행합니다. 이 스크립트에서는:

* Remove duplicates and rows with conflicting data
* transform the target `income` column into a column containing two labels.
* transform the `age` and `num persons worked for employer` numerical columns into categorical features by binning them
* scale the continuous `capital gains`, `capital losses`, and `dividends from stocks` so they're suitable for training
* encode the `education`, `major industry code`, `class of worker` so they're suitable for training
* split the data into training and test datasets, and saves the training features and labels and test features and labels.

이 교육 스크립트는 사전 처리된 교육 기능 및 라벨을 사용하여 모델을 학습하며, 모델 평가 스크립트는 교육된 모델과 사전 처리된 테스트 기능 및 레이블을 사용하여 모델을 평가합니다.

In [3]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)


columns = [
    "age",
    "education",
    "major industry code",
    "class of worker",
    "num persons worked for employer",
    "capital gains",
    "capital losses",
    "dividends from stocks",
    "income",
]
class_labels = [" - 50000.", " 50000+."]


def print_shape(df):
    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data shape: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-test-split-ratio", type=float, default=0.3)
    args, _ = parser.parse_known_args()

    print("Received arguments {}".format(args))

    input_data_path = os.path.join("/opt/ml/processing/input", "census-income.csv")

    print("Reading input data from {}".format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df, columns=columns)
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    df.replace(class_labels, [0, 1], inplace=True)

    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data after cleaning: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )

    split_ratio = args.train_test_split_ratio
    print("Splitting data into train and test sets with ratio {}".format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop("income", axis=1), df["income"], test_size=split_ratio, random_state=0
    )

    preprocess = make_column_transformer(
        (
            ["age", "num persons worked for employer"],
            KBinsDiscretizer(encode="onehot-dense", n_bins=10),
        ),
        (["capital gains", "capital losses", "dividends from stocks"], StandardScaler()),
        (["education", "major industry code", "class of worker"], OneHotEncoder(sparse=False)),
    )
    print("Running preprocessing and feature engineering transformations")
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)

    print("Train data shape after preprocessing: {}".format(train_features.shape))
    print("Test data shape after preprocessing: {}".format(test_features.shape))

    train_features_output_path = os.path.join("/opt/ml/processing/train", "train_features.csv")
    train_labels_output_path = os.path.join("/opt/ml/processing/train", "train_labels.csv")

    test_features_output_path = os.path.join("/opt/ml/processing/test", "test_features.csv")
    test_labels_output_path = os.path.join("/opt/ml/processing/test", "test_labels.csv")

    print("Saving training features to {}".format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)

    print("Saving test features to {}".format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)

    print("Saving training labels to {}".format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)

    print("Saving test labels to {}".format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)

Writing preprocessing.py


이 스크립트를 Processing 작업으로 실행합니다. 'SKLearnProcessor.run ()' 메서드를 사용합니다. 'run ()' 메소드에 'ProcessingInput' 을 입력합니다. 'source'는 Amazon S3의 인구 조사 데이터 세트이고 'destination'은 스크립트가이 데이터를 읽는 곳입니다 (이 경우 '/opt/ml/processing/input'). 처리 컨테이너 내부의 이러한 로컬 경로는 '/opt/ml/processing/'로 시작해야합니다.

또한 'run ()' 메소드에 'ProcessingOutput'을 지정하십시오. 여기서 'source'는 스크립트가 출력 데이터를 쓰는 경로입니다. 출력의 경우 'destination'은 Amazon SageMaker Python SDK가 생성하는 S3 버킷으로 기본 설정됩니다. 이 버킷은 `s3: //sagemaker-<region>-<account_id>/<processing_job_name>/output/ <output name>/` 형식으로 지정됩니다. 또한 작업을 실행한 후 이러한 출력 아티팩트를 더 쉽게 검색할 수 있도록 'output_name'에 대한 처리 출력 값을 제공합니다.

'run()' 메소드의 'arguments' 매개 변수는 'preprocessing.py' 스크립트에서 명령 줄 인수입니다.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
    code="preprocessing.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train_data":
        preprocessed_training_data = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test_data":
        preprocessed_test_data = output["S3Output"]["S3Uri"]

이제 처리 된 기능으로 구성된 전처리 작업의 출력을 검사하십시오.

In [5]:
training_features = pd.read_csv(preprocessed_training_data + "/train_features.csv", nrows=10)
print("Training features shape: {}".format(training_features.shape))
training_features.head(n=10)

Training features shape: (10, 73)


Unnamed: 0,0.0,0.0.1,0.0.2,0.0.3,0.0.4,1.0,0.0.5,0.0.6,0.0.7,0.0.8,...,0.0.56,0.0.57,0.0.58,0.0.59,0.0.60,1.0.4,0.0.61,0.0.62,0.0.63,0.0.64
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Training using the pre-processed data

`SKLearn` 인스턴스를 만듭니다. 이것은 트레이닝 스크립트 `train.py` 사용할 것입니다.  

In [6]:
from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(
    entry_point="train.py", framework_version="0.20.0", instance_type="ml.m5.xlarge", role=role
)

`train.py` 트레이닝 스크립트는 트레이닝 데이터를 사용하여 logistic regression 모델을 만들고, 다음 디렉토리에 `/opt/ml/model` 모델을 저장합니다. Amazon SageMaker 는 tar 파일을 만들고다름과 같은 이름으로 `model.tar.gz` S3 에 업로드 합니다.

In [7]:
%%writefile train.py

import os

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

if __name__ == "__main__":
    training_data_directory = "/opt/ml/input/data/train"
    train_features_data = os.path.join(training_data_directory, "train_features.csv")
    train_labels_data = os.path.join(training_data_directory, "train_labels.csv")
    print("Reading input data")
    X_train = pd.read_csv(train_features_data, header=None)
    y_train = pd.read_csv(train_labels_data, header=None)

    model = LogisticRegression(class_weight="balanced", solver="lbfgs")
    print("Training LR model")
    model.fit(X_train, y_train)
    model_output_directory = os.path.join("/opt/ml/model", "model.joblib")
    print("Saving model to {}".format(model_output_directory))
    joblib.dump(model, model_output_directory)

Writing train.py


만들어진 `train.py` 스트립트를 사용하여 모델을 트레이닝 합니다.

In [8]:
sklearn.fit({"train": preprocessed_training_data})
training_job_description = sklearn.jobs[-1].describe()
model_data_s3_uri = "{}{}/{}".format(
    training_job_description["OutputDataConfig"]["S3OutputPath"],
    training_job_description["TrainingJobName"],
    "output/model.tar.gz",
)

2021-06-01 11:57:42 Starting - Starting the training job...
2021-06-01 11:57:44 Starting - Launching requested ML instancesProfilerReport-1622548662: InProgress
......
2021-06-01 11:58:46 Starting - Preparing the instances for training......
2021-06-01 12:00:12 Downloading - Downloading input data
2021-06-01 12:00:12 Training - Downloading the training image..
2021-06-01 12:00:40 Uploading - Uploading generated training model
2021-06-01 12:00:40 Completed - Training job completed
[34m2021-06-01 12:00:26,625 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-06-01 12:00:26,627 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-01 12:00:26,636 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-06-01 12:00:26,973 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-06-01 12:00:28,392 sagemaker-training-toolki