# End-to-End Example #1

1. [Introduction](#Introduction)
2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)
  1. [Permissions and environment variables](#Permissions-and-environment-variables)
  2. [Data ingestion](#Data-ingestion)
  3. [Data inspection](#Data-inspection)
  4. [Data conversion](#Data-conversion)
3. [Training the K-Means model](#Training-the-K-Means-model)
4. [Set up hosting for the model](#Set-up-hosting-for-the-model)
5. [Validate the model for use](#Validate-the-model-for-use)


## Introduction

이 예제는 KMeans 알고리즘의 데이터로 CSV 파일을 읽어서 수행하는 예제입니다.

## Prequisites and Preprocessing

### Permissions and environment variables

Here we set up the linkage and authentication to AWS services. There are two parts to this:

1. The role(s) used to give learning and hosting access to your data. Here we extract the role you created earlier for accessing your notebook.  See the documentation if you want to specify  a different role
1. The S3 bucket name and locations that you want to use for training and model data.

In [5]:
from sagemaker import get_execution_role

role = get_execution_role()
bucket='<put your bucket name>'

### Data ingestion

For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

In [32]:
%%time
import pickle, csv, numpy, urllib.request, json

# Load the dataset from https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
# This data is found at http://www.learnbymarketing.com/tutorials/k-means-clustering-in-r-example/ 
sourceCSV = 'Wholesale customers data.csv'

urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv', sourceCSV)

# Verify the content
with open(sourceCSV, newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
        print(', '.join(row))

Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
2,3,12669,9656,7561,214,2674,1338
2,3,7057,9810,9568,1762,3293,1776
2,3,6353,8808,7684,2405,3516,7844
1,3,13265,1196,4221,6404,507,1788
2,3,22615,5410,7198,3915,1777,5185
2,3,9413,8259,5126,666,1795,1451
2,3,12126,3199,6975,480,3140,545
2,3,7579,4956,9426,1669,3321,2566
1,3,5963,3648,6192,425,1716,750
2,3,6006,11093,18881,1159,7425,2098
2,3,3366,5403,12974,4400,5977,1744
2,3,13146,1124,4523,1420,549,497
2,3,31714,12319,11757,287,3881,2931
2,3,21217,6208,14982,3095,6707,602
2,3,24653,9465,12091,294,5058,2168
1,3,10253,1114,3821,397,964,412
2,3,1020,8816,12121,134,4508,1080
1,3,5876,6157,2933,839,370,4478
2,3,18601,6327,10099,2205,2767,3181
1,3,7780,2495,9464,669,2518,501
2,3,17546,4519,4602,1066,2259,2124
1,3,5567,871,2010,3383,375,569
1,3,31276,1917,4469,9408,2381,4334
2,3,26373,36423,22019,5154,4337,16523
2,3,22647,9776,13792,2915,4482,5778
2,3,16165,4230,7595,201,4003,57
1,3,9898,961,2861,3151,242,833
1,3,14276,803,

In [36]:
# SageMaker CSV does not require the first header line
sourceDataCSV = 'kmeanssourcedata.csv'
with open(sourceCSV, 'r') as fin:
    data = fin.read().splitlines(True)
with open(sourceDataCSV, 'w') as fout:
    fout.writelines(data[1:])
    
from numpy import genfromtxt
my_data = genfromtxt(sourceDataCSV, delimiter=',',dtype='float32')

## Training the K-Means model

Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the k-means training algorithm.  But Amazon SageMaker's k-means has been tested on, and scales well with, multi-terabyte datasets.

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 7 and 11 minutes.

In [37]:
from sagemaker import KMeans

data_location = 's3://{}/kmeans_highlevel_csv_example/data'.format(bucket)
output_location = 's3://{}/kmeans_csv_example/output'.format(bucket)

print('training data will be uploaded to: {}'.format(data_location))
print('training artifacts will be uploaded to: {}'.format(output_location))

kmeans = KMeans(role=role,
                train_instance_count=1,
                train_instance_type='ml.c4.8xlarge',
                output_path=output_location,
                k=10,
                data_location=data_location)

training data will be uploaded to: s3://pilho-sagemaker-ai-workshop/kmeans_highlevel_csv_example/data
training artifacts will be uploaded to: s3://pilho-sagemaker-ai-workshop/kmeans_csv_example/output


In [38]:
%%time

kmeans.fit(kmeans.record_set(my_data))

INFO:sagemaker:Creating training-job with name: kmeans-2018-06-08-09-38-23-979


.....................
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/08/2018 09:41:47 INFO 139910262089536] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'local_lloyd_num_trials': u'auto', u'_log_level': u'info', u'_kvstore': u'auto', u'local_lloyd_init_method': u'kmeans++', u'force_dense': u'true', u'epochs': u'1', u'init_method': u'random', u'local_lloyd_tol': u'0.0001', u'local_lloyd_max_iter': u'300', u'_disable_wait_to_read': u'false', u'extra_center_factor': u'auto', u'eval_metrics': u'["msd"]', u'_num_kv_servers': u'1', u'mini_batch_size': u'5000', u'half_life_time_size': u'0', u'_num_slices': u'1'}[0m
[31m[06/08/2018 09:41:47 INFO 139910262089536] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'8', u'k': u'10', u'force_dense': u'True'}[0m
[31m[06/08/2018 09:41:47 INFO 139910262089536] Fina

===== Job Complete =====
Billable seconds: 101
CPU times: user 384 ms, sys: 4 ms, total: 388 ms
Wall time: 4min 12s


## Set up hosting for the model
Now, we can deploy the model we just trained behind a real-time hosted endpoint.  This next step can take, on average, 7 to 11 minutes to complete.

In [39]:
%%time

kmeans_predictor = kmeans.deploy(initial_instance_count=1,
                                 instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: kmeans-2018-06-08-09-42-36-136
INFO:sagemaker:Creating endpoint with name kmeans-2018-06-08-09-38-23-979


--------------------------------------------------------------!CPU times: user 248 ms, sys: 0 ns, total: 248 ms
Wall time: 5min 14s


## Validate the model for use
Finally, we'll validate the model for use. Let's generate a classification for a single observation from the trained model using the endpoint we just created.

In [40]:
# Classify the first 10 data
result = kmeans_predictor.predict(my_data[0:10])
print(result)

[label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 5.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 9302.55859375
    }
  }
}
, label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 6.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 4948.421875
    }
  }
}
, label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 6.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 8667.171875
    }
  }
}
, label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 5.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 2902.704833984375
    }
  }
}
, label {
  key: "closest_cluster"
  value {
    float32_tensor {
      values: 1.0
    }
  }
}
label {
  key: "distance_to_cluster"
  value {
    float32_tensor {
      values: 3530.25

### (Optional) Delete the Endpoint
If you're ready to be done with this notebook, make sure run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
print(kmeans_predictor.endpoint)

In [None]:
import sagemaker
sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint)