## Iris Training and Prediction with Sagemaker Scikit-learn
This tutorial shows you how to use Scikit-learn with Sagemaker by utilizing the pre-built container. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing.

The sagemaker-python-sdk module makes it easy to take existing scikit-learn code, which we will show by training a model on the IRIS dataset and generating a set of predictions. For more information about the Scikit-learn container, see the sagemaker-scikit-learn-containers repository and the sagemaker-python-sdk repository.

For more on Scikit-learn, please visit the Scikit-learn website: http://scikit-learn.org/stable/.



## Libraries used

In [21]:
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import boto3
import os
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.predictor import csv_serializer, json_deserializer

## Permissions and environment variables


In [14]:
sagemaker_session = sagemaker.Session(default_bucket='test-karan-02')
# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()
bucket = 'test-karan-02'
prefix = 'sagemaker_demo_scikit'

## Data ingestion


In [3]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

--2020-03-02 21:48:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... --2020-03-02 21:48:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
connected.
HTTP request sent, awaiting response... HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data.1’


2020-03-02 21:48:48 (107 MB/s) - ‘iris.data.1’ saved [4551/4551]

200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data.1’


2020-03-02 21:48:48 (107 MB/s) - ‘iris.data.1’ saved [4551/4551]



In [4]:
iris_df = pd.read_csv("iris.data", header = None)

iris_df.columns = ['sepal_length', 'sepal_width','petal_length','petal_width', 'class']

## Data conversion

In [5]:
iris_df['class'] = pd.Categorical(iris_df['class'])

iris_df['code_class'] = iris_df['class'].cat.codes

In [6]:
del iris_df['class']

In [7]:
iris_df = iris_df.astype(np.float32)

In [8]:
iris_df.to_csv('train.csv', index=False)

## Upload training data


In [9]:

key = 'train.csv'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, key)).upload_file('train.csv')
s3_train_data = 's3://{}/{}/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://test-karan-02/sagemaker_demo_scikit/train.csv
uploaded training data location: s3://test-karan-02/sagemaker_demo_scikit/train.csv


In [15]:
train_input = sagemaker_session.upload_data("train.csv" )

In [10]:
output_location = 's3://{}/{}/output/'.format(bucket, prefix)
print('training artifacts will be uploaded to: {}'.format(output_location))

training artifacts will be uploaded to: s3://test-karan-02/sagemaker_demo_scikit/output/
training artifacts will be uploaded to: s3://test-karan-02/sagemaker_demo_scikit/output/


In [16]:
train_input

's3://test-karan-02/data/train.csv'

's3://test-karan-02/data/train.csv'

## Training the linear model

In [None]:
script_path = 'scikit_iris_v2.py'
sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'max_leaf_nodes': 30})


sklearn.fit({'train': s3_train_data})

## Set up hosting for the model


In [18]:
predictor = sklearn.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

--------------------------!!

## Validate the model for use

In [22]:
predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = json_deserializer

In [None]:
result = predictor.predict(iris_df.iloc[30,:4])
print(result)

In [25]:
predictor.delete_endpoint()
