# Predict breast cancer type using SageMaker linear-learner

Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create logistic regression model using SageMaker linear-learner algorithm](#model)
- [4. Deploy the model in the AWS Cloud](#deployment)
- [5. Score the model](#score)

**Note:** This notebook works correctly with kernel `Python 3.5+`.

<a id="setup"></a>
## 0. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a SageMaker Service, setting up steps are described here: https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html
- Install reqiured python packages from PyPi repository

### Package installation

In [1]:
!pip install boto3 | tail -n 1
!pip install sagemaker | tail -n 1
!pip install pandas | tail -n 1
!pip install scikit-learn | tail -n 1

Requirement not upgraded as not directly required: six>=1.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore<1.13.0,>=1.12.69->boto3)
Requirement not upgraded as not directly required: docker-pycreds>=0.3.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from docker<4.0,>=3.6.0->docker-compose>=1.23.0->sagemaker)
Requirement not upgraded as not directly required: six>=1.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from python-dateutil>=2->pandas)
Requirement not upgraded as not directly required: scikit-learn in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


<a id="introduction"></a>
## 1. Introduction

This notebook defines, trains and deploys the model that predicts cancer type.

<a id="load"></a>
## 2. Load and explore data

In this section you will load data into pandas DataFrame and perform a basic exploration. Next you will upload training data to the Amaznon S3 Object Storage.

### 2.1 Load data from webpage

In [2]:
import pandas as pd
from sklearn.utils import shuffle

In [3]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)

data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

### 2.2 Explore data

In [4]:
print("Sample records:")
display(data.head())

print("Features columns summary:")
display(data.iloc[:, 2:].describe())

print("Label column summary:")
display(data.diagnosis.value_counts())

Sample records:


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Features columns summary:


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Label column summary:


B    357
M    212
Name: diagnosis, dtype: int64

### 2.3 Store training data in S3 Object Storage

You will use SageMaker linear-learner build-in algorithm. This algorithm expects first column to be the label when training data is in `text/csv` format.

Moreover label column have to be numeric, so you will recode it.

#### Save prepared data to local filesystem

In [5]:
data_shuffled = shuffle(data)
data_shuffled.replace({'diagnosis': {'M': 1, 'B': 0}}, inplace = True)
display(data_shuffled.head())

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
351,899667,1,15.75,19.22,107.1,758.6,0.1243,0.2364,0.2914,0.1242,...,17.36,24.17,119.4,915.3,0.155,0.5046,0.6872,0.2135,0.4245,0.105
65,859283,1,14.78,23.94,97.4,668.3,0.1172,0.1479,0.1267,0.09029,...,17.31,33.39,114.6,925.1,0.1648,0.3416,0.3024,0.1614,0.3321,0.08911
296,891936,0,10.91,12.35,69.14,363.7,0.08518,0.04721,0.01236,0.01369,...,11.37,14.82,72.42,392.2,0.09312,0.07506,0.02884,0.03194,0.2143,0.06643
235,88249602,0,14.03,21.25,89.79,603.4,0.0907,0.06945,0.01462,0.01896,...,15.33,30.28,98.27,715.5,0.1287,0.1513,0.06231,0.07963,0.2226,0.07617
542,921644,0,14.74,25.42,94.7,668.6,0.08275,0.07214,0.04105,0.03027,...,16.51,32.29,107.4,826.4,0.106,0.1376,0.1611,0.1095,0.2722,0.06956


In [6]:
train_data_filename = 'breast_cancer.csv'
data_shuffled.iloc[:, 1:].to_csv(path_or_buf = train_data_filename, index = False, header = False)

**Note:** Header row have to be omitted.

#### Upload data to S3 Object Storage

In [7]:
import time
import json
import boto3

In [8]:
aws_credentials = {'access_key': '***', 
                   'secret_key': '***', 
                   'region_name': '***'}

**Note:** You have to provide credentials from your Amazon account.

In [9]:
# The code was removed by Watson Studio for sharing.

In [10]:
session = boto3.Session(
    aws_access_key_id = aws_credentials['access_key'],
    aws_secret_access_key = aws_credentials['secret_key'],
    region_name = aws_credentials['region_name']
)
s3 = session.resource('s3')

In [11]:
bucket_name = 'sagemaker-us-east-1-014862798213'
train_data_filename = 'breast_cancer.csv'
train_data_path = 'breast-cancer/train'
output_data_path = 's3://{}/breast-cancer/output'.format(bucket_name)
time_suffix = time.strftime("%Y-%m-%d-%H-%M", time.gmtime())

**Note:** You have to replace `bucket_name` with name of bucket in your S3 Object Storage. 

You can run following code `[bkt.name for bkt in s3.buckets.all()]` to list all your buckets.

In [12]:
s3_bucket = s3.Bucket(bucket_name)
s3_bucket.upload_file(Filename = train_data_filename, Key = '{}/{}'.format(train_data_path, train_data_filename))

Let's check if your data have been uploaded successfully.

In [13]:
for s3_obj in s3_bucket.objects.all():
    if (s3_obj.bucket_name == bucket_name) and (train_data_path in s3_obj.key):
        train_data_uri = 's3://{}/{}'.format(s3_obj.bucket_name, s3_obj.key)
        print(train_data_uri)

s3://sagemaker-us-east-1-014862798213/breast-cancer/train/breast_cancer.csv


<a id="model"></a>
## 3. Create logistic regression model using SageMaker linear-learner algorithm

In this section you will learn how to:

- [3.1 Setup training parameters](#prep)
- [3.2 Start training job](#train)

<a id="prep"></a>
### 3.1 Setup training parameters

In [14]:
from sagemaker.amazon.amazon_estimator import get_image_uri

sm_client = session.client('sagemaker')

In [15]:
training_image = get_image_uri(session.region_name, 'linear-learner')

iam_client = session.client('iam')
[role_arn, *_] = [role['Arn'] for role in iam_client.list_roles()['Roles'] if 'AmazonSageMaker-ExecutionRole' in role['RoleName']]

linear_job_name = 'Breast-cancer-linear-learner-' + time_suffix

In [16]:
linear_training_params = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "HyperParameters": {
        "feature_dim": "30",
        "mini_batch_size": "100",
        "predictor_type": "binary_classifier",
        "epochs": "10",
        "num_models": "32",
        "loss": "auto"
    },
    "InputDataConfig": [{
        "ChannelName": "train",
        "ContentType": "text/csv", 
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": train_data_uri,
                "S3DataDistributionType": "ShardedByS3Key"
            }
        }
    }],
    "OutputDataConfig": {"S3OutputPath": output_data_path},
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 2
    },
    "RoleArn": role_arn,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 6 * 60
    },
    "TrainingJobName": linear_job_name

}

<a id="train"></a>
### 3.2 Start training job

In [17]:
sm_client.create_training_job(**linear_training_params)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '120',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Thu, 20 Dec 2018 13:39:25 GMT',
   'x-amzn-requestid': '0e4d7df2-e2c3-481a-ac5f-5c6c969790a5'},
  'HTTPStatusCode': 200,
  'RequestId': '0e4d7df2-e2c3-481a-ac5f-5c6c969790a5',
  'RetryAttempts': 0},
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:014862798213:training-job/breast-cancer-linear-learner-2018-12-20-13-38'}

In [18]:
try:
    sm_client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName = linear_job_name)
except Exception:
    print('Traing job error.')

In [19]:
train_job_details = sm_client.describe_training_job(TrainingJobName = linear_job_name)
train_job_status = train_job_details['TrainingJobStatus']

if train_job_status == 'Failed':
    print(train_job_details['FailureReason'])
else:
    train_job_arn = train_job_details['TrainingJobArn']
    print(train_job_arn)
    trained_model_uri = train_job_details['ModelArtifacts']['S3ModelArtifacts']
    print(trained_model_uri)

arn:aws:sagemaker:us-east-1:014862798213:training-job/breast-cancer-linear-learner-2018-12-20-13-38
s3://sagemaker-us-east-1-014862798213/breast-cancer/output/Breast-cancer-linear-learner-2018-12-20-13-38/output/model.tar.gz


<a id="deployment"></a>
## 4. Deploy model in the AWS Cloud

### 4.1 Setup deployment parameters

In [21]:
linear_hosting_container = {'Image': training_image, 'ModelDataUrl': trained_model_uri}

create_model_details = sm_client.create_model(
    ModelName = linear_job_name,
    ExecutionRoleArn = role_arn,
    PrimaryContainer = linear_hosting_container)

print(create_model_details['ModelArn'])

arn:aws:sagemaker:us-east-1:014862798213:model/breast-cancer-linear-learner-2018-12-20-13-38


### 4.2 Create deployment configuration endpoint

In [22]:
endpoint_config = 'Breast-cancer-linear-endpoint-config-' + time_suffix
print(endpoint_config)

create_endpoint_config_details = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config,
    ProductionVariants = [{
        'InstanceType': 'ml.m4.xlarge',
        'InitialInstanceCount': 1,
        'ModelName': linear_job_name,
        'VariantName': 'AllTraffic'}])

Breast-cancer-linear-endpoint-config-2018-12-20-13-38


In [23]:
endpoint_config_details = sm_client.describe_endpoint_config(EndpointConfigName = endpoint_config)
print(endpoint_config_details)

{'ResponseMetadata': {'RetryAttempts': 0, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'x-amzn-requestid': '5f0680c6-6391-4ea5-b862-6c6101ea65d6', 'content-length': '437', 'date': 'Thu, 20 Dec 2018 13:44:56 GMT'}, 'HTTPStatusCode': 200, 'RequestId': '5f0680c6-6391-4ea5-b862-6c6101ea65d6'}, 'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:014862798213:endpoint-config/breast-cancer-linear-endpoint-config-2018-12-20-13-38', 'EndpointConfigName': 'Breast-cancer-linear-endpoint-config-2018-12-20-13-38', 'CreationTime': datetime.datetime(2018, 12, 20, 13, 44, 47, 151000, tzinfo=tzlocal()), 'ProductionVariants': [{'InitialInstanceCount': 1, 'ModelName': 'Breast-cancer-linear-learner-2018-12-20-13-38', 'VariantName': 'AllTraffic', 'InitialVariantWeight': 1.0, 'InstanceType': 'ml.m4.xlarge'}]}


### 4.3 Create scoring endpoint

In [24]:
scoring_endpoint = 'Breast-cancer-endpoint-scoring-' + time_suffix

create_endpoint_details = sm_client.create_endpoint(
    EndpointName = scoring_endpoint,
    EndpointConfigName = endpoint_config)

In [25]:
try:
    sm_client.get_waiter('endpoint_in_service').wait(EndpointName = scoring_endpoint)
except Exception:
    print('Create scoring endpoint error')

In [26]:
scoring_endpoint_details = sm_client.describe_endpoint(EndpointName = scoring_endpoint)
scoring_enpoint_config_status = scoring_endpoint_details['EndpointStatus']
if scoring_enpoint_config_status != 'InService':
    print(scoring_endpoint_details['FailureReason'])
else:
    print(scoring_endpoint_details['EndpointArn'])

{'EndpointStatus': 'InService', 'EndpointConfigName': 'Breast-cancer-linear-endpoint-config-2018-12-20-13-38', 'LastModifiedTime': datetime.datetime(2018, 12, 20, 13, 51, 20, 90000, tzinfo=tzlocal()), 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'x-amzn-requestid': '8053446e-cdde-43ca-a2e6-8e167cc0b4a8', 'content-length': '781', 'date': 'Thu, 20 Dec 2018 13:51:43 GMT'}, 'HTTPStatusCode': 200, 'RequestId': '8053446e-cdde-43ca-a2e6-8e167cc0b4a8'}, 'EndpointName': 'Breast-cancer-endpoint-scoring-2018-12-20-13-38', 'CreationTime': datetime.datetime(2018, 12, 20, 13, 45, 4, 177000, tzinfo=tzlocal()), 'ProductionVariants': [{'DeployedImages': [{'SpecifiedImage': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:1', 'ResolvedImage': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner@sha256:1890548fee5880f79185e1d0180d21decdfad81175961135dfb44cd04504037f', 'ResolutionTime': datetime.datetime(2018, 12, 20, 13, 45, 5,

<a id="score"></a>
## 5. Score the model

### 5.1 Prepare sample data for scoring

You will use data in `csv` format as scoring payload. First column (label) is removed from data and last 20 training records are selected.

In [27]:
scoring_data_filename = 'scoring_breast_cancer.csv'

In [28]:
with open(train_data_filename) as f_train:
    with open(scoring_data_filename, 'w') as f_score:
        f_score.writelines([','.join(line.split(',')[1:]) for line in f_train.readlines()[-20:]])

### 5.2 Send data for scoring

In [30]:
sm_runtime = session.client('runtime.sagemaker')

with open(scoring_data_filename) as f_payload:
    scoring_response = sm_runtime.invoke_endpoint(EndpointName = scoring_endpoint,
                                                  ContentType = 'text/csv',
                                                  Body = f_payload.read().encode())
    
    scored_records = scoring_response['Body'].read().decode()
    print(json.loads(scored_records))

{'predictions': [{'predicted_label': 0.0, 'score': 0.22364917397499084}, {'predicted_label': 1.0, 'score': 0.9998002648353577}, {'predicted_label': 1.0, 'score': 1.0}, {'predicted_label': 0.0, 'score': 0.023445170372724533}, {'predicted_label': 1.0, 'score': 0.968568742275238}, {'predicted_label': 0.0, 'score': 1.1004765809730088e-07}, {'predicted_label': 0.0, 'score': 0.0007473318837583065}, {'predicted_label': 0.0, 'score': 0.2242916375398636}, {'predicted_label': 0.0, 'score': 0.07657745480537415}, {'predicted_label': 0.0, 'score': 0.010532176122069359}, {'predicted_label': 0.0, 'score': 0.22209057211875916}, {'predicted_label': 1.0, 'score': 0.9999949932098389}, {'predicted_label': 1.0, 'score': 1.0}, {'predicted_label': 0.0, 'score': 0.04370381310582161}, {'predicted_label': 1.0, 'score': 1.0}, {'predicted_label': 1.0, 'score': 0.9999145269393921}, {'predicted_label': 0.0, 'score': 0.0780254602432251}, {'predicted_label': 0.0, 'score': 0.004816959146410227}, {'predicted_label': 0.

### Authors

Wojciech Sobala, Data Scientist at IBM