In [1]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv

--2021-07-12 17:00:38--  http://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 522048 (510K) [application/x-httpd-php]
Saving to: ‘ai4i2020.csv’


2021-07-12 17:00:39 (1.34 MB/s) - ‘ai4i2020.csv’ saved [522048/522048]



In [2]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In [15]:
data_df = pd.read_csv('ai4i2020.csv')
data_df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


In [16]:
data_columns = ['Machine failure',
                'Air temperature [K]', 
                'Process temperature [K]', 
                'Rotational speed [rpm]', 
                'Torque [Nm]', 
                'Tool wear [min]']

rename_columns = {'Machine failure': 'y',
                  'Air temperature [K]': 'air_temperature',
                  'Process temperature [K]': 'process_temperature',
                  'Rotational speed [rpm]': 'rotational_speed',
                  'Torque [Nm]': 'torque',
                  'Tool wear [min]': 'tool_wear',
                  'H': 'high',
                  'L': 'low',
                  'M': 'medium'}


 
feature_df = pd.concat([data_df[data_columns], pd.get_dummies(data_df['Type'])], axis=1)
feature_df.rename(columns=rename_columns, inplace=True)
feature_df.tail()

Unnamed: 0,y,air_temperature,process_temperature,rotational_speed,torque,tool_wear,high,low,medium
9995,0,298.8,308.4,1604,29.5,14,0,0,1
9996,0,298.9,308.4,1632,31.8,17,1,0,0
9997,0,299.0,308.6,1645,33.4,22,0,0,1
9998,0,299.0,308.7,1408,48.5,25,1,0,0
9999,0,299.0,308.7,1500,40.2,30,0,0,1


In [13]:
import os
import boto3
import sagemaker
import s3fs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import time
import json
import sagemaker.amazon.common as smac

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

bucket = sess.default_bucket()
prefix = "marcus-machine-failure"

## Upload training data
Now that we've created our dataset, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

In [11]:
import s3fs

s3 = s3fs.S3FileSystem(anon=False)

key = "ai4i2020_raw.csv"
with s3.open(f'{bucket}/{prefix}/{key}','w') as f:
    feature_df.to_csv(f)

Adapted from SageMaker example [Breast Cancer Prediction](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.html) notebook

# Create Features and Labels

## Split the data into 80% training, 10% validation and 10% testing.

In [18]:
rand_split = np.random.rand(len(feature_df))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
test_list = rand_split >= 0.9

data_train = feature_df[train_list]
data_val = feature_df[val_list]
data_test = feature_df[test_list]

train_y = ((data_train.iloc[:, 0] == 1) + 0).to_numpy()
train_X = data_train.iloc[:, 1:].to_numpy()

val_y = ((data_val.iloc[:, 0] == 1) + 0).to_numpy()
val_X = data_val.iloc[:, 1:].to_numpy()

test_y = ((data_test.iloc[:, 0] == 0) + 0).to_numpy()
test_X = data_test.iloc[:, 1:].to_numpy();

Now, we’ll convert the datasets to the recordIO-wrapped protobuf format used by the Amazon SageMaker algorithms, and then upload this data to S3.   We’ll start with training data.

In [19]:
train_file = "linear_train.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, train_X.astype("float32"), train_y.astype("float32"))
f.seek(0)

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train", train_file)
).upload_fileobj(f)

Next we’ll convert and upload the validation dataset.

In [20]:
validation_file = "linear_validation.data"

f = io.BytesIO()
smac.write_numpy_to_dense_tensor(f, val_X.astype("float32"), val_y.astype("float32"))
f.seek(0)

boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "validation", validation_file)
).upload_fileobj(f)

## Training the linear model
Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the Linear Learner training algorithm, although we have tested it on multi-terabyte datasets.

Again, we'll use the Amazon SageMaker Python SDK to kick off training, and monitor status until it is completed. In this example that takes between 7 and 11 minutes. Despite the dataset being small, provisioning hardware and loading the algorithm container take time upfront.

First, let's specify our containers. Since we want this notebook to run in all 4 of Amazon SageMaker's regions, we'll create a small lookup. More details on algorithm containers can be found in AWS documentation.

In [12]:
from sagemaker import image_uris

container = image_uris.retrieve(region=boto3.Session().region_name, framework="linear-learner")

In [23]:
linear_job = "BASELINE-linear-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Job name is:", linear_job)

linear_training_params = {
    "RoleArn": role,
    "TrainingJobName": linear_job,
    "AlgorithmSpecification": {"TrainingImage": container, "TrainingInputMode": "File"},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.c4.2xlarge", "VolumeSizeInGB": 10},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train/".format(bucket, prefix),
                    "S3DataDistributionType": "ShardedByS3Key",
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/validation/".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None",
        },
    ],
    "OutputDataConfig": {"S3OutputPath": "s3://{}/{}/".format(bucket, prefix)},
    "HyperParameters": {
        "feature_dim": "8",
        "mini_batch_size": "100",
        "predictor_type": "regressor",
        "epochs": "10",
        "num_models": "32",
        "loss": "absolute_loss",
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 60 * 60},
}

Job name is: BASELINE-linear-2021-07-12-17-49-52


Now let’s kick off our training job in SageMaker’s distributed, managed training, using the parameters we just created. Because training is managed, we don’t have to wait for our job to finish to continue, but for this case, let’s use boto3’s ‘training_job_completed_or_stopped’ waiter so we can ensure that the job has been started.

In [24]:
%%time

region = boto3.Session().region_name
sm = boto3.client("sagemaker")

sm.create_training_job(**linear_training_params)

status = sm.describe_training_job(TrainingJobName=linear_job)["TrainingJobStatus"]
print(status)
sm.get_waiter("training_job_completed_or_stopped").wait(TrainingJobName=linear_job)
if status == "Failed":
    message = sm.describe_training_job(TrainingJobName=linear_job)["FailureReason"]
    print("Training failed with the following error: {}".format(message))
    raise Exception("Training job failed")

InProgress
CPU times: user 70.4 ms, sys: 8.74 ms, total: 79.1 ms
Wall time: 4min


## Host

Now that we’ve trained the linear algorithm on our data, let’s setup a model which can later be hosted. We will: 1. Point to the scoring container 1. Point to the model.tar.gz that came from training 1. Create the hosting model

In [25]:
linear_hosting_container = {
    "Image": container,
    "ModelDataUrl": sm.describe_training_job(TrainingJobName=linear_job)["ModelArtifacts"][
        "S3ModelArtifacts"
    ],
}

create_model_response = sm.create_model(
    ModelName=linear_job, ExecutionRoleArn=role, PrimaryContainer=linear_hosting_container
)

print(create_model_response["ModelArn"])

arn:aws:sagemaker:us-east-1:405147176623:model/baseline-linear-2021-07-12-17-49-52


Once we’ve setup a model, we can configure what our hosting endpoints should be. Here we specify: 1. EC2 instance type to use for hosting 1. Initial number of instances 1. Our hosting model name

In [26]:
linear_endpoint_config = "VALIDATION-linear-endpoint-config-" + time.strftime(
    "%Y-%m-%d-%H-%M-%S", time.gmtime()
)
print(linear_endpoint_config)
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=linear_endpoint_config,
    ProductionVariants=[
        {
            "InstanceType": "ml.m4.xlarge",
            "InitialInstanceCount": 1,
            "ModelName": linear_job,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

VALIDATION-linear-endpoint-config-2021-07-12-17-57-30
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:405147176623:endpoint-config/validation-linear-endpoint-config-2021-07-12-17-57-30


Now that we’ve specified how our endpoint should be configured, we can create them. This can be done in the background, but for now let’s run a loop that updates us on the status of the endpoints so that we know when they are ready for use.

In [27]:
%%time

linear_endpoint = "VALIDATION-linear-endpoint-" + time.strftime("%Y%m%d%H%M", time.gmtime())
print(linear_endpoint)
create_endpoint_response = sm.create_endpoint(
    EndpointName=linear_endpoint, EndpointConfigName=linear_endpoint_config
)
print(create_endpoint_response["EndpointArn"])

resp = sm.describe_endpoint(EndpointName=linear_endpoint)
status = resp["EndpointStatus"]
print("Status: " + status)

sm.get_waiter("endpoint_in_service").wait(EndpointName=linear_endpoint)

resp = sm.describe_endpoint(EndpointName=linear_endpoint)
status = resp["EndpointStatus"]
print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

if status != "InService":
    raise Exception("Endpoint creation did not succeed")

VALIDATION-linear-endpoint-202107121758
arn:aws:sagemaker:us-east-1:405147176623:endpoint/validation-linear-endpoint-202107121758
Status: Creating
Arn: arn:aws:sagemaker:us-east-1:405147176623:endpoint/validation-linear-endpoint-202107121758
Status: InService
CPU times: user 275 ms, sys: 10 ms, total: 285 ms
Wall time: 9min 2s


## Predict

Now that we have our hosted endpoint, we can generate statistical predictions from it. Let’s predict on our test dataset to understand how accurate our model is.

There are many metrics to measure classification accuracy. Common examples include include: - Precision - Recall - F1 measure - Area under the ROC curve - AUC - Total Classification Accuracy - Mean Absolute Error

For our example, we’ll keep things simple and use total classification accuracy as our metric of choice. We will also evaluate Mean Absolute Error (MAE) as the linear-learner has been optimized using this metric, not necessarily because it is a relevant metric from an application point of view. We’ll compare the performance of the linear-learner against a naive benchmark prediction which uses majority class observed in the training data set for prediction on the test data.

In [28]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=",", fmt="%g")
    return csv.getvalue().decode().rstrip()

Next, we’ll invoke the endpoint to get predictions.

In [29]:
runtime = boto3.client("runtime.sagemaker")

payload = np2csv(test_X)
response = runtime.invoke_endpoint(
    EndpointName=linear_endpoint, ContentType="text/csv", Body=payload
)
result = json.loads(response["Body"].read().decode())
test_pred = np.array([r["score"] for r in result["predictions"]])

Let’s compare linear learner based mean absolute prediction errors from a baseline prediction which uses majority class to predict every instance.

In [30]:
test_mae_linear = np.mean(np.abs(test_y - test_pred))
test_mae_baseline = np.mean(
    np.abs(test_y - np.median(train_y))
)  ## training median as baseline predictor

print("Test MAE Baseline :", round(test_mae_baseline, 3))
print("Test MAE Linear:", round(test_mae_linear, 3))

Test MAE Baseline : 0.96
Test MAE Linear: 0.96


Let’s compare predictive accuracy using a classification threshold of 0.5 for the predicted and compare against the majority class prediction from training data set.

In [31]:
test_pred_class = (test_pred > 0.5) + 0
test_pred_baseline = np.repeat(np.median(train_y), len(test_y))

prediction_accuracy = np.mean((test_y == test_pred_class)) * 100
baseline_accuracy = np.mean((test_y == test_pred_baseline)) * 100

print("Prediction Accuracy:", round(prediction_accuracy, 1), "%")
print("Baseline Accuracy:", round(baseline_accuracy, 1), "%")

Prediction Accuracy: 4.0 %
Baseline Accuracy: 4.0 %


In [49]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_pred_baseline, normalize=True)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.04


In [44]:
from sklearn.metrics import precision_recall_fscore_support
(precision, recall, fbeta, support) = precision_recall_fscore_support(test_y, test_pred_baseline, average='micro')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F-Beta: {fbeta:.2f}')
print(f'Support: {support}')
    

Precision: 0.04
Recall: 0.04
F-Beta: 0.04
Support: None


https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support

## Cleanup

In [50]:
sm.delete_endpoint(EndpointName=linear_endpoint)

{'ResponseMetadata': {'RequestId': 'f36e076c-192b-4dd2-b096-7f118f9acbad',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f36e076c-192b-4dd2-b096-7f118f9acbad',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Mon, 12 Jul 2021 18:31:06 GMT'},
  'RetryAttempts': 0}}