# Machine Learning

You have seen machine learning (ML) in a variety of settings already throughout the curriculum. 
ML can be defined as a field of computer science that aims at exploring the construction of algorithms and models that can learn from data(often referred to as Ground Truth, in the case of supervised ML to help you identify patterns, and make data-driven decisions


The following diagram is a simplified view of typical phases of the ML process in a production environment.


<img src="../images/ML.PNG">

From a logical perspective, the following is part of every ML process:

- **Obtain training data** - You either build your own dataset, or locate ready-to-use publicly available data. The training data is used as as input for the training phase. You will address this in the next Lab Step. 


- **Train your model** - This is an offline phase that often requires a lot of time. This phase corresponds with the _Learning Processing_ part of the diagram above.


- **Store your model** - This is usually a large matrix of numbers, based on the model complexity and the number of input features. This phase corresponds with the _Model_ part of the diagram above.


- **Evaluate your model using part of your dataset** - This is needed to verify whether your model behaves well with new data or not. The evaluation phase can be iterative, and effectively sits between the _Learning Processing_ and _Model_ parts of the diagram above. 


- **Deploy and use your model for real-time predictions** - By this time in the process, usually several tests have been run and evaluated based on different data sets. At this point you are narrowing it down to one and deploying it. The processing itself is often quite fast, and in some use cases even run on low-power devices such as a smartphone. In this Amazon ML lab you will use AWS APIs. This phase corresponds to the _Predicting Processing_ in the diagram above. 

<br>
In this notebook, we will a create a Machine Learning model for IRIS data.

## Granting Amazon ML Permissions to Read Your Data from Amazon S3

To create a datasource object from your input data in Amazon S3, you must grant Amazon ML the following permissions to the S3 location where your input data is stored.

Use get_bucket_acl() method to know the access policies attached with the bucket. Access control lists (ACLs) are one of the resource-based access policy options that you can use to manage access to your buckets and objects. You can use ACLs to grant basic read/write permissions to other AWS accounts. 

In [1]:
################################### SET THE FOLLOWING PARAMETERS ###################################################
#***********************************************************************************
#Set the AWS Region
region = 'us-east-1'

#Set the AWS Access ID (Given to you buy the DSA staff)
access_id = 'AKIA2M4ITY7JQWGANH3B'  

#Set the AWS Access Key (Given to you buy the DSA staff)
access_key = 'PehA8Lji/KXz7Bw+llaHd4cffXXEedXC8zbhFH+T'

#Change the data source id to include your pawprint
datasource_id = 'irisdata_dsa_lcmhng'

# Change the pawprint portion of the output URL below to use  your actual pawprint
s3_output_url = "s3://dsabucket.module4/pawprint/"  # Give your pawprint here

In [2]:
import boto3
import base64
import json
import os
import sys
import time
import datetime
import random

data_s3_url = "s3://dsabucket.module4/iris_data/irisdata.csv"

# Call to S3 to retrieve the policy for the given bucket
s3 = boto3.client('s3', 
                   aws_access_key_id = access_id, 
                   aws_secret_access_key = access_key)
bucket_name = 'dsabucket.module4'
# get_bucket_acl() method gets the access control policy for the specified bucket.
# We want to use dsabucket.module4 for storing the data. Check and add more policies so AWS ML service can access the dataset
result = s3.get_bucket_acl(Bucket=bucket_name)
print(json.dumps(result, indent=2))

{
  "ResponseMetadata": {
    "RequestId": "S3NHDW8JHX7Z6E4G",
    "HostId": "avymbZRhZnzWCq47VVQmDVfqNCBlyNybdcGqS2f5oPFT3LJ1oc4eSC3EkCsv7AaaO0XbQ2Vw8qM=",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amz-id-2": "avymbZRhZnzWCq47VVQmDVfqNCBlyNybdcGqS2f5oPFT3LJ1oc4eSC3EkCsv7AaaO0XbQ2Vw8qM=",
      "x-amz-request-id": "S3NHDW8JHX7Z6E4G",
      "date": "Wed, 10 Nov 2021 02:00:26 GMT",
      "content-type": "application/xml",
      "transfer-encoding": "chunked",
      "server": "AmazonS3"
    },
    "RetryAttempts": 0
  },
  "Owner": {
    "DisplayName": "dsamasters",
    "ID": "dd6b47de89624f8f1cdfe158faff7cba3652f079c0f8e9b3e8e637f300ebfe6f"
  },
  "Grants": [
    {
      "Grantee": {
        "DisplayName": "dsamasters",
        "ID": "dd6b47de89624f8f1cdfe158faff7cba3652f079c0f8e9b3e8e637f300ebfe6f",
        "Type": "CanonicalUser"
      },
      "Permission": "FULL_CONTROL"
    }
  ]
}


In below cell we are creating a JSON object defining the access rules who can access dsabucket.module4. 

The policy is separated into two parts because the ListBucket action requires permissions on the bucket while the other actions require permissions on the objects in the bucket. We used two different Amazon Resource Names (ARNs) to specify bucket-level and object-level permissions. The first Resource element specifies arn:aws:s3:::dsabucket.module4 for the ListBucket action so that applications can list all objects in the test bucket. The second Resource element specifies arn:aws:s3:::dsabucket.module4/* for the GetObject, PutObject, and DeletObject actions so that applications can read, write, and delete any objects in the test bucket.

We did not combine the two ARNs by using a wildcard, such as arn:aws:s3:::dsabucket.module4*. Even though this ARN would grant permissions for all actions in a single statement, it is broader and grants access to any bucket and objects in that bucket that begin with dsabucket.module4, like dsabucket.module4-bucket.

In [3]:
bucket_name = 'dsabucket.module4'

bucket_policy= {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "machinelearning.amazonaws.com"},
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::dsabucket.module4"]
    },
    {
      "Effect": "Allow",
      "Principal": { "Service": "machinelearning.amazonaws.com"},
      "Action": "s3:PutObjectAcl",
      "Resource": "arn:aws:s3:::dsabucket.module4/*"
    },
    {
      "Effect": "Allow",
      "Principal": { "Service": "machinelearning.amazonaws.com"},
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::dsabucket.module4/*"]
    }
  ]
}


# Convert the policy to a JSON string
bucket_policy = json.dumps(bucket_policy)

# Set the new policy on the given bucket
s3.put_bucket_policy(Bucket=bucket_name, Policy=bucket_policy)


{'ResponseMetadata': {'RequestId': '58XZRDB1YNC2QBPW',
  'HostId': 'JfkA1Rfor1uv2mo4Pv//ut0/cZ39B3FCvpmX1XGBBo4lPJ8VIscJvhjriPuQL48KGRvbnT1Q4ms=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'JfkA1Rfor1uv2mo4Pv//ut0/cZ39B3FCvpmX1XGBBo4lPJ8VIscJvhjriPuQL48KGRvbnT1Q4ms=',
   'x-amz-request-id': '58XZRDB1YNC2QBPW',
   'date': 'Wed, 10 Nov 2021 02:02:52 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

s3:GetObject applies to the objects in the bucket so the Resource is correct: "Resource": "arn:aws:s3:::my-bucket/*".

s3:ListBucket applies to the Bucket itself and so the Resource should be "Resource": "arn:aws:s3:::my-bucket"

## Dataset selection and manipulation

Data manipulation prior to using Machine Learning is very common. Below code cell has all steps to create a data source that an AWS ML model can use. Whether your input data comes from your own database, or from an open dataset, most of the time you need to manipulate the raw data. There can be several reasons for this, including:

- **Data normalization (or feature scaling)** - Fortunately Amazon ML takes care of this for you, but you do want to avoid unbalanced contributions of your input features, in case their values span very different ranges.


- **Feature selection (or subset selection)** - Once again, you don't need to worry about this when using Amazon ML, although removing redundant or irrelevant features before training your model can make it faster and more accurate.


- **Data formatting** - Recall that models are big matrices of data. You always need to feed them with a suitable input format. In this Lab you will use a basic CSV file.

In [4]:
client = boto3.client('machinelearning', region_name='us-east-1', 
                   aws_access_key_id = access_id, 
                   aws_secret_access_key = access_key)

response = client.create_data_source_from_s3(
    DataSourceId=datasource_id,
    DataSourceName='iris data',
    DataSpec={
        'DataLocationS3': 's3://dsabucket.module4/iris_data/irisdata.csv',
        'DataSchema': """{ "version": "1.0", 
                         "targetFieldName": "class",
                         "dataFormat": "CSV", 
                         "dataFileContainsHeader": "true", 
                         "attributes": 
                            [  
                                { "fieldName": "sepal_length", "fieldType": "NUMERIC" }, 
                                { "fieldName": "sepal_width",  "fieldType": "NUMERIC" },  
                                { "fieldName": "petal_length", "fieldType": "NUMERIC" }, 
                                { "fieldName": "petal_width",  "fieldType": "NUMERIC" }, 
                                { "fieldName": "class",        "fieldType": "CATEGORICAL" }
                            ]
                      }"""
              },
    ComputeStatistics=True
)

### Create a MLModel using the DataSource and the recipe as information sources.

CreateMLModel is an asynchronous operation. In response to CreateMLModel , Amazon Machine Learning (Amazon ML) immediately returns and sets the MLModel status to PENDING .After the MLModel has been created and is ready for use, Amazon ML sets the status to COMPLETED.

GetMLModel operation is used to check the progress of the MLModel during the creation operation.

CreateMLModel requires a DataSource with computed statistics, which can be created by setting ComputeStatistics to true in CreateDataSourceFromS3

In [5]:
# Below are 
schema_fn = "iris.schema"
recipe_fn = "recipe.json"
name = "iris classification"
created_model_id=""
train_ds_id=""
test_ds_id=""
eval_id=""


# create AWS machine learning clinet
ml = boto3.client('machinelearning', region_name='us-east-1', 
                   aws_access_key_id = access_id, 
                   aws_secret_access_key = access_key)


In [6]:
def build_model(data_s3_url, schema_fn, recipe_fn, name, train_percent=70):
    """Creates all the objects needed to build an ML Model & evaluate its quality.
    """
    # ml - aws machine learning object
    # data_s3_url - S3 location where input data is located
    # schema_fn - schema of IRIS dataset
    # train_percent - How much % data you want in training set
    # name - name of the ML model
    # recipe_fn - Its a JSON file that describes the attributes in the dataset
    
    # Create train and test data sources
    (train_ds_id, test_ds_id) = create_data_sources(ml, data_s3_url, schema_fn, train_percent, name)
    
    # Create the model using train dataset
    ml_model_id = create_model(ml, train_ds_id, recipe_fn, name)
    
    eval_id = create_evaluation(ml, ml_model_id, test_ds_id, name)

    return ml_model_id

In [7]:
import uuid


def create_data_sources(ml, data_s3_url, schema_fn, train_percent, name):
    """Create two data sources.  One with (train_percent)% of the data,
    which will be used for training.  The other one with the remainder of the data,
    which is commonly called the "test set" and will be used to evaluate the quality
    of the ML Model.
    """
    train_ds_id = 'ds-' + str(uuid.uuid4())
    spec = {
        "DataLocationS3": data_s3_url,
        "DataRearrangement": json.dumps({
            "splitting": {
                "percentBegin": 0,
                "percentEnd": train_percent,
                "strategy":"random"
            },
        }),
        "DataSchema": open(schema_fn).read(),
    }
    
    ml.create_data_source_from_s3(
        DataSourceId=train_ds_id,
        DataSpec=spec,
        DataSourceName=name + " - training split",
        ComputeStatistics=True
    )    
    
    print("Created training data set %s" % train_ds_id)

    test_ds_id = 'ds-' + str(uuid.uuid4())
    spec['DataRearrangement'] = json.dumps({
        "splitting": {
            "percentBegin": train_percent,
            "percentEnd": 100,
            "strategy":"random"
        }
    })
    ml.create_data_source_from_s3(
        DataSourceId=test_ds_id,
        DataSpec=spec,
        DataSourceName=name + " - testing split",
        ComputeStatistics=True
    )
    print("Created test data set %s" % test_ds_id)
    return (train_ds_id, test_ds_id)


We are calling create_ml_model() method inside create_model() method definition for creating a new MLModel using the DataSource and the recipe as information sources.


Syntax: 

    response = client.create_ml_model(
        MLModelId='string',
        MLModelName='string',
        MLModelType='REGRESSION'|'BINARY'|'MULTICLASS',
        Parameters={
            'string': 'string'
        },
        TrainingDataSourceId='string',
        Recipe='string',
        RecipeUri='string'
    )
    


- **sgd.maxMLModelSizeInBytes** - The maximum allowed size of the model. Depending on the input data, the size of the model might affect its performance. The value is an integer that ranges from 100000 to 2147483648 . The default value is 33554432 .


- **sgd.maxPasses** - The number of times that the training process traverses the observations to build the MLModel . The value is an integer that ranges from 1 to 10000 . The default value is 10 .


- **sgd.l1RegularizationAmount** - The coefficient regularization L1 norm. It controls overfitting the data by penalizing large coefficients. This tends to drive coefficients to zero, resulting in a sparse feature set. If you use this parameter, start by specifying a small value, such as 1.0E-08 . The value is a double that ranges from 0 to MAX_DOUBLE . The default is to not use L1 normalization. This parameter can't be used when L2 is specified. Use this parameter sparingly.


- **Recipe (string)** -- The data recipe for creating the MLModel . You must specify either the recipe or its URI. If you don't specify a recipe or its URI, Amazon ML creates a default.


- **RecipeUri (string)** -- The Amazon Simple Storage Service (Amazon S3) location and file name that contains the MLModel recipe. You must specify either the recipe or its URI. If you don't specify a recipe or its URI, Amazon ML creates a default.


In [8]:
def create_model(ml, train_ds_id, recipe_fn, name):
    """Creates an ML Model object, which begins the training process.
The quality of the model that the training algorithm produces depends
primarily on the data, but also on the hyper-parameters specified
in the parameters map, and the feature-processing recipe.
    """
    created_model_id = 'ml-' + str(uuid.uuid4())
    ml.create_ml_model(
        MLModelId=created_model_id,
        MLModelName=name + " model",
        MLModelType="MULTICLASS",  # we're predicting True/False values
        Parameters={
            # Refer to the "Machine Learning Concepts" documentation
            # for guidelines on tuning your model
            "sgd.maxPasses": "100",
            "sgd.maxMLModelSizeInBytes": "104857600",  # 100 MiB
            "sgd.l2RegularizationAmount": "1e-4",
        },
        Recipe=open(recipe_fn).read(),
        TrainingDataSourceId=train_ds_id
    )
    print("Created ML Model %s" % created_model_id)
    return created_model_id


In [9]:
def create_evaluation(ml, model_id, test_ds_id, name):
    eval_id = 'ev-' + str(uuid.uuid4())
    ml.create_evaluation(
        EvaluationId=eval_id,
        EvaluationName=name + " evaluation",
        MLModelId=model_id,
        EvaluationDataSourceId=test_ds_id
    )
    print("Created Evaluation %s" % eval_id)
    return eval_id

### Create the model

<a id='creating_model'></a>

In [10]:
model_id = build_model(data_s3_url, schema_fn, recipe_fn, name=name)

Created training data set ds-92cdf442-294b-4e7b-8381-0af5c9d8f04a
Created test data set ds-286e924a-a5dc-48ca-b18d-93959c06728e
Created ML Model ml-76e72c9e-45b0-4727-8e81-66ae52453590
Created Evaluation ev-58ada962-809e-4f3a-87a5-f0ce98277210


We have created the model but it takes some time for the model to be built and ready to use. Below poll function will keep polling until the status changes to anything under ['COMPLETED', 'FAILED', 'INVALID']

In [11]:
def poll_until_completed(ml, model_id):
    delay = 2
    while True:
        model = ml.get_ml_model(MLModelId=model_id)
        status = model['Status']
        message = model.get('Message', '')
        now = str(datetime.datetime.now().time())
        print("Model %s is %s (%s) at %s" % (model_id, status, message, now))
        if status in ['COMPLETED', 'FAILED', 'INVALID']:
            break

        # exponential backoff with jitter
        delay *= random.uniform(1.1, 2.0)
        time.sleep(delay)

In [12]:
poll_until_completed(ml, model_id=model_id)  # Can't use it until it's COMPLETED

Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:17:09.642442
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:17:12.206976
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:17:17.274315
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:17:25.830105
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:17:41.673346
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:18:10.478218
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is PENDING () at 20:18:56.790081
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is INPROGRESS () at 20:20:08.453534
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is INPROGRESS (Current Step: TRAINING (1/1) Current Iteration: (100/100) 100%) at 20:21:48.077017
Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is COMPLETED () at 20:23:39.695900


Wait until the model is created. Once it is in completed status, you can run create_realtime_endpoint() function in below code cell to create end point for the model for making realtime predictions on new test data. The endpoint contains the URI of the MLModel. That is, the location to send real-time prediction requests for the specified MLModel.

## Note:
Make sure to check and update the model id in below cell with the latest machine learning model we created above to create the end point. 

[Go to this cell to make sure the ML id is same](#creating_model)

In [20]:
response = client.create_realtime_endpoint(
    MLModelId=model_id
)

In [21]:
import pprint
pprint.pprint(response)

{'MLModelId': 'ml-76e72c9e-45b0-4727-8e81-66ae52453590',
 'RealtimeEndpointInfo': {'CreatedAt': datetime.datetime(2021, 11, 9, 20, 24, 21, 405000, tzinfo=tzlocal()),
                          'EndpointStatus': 'READY',
                          'EndpointUrl': 'https://realtime.machinelearning.us-east-1.amazonaws.com',
                          'PeakRequestsPerSecond': 200},
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '235',
                                      'content-type': 'application/x-amz-json-1.1',
                                      'date': 'Wed, 10 Nov 2021 02:32:02 GMT',
                                      'x-amzn-requestid': '7780b59d-82e1-4c48-9bf1-3785e415780c'},
                      'HTTPStatusCode': 200,
                      'RequestId': '7780b59d-82e1-4c48-9bf1-3785e415780c',
                      'RetryAttempts': 0}}


### Evaluation:

Lets evaluate the model by checking the predictive ability. The test record present in [iris_record.csv](iris_record.csv)(csv file is present in current working directory /module3/labs/iris_recrd.csv) has one record whose species is virginica. Lets test the model.

Wait for a minute before running below cell. The create_realtime_endpoint() method in above cell will create an endpoint for a making predictions and it will take some time in updating the status of the endpoint.

## Note:

Make sure the end point in below cell matches with the endpoint given in the response output of create_realtime_endpoint() method.

In [15]:
# from boto3.session import Session
import json
 
# session = Session(aws_access_key_id=access_id, aws_secret_access_key=secret_key)
 
try:
    model = ml.get_ml_model(MLModelId=model_id)
    prediction_endpoint = 'https://realtime.machinelearning.us-east-1.amazonaws.com'
 
    with open('iris_record.csv') as f:
        record_str = f.readline()
 
    record = {}
    for index,val in enumerate(record_str.split(',')):
        record['Var%03d' % (index+1)] = val
 
    response = ml.predict(MLModelId=model_id, Record=record, PredictEndpoint=prediction_endpoint)
    print(json.dumps(response, indent=2))
    label = response.get('Prediction').get('predictedLabel')
    print("*"*30)
    print("Its a %s." % label)
    
except Exception as e:
    print(e)

{
  "Prediction": {
    "predictedLabel": "virginica",
    "predictedScores": {
      "setosa": 0.17459960281848907,
      "versicolor": 0.24952912330627441,
      "virginica": 0.5758712887763977
    },
    "details": {
      "Algorithm": "SGD",
      "PredictiveModelType": "MULTICLASS"
    }
  },
  "ResponseMetadata": {
    "RequestId": "08aef466-ef8f-4ba4-913f-e31074520ab1",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "08aef466-ef8f-4ba4-913f-e31074520ab1",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "223",
      "date": "Wed, 10 Nov 2021 02:25:49 GMT"
    },
    "RetryAttempts": 0
  }
}
******************************
Its a virginica.


It did predict the leaf belongs to the species Virginica. 

## Batch Predictions

Below code cells demonstrate how to use an ML Model, to kick off a batch prediction job, which uses the ML Model to generate predictions on new data. It takes the ML Model and test data to make the predictions. create_batch_prediction() method writes the prediction results to the supplied S3 location.

In [16]:
# The URL of the sample data in S3

def use_model(ml, model_id, output_s3, inputdatasource_id):
    """Creates all the objects needed to build an ML Model & evaluate its quality.
    """

    poll_until_completed(ml, model_id)  # Can't use it until it's COMPLETED
#     ml.update_ml_model(MLModelId=model_id, ScoreThreshold=threshold)
    print("Set score threshold for %s to %.2f" % (model_id, threshold))

    bp_id = 'bp-' + str(uuid.uuid4())
    ml.create_batch_prediction(
        BatchPredictionId=bp_id,
        BatchPredictionName="Batch Prediction for marketing sample",
        MLModelId=model_id,
        BatchPredictionDataSourceId=inputdatasource_id,
        OutputUri=output_s3
    )
    print("Created Batch Prediction %s" % bp_id)


Call the method use_model(), to make the batch predictions. Results will be written to the location "s3://dsabucket.module4/"

# Note:
 
Give your pawprint in below cell where it is commented

In [26]:
train_ds_id

''

In [24]:
import base64
import boto3
import datetime
import os
import random
import sys
import time
import urllib

threshold = 0.7

# NOTE!!!! - You Need to change this to the ID of the test data set shown below "Create The Model" several cells above
# UNSCORED_DATA_ID = "ds-ed65601e-3e60-439d-a1e7-f2327d3d3139" # Replace this Example-ID by the ID of the test data set
UNSCORED_DATA_ID = "ds-286e924a-a5dc-48ca-b18d-93959c06728e"

# parsed_url = urlparse.parse(s3_output_url)

use_model(ml, model_id, s3_output_url, UNSCORED_DATA_ID)

Model ml-76e72c9e-45b0-4727-8e81-66ae52453590 is COMPLETED () at 20:34:30.225967
Set score threshold for ml-76e72c9e-45b0-4727-8e81-66ae52453590 to 0.70
Created Batch Prediction bp-7f525300-825d-415e-a366-c262c866cccf


Wait for 2 minutes before running below cell. List the objects in the S3 bucket "s3://dsabucket.module4/$<your pawprint>$" to get the address of the results directory 

In [33]:

# s3 = boto3.client('s3')  # again assumes boto.cfg setup, assume AWS S3
for obj in s3.list_objects(Bucket=bucket_name)['Contents']:
    print(obj['Key']+'\n')

aca2zb/batch-prediction/bp-12b29cce-78bc-4ed3-a57c-63673a369cae.manifest

aca2zb/batch-prediction/result/bp-12b29cce-78bc-4ed3-a57c-63673a369cae-irisdata.csv.gz

ajky9b/batch-prediction/bp-d20414a6-31a8-4726-b532-f3085913db51.manifest

ajky9b/batch-prediction/result/bp-d20414a6-31a8-4726-b532-f3085913db51-irisdata.csv.gz

avgnzd/batch-prediction/bp-fd86a58d-9733-453b-be4c-00ee2984d8f2.manifest

avgnzd/batch-prediction/result/bp-fd86a58d-9733-453b-be4c-00ee2984d8f2-irisdata.csv.gz

bbb9hy/batch-prediction/bp-a8545530-7f6f-4f7b-9f8c-8e5e627ec590.manifest

bbb9hy/batch-prediction/result/bp-a8545530-7f6f-4f7b-9f8c-8e5e627ec590-irisdata.csv.gz

bmgwd9/batch-prediction/bp-e9410f78-1d62-4548-b8fc-1aba4e8dcad6.manifest

bmgwd9/batch-prediction/result/bp-e9410f78-1d62-4548-b8fc-1aba4e8dcad6-irisdata.csv.gz

bprh4/batch-prediction/bp-fc656a12-8c4e-4e7a-b69a-6512cc22012e.manifest

bprh4/batch-prediction/result/bp-fc656a12-8c4e-4e7a-b69a-6512cc22012e-irisdata.csv.gz

cjgwx7/batch-prediction/bp-f4f

The batch prediction results are stored in the location "dsabucket.module4/batch-prediction/result/". The results are saved as a zip file. Take the zip file, convert it in to a stream of bytes. GzipFile() takes the compresssed file and decompresses it. The final output is in the form of strings. 


## Note:

Fill the blank for the location of prediction results file name.  

It looks similar to below

    <your pawprint>/batch-prediction/result/bp-335b03ff-fbad-4949-9123-b9564dd0094b-irisdata.csv.gz

In [34]:
## I'm not sure why my pawprint isn't being included, but the below matches the output of what I see above

from io import BytesIO
from gzip import GzipFile

# NOTE!!! - You need to change the key to your prediction results file name shown in the scrollbox above
retr = s3.get_object(Bucket=bucket_name, Key='pawprint/batch-prediction/result/bp-7f525300-825d-415e-a366-c262c866cccf-irisdata.csv.gz') # Get the file name from above list objects output

bytestream = BytesIO(retr['Body'].read())
got_text = GzipFile(None, 'rb', fileobj=bytestream).read().decode('utf-8')
type(got_text)

str

Write the string output extracted in above cell into a csv file. These are the predictions from the model. 

In [35]:
import io

s = io.StringIO(got_text)
with open('iris_results.csv', 'w') as f:
    for line in s:
        f.write(line)

Read the csv file into a pandas dataframe. 

In [36]:
import pandas as pd

with open('iris_results.csv', 'r') as file:
    df = pd.read_csv(file)
#     df.reset_index(inplace=True)
    print(df.head())

  trueLabel    setosa  versicolor  virginica
0    setosa  0.999532    0.000201   0.000267
1    setosa  0.996606    0.002008   0.001386
2    setosa  0.966665    0.016052   0.017282
3    setosa  0.998670    0.000660   0.000670
4    setosa  0.992480    0.002195   0.005325


The model has a column for true labels. 

In [37]:
truelabels=df['trueLabel']
truelabels.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: trueLabel, dtype: object

df.idxmax() function expects numerical inputs. The "trueLabel" df is of type string. Since we alreday captured true labels in the variable truelabels, delete the column trueLabel from the dataframe "df".

In [38]:
del df['trueLabel']

Probabilities are generated for different classes in the form of predictions. These probabilities indicate the chance of the row belonging to certain species. Use these probabilities to label the predictions. In the next code cell, we are using df.idxmax() function which returns the column name which has the highest value of all rows. 

Now predictions have the species name predicted for each of the 50 rows in test dataset.

In [39]:
predictions=df.idxmax(axis=1)

Generate a confusion matrix using trueLabels and predicted labels

In [40]:
from sklearn.metrics import confusion_matrix
confusion_matrix(truelabels, predictions)

array([[13,  0,  0],
       [ 0,  9,  4],
       [ 0,  0, 16]])

## Model Accuracy

In [41]:
from sklearn import metrics

print(metrics.accuracy_score(truelabels, predictions))

0.9047619047619048


There it is. The model built by AWS machine learning service predicted the species with 90% accuracy. 

# Save your notebook, then `File > Close and Halt`