# SageMaker Autopilot

Based on: https://aws.amazon.com/blogs/aws/amazon-sagemaker-autopilot-fully-managed-automatic-machine-learning/

## AWSCLI Setup
First, the awscli tool was installed from the `conda-forge` channel. The access parameters were taken from https://www.rosettahub.com/ under "Show AWS API Keys" and the file was stored in `~/.aws/credentials`. The line `[default]` was added to the top, as no other AWS configuration was present on my own system.

The following file was saved under `~/.aws/config`

```
[default]
region=eu-west-1
output=json
```


The preprocessed data (see my TansmogrifAI project) was uploaded into the bucket. The test data can't contain the result field and the batch transformer (not the trainer) doesn't like escaped quotes `\"`. Removed them.

In [76]:
%%bash
# aws s3 mb s3://lukasjautomlbuck
defaultbucket="lukasjautomlbuck"
repodir=$(git rev-parse --show-toplevel)"/Hauptseminar-TransmogrifAI/LukasJansen/"
echo "$defaultbucket"
cat "$repodir/college_test_headerfix.csv" | cut -d ',' -f 31 --complement > college_test_headerfix_nores.csv
cat "$repodir/phishing_test_headerfix.csv" | cut -d ',' -f 32 --complement  > phishing_test_headerfix_nores.csv
sed 's/\\"/ /g' college_test_headerfix_nores.csv  > college_test_headerfix_escapefix_nores.csv
sed 's/\\"/ /g' "$repodir/college_train_headerfix.csv"  > college_train_headerfix_escapefix.csv
aws s3 cp college_train_headerfix_escapefix.csv s3://"$defaultbucket"/college/input_train
aws s3 cp college_test_headerfix_escapefix_nores.csv s3://"$defaultbucket"/college/input_test
aws s3 cp "$repodir/phishing_train_headerfix.csv" s3://"$defaultbucket"/phishing/input_train
aws s3 cp phishing_test_headerfix_nores.csv s3://"$defaultbucket"/phishing/input_test
aws s3 ls "$defaultbucket" 


lukasjautomlbuck
2021-11-11 15:42:37     724520 college/input_test
2021-11-11 15:42:33    2246087 college/input_train
2021-11-12 11:54:21        790 college/output/automl-dm-12-11-11-50/preprocessed-data/header/headers.csv
2021-11-12 11:54:21      21455 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_20.csv
2021-11-12 11:54:21      21378 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_21.csv
2021-11-12 11:54:21      21234 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_22.csv
2021-11-12 11:54:21      21599 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_23.csv
2021-11-12 11:54:21      21481 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_24.csv
2021-11-12 11:54:21      21859 college/output/automl-dm-12-11-11-50/preprocessed-data/tuning_data/train/chunk_25.csv
2021-11-12 11:54:21      21010 college/output/automl-dm-12-11-11-50/preproc

shell-init: Kann das aktuelle Verzeichnis nicht wiederfinden: getcwd: Kann auf die übergeordneten Verzeichnisse nicht zugreifen.: Datei oder Verzeichnis nicht gefunden
fatal: Konnte aktuelles Arbeitsverzeichnis nicht lesen.: Datei oder Verzeichnis nicht gefunden

The user-provided path /TransmogrifAI/LukasJansen//college_train_headerfix.csv does not exist.

The user-provided path /TransmogrifAI/LukasJansen//college_test_headerfix.csv does not exist.

The user-provided path /TransmogrifAI/LukasJansen//phishing_train_headerfix.csv does not exist.

The user-provided path /TransmogrifAI/LukasJansen//phishing_test_headerfix.csv does not exist.


Now under `https://eu-west-1.console.aws.amazon.com/sagemaker/home?region=eu-west-1#/studio/create-domain`a new role was created with any bucket access. Again, not a domain, but a execution role. Under "User profile"

In [1]:
%%bash
aws iam list-roles | grep SageMaker  

            "RoleName": "AmazonSageMaker-ExecutionRole-20211112T113938",
            "Arn": "arn:aws:iam::573849816758:role/service-role/AmazonSageMaker-ExecutionRole-20211112T113938",
            "Description": "SageMaker execution role created from the SageMaker AWS Management Console.",


## Running

In [274]:
import datetime
bucket="lukasjautomlbuck"
arn="arn:aws:iam::573849816758:role/service-role/AmazonSageMaker-ExecutionRole-20211112T113938"

def createjobconfig(problem:str, targetfield, minutes:int=10):
    input_data_config = [{
        'DataSource': {
            'S3DataSource': {
            'S3DataType': 'S3Prefix',
            'S3Uri': f's3://{bucket}/{problem}/input_train'
            }
        },
        'TargetAttributeName': targetfield
        }
    ]
    output_data_config = {
        'S3OutputPath': f's3://{bucket}/{problem}/output'
    }
    jobconfig = {"CompletionCriteria":{"MaxRuntimePerTrainingJobInSeconds": 60*minutes}}
    name = 'automl-dm-' + problem + datetime.datetime.now().strftime("-%d-%m-%H-%M")
    return (name, input_data_config, output_data_config, jobconfig)

createjobconfig("college", "percent_bell_grant")

('automl-dm-college-13-11-14-03',
 [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
     'S3Uri': 's3://lukasjautomlbuck/college/input_train'}},
   'TargetAttributeName': 'percent_bell_grant'}],
 {'S3OutputPath': 's3://lukasjautomlbuck/college/output'},
 {'CompletionCriteria': {'MaxRuntimePerTrainingJobInSeconds': 600}})

In [281]:
import boto3
sm = boto3.client('sagemaker')

name, input_data_config, output_data_config, jobconfig = createjobconfig("college", "percent_pell_grant")
sm.create_auto_ml_job(AutoMLJobName=name,
                      AutoMLJobConfig=jobconfig,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=arn)

{'AutoMLJobArn': 'arn:aws:sagemaker:eu-west-1:573849816758:automl-job/automl-dm-college-13-11-14-04',
 'ResponseMetadata': {'RequestId': 'e03ca8ce-9a65-4a9f-bdfc-04e9135a861c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'e03ca8ce-9a65-4a9f-bdfc-04e9135a861c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '100',
   'date': 'Sat, 13 Nov 2021 13:04:39 GMT'},
  'RetryAttempts': 0}}

In [282]:
name2, input_data_config, output_data_config, jobconfig = createjobconfig("phishing", "Result")
sm.create_auto_ml_job(AutoMLJobName=name2,
                      AutoMLJobConfig=jobconfig,
                      InputDataConfig=input_data_config,
                      OutputDataConfig=output_data_config,
                      RoleArn=arn)

{'AutoMLJobArn': 'arn:aws:sagemaker:eu-west-1:573849816758:automl-job/automl-dm-phishing-13-11-14-04',
 'ResponseMetadata': {'RequestId': 'bf2635a9-4953-487a-ad38-c9cb2dca7bf1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'bf2635a9-4953-487a-ad38-c9cb2dca7bf1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '101',
   'date': 'Sat, 13 Nov 2021 13:04:54 GMT'},
  'RetryAttempts': 1}}

In [23]:
def jobstat(name):
    interesting = ['AutoMLJobName','LastModifiedTime', 'AutoMLJobStatus', 'AutoMLJobSecondaryStatus']
    status = sm.describe_auto_ml_job(AutoMLJobName=name)
    return {key:str(status[key]) for key in interesting}

In [306]:
from pprint import pprint
pprint(jobstat(name))
pprint(jobstat(name2))

{'AutoMLJobName': 'automl-dm-college-13-11-14-04',
 'AutoMLJobSecondaryStatus': 'ModelTuning',
 'AutoMLJobStatus': 'InProgress',
 'LastModifiedTime': '2021-11-13 14:44:54.611000+01:00'}
{'AutoMLJobName': 'automl-dm-phishing-13-11-14-04',
 'AutoMLJobSecondaryStatus': 'ModelTuning',
 'AutoMLJobStatus': 'InProgress',
 'LastModifiedTime': '2021-11-13 14:44:50.788000+01:00'}


In [307]:
# Block has been run 10min after Model Tuning Start
#sm.stop_auto_ml_job(AutoMLJobName=name)
#sm.stop_auto_ml_job(AutoMLJobName=name2)

{'ResponseMetadata': {'RequestId': 'aa73bc82-53c2-40d5-90c1-21fac2b1e401',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'aa73bc82-53c2-40d5-90c1-21fac2b1e401',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sat, 13 Nov 2021 13:45:02 GMT'},
  'RetryAttempts': 2}}

In [308]:
import pandas as pd

def getCandidates(name, order='Descending'):
    candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=name, SortOrder=order,MaxResults=100, SortBy='FinalObjectiveMetricValue')['Candidates']
    res = [{"name": c['CandidateName'], 'metric':c['FinalAutoMLJobObjectiveMetric']['MetricName'], 'value':c['FinalAutoMLJobObjectiveMetric']['Value']} for c in candidates]
    return pd.DataFrame(res)

getCandidates(name, order='Ascending').head()


Unnamed: 0,name,metric,value
0,automl-dm-college-13-11-14-046vl-063-6cef36c7,validation:mse,0.01578
1,automl-dm-college-13-11-14-046vl-086-977a83a8,validation:mse,0.01596
2,automl-dm-college-13-11-14-046vl-091-a182d1af,validation:mse,0.01603
3,automl-dm-college-13-11-14-046vl-075-4a38776e,validation:mse,0.01622
4,automl-dm-college-13-11-14-046vl-093-390d10a1,validation:mse,0.01629


In [309]:
getCandidates(name2, order='Descending').head()

Unnamed: 0,name,metric,value
0,automl-dm-phishing-13-11-14-04UD-004-20cf060b,validation:f1_binary,0.9608
1,automl-dm-phishing-13-11-14-04UD-002-3c94abf8,validation:f1_binary,0.96061
2,automl-dm-phishing-13-11-14-04UD-016-e7e0d16c,validation:f1_binary,0.9582
3,automl-dm-phishing-13-11-14-04UD-001-0e6808de,validation:f1_binary,0.95659
4,automl-dm-phishing-13-11-14-04UD-003-81d68dac,validation:f1_binary,0.95429


In [310]:
best_candidate=sm.describe_auto_ml_job(AutoMLJobName=name)["BestCandidate"]
{"name": best_candidate['CandidateName'], 'metric':best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'], 'value':best_candidate['FinalAutoMLJobObjectiveMetric']['Value']}

{'name': 'automl-dm-college-13-11-14-046vl-063-6cef36c7',
 'metric': 'validation:mse',
 'value': 0.015780000016093254}

In [311]:
best_candidate2=sm.describe_auto_ml_job(AutoMLJobName=name2)["BestCandidate"]
{"name": best_candidate2['CandidateName'], 'metric':best_candidate2['FinalAutoMLJobObjectiveMetric']['MetricName'], 'value':best_candidate2['FinalAutoMLJobObjectiveMetric']['Value']}

{'name': 'automl-dm-phishing-13-11-14-04UD-004-20cf060b',
 'metric': 'validation:f1_binary',
 'value': 0.9607999920845032}

In [312]:
model_name='automl-colmodel-' + datetime.datetime.now().strftime("%d-%m-%H-%M")
model = sm.create_model(Containers=best_candidate['InferenceContainers'],
                            ModelName=model_name,
                            ExecutionRoleArn=arn)

In [313]:

model_name2='automl-phimodel-' + datetime.datetime.now().strftime("%d-%m-%H-%M")
model2 = sm.create_model(Containers=best_candidate2['InferenceContainers'],
                            ModelName=model_name2,
                            ExecutionRoleArn=arn)                         

In [314]:
transform_name='automl-coltrans-' + datetime.datetime.now().strftime("%d-%m-%H-%M")
transformer = sm.create_transform_job(TransformJobName=transform_name,
                            ModelName=model_name,
                            DataProcessing={"JoinSource":"Input"},
                            BatchStrategy= "MultiRecord",
                            TransformInput={"DataSource": {"S3DataSource":{'S3DataType': 'S3Prefix', 'S3Uri':f's3://{bucket}/college/input_test'}},"ContentType": "text/csv",
        "SplitType": "Line",
        "CompressionType": "None",},
                            TransformOutput={"S3OutputPath": f's3://{bucket}/college/output_test4','Accept':"text/csv", 'AssembleWith':'Line'},
                            TransformResources={"InstanceType": "ml.m5.4xlarge", "InstanceCount": 1})


In [315]:
transform_name2='automl-phitrans-' + datetime.datetime.now().strftime("%d-%m-%H-%M")
transformer = sm.create_transform_job(TransformJobName=transform_name2,
                            ModelName=model_name2,
                            DataProcessing={"JoinSource":"Input"},
                            BatchStrategy= "MultiRecord",
                            TransformInput={"DataSource": {"S3DataSource":{'S3DataType': 'S3Prefix', 'S3Uri':f's3://{bucket}/phishing/input_test'}},"ContentType": "text/csv",
        "SplitType": "Line",
        "CompressionType": "None",},
                            TransformOutput={"S3OutputPath": f's3://{bucket}/phishing/output_test4', 'Accept':"text/csv", 'AssembleWith':'Line'},
                            TransformResources={"InstanceType": "ml.m5.4xlarge", "InstanceCount": 1})

In [317]:
print(transform_name)
print(sm.describe_transform_job(TransformJobName=transform_name)['TransformJobStatus'])
print(transform_name2)
print(sm.describe_transform_job(TransformJobName=transform_name2)['TransformJobStatus'])

automl-coltrans-13-11-14-45
Completed
automl-phitrans-13-11-14-45
Completed


In [318]:
from numpy import sqrt
from sklearn.metrics import f1_score,mean_squared_error
result = pd.read_csv(f's3://{bucket}/phishing/output_test4/input_test.out')
prediction = result.iloc[:,-1].astype('float32')
test = pd.read_csv("../../Hauptseminar-TransmogrifAI/LukasJansen/phishing_test_headerfix.csv")
shouldbe = test.loc[:,"Result"].astype('float32')
print(f1_score(y_pred=prediction, y_true=shouldbe, average="binary"))

result = pd.read_csv(f's3://{bucket}/college/output_test4/input_test.out')
prediction = result.iloc[:,-1].astype('float32')
test = pd.read_csv("../../Hauptseminar-TransmogrifAI/LukasJansen/college_test_headerfix.csv")
shouldbe = test.loc[:,"percent_pell_grant"].astype('float32')
print(sqrt(mean_squared_error(y_pred=prediction, y_true=shouldbe)))



0.9340341655716163
0.18708277


Ergebnisse:
F1: 
0.9147609147609147
0.9248100602567461
0.9340341655716163

RMSE:
0.1916199
0.1888683
0.18708277