<img src="https://github.com/pmservice/ai-openscale-tutorials/raw/master/notebooks/images/banner.png" align="left" alt="banner">

# Credit risk using SageMaker linear-learner

Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create logistic regression model using SageMaker linear-learner algorithm](#model)
- [4. Deploy the SageMaker model in the AWS Cloud](#deployment)
- [5. Score the model](#score)

**Note:** This notebook works correctly with kernel `Python 3.5+`.

<a id="setup"></a>
## 0. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a SageMaker Service, setting up steps described here: https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html
- Install reqiured python packages from PyPi repository

### Package installation

In [1]:
!pip install -U boto3 | tail -n 1
!pip install -U sagemaker | tail -n 1
!pip install -U pandas | tail -n 1
!pip install -U scikit_learn | tail -n 1
!pip install -U category_encoders | tail -n 1

Requirement not upgraded as not directly required: six>=1.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from python-dateutil<3.0.0,>=2.1; python_version >= "2.7"->botocore<1.13.0,>=1.12.94->boto3)
Requirement not upgraded as not directly required: docker-pycreds>=0.4.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from docker<4.0,>=3.6.0->docker-compose>=1.23.0->sagemaker)
Requirement not upgraded as not directly required: six>=1.5 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from python-dateutil>=2.5.0->pandas)
Requirement not upgraded as not directly required: numpy>=1.8.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from scikit_learn)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from patsy>=0.4.1->category_encoders)


<a id="introduction"></a>
## 1. Introduction

This notebook defines, trains and deploys the model predicting risk for credit.

<a id="load"></a>
## 2. Load and explore data

In this section you will prepare your data for training using SageMaker linear-learner algorithm.

- [2.1 Load data from github repository](#download_github)
- [2.2 Explore data](#explore)
- [2.3 Store training data in S3 Object Storage](#upload_s3)

<a id="download_github"></a>
### 2.1 Load data from github repository

In [2]:
import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/pmservice/wml-sample-models/master/spark/credit-risk/data/credit_risk_training.csv')

<a id="explore"></a>
### 2.2 Explore data

In [4]:
print("Sample records:")
display(data.head())

print("Label column summary:")
display(data.Risk.value_counts())

Sample records:


Unnamed: 0,CheckingStatus,LoanDuration,CreditHistory,LoanPurpose,LoanAmount,ExistingSavings,EmploymentDuration,InstallmentPercent,Sex,OthersOnLoan,...,OwnsProperty,Age,InstallmentPlans,Housing,ExistingCreditsCount,Job,Dependents,Telephone,ForeignWorker,Risk
0,0_to_200,31,credits_paid_to_date,other,1889,100_to_500,less_1,3,female,none,...,savings_insurance,32,none,own,1,skilled,1,none,yes,No Risk
1,less_0,18,credits_paid_to_date,car_new,462,less_100,1_to_4,2,female,none,...,savings_insurance,37,stores,own,2,skilled,1,none,yes,No Risk
2,less_0,15,prior_payments_delayed,furniture,250,less_100,1_to_4,2,male,none,...,real_estate,28,none,own,2,skilled,1,yes,no,No Risk
3,0_to_200,28,credits_paid_to_date,retraining,3693,less_100,greater_7,3,male,none,...,savings_insurance,32,none,own,1,skilled,1,none,yes,No Risk
4,no_checking,28,prior_payments_delayed,education,6235,500_to_1000,greater_7,3,male,none,...,unknown,57,none,own,2,skilled,1,none,yes,Risk


Label column summary:


No Risk    3330
Risk       1670
Name: Risk, dtype: int64

<a id="upload_s3"></a>
### 2.3 Store training data in S3 Object Storage

You will use SageMaker build-in linear-learner algorithm. This algorithm expects first column to be the label when training data is in `text/csv` format.

Moreover label column have to be numeric, so you will recode it.

#### Save prepared data to local filesystem

In [5]:
string_columns = [idx for idx, ty in enumerate(data[0:-1].dtypes.tolist()) if ty is np.dtype('O')]
numeric_columns = list(set(range(len(data[0:-1].columns) - 1)) - set(string_columns))

In [6]:
data_recoded = pd.concat([data.iloc[:, -1], pd.get_dummies(data.iloc[:, string_columns]), data.iloc[:, numeric_columns]], axis=1)
data_recoded.replace({'Risk': {'Risk': 1, 'No Risk': 0}}, inplace = True)

In [7]:
train_data_filename = 'credit_risk.csv'
data_recoded.to_csv(path_or_buf = train_data_filename, index = False, header = False)

**Note:** Header row have to be omitted. First column have to be target.

In [29]:
print(data_recoded.columns.tolist())

['Risk', 'CheckingStatus_0_to_200', 'CheckingStatus_greater_200', 'CheckingStatus_less_0', 'CheckingStatus_no_checking', 'CreditHistory_all_credits_paid_back', 'CreditHistory_credits_paid_to_date', 'CreditHistory_no_credits', 'CreditHistory_outstanding_credit', 'CreditHistory_prior_payments_delayed', 'LoanPurpose_appliances', 'LoanPurpose_business', 'LoanPurpose_car_new', 'LoanPurpose_car_used', 'LoanPurpose_education', 'LoanPurpose_furniture', 'LoanPurpose_other', 'LoanPurpose_radio_tv', 'LoanPurpose_repairs', 'LoanPurpose_retraining', 'LoanPurpose_vacation', 'ExistingSavings_100_to_500', 'ExistingSavings_500_to_1000', 'ExistingSavings_greater_1000', 'ExistingSavings_less_100', 'ExistingSavings_unknown', 'EmploymentDuration_1_to_4', 'EmploymentDuration_4_to_7', 'EmploymentDuration_greater_7', 'EmploymentDuration_less_1', 'EmploymentDuration_unemployed', 'Sex_female', 'Sex_male', 'OthersOnLoan_co-applicant', 'OthersOnLoan_guarantor', 'OthersOnLoan_none', 'OwnsProperty_car_other', 'Owns

#### Upload data to S3 Object Storage

In [8]:
import time
import json
import boto3

In [9]:
aws_credentials = {'access_key': '***', 
                   'secret_key': '***', 
                   'region_name': '***'}

**Note:** You have to provide credentials from your Amazon account.

In [10]:
# The code was removed by Watson Studio for sharing.

In [11]:
session = boto3.Session(
    aws_access_key_id = aws_credentials['access_key'],
    aws_secret_access_key = aws_credentials['secret_key'],
    region_name = aws_credentials['region_name']
)
s3 = session.resource('s3')

**Note:** You have to replace `bucket_name` with name of bucket in your S3 Object Storage and path where training data will be stored.

**Tip:** You can run following code `[bkt.name for bkt in s3.buckets.all()]` to list all your buckets.

In [12]:
bucket_name = '***'
train_data_path = '***'

In [13]:
# The code was removed by Watson Studio for sharing.

In [14]:
output_data_path = 's3://{}/credit-risk/output'.format(bucket_name)
time_suffix = time.strftime("%Y-%m-%d-%H-%M", time.gmtime())

In [15]:
s3_bucket = s3.Bucket(bucket_name)
s3_bucket.upload_file(Filename = train_data_filename, Key = '{}/{}'.format(train_data_path, train_data_filename))

Let's check if your data have been uploaded successfully.

In [16]:
for s3_obj in s3_bucket.objects.all():
    if (s3_obj.bucket_name == bucket_name) and (train_data_path in s3_obj.key):
        train_data_uri = 's3://{}/{}'.format(s3_obj.bucket_name, s3_obj.key)
        print(train_data_uri)

s3://sagemaker-us-east-1-014862798213/credit-risk/train/credit_risk.csv


<a id="model"></a>
## 3. Create logistic regression model using SageMaker linear-learner algorithm

In this section you will learn how to:

- [3.1 Setup training parameters](#prep)
- [3.2 Start training job](#train)

<a id="prep"></a>
### 3.1 Setup training parameters

In [17]:
from sagemaker.amazon.amazon_estimator import get_image_uri

sm_client = session.client('sagemaker')

In [18]:
training_image = get_image_uri(session.region_name, 'linear-learner')

iam_client = session.client('iam')
[role_arn, *_] = [role['Arn'] for role in iam_client.list_roles()['Roles'] if 'AmazonSageMaker-ExecutionRole' in role['RoleName']]

linear_job_name = 'Credit-risk-linear-learner-' + time_suffix

In [19]:
linear_training_params = {
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "HyperParameters": {
        "feature_dim": str(data_recoded.shape[1] - 1),
        "mini_batch_size": "100",
        "predictor_type": "binary_classifier",
        "epochs": "10",
        "num_models": "32",
        "loss": "auto"
    },
    "InputDataConfig": [{
        "ChannelName": "train",
        "ContentType": "text/csv", 
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": train_data_uri,
                "S3DataDistributionType": "ShardedByS3Key"
            }
        }
    }],
    "OutputDataConfig": {"S3OutputPath": output_data_path},
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 2
    },
    "RoleArn": role_arn,
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 6 * 60
    },
    "TrainingJobName": linear_job_name

}

<a id="train"></a>
### 3.2 Start training job

In [20]:
sm_client.create_training_job(**linear_training_params)

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '118',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Thu, 14 Feb 2019 15:20:24 GMT',
   'x-amzn-requestid': 'f8ad3085-9ba5-4a93-b29a-049cfd0943dd'},
  'HTTPStatusCode': 200,
  'RequestId': 'f8ad3085-9ba5-4a93-b29a-049cfd0943dd',
  'RetryAttempts': 0},
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:014862798213:training-job/credit-risk-linear-learner-2019-02-14-15-20'}

In [21]:
try:
    sm_client.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName = linear_job_name)
except Exception:
    print('Traing job error.')

train_job_details = sm_client.describe_training_job(TrainingJobName = linear_job_name)
train_job_status = train_job_details['TrainingJobStatus']

if train_job_status == 'Failed':
    print(train_job_details['FailureReason'])
else:
    train_job_arn = train_job_details['TrainingJobArn']
    print(train_job_arn)
    trained_model_uri = train_job_details['ModelArtifacts']['S3ModelArtifacts']
    print(trained_model_uri)

arn:aws:sagemaker:us-east-1:014862798213:training-job/credit-risk-linear-learner-2019-02-14-15-20
s3://sagemaker-us-east-1-014862798213/credit-risk/output/Credit-risk-linear-learner-2019-02-14-15-20/output/model.tar.gz


<a id="deployment"></a>
## 4. Deploy the SageMaker model in the AWS Cloud

In this section you will learn howto:

- [4.1 Setup deployment parameters](#setup_training)
- [4.2 Create deployment configuration endpoint](#deployment_configuration)
- [4.3 Create online scoring endpoint](#setup_online_scoring)

<a id="setup_training"></a>
### 4.1 Setup deployment parameters

In [22]:
linear_hosting_container = {'Image': training_image, 'ModelDataUrl': trained_model_uri}

create_model_details = sm_client.create_model(
    ModelName = linear_job_name,
    ExecutionRoleArn = role_arn,
    PrimaryContainer = linear_hosting_container)

print(create_model_details['ModelArn'])

arn:aws:sagemaker:us-east-1:014862798213:model/credit-risk-linear-learner-2019-02-14-15-20


<a id="deployment_configuration"></a>
### 4.2 Create deployment configuration endpoint

In [23]:
endpoint_config = 'Credit-risk-linear-endpoint-config-' + time_suffix
print(endpoint_config)

create_endpoint_config_details = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config,
    ProductionVariants = [{
        'InstanceType': 'ml.m4.xlarge',
        'InitialInstanceCount': 1,
        'ModelName': linear_job_name,
        'VariantName': 'AllTraffic'}])

endpoint_config_details = sm_client.describe_endpoint_config(EndpointConfigName = endpoint_config)
print(endpoint_config_details)

Credit-risk-linear-endpoint-config-2019-02-14-15-20
{'ProductionVariants': [{'ModelName': 'Credit-risk-linear-learner-2019-02-14-15-20', 'VariantName': 'AllTraffic', 'InitialVariantWeight': 1.0, 'InstanceType': 'ml.m4.xlarge', 'InitialInstanceCount': 1}], 'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:014862798213:endpoint-config/credit-risk-linear-endpoint-config-2019-02-14-15-20', 'CreationTime': datetime.datetime(2019, 2, 14, 15, 24, 32, 30000, tzinfo=tzlocal()), 'EndpointConfigName': 'Credit-risk-linear-endpoint-config-2019-02-14-15-20', 'ResponseMetadata': {'RetryAttempts': 0, 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '6a98375e-4cd6-4fae-ae61-130f1fd5f777', 'content-length': '430', 'content-type': 'application/x-amz-json-1.1', 'date': 'Thu, 14 Feb 2019 15:24:31 GMT'}, 'RequestId': '6a98375e-4cd6-4fae-ae61-130f1fd5f777'}}


<a id="setup_online_scoring"></a>
### 4.3 Create online scoring endpoint

In [24]:
scoring_endpoint = 'Credit-risk-endpoint-scoring-' + time_suffix

create_endpoint_details = sm_client.create_endpoint(
    EndpointName = scoring_endpoint,
    EndpointConfigName = endpoint_config)

In [25]:
try:
    sm_client.get_waiter('endpoint_in_service').wait(EndpointName = scoring_endpoint)
except Exception:
    print('Create scoring endpoint error')

scoring_endpoint_details = sm_client.describe_endpoint(EndpointName = scoring_endpoint)
scoring_enpoint_config_status = scoring_endpoint_details['EndpointStatus']

if scoring_enpoint_config_status != 'InService':
    print(scoring_endpoint_details['FailureReason'])
else:
    print(scoring_endpoint_details['EndpointArn'])

arn:aws:sagemaker:us-east-1:014862798213:endpoint/credit-risk-endpoint-scoring-2019-02-14-15-20


<a id="score"></a>
## 5. Score the model

In this section you will learn howto score deployed model.

- [5.1 Prepare sample data for scoring](#prepare_scoring)
- [5.2 Send payload for scoring](#score)

<a id="prepare_scoring"></a>
### 5.1 Prepare sample data for scoring

You will use data in `csv` format as scoring payload. First column (label) is removed from data. Last 20 training records are selected as scoring payload.

In [26]:
scoring_data_filename = 'scoring_credit_risk.csv'

In [27]:
with open(train_data_filename) as f_train:
    with open(scoring_data_filename, 'w') as f_score:
        f_score.writelines([','.join(line.split(',')[1:]) for line in f_train.readlines()[-20:]])

<a id="score"></a>
### 5.2 Send payload for scoring

In [28]:
sm_runtime = session.client('runtime.sagemaker')

with open(scoring_data_filename) as f_payload:
    scoring_response = sm_runtime.invoke_endpoint(EndpointName = scoring_endpoint,
                                                  ContentType = 'text/csv',
                                                  Body = f_payload.read().encode())
    
    scored_records = scoring_response['Body'].read().decode()
    print(json.loads(scored_records))

{'predictions': [{'score': 0.0002489904873073101, 'predicted_label': 0.0}, {'score': 0.9993849992752075, 'predicted_label': 1.0}, {'score': 5.1339444326004013e-05, 'predicted_label': 0.0}, {'score': 0.00041797070298343897, 'predicted_label': 0.0}, {'score': 0.00042378739453852177, 'predicted_label': 0.0}, {'score': 0.9998983144760132, 'predicted_label': 1.0}, {'score': 0.0001031524661812, 'predicted_label': 0.0}, {'score': 9.491942182648927e-05, 'predicted_label': 0.0}, {'score': 0.0007143352995626628, 'predicted_label': 0.0}, {'score': 0.00048081946442835033, 'predicted_label': 0.0}, {'score': 0.9991300702095032, 'predicted_label': 1.0}, {'score': 0.00019867101218551397, 'predicted_label': 0.0}, {'score': 0.0005870789173059165, 'predicted_label': 0.0}, {'score': 0.0005030444008298218, 'predicted_label': 0.0}, {'score': 0.00037679175147786736, 'predicted_label': 0.0}, {'score': 0.0006642662337981164, 'predicted_label': 0.0}, {'score': 0.00023503316333517432, 'predicted_label': 0.0}, {'

### Authors:

Wojciech Sobala, Data Scientist at IBM