# This notebook is used to configure RedShift ML

**Note:** Please set kernel to `Python 3 (Data Science)`

This notebook trains a model directly from RedShift using SageMaker AutoPilot and the model is run in the RedShift cluster.

Note: This notebook uses the model trained by RedShift ML for prediction and not the SageMaker endpoint.

## Overview of RedShift ML 

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. Simply use SQL statements to create and train Amazon SageMaker machine learning models using your Redshift data and then use these models to make predictions. Redshift ML makes the model available as a SQL function within your Redshift data warehouse so you can easily apply it directly in your queries and reports. For example, you can use customer retention data in Redshift to train a churn detection model and then apply that model to your dashboards for your marketing team to offer incentives to customers at risk of churning.

RedShift ML supports training a model using SageMaker AutoPilot or bring your own model (BYOM). BYOM supports both local and remote inference. Local mode runs the model created from a SageMaker training job or from a S3 path where the model artifacts are stored. Remote inference is using a SageMaker endpoint. For more information on BYOM, please refer to https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html#r_byom_create_model . When using BYOM, your inputs must match what the model is expecting. e.g. in notebook 04, you need to perform preprocessing of data first before sending the data to the model for predictions. Similarly, this has to be done when using BYOM.

In this notebook, you will use RedShift ML which in turn uses SageMaker AutoPilot to train the model.

### Variables
Variable name for secret in Secret Manager, RedShift ML model and function name. RedShift, Athena and Glue information are stored in the secret.

In [1]:
secret_name='bankdm_redshift_login' 

model_name = 'dm01'
function_name = 'predict_dm01'

### Install and import libraries

In [2]:
!pip install -q SQLAlchemy==1.3.13
!pip install psycopg2-binary pyathena
!pip install -U pip
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from pyathena import connect
from botocore.exceptions import ClientError
import pandas as pd
import time
import json
import boto3
import sagemaker

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


### Create client session


In [3]:
# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

s3 = boto3.client('s3')
redshift = boto3.client('redshift')
secretsmanager = boto3.client('secretsmanager')

session = boto3.session.Session()
region = session.region_name

### Get credentials & connection information from Secret Manager

In [4]:
try:
    get_secret_value_response = secretsmanager.get_secret_value(
            SecretId=secret_name
        )
    secret_arn=get_secret_value_response['ARN']

except ClientError as e:
    print("Error retrieving secret. Error: " + e.response['Error']['Message'])
    
else:
    # Depending on whether the secret is a string or binary, one of these fields will be populated.
    if 'SecretString' in get_secret_value_response:
        secret = get_secret_value_response['SecretString']
    else:
        secret = base64.b64decode(get_secret_value_response['SecretBinary'])
            
secret_json = json.loads(secret)
master_user_name = secret_json['username']
master_user_pw = secret_json['password']
redshift_port = secret_json['port']
redshift_cluster_identifier = secret_json['dbClusterIdentifier']
redshift_endpoint_address = secret_json['host']

database_name_redshift = secret_json['database_name_redshift']
database_name_glue = secret_json['database_name_glue']

schema_redshift = secret_json['schema_redshift']
schema_athena = secret_json['schema_athena']

table_name_glue = secret_json['table_name_glue']
table_name_redshift = secret_json['table_name_redshift']

# print(master_user_name)

## RedShift

### Connect to RedShift

In [5]:
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
iam_role = response['Clusters'][0]['IamRoles'][0]['IamRoleArn']

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(master_user_name, master_user_pw, redshift_endpoint_address, redshift_port, database_name_redshift))
session = sessionmaker()
session.configure(bind=engine)


### Create model using SageMaker AutoPilot

To train a model, use the `CREATE MODEL` SQL command in Redshift and specify training data either as a table or SELECT statement. Redshift ML then compiles and imports the trained model inside the Redshift data warehouse and prepares a SQL inference function that can be immediately used in SQL queries. Redshift ML automatically handles all the steps needed to train and deploy a model.

After training with AutoPilot, the model will be run on the RedShift cluster itself and there are no additional charges on running the endpoint (AutoPilot charges still apply). 

In the code below, it creates a model using data from the entire table and the target column is `y`. The function name is also specified so it can be reference later on. 

For more information on the other parameters or on configuring RedShift ML, please take a look at the documentation https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html

In [None]:
statement = f"""
CREATE MODEL {model_name}
FROM {schema_redshift}.{table_name_redshift}
TARGET y
FUNCTION {function_name}
IAM_ROLE '{iam_role}'
SETTINGS (
  S3_BUCKET '{bucket}'
);
"""

# Other parameters you can set
# [ MODEL_TYPE { XGBOOST | MLP } ]              
# [ PROBLEM_TYPE ( REGRESSION | BINARY_CLASSIFICATION | MULTICLASS_CLASSIFICATION ) ]
# [ OBJECTIVE ( 'MSE' | 'Accuracy' | 'F1' | 'F1Macro' | 'AUC') ]

# print(statement)

# The code below temporarily the isolation level in order to create the model using code. 
# When using the RedShift editor, this is not required.
s = session()
s.connection().connection.set_isolation_level(0)
s.execute(statement)
s.commit()
s.connection().connection.set_isolation_level(1)


### Check the status of the SageMaker AutoPilot training. This takes approximately 92 minutes.
While running, you can also look at the 'processing jobs' and 'training jobs' in SageMaker

In [None]:
statement = f"""
show model {model_name}
"""

# print(statement)
df = pd.read_sql_query(statement, engine)
# df.head(50)
print(df.values[4][1])

# This could take an hour
while df.values[4][1] != 'READY':
    time.sleep(10)
    df = pd.read_sql_query(statement, engine)
    print(df.values[4][1])

### Check the details of the model
Double check to ensure the `Model State` is `READY`.

In [6]:
statement = f"""
show model {model_name}
"""

# print(statement)
df = pd.read_sql_query(statement, engine)
df.head()

Unnamed: 0,Key,Value
0,Model Name,dm01
1,Schema Name,public
2,Owner,bankdm
3,Creation Time,"Sat, 02.10.2021 03:35:41"
4,Model State,READY


### Use the RedShift ML function with a SQL query.
A SQL statement can be used to call the RedShift ML function that was created earlier. The SQL statement selects all the columns in the table except for the target column `y` and passes it to the RedShift ML function for prediction. Note that no preprocessing of the data is required here. The `y` column in the SQL statement is selected from the table to compare the predicted result (predict_dm01) vs the actual result (y). This is shown in the output below.



In [7]:
statement = f"""
SELECT {function_name}(
                   age, job, marital, education, defaulted, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), 
                   y
          FROM {schema_redshift}.{table_name_redshift}
"""

# print(statement)
df = pd.read_sql_query(statement, engine)
# Check the accuracy for the first 10 predictions
df.head(10)

Unnamed: 0,predict_dm01,y
0,no,no
1,no,no
2,no,no
3,no,no
4,no,no
5,no,no
6,no,no
7,no,no
8,no,no
9,no,no


#### Check the overall accuracy
The SQL statement below outputs a 'confusion matrix' like table where it shows how many predictions were correct or incorrect.

In [8]:
statement = f"""
SELECT {function_name}, y, COUNT(*)
  FROM (SELECT {function_name}(
                   age, job, marital, education, defaulted, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), y
          FROM {schema_redshift}.{table_name_redshift})
 GROUP BY {function_name}, y;
"""

# print(statement)
df = pd.read_sql_query(statement, engine)
df.head()

Unnamed: 0,predict_dm01,y,count
0,yes,yes,4289
1,no,no,33276
2,yes,no,3272
3,no,yes,351


---

## Next steps

This is the end of the demo. For the next steps, please proceed to notebook06 to clean up the resources created.