# Create the foundation

**Note:** Please set kernel to `Python 3 (Data Science)`

Before proceeding, please read the **README.md** and complete the prerequisite first.

---


## Overview of AWS services used in this notebook

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Redshift lets you easily save the results of your queries back to your S3 data lake using open formats, like Apache Parquet, so that you can do additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker. Many customers use RedShift as their data warehouse and it could be one of data source for customers doing machine learning.

AWS Secrets Manager helps you protect secrets needed to access your applications, services, and IT resources. The service enables you to easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. Users and applications retrieve secrets with a call to Secrets Manager APIs, eliminating the need to hardcode sensitive information in plain text.

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code as a ZIP file or container image, and Lambda automatically and precisely allocates compute execution power and runs your code based on the incoming request or event, for any scale of traffic.

AWS Identity and Access Management (IAM) enables you to manage access to AWS services and resources securely. Using IAM, you can create and manage AWS users and groups, and use permissions to allow and deny their access to AWS resources.

---

## Introduction

This series of notebooks demostrate a MLOps workflow where the data source is from RedShift. RedShift ML is also shown where you can train and use a model directly from RedShift. More information regarding the setup can be found in the [README.md](README.md) file.

### High-level architecture diagram
The diagram below shows the architecture diagram at this point in time (not final).

![diagram](img/diagram1.png)

In this notebook, you create the foundation components - IAM roles, policies, RedShift cluster, secret in Secret Manager and lambda.

---

### Variables
Variable names for secret, RedShift, Athena and Glue.

Most of the information below are stored in the secret and you will retrieve them in subsequent notebooks.

In [None]:
secret_name='bankdm_redshift_login' 

# Random function to generate password.
import random
import string
def random_char(y):
       return ''.join(random.choice(string.ascii_letters) for x in range(y))
    
# The variables below are only required for notebook 01
# The RedShift, Athena and Glue information are stored in Secrets Manager
subnet_name = 'Private subnet' # Change this is the private subnet name is different

database_name_redshift = 'bankdm'
database_name_glue = 'bankdm'

schema_redshift = 'dm'
schema_athena = 'athena' # have to be athena

table_name_glue = 'bankdm_glue'
table_name_redshift = 'data'


# Redshift configuration parameters
redshift_cluster_identifier = 'bankdm'
database_name = 'bankdm'
cluster_type = 'single-node' # or multi-node

master_user_name = 'bankdm'
master_user_pw = random_char(16) + '1' # the password requires a number

# Note that only some Instance Types support Redshift Query Editor 
# (https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor.html)
node_type = 'dc2.large'
# number_nodes = '1' # for multi-node. Also uncomment this line below: NumberOfNodes=int(number_nodes),

# Set the security group ID if not using the default one
security_group_id = None


### Import the necessary libraries and create client sessions


In [None]:
import json
import boto3
from botocore.exceptions import ClientError
from botocore.config import Config
import time
import sagemaker
import zipfile

iam = boto3.client('iam')
sts = boto3.client('sts')
accountID = sts.get_caller_identity()["Account"]  
redshift = boto3.client('redshift')
sm = boto3.client('sagemaker')
ec2 = boto3.client('ec2')
secretsmanager = boto3.client('secretsmanager')

s3 = boto3.client('s3')
lambda_client = boto3.client('lambda')

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket()

## IAM Roles and Policy
### Adding permissions to SageMaker Execution role
This should already be done by the CloudFormation template but it doesn't hurt to check this.


In [None]:
role_name = role.split("/")[-1]

print("Role name: {}".format(role_name))

In [None]:
pre_policies = iam.list_attached_role_policies(RoleName=role_name)["AttachedPolicies"]

required_policies = ["IAMFullAccess"]

for pre_policy in pre_policies:
    for role_req in required_policies:
        if pre_policy["PolicyName"] == role_req:
            print("Attached: {}".format(pre_policy["PolicyName"]))
            try:
                required_policies.remove(pre_policy["PolicyName"])
            except:
                pass

if len(required_policies) > 0:
    print(
        "*************** [ERROR] You need to attach the following policies in order to continue with this workshop *****************\n"
    )
    for required_policy in required_policies:
        print("Not Attached: {}".format(required_policy))
else:
    print("[OK] You are all set to continue with this notebook!")


#### Create a function to add policy to the role

In [None]:
def addPolicy(policy, role_name):
    try:
        response = iam.attach_role_policy(PolicyArn="arn:aws:iam::aws:policy/{}".format(policy), RoleName=role_name)
        print("Policy {} has been succesfully attached to role: {}".format(policy, role_name))
    except ClientError as e:
        if e.response["Error"]["Code"] == "EntityAlreadyExists":
            print("[OK] Policy is already attached.")
        elif e.response["Error"]["Code"] == "LimitExceeded":
            print("[OK]")
        else:
            print("*************** [ERROR] {} *****************".format(e))


#### Add the following policies to the role.

In [None]:
addPolicy("AmazonRedshiftFullAccess", role_name)
addPolicy("SecretsManagerReadWrite", role_name)
addPolicy("AmazonAthenaFullAccess", role_name)
# The Lambda role is needed to create the lambda function below
addPolicy("AWSLambda_FullAccess", role_name)


### Add the following policies to SageMaker ServiceCatalog role

In [None]:
servicerole = 'AmazonSageMakerServiceCatalogProductsUseRole'
addPolicy("AmazonSageMakerPipelinesIntegrations", servicerole)
# The Lambda role is required to create lambda function in the SageMaker Pipeline. 
# However, this portion of the code is commented out.
addPolicy("AWSLambda_FullAccess", servicerole)

### Add permissions to BankDM role
#### Create AssumeRolePolicyDocument

In [None]:
role = f"arn:aws:iam::{accountID}:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole"
assume_role_policy_doc = {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": role,
        "Service": ["sagemaker.amazonaws.com", "redshift.amazonaws.com"]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

assume_role_policy_doc

#### Create Role

In [None]:
iam_redshift_role_name = 'BankDM-RedShift'

In [None]:
try:
    iam_role_redshift = iam.create_role(
        RoleName=iam_redshift_role_name,
        AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),
        Description='BankDM Redshift Role'
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("Role already exists")
    else:
        print("Unexpected error: %s" % e)

#### Get the Role ARN

In [None]:
role = iam.get_role(RoleName=iam_redshift_role_name)
iam_role_redshift_arn = role['Role']['Arn']
print(iam_role_redshift_arn)

### Attach AWS built-in policy to role
Note: The CloudFormation should have added the below policy but to be safe, the script adds them again.


In [None]:
addPolicy("SecretsManagerReadWrite", iam_redshift_role_name)
addPolicy("AmazonRedshiftFullAccess", iam_redshift_role_name)
addPolicy("AmazonSageMakerFullAccess", iam_redshift_role_name)
addPolicy("AmazonS3FullAccess", iam_redshift_role_name)
addPolicy("AmazonAthenaFullAccess", iam_redshift_role_name)

## RedShift cluster
### Get Security Group ID 

* Make sure the VPC used by RedShift is the same this notebook is running within
* Make sure the VPC has the following 2 properties enabled
 *     DNS resolution = Enabled
 *     DNS hostnames = Enabled
* This allows private, internal access to Redshift from this SageMaker notebook using the fully qualified endpoint name.

In [None]:
if security_group_id is None:
    try:
        domain_id = sm.list_domains()['Domains'][0]['DomainId'] #['NotebookInstances'][0]['NotebookInstanceName']
        describe_domain_response = sm.describe_domain(DomainId=domain_id)
        vpc_id = describe_domain_response['VpcId']
        security_groups = ec2.describe_security_groups(Filters=[{"Name": "vpc-id", "Values": [vpc_id]}])['SecurityGroups']
        security_group_id = ''

        for sg in security_groups:
            if(sg['GroupName'] == 'default'):
                security_group_id = sg['GroupId']

        print(security_group_id)    
    except:
        pass
else:
    pass

### Subnet for RedShift

Get the subnet ID for the private subnet. 

In [None]:
sn_all = ec2.describe_subnets(Filters=[{"Name": "vpc-id", "Values": [vpc_id]}])
subnetId = ''
for sn in sn_all['Subnets'] :
    for tags in sn['Tags'] :
#         print(tags)
        if(tags['Key'] == 'Name' and tags['Value'] == subnet_name):
           subnetId = sn['SubnetId']
subnetId

In [None]:
sn_all['Subnets'][0]['Tags']

### Create Redshift Cluster
Create the RedShift subnet group and after that, create the RedShift cluster.

In [None]:
try:
    response = redshift.create_cluster_subnet_group(
        ClusterSubnetGroupName='bankdm-subnet',
        Description='string',
        SubnetIds=[
            subnetId,
        ]
    )
    
except ClientError as e:
    if e.response['Error']['Code'] == 'ClusterSubnetGroupAlreadyExists':
        print("Cluster subnet group already exists. This is ok.")
    else:
        print("Unexpected error: %s" % e)

In [None]:
try:
    response = redshift.create_cluster(
            DBName=database_name,
            ClusterIdentifier=redshift_cluster_identifier,
            ClusterType=cluster_type,
            NodeType=node_type,
    #         NumberOfNodes=int(number_nodes),       # This is required if multi-node is specified
            ClusterSubnetGroupName='bankdm-subnet',
            MasterUsername=master_user_name,
            MasterUserPassword=master_user_pw,
            IamRoles=[iam_role_redshift_arn],
            VpcSecurityGroupIds=[security_group_id],
            Port=5439,
            PubliclyAccessible=False
    )
    
except ClientError as e:
    if e.response['Error']['Code'] == 'ClusterAlreadyExists':
        print("Cluster already exists. This is ok.")
    else:
        print("Unexpected error: %s" % e)

#### Please Wait for Cluster Status to change to `Available`

In [None]:
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
cluster_status = response['Clusters'][0]['ClusterStatus']
print(cluster_status)

while cluster_status != 'available':
    time.sleep(10)
    response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
    cluster_status = response['Clusters'][0]['ClusterStatus']
    print(cluster_status)

In [None]:
response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)
host = response['Clusters'][0]['Endpoint']['Address']
port = response['Clusters'][0]['Endpoint']['Port']
print(host)

## Create Secret in Secrets Manager

Add RedShift, Athena and Glue information to the secret. 

Note: If the secret already exists and you are creating the RedShift cluster again, the secret will not be updated to the new password. Please update the password manually in Secrets Manager.
This is to prevent accidential update to the secret.

In [None]:
secretstring = f'"username":"{master_user_name}","password":"{master_user_pw}","engine":"redshift", \
"host":"{host}","port": "{port}","dbClusterIdentifier":"{redshift_cluster_identifier}", "db":"{database_name}", \
"database_name_redshift":"{database_name_redshift}","database_name_glue": "{database_name_glue}", \
"schema_redshift":"{schema_redshift}", "schema_athena":"{schema_athena}", \
"table_name_glue":"{table_name_glue}", "table_name_redshift":"{table_name_redshift}"'

secretstring 

In [None]:
try:
    response = secretsmanager.create_secret(
        Name=secret_name,
        Description='BankDM Redshift Login',
        SecretString= '{' + secretstring + '}',
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceExistsException':
        print("Secret already exists. If you are recreating the RedShift cluster, please update the password manually in Secrets Manager.")
    else:
        print("Unexpected error: %s" % e)

### Create Lambda IAM role and policy

In [None]:
def create_lambda_role(role_name):
    try:
        response = iam.create_role(
            RoleName = role_name,
            AssumeRolePolicyDocument = json.dumps({
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "lambda.amazonaws.com"
                        },
                        "Action": "sts:AssumeRole"
                    }
                ]
            }),
            Description='Role for Lambda'
        )

        role_arn = response['Role']['Arn']

        response = iam.attach_role_policy(
            RoleName=role_name,
            PolicyArn='arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
        )

        addPolicy("SecretsManagerReadWrite", role_name)
        addPolicy("AmazonRedshiftFullAccess", role_name)
        addPolicy("AmazonSageMakerFullAccess", role_name)
        addPolicy("AmazonS3FullAccess", role_name)
        
        return role_arn

    except iam.exceptions.EntityAlreadyExistsException:
        print(f'Using ARN from existing role: {role_name}')
        response = iam.get_role(RoleName=role_name)
        return response['Role']['Arn']

lambda_role = create_lambda_role("BankDM-Lambda")

# Have to wait a little before creating the lambda
time.sleep(10)

In [None]:
# Zip up the lambda code
archive = zipfile.ZipFile('lambda.zip', 'w')
archive.write('lambda_redshift_dl.py', 'lambda_redshift_dl.py')
archive.close()

# Upload the file to S3
s3.upload_file('lambda.zip', bucket, 'bankdm/lambda.zip')

# Delete the lambda function if it exists
try:
    response = lambda_client.delete_function(
            FunctionName='bankdm-redshift-dl',
        )
except ClientError as e:
    if e.response['Error']['Code'] == 'ResourceNotFoundException':
        print("Lambda function not found. Creating it...")
    else:
        print("Unexpected error: %s" % e) 

# Create the lambda function
try:
    response = lambda_client.create_function(
                Code={
                    'S3Bucket': bucket,
                    'S3Key': 'bankdm/lambda.zip', 
                },
                FunctionName='bankdm-redshift-dl',
                Handler='lambda_redshift_dl.lambda_handler',
                Publish=True,
                Role=lambda_role,
                Runtime='python3.8',
                Timeout=600, # Set to 10 minutes
                MemorySize=512,
            )
except ClientError as e:
    print("Unexpected error: %s" % e) 

---

## Next steps

Now that you have created the foundation layer (IAM roles, policies, secret manager, RedShift cluster, lambda), you can proceed to explore the data (notebook 02 - optional) or you can choose to insert data into RedShift (notebook 03).