## Redshift Airflow Setup

This code:
1. Loads AWS credentials
2. Creates Redshift serverless instance and retrieves connection details
3. Configures Airflow to have connections for AWS + for this Redshift instance

This is in support of `redshift_dag1.py` which then uses this setup to run Airflow jobs with Redshift.

---

### Module Import

---

In [1]:
import boto3
import configparser

---

### Loading Config Files + Credentials

---

In [2]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

#Redshift Credentials
redshift_path = "/home/rambino/dev/DataEngineering_Udacity/04_AWS_DataWarehousing/redshift_credentials.cfg"
redshift_cred = configparser.ConfigParser()
redshift_cred.read(redshift_path)

# #ETL Config
# cfg_path = "/home/rambino/dev/DataEngineering_Udacity/Projects/DataWarehouseWithRedshift/dwh.cfg"
# cfg = configparser.ConfigParser()
# cfg.read(cfg_path)

['/home/rambino/dev/DataEngineering_Udacity/04_AWS_DataWarehousing/redshift_credentials.cfg']

---

#### Creating IAM role for Redshift

---

In [15]:
iam = boto3.client('iam',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['kevin_aws_account']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['kevin_aws_account']['aws_secret_access_key']
)

In [16]:
#Create IAM role:

#This policy is something about allowing Redshift to impersonate a user, but I don't fully understand it yet.
#Look more into what "sts:AssumeRole" really means.

import json

dwhRole = iam.create_role(
    Path = "/",
    RoleName =  "RedShift_Impersonation",
    Description = "Allows redshift to access S3",
    AssumeRolePolicyDocument=json.dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": 'sts:AssumeRole',
                    "Principal":{"Service": "redshift.amazonaws.com"}
                }
            ]
        }
    )
)

dwhRole

EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name RedShift_Impersonation already exists.

In [17]:
role = iam.get_role(RoleName = "Redshift_Impersonation")
role_arn = role['Role']['Arn']
role_arn


'arn:aws:iam::544495716151:role/RedShift_Impersonation'

In [18]:
#Attaching IAM policy to the role (which actually gives permissions):

attach_response = iam.attach_role_policy(
    RoleName = "RedShift_Impersonation",
    PolicyArn="arn:aws:iam::aws:policy/AmazonS3FullAccess"
)

attach_response

{'ResponseMetadata': {'RequestId': '7b1ea781-a7bc-40d2-a77d-e4698bb64e38',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7b1ea781-a7bc-40d2-a77d-e4698bb64e38',
   'content-type': 'text/xml',
   'content-length': '212',
   'date': 'Mon, 11 Sep 2023 17:09:01 GMT'},
  'RetryAttempts': 0}}

---

#### Apply VPC Security Group rules to Redshift

---

In [3]:
#Defining PORT for Redshift + VPC security group
redshift_port = 5439

In [4]:
ec2 = boto3.client('ec2',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['kevin_aws_account']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['kevin_aws_account']['aws_secret_access_key']
)

In [20]:
response = ec2.create_security_group(
    Description = "Security Group for allowing all access to Redshift cluster",
    GroupName = "Redshift_secGroup"
)
response

ClientError: An error occurred (InvalidGroup.Duplicate) when calling the CreateSecurityGroup operation: The security group 'Redshift_secGroup' already exists for VPC 'vpc-0714907a778e89500'

In [5]:
sec_groups = ec2.describe_security_groups(
    GroupNames = [
        'Redshift_secGroup'
    ]
)

sec_groups
redshift_sg_id = sec_groups['SecurityGroups'][0]['GroupId']

In [6]:
vpc = ec2.authorize_security_group_ingress(
    CidrIp = '0.0.0.0/0', #Allowing permission to access from any IP
    FromPort = redshift_port, #Default port for Redshift
    ToPort = redshift_port,
    IpProtocol = 'TCP',
    GroupId = redshift_sg_id
)

ClientError: An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 0.0.0.0/0, TCP, from port: 5439, to port: 5439, ALLOW" already exists

---

#### Creating Redshift cluster

---

In [7]:
redshiftServerless = boto3.client('redshift-serverless',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['kevin_aws_account']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['kevin_aws_account']['aws_secret_access_key']
)

In [8]:
nameSpace = "udacity-course-namespace"
dbName = "default-db"

In [32]:
# 1. Create NameSpace (collection of Database objects and users)
#   1a. Database name
#   1b. Give IAM role to NameSpace
# 2. Create Workgroup
#   2a.Set capacity (used to process Data Warehouse loads - maybe a cap?)
#   2b. Set VPC
#   2c. Set VPC security group
#   2d. Specify subnets to be used within VPC

redshiftServerless.create_namespace(
    adminUserPassword=redshift_cred['redshift_credentials']['un'],
    adminUsername=redshift_cred['redshift_credentials']['pw'],
    namespaceName = nameSpace,
    dbName = dbName,
    iamRoles = [role_arn],
)



{'namespace': {'adminUsername': 'VictorCreme3',
  'creationDate': datetime.datetime(2023, 9, 11, 17, 27, 5, 519000, tzinfo=tzutc()),
  'dbName': 'default-db',
  'iamRoles': ['arn:aws:iam::544495716151:role/RedShift_Impersonation'],
  'kmsKeyId': 'AWS_OWNED_KMS_KEY',
  'logExports': [],
  'namespaceArn': 'arn:aws:redshift-serverless:us-west-2:544495716151:namespace/b69f3367-7c37-4ec3-bb7b-b94d81ca2a95',
  'namespaceId': 'b69f3367-7c37-4ec3-bb7b-b94d81ca2a95',
  'namespaceName': 'udacity-course-namespace',
  'status': 'AVAILABLE'},
 'ResponseMetadata': {'RequestId': 'b5d79e5e-6153-48da-91b2-ab348083274d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b5d79e5e-6153-48da-91b2-ab348083274d',
   'date': 'Mon, 11 Sep 2023 17:27:05 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '458',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

In [33]:
workgroup_response = redshiftServerless.create_workgroup(
    workgroupName = "udacity-course",
    namespaceName = nameSpace,
    baseCapacity = 128,
    securityGroupIds = [redshift_sg_id],
    enhancedVpcRouting=True,
    publiclyAccessible=True
)
workgroup_response

{'workgroup': {'baseCapacity': 128,
  'configParameters': [{'parameterKey': 'auto_mv', 'parameterValue': 'true'},
   {'parameterKey': 'datestyle', 'parameterValue': 'ISO, MDY'},
   {'parameterKey': 'enable_case_sensitive_identifier',
    'parameterValue': 'false'},
   {'parameterKey': 'enable_user_activity_logging', 'parameterValue': 'true'},
   {'parameterKey': 'query_group', 'parameterValue': 'default'},
   {'parameterKey': 'search_path', 'parameterValue': '$user, public'},
   {'parameterKey': 'max_query_execution_time', 'parameterValue': '14400'}],
  'creationDate': datetime.datetime(2023, 9, 11, 17, 27, 9, 576000, tzinfo=tzutc()),
  'endpoint': {'address': 'udacity-course.544495716151.us-west-2.redshift-serverless.amazonaws.com',
   'port': 5439},
  'enhancedVpcRouting': True,
  'namespaceName': 'udacity-course-namespace',
  'publiclyAccessible': True,
  'securityGroupIds': ['sg-0112dfb8ffa5fbfa2'],
  'status': 'CREATING',
  'subnetIds': ['subnet-0c9dc89be92992e01',
   'subnet-080c

In [9]:
from time import sleep

res = redshiftServerless.list_workgroups()
while res['workgroups'][0]['status'] == "CREATING":
    print("Creating cluster...")
    sleep(10)
    res = redshiftServerless.list_workgroups()

res['workgroups'][0]

{'baseCapacity': 128,
 'configParameters': [{'parameterKey': 'auto_mv', 'parameterValue': 'true'},
  {'parameterKey': 'datestyle', 'parameterValue': 'ISO, MDY'},
  {'parameterKey': 'enable_case_sensitive_identifier',
   'parameterValue': 'false'},
  {'parameterKey': 'enable_user_activity_logging', 'parameterValue': 'true'},
  {'parameterKey': 'query_group', 'parameterValue': 'default'},
  {'parameterKey': 'search_path', 'parameterValue': '$user, public'},
  {'parameterKey': 'max_query_execution_time', 'parameterValue': '14400'}],
 'creationDate': datetime.datetime(2023, 9, 11, 17, 27, 9, 576000, tzinfo=tzutc()),
 'endpoint': {'address': 'udacity-course.544495716151.us-west-2.redshift-serverless.amazonaws.com',
  'port': 5439,
  'vpcEndpoints': [{'networkInterfaces': [{'availabilityZone': 'us-west-2c',
      'networkInterfaceId': 'eni-0eeccf3c6873a20f4',
      'privateIpAddress': '172.31.4.51',
      'subnetId': 'subnet-080c291f07f1150fb'}],
    'vpcEndpointId': 'vpce-0bcfb5acf0a71a82a'

In [10]:
#here is the resource address that we can use from Airflow:
redshift_workgroup = res['workgroups'][0]['endpoint']['address']
redshift_workgroup

'udacity-course.544495716151.us-west-2.redshift-serverless.amazonaws.com'

In [None]:
# OLD VERSION WHICH CREATES NON-SERVERLESS REDSHIFT:
# #Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html#Redshift.Client.create_cluster
# redshift_response = redshiftServerless.create_cluster(
#     ClusterType = "multi-node",
#     NodeType = 'dc2.large',
#     NumberOfNodes = 2,
#     DBName = "my_redshift_db",
#     ClusterIdentifier = 'redshift-cluster-2',
#     MasterUsername = redshift_cred['redshift_credentials']['un'],
#     MasterUserPassword = redshift_cred['redshift_credentials']['pw'],
#     IamRoles = [role_arn],
#     PubliclyAccessible = True,
#     VpcSecurityGroupIds = [
#         redshift_sg_id
#     ],
#     Port = redshift_port
# )

# '''
# WARNING! After running this code, you WILL create a Redshift cluster. Be sure to delete it to not incur costs!!
# '''

# redshift_response

In [None]:
# from time import sleep

# #Cluster takes time to create. This loop iterates until redshift is finished and returns details:
# for i in range(20):
#     clusters = redshift.describe_clusters()
#     if(clusters['Clusters'] == []):
#         print("cluster still forming...")
#         sleep(5)
#         continue
#     else:
#         try:
#             redshift_host = clusters['Clusters'][0]['Endpoint']['Address']
#             redshift_port = str(clusters['Clusters'][0]['Endpoint']['Port'])
#             redshift_name = clusters['Clusters'][0]['DBName']
#             cluster_id = clusters['Clusters'][0]['ClusterIdentifier']

#             redshift_user = redshift_cred['redshift_credentials']['UN']
#             redshift_pw = redshift_cred['redshift_credentials']['PW']
#             print("---Variables Loaded Successfully---")
#             print(clusters)
#             break
#         except:
#             print("Error in outputting cluster metrics, trying again...")
#             sleep(10)

    

#     #if(clusters['Clusters'] == []):
#     #   print("No clusters")

In [None]:
# response = redshift.delete_cluster(
#     ClusterIdentifier = cluster_id,
#     SkipFinalClusterSnapshot=True
# )
# response

---

### Airflow Connnection: AWS credentials

---

In [11]:
import getpass
import os

In [12]:
#Note: Double curly braces ('{{') necessary when using string formatting
#Requires Airflow to be running in docker container on local machine

command = '''sudo -S docker-compose run airflow-worker connections add 'aws_credentials' \
    --conn-json '{{ \
        "conn_type": "aws", \
        "login":"{}", \
        "password":"{}", \
        "extra": {{ \
            "region_name": "us-west-2" \
        }} \
    }}'
'''.format(
    aws_cred['airflow_access']['aws_access_key_id']
    ,aws_cred['airflow_access']['aws_secret_access_key']
)

os.system('echo {} | {}'.format(getpass.getpass(),command))

[sudo] password for rambino: Sorry, try again.
[sudo] password for rambino: 
sudo: no password was provided
sudo: 1 incorrect password attempt


256

---

### Airflow Connnection: Redshift Credentials

---

Necessary:
1. ConnID = "redshift_connection" (or sth similar)
2. ConnType = "Postgres"
3. Host = redshift_host
4. Port = redshift_port
5. Schema = redshift_name
6. Login = redshift_user
7. Password = redshift_pw

In [14]:
#Note: Double curly braces ('{{') necessary when using string formatting

#Dummy connection with no real data:
command = '''sudo -S docker-compose run airflow-worker connections add 'redshift_connection' \
    --conn-json '{{ \
        "conn_type": "redshift", \
        "login": "{0}", \
        "password": "{1}", \
        "host": "{2}", \
        "port": {3}, \
        "schema": "{4}" \
    }}'
'''.format(
    redshift_cred['redshift_credentials']['UN'],
    redshift_cred['redshift_credentials']['PW'],
    redshift_workgroup,
    redshift_port,
    dbName
)

os.system('echo {} | {}'.format(getpass.getpass(),command))

[sudo] password for rambino: Starting docker_airflow-airflow-init-1 ... 
Starting docker_airflow-airflow-init-1 ... done
Creating docker_airflow_airflow-worker_run ... 
Creating docker_airflow_airflow-worker_run ... done



Successfully added `conn_id`=redshift_connection : redshift://adminUser1:******@udacity-course.544495716151.us-west-2.redshift-serverless.amazonaws.com:5439/default-db


0