## Redshift Airflow Setup

This code:
1. Loads AWS credentials
2. Creates Redshift instance and retrieves connection details
3. Configures Airflow to have connections for AWS + for this Redshift instance

This is in support of `redshift_dag1.py` which then uses this setup to run Airflow jobs with Redshift.

---

### Module Import

---

In [1]:
import boto3
import configparser

---

### Loading Config Files + Credentials

---

In [2]:
#AWS Credentials
aws_path = "/home/rambino/.aws/credentials"
aws_cred = configparser.ConfigParser()
aws_cred.read(aws_path)

#Redshift Credentials
redshift_path = "/home/rambino/dev/DataEngineering_Udacity/04_AWS_DataWarehousing/redshift_credentials.cfg"
redshift_cred = configparser.ConfigParser()
redshift_cred.read(redshift_path)

# #ETL Config
# cfg_path = "/home/rambino/dev/DataEngineering_Udacity/Projects/DataWarehouseWithRedshift/dwh.cfg"
# cfg = configparser.ConfigParser()
# cfg.read(cfg_path)

['/home/rambino/dev/DataEngineering_Udacity/04_AWS_DataWarehousing/redshift_credentials.cfg']

---

#### Creating IAM role for Redshift

---

In [3]:
iam = boto3.client('iam',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [5]:
#Create IAM role:

#This policy is something about allowing Redshift to impersonate a user, but I don't fully understand it yet.
#Look more into what "sts:AssumeRole" really means.

import json

dwhRole = iam.create_role(
    Path = "/",
    RoleName =  "RedShift_Impersonation",
    Description = "Allows redshift to access S3",
    AssumeRolePolicyDocument=json.dumps(
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": 'sts:AssumeRole',
                    "Principal":{"Service": "redshift.amazonaws.com"}
                }
            ]
        }
    )
)

dwhRole

EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the CreateRole operation: Role with name RedShift_Impersonation already exists.

In [4]:
role = iam.get_role(RoleName = "Redshift_Impersonation")
role_arn = role['Role']['Arn']
role_arn


'arn:aws:iam::549653882425:role/RedShift_Impersonation'

In [5]:
#Attaching IAM policy to the role (which actually gives permissions):

attach_response = iam.attach_role_policy(
    RoleName = "RedShift_Impersonation",
    PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
)

attach_response

{'ResponseMetadata': {'RequestId': '16b291ef-469f-441f-bed7-bd8e0c83a7a4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '16b291ef-469f-441f-bed7-bd8e0c83a7a4',
   'content-type': 'text/xml',
   'content-length': '212',
   'date': 'Mon, 17 Oct 2022 17:13:55 GMT'},
  'RetryAttempts': 0}}

---

#### Apply VPC Security Group rules to Redshift

---

In [6]:
#Defining PORT for Redshift + VPC security group
redshift_port = 5439

In [7]:
ec2 = boto3.client('ec2',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [10]:
response = ec2.create_security_group(
    Description = "Security Group for allowing all access to Redshift cluster",
    GroupName = "Redshift_secGroup"
)
response

ClientError: An error occurred (InvalidGroup.Duplicate) when calling the CreateSecurityGroup operation: The security group 'Redshift_secGroup' already exists for VPC 'vpc-0d64087a33995cf20'

In [8]:
sec_groups = ec2.describe_security_groups(
    GroupNames = [
        'Redshift_secGroup'
    ]
)

sec_groups
redshift_sg_id = sec_groups['SecurityGroups'][0]['GroupId']

In [12]:
vpc = ec2.authorize_security_group_ingress(
    CidrIp = '0.0.0.0/0', #Allowing permission to access from any IP
    FromPort = redshift_port, #Default port for Redshift
    ToPort = redshift_port,
    IpProtocol = 'TCP',
    GroupId = redshift_sg_id
)

ClientError: An error occurred (InvalidPermission.Duplicate) when calling the AuthorizeSecurityGroupIngress operation: the specified rule "peer: 0.0.0.0/0, TCP, from port: 5439, to port: 5439, ALLOW" already exists

---

#### Creating Redshift cluster

---

In [9]:
redshift = boto3.client('redshift',
    region_name             = "us-west-2",
    aws_access_key_id       = aws_cred['default']['aws_access_key_id'],
    aws_secret_access_key   = aws_cred['default']['aws_secret_access_key']
)

In [11]:
#Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html#Redshift.Client.create_cluster
redshift_response = redshift.create_cluster(
    ClusterType = "multi-node",
    NodeType = 'dc2.large',
    NumberOfNodes = 2,
    DBName = "my_redshift_db",
    ClusterIdentifier = 'redshift-cluster-2',
    MasterUsername = redshift_cred['redshift_credentials']['un'],
    MasterUserPassword = redshift_cred['redshift_credentials']['pw'],
    IamRoles = [role_arn],
    PubliclyAccessible = True,
    VpcSecurityGroupIds = [
        redshift_sg_id
    ],
    Port = redshift_port
)

'''
WARNING! After running this code, you WILL create a Redshift cluster. Be sure to delete it to not incur costs!!
'''

redshift_response

{'Cluster': {'ClusterIdentifier': 'redshift-cluster-2',
  'NodeType': 'dc2.large',
  'ClusterStatus': 'creating',
  'ClusterAvailabilityStatus': 'Modifying',
  'MasterUsername': 'dev',
  'DBName': 'my_redshift_db',
  'AutomatedSnapshotRetentionPeriod': 1,
  'ManualSnapshotRetentionPeriod': -1,
  'ClusterSecurityGroups': [],
  'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-0e29a3f1bc12cd56e',
    'Status': 'active'}],
  'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',
    'ParameterApplyStatus': 'in-sync'}],
  'ClusterSubnetGroupName': 'default',
  'VpcId': 'vpc-0d64087a33995cf20',
  'PreferredMaintenanceWindow': 'wed:10:00-wed:10:30',
  'PendingModifiedValues': {'MasterUserPassword': '****'},
  'ClusterVersion': '1.0',
  'AllowVersionUpgrade': True,
  'NumberOfNodes': 2,
  'PubliclyAccessible': True,
  'Encrypted': False,
  'Tags': [],
  'EnhancedVpcRouting': False,
  'IamRoles': [{'IamRoleArn': 'arn:aws:iam::549653882425:role/RedShift_Impersonation',
    'Ap

In [12]:
from time import sleep

#Cluster takes time to create. This loop iterates until redshift is finished and returns details:
for i in range(20):
    clusters = redshift.describe_clusters()
    if(clusters['Clusters'] == []):
        print("cluster still forming...")
        sleep(5)
        continue
    else:
        try:
            redshift_host = clusters['Clusters'][0]['Endpoint']['Address']
            redshift_port = str(clusters['Clusters'][0]['Endpoint']['Port'])
            redshift_name = clusters['Clusters'][0]['DBName']
            cluster_id = clusters['Clusters'][0]['ClusterIdentifier']

            redshift_user = redshift_cred['redshift_credentials']['UN']
            redshift_pw = redshift_cred['redshift_credentials']['PW']
            print("---Variables Loaded Successfully---")
            print(clusters)
            break
        except:
            print("Error in outputting cluster metrics, trying again...")
            sleep(10)

    

    #if(clusters['Clusters'] == []):
    #   print("No clusters")

Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
Error in outputting cluster metrics, trying again...
---Variables Loaded Successfully---
{'Clusters': [{'ClusterIdentifier': 'redshift-cluster-2', 'NodeType': 'dc2.large', 'ClusterStatus': 'available', 'ClusterAvailabilityStatus': 'Available', 'MasterUsername': 'dev', 'DBName': 'my_redshift_db', 'Endpoint': {'Address': 'redshift-cluster-2.ci137bsnqj5n.us-west-2.redshift.amazonaws.com', 'Port': 5439}, 'ClusterCreateTime': datetime.datetime(2022, 10, 17, 17, 18, 36, 6

In [13]:
response = redshift.delete_cluster(
    ClusterIdentifier = cluster_id,
    SkipFinalClusterSnapshot=True
)
response

{'Cluster': {'ClusterIdentifier': 'redshift-cluster-2',
  'NodeType': 'dc2.large',
  'ClusterStatus': 'deleting',
  'ClusterAvailabilityStatus': 'Modifying',
  'MasterUsername': 'dev',
  'DBName': 'my_redshift_db',
  'Endpoint': {'Address': 'redshift-cluster-2.ci137bsnqj5n.us-west-2.redshift.amazonaws.com',
   'Port': 5439},
  'ClusterCreateTime': datetime.datetime(2022, 10, 17, 17, 18, 36, 633000, tzinfo=tzutc()),
  'AutomatedSnapshotRetentionPeriod': 1,
  'ManualSnapshotRetentionPeriod': -1,
  'ClusterSecurityGroups': [],
  'VpcSecurityGroups': [{'VpcSecurityGroupId': 'sg-0e29a3f1bc12cd56e',
    'Status': 'active'}],
  'ClusterParameterGroups': [{'ParameterGroupName': 'default.redshift-1.0',
    'ParameterApplyStatus': 'in-sync'}],
  'ClusterSubnetGroupName': 'default',
  'VpcId': 'vpc-0d64087a33995cf20',
  'AvailabilityZone': 'us-west-2b',
  'PreferredMaintenanceWindow': 'wed:10:00-wed:10:30',
  'PendingModifiedValues': {},
  'ClusterVersion': '1.0',
  'AllowVersionUpgrade': True,
 

---

### Airflow Connnection: AWS credentials

---

In [33]:
import getpass
import os

In [37]:
#Note: Double curly braces ('{{') necessary when using string formatting
#Requires Airflow to be running in docker container on local machine

command = '''sudo -S docker-compose run airflow-worker connections add 'aws_credentials' \
    --conn-json '{{ \
        "conn_type": "aws", \
        "login":"{}", \
        "password":"{}", \
        "extra": {{ \
            "region_name": "us-west-2" \
        }} \
    }}'
'''.format(
    aws_cred['default']['aws_access_key_id']
    ,aws_cred['default']['aws_secret_access_key']
)

os.system('echo {} | {}'.format(getpass.getpass(),command))

[sudo] password for rambino: Starting docker_airflow_airflow-init_1 ... 
Starting docker_airflow_airflow-init_1 ... done
Creating docker_airflow_airflow-worker_run ... 
Creating docker_airflow_airflow-worker_run ... done







Successfully added `conn_id`=aws_credentials : aws://AKIAX76PMUY4U6WL7JEA:******@:


0

---

### Airflow Connnection: Redshift Credentials

---

Necessary:
1. ConnID = "redshift_connection" (or sth similar)
2. ConnType = "Postgres"
3. Host = redshift_host
4. Port = redshift_port
5. Schema = redshift_name
6. Login = redshift_user
7. Password = redshift_pw

In [35]:
#Note: Double curly braces ('{{') necessary when using string formatting

#Dummy connection with no real data:
command = '''sudo -S docker-compose run airflow-worker connections add 'redshift_connection' \
    --conn-json '{{ \
        "conn_type": "Postgres", \
        "login": "{0}", \
        "password": "{1}", \
        "host": "{2}", \
        "port": {3}, \
        "schema": "{4}" \
    }}'
'''.format(
    redshift_user,
    redshift_pw,
    redshift_host,
    redshift_port,
    redshift_name
)

os.system('echo {} | {}'.format(getpass.getpass(),command))

[sudo] password for rambino: Starting docker_airflow_airflow-init_1 ... 
Starting docker_airflow_airflow-init_1 ... done
Creating docker_airflow_airflow-worker_run ... 
Creating docker_airflow_airflow-worker_run ... done







Successfully added `conn_id`=redshift_connection : Postgres://dev:******@redshift-cluster-2.ci137bsnqj5n.us-west-2.redshift.amazonaws.com:5439/my_redshift_db


0