This lab is provided as part of AWS Innovate Data Edition, it has been adapted from an AWS Blog
Click here to explore the full list of hands-on labs.
ℹ️ You will run this lab in your own AWS account and running this lab will incur some costs. Please follow directions at the end of the lab to remove resources to avoid future costs.
- Overview
- Architecture
- Step 1 - Create Redshift Cluster
- Step 2 - Setup Student dataset
- Step 3 - Create VPC Endpoints
- Step 4 - Alter Security Group Rules
- Step 5 - Create an S3 Bucket
- Step 6 - Prepare data using DataBrew
- Summary
- Cleanup
- Survey
In this lab we will use AWS Glue DataBrew to prepare data from Amazon Redshift. We will explore a student dataset stored in Amazon Redshift containing details of school id, name, age, country, gender, number of study hours and marks. We will use AWS Glue DataBrew to connect to Redshift cluster and ingest data. This data will then be prepared, cleaned and made ready for a downstream machine learning process.
With AWS Glue DataBrew, users can evaluate the quality of your data by profiling it to understand data patterns and detect anomalies. They can also visualize, clean, and normalize data directly from your data lake, data warehouses, and database including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS.
Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.
In this lab we will setup the following architecture. The architecture will have several components explained below.
- VPC - The Amazon Virtual Private Cloud (VPC) will host the Private subnet where the Amazon Redshift cluster will be hosted. In this lab, we will use the default VPC.
- Private Subnet - The Private subnet will host Elastic Network interface for the Redshift cluster. In this lab we will use the default Subnet. Note: The default is attached to the internet gateway and thus is not private for the lab purposes.
- Elastic Network Interface (ENI).
- Security Group - The Security group will specify the inbound and outbound rule to secure the network traffic coming in and going out of the ENI. In this lab we will alter the existing security group.
- Amazon Redshift - Amazon Redshift will host the student dataset for this lab
- AWS Glue - The AWS Glue DataBrew created connection to Amazon Redshift will be housed in the AWS Glue service
- Amazon S3 - The Amazon S3 will store any Glue DataBrew intermediate logs and outputs from the DataBrew recipes.
- Glue Interface Endpoint - An interface endpoint will make AWS Glue service available within the VPC.
- S3 Gateway Endpoint - A gateway endpoint will make Amazon S3 service available within the VPC.
- AWS Glue DataBrew - This service will connect to the student dataset in Amazon Redshift. The Service will allow users to prepare the data and create a reusable recipe to refine the data to make it available to AI/ML services like AWS Sagemaker.
-
Navigate to the Amazon Redshift service in the AWS Console.
-
Click on create cluster and name the cluster
student-cluster
. -
You could choose the Free Trial option which will create a cluster will sample data. For this lab we will choose Production.
-
Select the instance type as dc2.large and 2 nodes.
-
Leave the checkbox for Load Sample data unchecked
-
Use defaults for VPC and Security group. Note the security group. We will alter the inbound and outbound rules for this security group at a later step.
-
Select Create Cluster.
You could also use your preferred SQL client to execute the SQL statements. See here for connecting using SQLWorkBench/J
-
Connect to the 'dev' db which is created during the creation of the Cluster.
-
In the query editor run the DDLSchemaTable.sql
-
This will create
student_schema
schema and a tablestudy_details
within the schema. -
Following this run the studentRecordsInsert.sql to insert the sample student dataset for the lab.
-
View the data student data loaded in Redshift
-
Navigate to the VPC service in the AWS Console.
-
Select the Endpoints option from the left pane.
-
Create a VPC Endpoint by selecting the 'Create Endpoint'.
-
Set the servict category as AWS services and search for the S3 service.
-
Select the service with type Gateway.
-
The default VPC should be selected by default else select the VPC where the redshift cluster was created.
-
Leave other options as is, scroll down, and click on 'Create Endpoint'
-
Go to the endpoint and make sure the Route Table is associated with all the subnets
-
Create a VPC Endpoint by selecting the 'Create Endpoint'.
-
Set the service category as AWS services and search for the Glue service.
-
Select the service with type Interface.
-
The default VPC should be selected by default else select the VPC where the redshift cluster was created.
-
Leave other options as is, scroll down, and click on 'Create Endpoint'
-
Go to the Security group feature in the EC2 Console.
-
Fetch the security group noted in Step1
-
Alter the inbound rules like so.
-
Alter the outbound rules like so. While altering the outbound rules, ensure that the prefix list selected (pl-xxxxx) match the prefix list created for the S3 VPC endpoint.
-
Go to the S3 Console and click on create bucket.
-
S3 buckets have to unique. Select a unique name and create bucket.
-
Create 2 folders namely,
AWSDatasetOutput
andrecipeJobOutput
.
- Go to the IAM Policy console.
- Click on Create Policy.
- Navigate to JSON sub tab and paste the contents of the policy json
- This policy is an extension of AwsGlueDataBrewDataResourcePolicy
- Additionally to the permissions AwsGlueDataBrewDataResourcePolicy, this databrew instance also requires a few extra permissions. e.g. glue:GetConnection.
- Note: This policy can be further restricted to allow access to specific resources. An example of a more restricted policy is AwsGlueDataBrewDataResourcePolicy.json. In this example, the resources section is more specific to allow access to the specific S3 bucket and the specific glue connection.
- Click Next Tags, then enter
AwsGlueDataBrewDataResourcePolicy
as the name of the Policy and click on Create Policy.
- Go to the IAM Role console.
- Click on Create Role.
- Select DataBrew as the trusted entity. Click on Next: Permissions
- Filter for AwsGlueDataBrewDataResourcePolicy and select policy.
- Click on Next:Tags. Add any tags as appropriate.
- Click on Next: Review
- Record
AwsGlueDataBrewDataAccessRole
as the role name and click on Create Role.
-
Go to the AWS Glue DataBrew.
-
On the left pane select Datasets and navigate to the Connections tab.
-
Create a new Connection.
-
Enter the Connection name as
students-connection
. -
Select 'Amazon RedShift' as the Connection type.
-
Select the Redshift cluster, the database name, the AWS User and the password that was used to create the cluster.
-
Select the created connection and click on 'Create dataset with this connection'.
-
Enter the Dataset name as
studentrs-dataset
. -
The connection name should be auto-populated. Select the table,
study_details
. -
Enter the s3 destination as
s3://bucketname/AWSDatasetOutput/
-
Click on Create dataset.
-
At this point, if you open the dataset and navigate to the Data profile overview subtab or Column statistics subtab, you will see no information. This is because a data profiling has not been completed. We will do this in a future step.
-
Select the created dataset and click on 'Create project with this dataset'.
-
Enter the project name as
studentrs-project
. The recipe name, the dataset name and the table name should be autopopulated. -
Select the Role Name as
AwsGlueDataBrewDataAccessRole
created in Step 6 -
Click on Create Project. Once created, the project will run and provide a sample dataset view.
-
Navigate to the Jobs in the left pane and go to the Profile job
-
Click on 'Create Job' and enter the job name as
student-profile-job
. -
Select the 'Create a profile job' option for Job Type.
-
Enter
studentrs-dataset
as Job input. -
For the job output setting enter the s3 location created in Step 5. Set as
s3://bucketname
/. -
Click on 'Create and run job'.
-
Select the created job and monitor progress to make sure the job has completed. This might take a few minutes depending on the size of the dataset.
-
Navigate to the Data lineage sub tab for the selected dataset to view a graphical representation of the data flow.
-
Navigate to the dataset and view column statistics. The data profiling job populates this data.
-
The data profiling provides insight into the data. e.g. missing data, outliers etc. In the dataset here we have 3 records without age field populated.
-
DataBrew allows the refining of data by providing a number of tools that we will explore below. Using these tools, we can create databrew recipes to refine and prepare the data to be ingested by AWS Sagemaker to drive inferences.
-
As part of the refining process, lets delete the first name, last name and the schoolname.
-
We know from the profiling report that the age value is missing in three records. Let’s fill in the missing value with the median age of other records. Choose Missing and choose Fill with numeric aggregate.
-
The next step is to convert the categorical value to a numerical value for the gender column.
- Choose Mapping and choose Categorical mapping.
- For Source column, choose gender.
- For Mapping options, select Map top 2 values.
- For Map values, select Map values to numeric values.
- For F, choose 1.
- For M, choose 2.
-
ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data. To apply one-hot encoding, complete the following steps:
-
Choose Encode and choose One-hot encode column.
-
For Column select health.
-
Click Apply
-
This steps splits the health column into a number of columns.
-
A number of similar changes can be done. e.g. deleting the original gender column and renaming the new gender_mapped column to gender etc.
-
Post all the desired refinements, a recipe containing all the changes to be applied is created. This can be viewed on the right hand pane of the screen.
-
The recipe can now be published so that it can be applied to the entire dataset. Select the publish recipe option and leave the version description as is.
-
The published recipes can be viewed by selecting the Recipe option on the left pane.
Now that the recipe is created it can be run to profile the entire data student dataset.
-
For Job name¸ enter
student-performance
. -
Leave the job type as Create a recipe job
-
Dataset input as
studentrs-dataset
. -
Select the output to Amazon S3 and select the S3 bucket we created in Step 5.
-
For the IAM Role name select
AwsGlueDataBrewDataAccessRole
-
Click on Create and run job.
-
Navigate to the Job and wait for the job to finish. This should take a few minutes.
-
Navigate to the output to view the results of the recipe job created in the selected Amazon S3 bucket.
-
This CSV file can now be fed into AL/ML services for further analysis as required.
In this lab, we created an Amazon Redshift cluster data warehouse and loaded a student dataset. We used a JDBC connection to create a DataBrew dataset for an Amazon Redshift table. We then performed data profiling followed by some data transformation using DataBrew, preparing the data to be ingested by a ML model building exercise.
Follow the below steps to cleanup your account to prevent any aditional charges:
-
Navigate to the Jobs and delete the recipe job and the profiling job.
-
Navigate to the Projects and delete the project created in the lab.
-
Navigate to the Datasets and delete the dataset created in the lab.
-
Navigate to AWS Glue service and navigate to connections.
-
Navigate to S3 console
-
Empty and then delete the bucket created in Step 5.
-
You can choose to remove the VPC Endpoints, IAM Policy and role and any security group alterations done.
-
Navigate to Redshift console
-
Open the student-cluster and delete the cluster. Uncheck the prompt to take a snapshot before deletion of the cluster