Skip to content

Amazon QuickSight and Amazon Athena workshop. Workshop will focus on ingesting data into Athena, combining it with other data sources, and visualizaing it in QuickSight.

Notifications You must be signed in to change notification settings

jcataluna/QuicksightAthena01

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Quicksight and Athena Workshop - AWS & Slalom

Amazon QuickSight and Amazon Athena workshop. Workshop will focus on ingesting data into Athena, combining it with other data sources, and visualizing it in QuickSight.

Hands on workshop is broken up into 5 different sections to get you familiar with the Quicksight and Athena products:

Sign Up for AWS

Create your AWS Account

Navigate to Amazon AWS Free Tier. There are a variety of services that offer free tier to start building your solutions. Choose basic support plan.

Architecture and Permissions

Purpose of serverless components is to reduce the overhead of maintaining, provisioning, and managing servers to serve applications. AWS provides three compelling serverless services through AWS to store large amounts of data, manipulate data at scale, query data at scale and speed, and easily visualize it - namely AWS Glue, Amazon Athena, Amazon QuickSight.
alt text
To get these services working we need to allow these services to talk to one another. Following we will set up permissions for to accomplish this through AWS IAM.

Build Permissions and S3 Bucket

AWS provides a service to build resources out of predefined templates using CloudFormation. We will use a CloudFormation script to automate the creation of permissions, roles, and other elements we may require.

To create this we need to run a cloud formation template:

  1. Make sure you are in the N. Virginia Region
  2. Under services, click CloudFormation under Management Tools.
    alt text

  3. Click Create Stack
  4. Under "Choose a template", select the "Specify an Amazon S3 template URL" radio button option and enter this template url:
https://s3-us-west-2.amazonaws.com/slalom-seattle-ima/scripts/cloudformation/cf_QuickSightAthena_Workshop.template
  1. Click Next
  2. Enter the a name for your stack, like QuicksightAthena-Workshop
  3. Provide a unique name for your bucket to store your data - It needs to be globally unique name and the bucket name must contain only lowercase letters, numbers, periods (.), and dashes (-). No spaces!
  4. Hit Next
  5. Hit Next
  6. There is an acknowledge checkbox for you to review, and hit Create
  7. We will wait a couple minutes until the progress says CREATE_COMPLETE
    alt text


Query a file on S3

To get started with Athena and QuickSight, we need to provide data to query. This data may originate from a variety of sources into S3, but for this example we will upload a file into S3 manually.

  1. Open the S3 Console from the Services drop down menu
  2. Click your newly created bucket, by you or by our CloudFormation script.
  3. Hit Create folder and name it "B2B"
  4. Create another folder within B2B called "orders"
  5. Download sample dataset B2B Orders. Unzip the dataset files into a folder. Click on new folder and Upload the orders.csv.
  6. Make note of the folders you saved this file under.
  7. Open the Athena console from the Services dropdown.
  8. Create a table manually via DDL in the query window.
  9. Replace the location value to the folder location of your dataset. s3://your bucket name/B2B/orders/
CREATE DATABASE labs
CREATE EXTERNAL TABLE IF NOT EXISTS labs.orders (
  `row_id` int,
  `order_id` string,
  `order_date` date,
  `ship_date` date,
  `ship_mode_id` int,
  `customer_id` string,
  `segment` int,
  `product_id` string,
  `sales` double,
  `company_id` int,
  `quantity` int,
  `discount_pct` double,
  `profit_amt` double 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://<your bucket here>/B2B/orders/'
TBLPROPERTIES ('has_encrypted_data'='false')
  1. Hit Run Query

  2. Run the following SQL statement and make sure that your table is reading correctly:

SELECT * 
FROM labs.orders LIMIT 100
  1. Show Create Table statement helps you better understand what it going on behind the scenes when creating a table.
SHOW CREATE TABLE labs.orders

Alternate definitions, schema on read:

DROP TABLE labs.orders;
CREATE EXTERNAL TABLE IF NOT EXISTS labs.orders (
  `row_id` string,
  `order_id` string,
  `order_date` string,
  `ship_date` string,
  `ship_mode_id` string,
  `customer_id` string,
  `segment_id` string,
  `product_id` string,
  `sale` string,
  `company_id` string,
  `quantity` string,
  `discount_pct` string,
  `profit_amt` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\"",
   "escapeChar"    = "\\"
)  
LOCATION "s3://marioj-bucket-02/B2B/orders/"
TBLPROPERTIES ("skip.header.line.count"="1")

More resources:

Congratulations, you queried your first S3 file through Amazon Athena!



Introducing Glue and Athena

One of the many benefits of Glue, is its ability to discover and profile data from S3 Objects. This become handy in quickly creating a catalog of new and incoming data. To get started:

  1. In Athena, from the Database pane on the left hand side, click Create Table drop down and select Automatically
    alt text
  2. If this is your first time using Glue, you may be asked to upgrade your catalog, and get redirected. Make sure you are in the Crawlers section of Glue. On the left hand side there is a Crawlers link and hit Add Crawler
  3. Enter name your crawler "Taxi Crawler" and select the IAM role "GlueServiceRole". Click Next.
  4. Select the Specify path in another account radio button and enter s3://serverless-analytics/canonical/NY-Pub/ for the S3 path. Click Next.
  5. Do not add another data source and click Next.
  6. For frequency leave as Run on Demand and click Next.
  7. Select our labs database as a target
  8. In order to avoid table name collision Glue generates a unique table name so we'll need to provide a prefix, say taxi_ (include the underscore)
  9. Click Next
  10. Review the information is correct, specifically the "Include Path" field. Hit Finish when complete.
  11. Check the box next to your newly created crawler and click Run Crawler. It should take about a minute to run and create our table.

Exploring Glue Data Catalog

  1. On the left hand side, click Databases
  2. Find the labs database and click on it
  3. Click Tables in labs to view our newly created table alt text
  4. Click the table name and explore

Querying Taxi Data

  1. Switch back to the Athena console
  • You may need to replace the database and/or table names with ones shown in the Data Catalog.
  1. Enter SHOW PARTITIONS labs.taxi_ny_pub to verify all partitions were automatically added
  2. Try the SQL statement below to explore the data.
SELECT *
FROM labs.taxi_ny_pub
WHERE year BETWEEN '2013' AND '2016' AND type='yellow'
ORDER BY pickup_datetime desc
LIMIT 10;


alt text

SELECT 
  type,
  year, 
  count(*) fare_count, 
  avg(fare_amount) avg_fare, 
  lag(avg(fare_amount)) over (partition by type order by year asc) last_year_avg_fare
FROM labs.taxi_ny_pub
WHERE year is not null
GROUP BY year, type
ORDER BY year DESC, type DESC
  • Remember, you have the ability to Save a query for future re-use and reference.


Breakout Exercises

Breakout 1 - Load B2B Dataset

Now that we have learned about crawlers, lets put it to use to load the rest of our B2B Orders dataset.

  • Unzip the data, and upload it to your S3 Bucket remember, one folder represents one table.
  • Run a crawler through your bucket to discovery the dataset.
  • Add new tables to the labs database with prefix "b2b_"

important products table is not properly detected so manually reapair is needed. Edit table to chamge column names accordingly and add skip.header.line.count: 1 to the table properties

Make sure to check fields, and how Glue is parsing your data. Correct any mistakes. Once complete, you should be able to run this query:

SELECT
  year(date_parse(Order_Date,'%c/%e/%Y')) Order_Year,
  Company_Name,
  SUM(quantity) Quantity,
  SUM(sales) Total_Sales,
  SUM(sales)/revenue_billion Sales_to_Revenue_Ratio
FROM labs.b2b_orders o
  JOIN labs.b2b_company co on  co.company_id = o.company_id
  JOIN labs.b2b_customer cu on cu.customer_id = o.customer_id
  JOIN labs.b2b_product p on p.product_id = o.product_id
  JOIN labs.b2b_segment s on s.segment_id = o.segment_id
  JOIN labs.b2b_ship_mode sm on sm.ship_mode_id = o.ship_mode_id
  JOIN labs.b2b_company_financials cp on cp.company_id = co.company_id
  JOIN labs.b2b_industry i on i.industry_id = co.industry_id
GROUP BY
  year(date_parse(Order_Date,'%c/%e/%Y')),
  Company_Name,
  revenue_billion
ORDER BY
  SUM(sales)/revenue_billion DESC
LIMIT 100


alt text

Breakout 2 - Discover Instacart Data

In this section, we will break out and follow the same instructions, but while loading data from another public source, Instacart. Instacart is a company that operates as a same-day grocery delivery service. Customers select groceries through a web application from various retailers and delivered by a personal shopper. Instacart has published a public datasource to provide insight into consumer shopping trends for over 200,000 users. Data Instacart in May 2017 to look at Instacart's customers' shopping pattern. You can find the data dictionary for the data set here

  • Source s3 bucket: s3://royon-customer-public/instacart/
  • Database: labs
  • Prefix: instacart_

Expected output

alt text

Notes on best practices

  • Partition your data
  • Compress your data!
  • With large datasets, split your files into ~100MB files
  • Convert data to a columnar format, with large datasets.

For more great tips view this post on AWS Big Data blog.



Visualizing and Dashboarding with QuickSight

Exercise 1

1. Setting up your QuickSight Account

Go to your AWS console and search for QuickSight. You will first be presented with a screen to sign up:
alt text

You can choose either Standard or Enterprise Edition (the main difference today is that Enterprise edition can hook up to Active Directory, though there will be more functionality in the future added to Enterprise Edition). For purposes of our lab today Standard Edition is fine. With both editions you get one free user, forever.
alt text

Next you will create and name for you account (you can name the account whatever you'd like) and a notification email address (set it to be your own email address). You will also see some prompts about enabling autodiscovery of your data in AWS sources, as well as access to Athena, S3 buckets, and S3 Storage analytics. Check all the boxes.
alt text

Note - Make sure you launch QuickSight in the same region you have chosen for Athena.
alt text

Once you are finished your account will load with some sample datasets and dashboards.

alt text

Alright, now we are ready to roll!

Here is some documentation on getting familiar with the UI: Navigating the UI

2. Connecting to the Data

Documentation: Data Preparation, Table Joins

Open QuickSight and choose 'Manage Data' in the upper right hand corner:
alt text

Choose 'New Dataset' and then select Athena.
alt text

alt text

Give it a name and choose 'Create Data Source'. Find the database you created earlier which contains the B2B tables and select the b2b_orders table. Try to make sure you are choosing the orders table that was created automatically by Glue instead of the table that we created using the SQL statement (if you happen to pick the wrong one, no problem, you just won't need to do the step where we create a calculated field to change the order_date to a date field). Choose 'Edit/Preview Data'. (If you clicked 'Select' instead, it's OK, just choose 'Edit/Preview Data' on the next screen and leave it on 'Import to SPICE for quicker analytics'.)
alt text

Now we will join all the tables we had created in Athena by using the Glue data crawler. Some tables join directly to the Orders table and some join to the Company table. To join a table to something other than the first one we selected (Orders) drag and drop it on top of the table you want to join it to. You will then need to define the join clauses - to do this, click on the little venn diagrams in-between the tables you see on the screen. They will all be based on the key which is named after the dimension table you are trying to join. Set them all to 'Inner' joins and click 'Apply' after you finish each table. alt text alt text

When you are finished it should look something like this (we will skip the Segment and Product tables as the crawler didn't pick up the headers correctly - we can correct this using a Glue ETL job, but for purposes of this lab we can just leave these two tables out of our new dataset):

alt text

Before we start visualizing, let's also add a couple calculated fields to convert the date fields, order_date and ship_date to date fields rather than strings (normally we could just change the datatype in QuickSight in the data preview window, but Athena does not support this today. It will be supported soon, and you could do this for any other type of data source, but for Athena we will need to make calculated fields). On the left side choose 'New Field' and then use the parseDate() function to convert the string field to a date field. Use these formulas for each calculated field:

parseDate({order_date},'MM/dd/yyyy')
parseDate({ship_date},'MM/dd/yyyy')


alt text

Once you are finished preparing the dataset, choose Save & Visualize on the top of your screen.
alt text

3. Creating Our Dashboard

Documentation: Creating Your First Analysis, Modifying Visuals

Great, now we are ready to begin visualizing our data. By default AutoGraph is chosen as the visual type, which will pick an appropriate visual type depending on the types of fields choose to visualize. We can leave it like that for now, and later we will specify certain visual types.

First click on 'sales' and we will get a KPI visual type. Then click on the Field Wells on the top and use the pull down menu to choose 'Show As->Currency':
alt text

Now click on the 'Order Date' field. Notice how our visual type automatically is changed to a line chart. It will default to the Year level, but use the pull down menu on the Order Date field to choose 'Aggregate->Month', or you can do the same thing by clicking on the Order Date label on the x-axis:
alt text

Next click the pull down menu on the segment field in the list of measures and choose 'Convert to dimension'. Then find it in the list of dimensions and select it. Now we will have 3 lines in our line chart, one per segment. Expand the axis range on the bottom of the visual to see the whole trend:
alt text


alt text

Great, we have our first visual! Now let's add another visual using the '+' button in the upper left and selecting 'Add visual':
alt text

For our next visual, let's start by clicking 'industry_name' and 'sales'. We will get a bar chart sorted in descending order by sales:
alt text

Let's add a drill down capability for our end users by dragging the 'company_name' field just below the 'industry_name' field on the Y axis field well. You should see a notification that says 'Add drill-down layer':
alt text

Cool, now our end users will be able to drill down from Industry to the actual Companies in that industry. You can see how this works by clicking on one of the bars and selecting 'Drill down to company_name':
alt text

If you want to drill back up, you can either click the bars again or you can use the icons in the upper right to either drill one level back up or all the way back to the top (if you have more than one drill down built in):
alt text

Next let's change the visual type to a Treemap using the Visual Types selector in the bottom left:
alt text

Now add another visual to the dashboard. This one will be a very granular table of all the order details. First select the 'Pivot Table' visual type. Then click on the company_name dimension. Expand the Field Wells on top and drag the order_id to the Rows underneath the company_name:
alt text

Also click on the product_id, ship_mode, sales, profit, and quantity fields to add more detail to our visual. It should look something like this:
alt text

Great, our dashboard is starting to shape up. We can now add some KPI visuals across the top to provide some high-level summary information for our users. Add another visual and select the 'sales' field. Expand the Field Wells and drag the Order Date to the 'Trend group' field well. Let's also resize the visual by dragging the bottom right corner of the visual to make it smaller. Drag it to the top of the dashboard by grabbing the dotted area on the top of the visual. Once you have it on the top it should look like this:
alt text

Let's repeat this last step to add two more KPI's to the top of the dashboard. After you add another visual, select the KPI visual type in the lower left of the screen:
alt text

The second one will be a KPI for the number of unique orders YoY. To do this, select the KPI visual type and drag 'order_id' to the 'Value' field well and 'Order Date' to the 'Trend group' field well. Change the aggregation on 'order_id' from Count to Count Distinct:
alt text

For the third KPI, let's show a YoY trend of the average order size. Click 'sales' and then use the pull down menu on the field to change the aggregation to Average. Add the 'Order Date' to the 'Trend group' field well like we did for the first KPI:
alt text

You can optionally play around with the KPI formatting options. You can change the primary number that is displayed and the comparison type. You can also choose if you would like to show the trend arrows as well as the progress bar (which is displayed as a bullet chart on the bottom of the KPI).
alt text

Lastly let's edit the titles of the KPI's to be more user friendly. I chose 'Sales YoY', 'Avg Order Size YoY', and '# of Orders YoY' for my titles:
alt text

Awesome! Our dashboard is looking really good. We are almost ready to share it with the rest of our end users. Just before we do that, let's add a filter (or many) for our users to leverage. On the left, choose 'Filter' and then either click 'Create one' or the little filter icon on the top and choose 'Order Date'. I like to use the 'Relative dates' type of UI for my date filters. Set it to the 'Last 5 years' and hit 'Apply'. Lastly click on the top where it says 'Only this visual' and change it to 'All visuals' so that it applies to the entire dashboard:
alt text


alt text

4. Sharing

Documentation: Creating and Sharing Your First Dashboard

We are ready to share our dashboard with the rest of our users now! Click the 'Share' button in the upper right of the screen and select 'Create Dashboard'. Give it a name like 'Sales Dashboard' and choose 'Create Dashboard'.

alt text

On the next screen you will be able to share it with other users in your QuickSight account.

alt text

Once you add them you can click 'Share' and it will send them an email saying a dashboard has been shared with them. Also the next time they log into QuickSight they will see it in the list of dashboards they have access to.

Great job! You have just created your first dashboard to be shared with the rest of your team!


alt text

Exercise 2 - Visualizing NY Taxi Data

One of the most compelling reasons for using Athena to query data on S3 is that you can query some really really BIG datasets. In our next exercise we will use QuickSight and Athena to visualize 2.7 Billion records. That's right, billion.

1. Connect to the Dataset

Open QuickSight and choose 'Manage Data' in the upper right hand corner.

Choose 'New Dataset' and then select Athena.

Give it a name and choose 'Create Data Source'. Find the database you created earlier which contains the NY taxi data and select the appropriate table. Choose 'Edit/Preview Data'.

Before we start visualizing, let's add a calculated field to convert the date field. The date field in this dataset is in Epoch date format. Therefore we will use a function to convert it to a more usable format. On the left side choose 'New Field' and then use the epochDate() function to convert pickup_datetime field to a date field. It is measured down to the millisecond, so we will also divide the integer by 1000 to get it into seconds before converting. Use this formula:

epochDate({pickup_datetime}/1000)

alt text

Make sure we keep it set to 'Query' rather than SPICE, which is different from what we did in the first exercise (actually when doing table joins QuickSight forces you to use SPICE, but when connecting to individual tables we get this choice). Since we are going to be working with nearly 3 billion records, we will want to query the data directly in S3 using Athena.
alt text

2. Creating Our Dashboard

Great, now we are ready to begin visualizing our data. By default AutoGraph is chosen as the visual type, which will pick an appropriate visual type depending on the types of fields choose to visualize. We can leave it like that for now, and later we will specify certain visual types.

Select 'passenger_count' and then use the pull down menu to change the aggregation to Count. Then use the pull down menu again and choose 'Format->1234.57' to round to two decimal places. The KPI will show that we have 2.67 billion records in the dataset. Pretty impressive performance on a dataset of that size!
alt text

alt text

Let's add another visual. This time select 'Pickup Date' (the calculated field you created). You should get a line chart. Use the pull down menu and change the aggregation to Week. Then expand the axis range on the bottom of the visual.
alt text

Select the 'type' field and you should get three lines, one for each type of taxi:
alt text

Let's add another visual. This one will also be a time trend but we will look at the data YoY. First change the visual type to a Line Chart. Then drag the 'month' field to the X axis field well and the 'year' field to the Color field well.

Notice the months on the bottom are out of order. Since the field is a string data type the months are sorted in alphabetical order. To fix this we must edit the dataset and change the data types for these columns. Use the dropdown menu for the name of your dataset and choose 'Edit analysis data sets' and then click 'Edit' on the next screen:
alt text


alt text

Click on the 'a' icon underneath both of these fields in the data preview window and change them both to 'Int':
alt text

Choose 'Save & visualize'. Now the months on our line chart should be sorted in the correct order:
alt text

One of the first things you will notice is that there is a huge drop in Feb on the 2010 line. A quick google search for 'nyc feb 2010' will reveal that there was a huge blizzard in Feb 2010! Makes sense why there were less rides for that month.

Feel free to continue exploring this data. There aren't a ton more dimensions to play with - the dataset was meant to highlight the scale of how many records Athena + S3 can handle rather than analytical depth - but go wild with it!

3. (Optional) Create A Story

How to createa a Story

In addiition to creating and sharing dashboards, you can also create and share 'stories'. They are great if you have found something interesting in the data and you would like to lead you users to that particular finding. For instance, using our last finding you could capture a 'scene' of the YoY trend visual, then filter to 2010 and capture another 'scene' to highlight the drop in Feb 2010 due to the blizzard.

Conclusion

Congratulations on creating your first Glue Crawlers, Athena Databases & Tables, and QuickSight Analyses and Dashboards! You are now well versed in Serverless Analytics!

For more tips and information about what's new in QuickSight, check out the blog as well as the other resources on the website!

The End

About

Amazon QuickSight and Amazon Athena workshop. Workshop will focus on ingesting data into Athena, combining it with other data sources, and visualizaing it in QuickSight.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%