
# 🧩 AWS Glue Crawler + Athena Lab
**Objective:** Learn how to catalog and query data in S3 using AWS Glue and Amazon Athena — without provisioning servers.

---

## 🧭 Lab Overview

This lab demonstrates how to perform **in-place SQL queries** directly on files stored in Amazon S3.  
You will learn how to catalog data using AWS Glue and query it interactively using Amazon Athena.

### Key Learning Outcomes
- Create and configure IAM roles for Glue.
- Set up an S3 bucket and upload data.
- Use AWS Glue Crawler to discover schema and create metadata.
- Query S3 data directly using Athena.



## ⚙️ Prerequisites

- AWS account with console access
- Region: **US-East-1 (N. Virginia)** or consistent region throughout the lab
- Dataset: `iris_all.csv` 
- Permissions to create:
  - IAM Role
  - S3 Bucket
  - AWS Glue Crawler
  - Athena queries



## 🪪 Step 1: Create IAM Role for AWS Glue

1. Open the **IAM Console** → *Roles* → **Create Role**  
2. Choose **AWS Service** → select **Glue** → **Next**  
3. Attach the managed policy **AWSGlueServiceRole**  
4. Continue to the review screen and name the role:  
   ```
   AWSGlueServiceRoleDefault
   ```
5. Click **Create Role**

> 💡 This role allows AWS Glue to access S3 buckets with names starting with `aws-glue-`.



## 🪣 Step 2: Create S3 Bucket and Upload Dataset

1. Open the **S3 Console** → **Create Bucket**  
2. Use this naming pattern:  
   ```
   aws-glue-yourname
   ```
   Example: `aws-glue-labdemo`  
3. Choose the same region as your IAM role (e.g., *us-east-1*)  
4. Inside the bucket, create the folder structure:
   ```
   iris/
     └── csv/
   ```
5. Upload the file `iris_all.csv` to this location:
   ```
   s3://aws-glue-yourname/iris/csv/
   ```



## 🧬 Step 3: Create and Configure AWS Glue Crawler

1. From the **AWS Glue Console**, go to **Crawlers** → **Add Crawler**  
2. Name the crawler:
   ```
   iris_csv_crawler
   ```
3. For *Source Type*, choose **Data Stores**  
4. Select **S3** as the data source  
5. Browse and select the folder path: `s3://aws-glue-yourname/iris/csv/`  
6. Choose **No** for “Add another data store”  
7. For the IAM role, choose the one you created earlier:  
   ```
   AWSGlueServiceRoleDefault
   ```
8. Set the crawler schedule to **Run on Demand**  
9. Under *Output*, add a new database named:
   ```
   demo_db
   ```
10. Add a table prefix (optional):
   ```
   iris_
   ```
11. Review and click **Finish**.



## ▶️ Step 4: Run the Crawler

1. Select your crawler: **iris_csv_crawler**  
2. Click **Run Crawler**  
3. Wait for the crawler to complete. Once done, it should display:
   ```
   1 table added to the data catalog
   ```



## 🧾 Step 5: Verify Metadata in the Glue Catalog

1. In Glue Console → *Databases* → select **demo_db**  
2. Click on the table **iris_csv**  
3. Review the schema detected by the crawler:

| Column Name   | Data Type |
|----------------|-----------|
| sepal_length   | double    |
| sepal_width    | double    |
| petal_length   | double    |
| petal_width    | double    |
| class          | string    |

> 🧠 The crawler used a CSV classifier and automatically inferred column names and data types.



## 🧮 Step 6: Query the Data Using Amazon Athena

1. Open the **Amazon Athena Console**  
2. Ensure the region matches your Glue Catalog  
3. In *Settings*, configure a query result location, e.g.:
   ```
   s3://aws-glue-yourname/athena-results/
   ```
4. Select database **demo_db**  
5. Open table **iris_csv**  
6. Choose **Preview Table** — Athena will execute:
   ```sql
   SELECT * FROM "demo_db"."iris_csv" LIMIT 10;
   ```
7. Confirm that 10 rows are displayed.



## 🧠 Step 7: Try Custom Queries

### a) Filter for a specific class
```sql
SELECT * 
FROM demo_db.iris_csv 
WHERE class = 'Iris-setosa';
```

### b) Count rows with partial match
```sql
SELECT class, COUNT(*) AS total
FROM demo_db.iris_csv
WHERE class LIKE '%setosa%'
GROUP BY class;
```

### c) Compute derived column
```sql
SELECT sepal_length * sepal_width AS sepal_area, class
FROM demo_db.iris_csv
LIMIT 10;
```



## 🔍 Step 8: Reflection

- You queried data directly from S3 without loading it into a database.  
- AWS Glue automatically created a catalog of your dataset.  
- Athena used that catalog to run serverless SQL queries.

✅ No servers to manage.  
✅ Pay only for the queries you run.  
✅ Ideal for data exploration in Data Lakes.



## 🧹 Step 9: Cleanup (Optional)

To avoid incurring charges:

1. Delete the Athena query result folder from S3.  
2. Delete the **Glue Crawler** and **Database** (`demo_db`).  
3. Delete the **S3 Bucket** used in this lab.  
4. Optionally, remove the **IAM Role** if not reused.



## 🪞 References

- [AWS Glue Documentation](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html)  
- [Amazon Athena Documentation](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)
