
# 🧩 AWS Athena Customer Reviews

**Objective:** Learn how to query large `Parquet` datasets in Amazon S3 using Amazon Athena.

---

## 🧭 Lab Overview

In this lab, you will:
1. Download the customer reviews dataset from the Git repository.
2. Upload it to your S3 bucket.
3. Create a table manually in Athena using the `CREATE TABLE` script.
4. Run analytical queries to explore sentiment and star rating relationships.
5. Observe how Athena queries Parquet efficiently without scanning unnecessary data.

---

## ⚙️ Prerequisites

- AWS Account with Athena and S3 permissions.
- Region: **US-East-1 (N. Virginia)** (or any consistent region).
- Access to S3 bucket (you may reuse your course bucket).
- Athena Query Editor access and an existing **database**, for example `demo_db`.
- Parquet file: `customer_reviews.parquet` (~30 MB, ~100K reviews).



## 🪣 Step 1: Download and Upload Dataset to S3

1. Navigate to your **Data Lake Git repository** → `customer_review/data/` folder.
2. Download the Parquet file (~30 MB) containing customer reviews.
3. Open **AWS S3 Console**.
4. In your course bucket (e.g., `aws-glue-yourname`), create a new folder:
   ```
   customer_review_parquet/
   ```
5. Upload the Parquet file to this folder.
6. Copy the **S3 URI** of this folder (e.g.):
   ```
   s3://aws-glue-yourname/customer_review_parquet/
   ```



## 🧾 Step 2: Create Table in Athena

We will manually define the schema and point Athena to our Parquet files using a `CREATE TABLE` statement.

In the Athena Query Editor, make sure the **database** is set to `demo_db`.  
Then, paste and run the following script (update the `LOCATION` with your own S3 path):


In [None]:

CREATE EXTERNAL TABLE `amazon_reviews_parquet`(
  `marketplace` string, 
  `customer_id` string, 
  `review_id` string, 
  `product_id` string, 
  `product_parent` string, 
  `product_title` string, 
  `product_category` string,  
  `star_rating` int, 
  `helpful_votes` int, 
  `total_votes` int, 
  `vine` string, 
  `verified_purchase` string, 
  `review_headline` string, 
  `review_body` string, 
  `review_date` string,   
  `sentiment` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://aws-glue-yourname/customer_review_parquet/';



✅ **Expected Outcome:**  
Athena should successfully create the table `amazon_reviews_parquet`.  
If an older version exists, delete it first and re-run this script.



## 🧮 Step 3: Preview and Explore Data

### Preview the first 10 rows


In [None]:

SELECT * 
FROM "amazon_reviews_parquet"
LIMIT 10;



### Check distinct sentiment values


In [None]:

SELECT DISTINCT sentiment
FROM "amazon_reviews_parquet"
LIMIT 10;



## 📊 Step 4: Analytical Queries

### Count total number of reviews


In [None]:

SELECT COUNT(*) AS total_reviews
FROM "amazon_reviews_parquet";



### Count reviews per sentiment


In [None]:

SELECT sentiment, COUNT(*) AS total_reviews
FROM "amazon_reviews_parquet"
GROUP BY sentiment
ORDER BY total_reviews DESC;



### Analyze relationship between star rating and sentiment


In [None]:

SELECT star_rating, sentiment, COUNT(*) AS total_reviews
FROM "amazon_reviews_parquet"
GROUP BY star_rating, sentiment
ORDER BY star_rating, sentiment;



### Find top-rated products


In [None]:

SELECT 
    product_title, star_rating,
    sentiment, review_headline, review_body
FROM "amazon_reviews_parquet"
ORDER BY star_rating DESC
LIMIT 10;



### Find 5-star reviews that are not positive


In [None]:

SELECT
    product_title, star_rating,
    sentiment, review_headline, review_body
FROM "amazon_reviews_parquet"
WHERE star_rating = 5
  AND sentiment != 'POSITIVE'
LIMIT 10;



✅ **Observation:**  
Athena queries run directly on S3 data without needing ETL or databases.  
Parquet’s columnar structure helps Athena scan only required columns, reducing data scanned and improving performance.



## 🧠 Step 5: Reflection

- Parquet format allows selective column reads — efficient and fast.  
- Athena integrates seamlessly with S3, enabling serverless querying.  
- Sentiment fields enrich analytical capabilities for customer insights.  
- Misclassifications in sentiment detection are possible; review examples help interpret them.

---

## 🧹 Cleanup (Optional)

1. Delete the Athena query results folder from S3.
2. Drop the table if not needed:
   ```sql
   DROP TABLE amazon_reviews_parquet;
   ```
3. Optionally, delete the uploaded data from your S3 bucket.

---

## 🪞 References

- [Amazon Athena Documentation](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)
- [AWS Parquet and Query Performance](https://docs.aws.amazon.com/athena/latest/ug/parquet.html)
