
# 🧩 AWS Athena Customer Reviews CSV Comparison Lab

**Objective:** Demonstrate how Athena scans significantly more data when querying a CSV table compared to a Parquet table.

---

## 🧭 Lab Overview

In this follow-up lab, we’ll use the same **Customer Reviews dataset** and:
1. Create a new **CSV table** in Athena.
2. Run identical queries on both Parquet and CSV tables.
3. Compare data scanned (MB) and performance.
4. Understand why **Parquet is more cost-efficient and faster**.

---

## ⚙️ Prerequisites

- Completed previous lab: `AWS_Athena_Customer_Reviews_Parquet_Lab.ipynb`
- Existing S3 bucket, e.g. `aws-glue-yourname`
- Database: `demo_db`
- Existing Parquet table: `amazon_reviews_parquet`



## 🪣 Step 1: Prepare Dataset in CSV Format

1. Open **Amazon Athena Console**.
2. Ensure your workgroup output location is configured (e.g., `s3://aws-glue-yourname/athena-results/`).
3. Run the following **CTAS (Create Table As Select)** query to convert Parquet → CSV:

```sql
CREATE TABLE amazon_reviews_csv
WITH (
    format = 'TEXTFILE',
    field_delimiter = ',',
    external_location = 's3://aws-glue-yourname/customer_review_csv/'
) AS
SELECT * FROM amazon_reviews_parquet;
```

✅ **Result:** Athena will create a CSV version of your dataset in the specified S3 folder.



## 🧾 Step 2: Create Table Definition for CSV (Manual Alternative)

If you prefer to define the table manually, use the following script:

```sql
CREATE EXTERNAL TABLE amazon_reviews_csv (
  marketplace string,
  customer_id string,
  review_id string,
  product_id string,
  product_parent string,
  product_title string,
  product_category string,
  star_rating int,
  helpful_votes int,
  total_votes int,
  vine string,
  verified_purchase string,
  review_headline string,
  review_body string,
  review_date string,
  sentiment string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
)
LOCATION 's3://aws-glue-yourname/customer_review_csv/'
TBLPROPERTIES ('skip.header.line.count'='1');
```

✅ **Expected Outcome:** The table `amazon_reviews_csv` should now appear in Athena under your selected database.



## 🧮 Step 3: Run the Same Queries

Run each query below and observe the **Data Scanned (MB)** in the Athena **Query History** panel.

### 1️⃣ Preview first 10 rows


In [None]:

SELECT * FROM amazon_reviews_csv LIMIT 10;



### 2️⃣ Count total number of reviews


In [None]:

SELECT COUNT(*) AS total_reviews FROM amazon_reviews_csv;



### 3️⃣ Count reviews by sentiment


In [None]:

SELECT sentiment, COUNT(*) AS total_reviews
FROM amazon_reviews_csv
GROUP BY sentiment
ORDER BY total_reviews DESC;



### 4️⃣ Analyze relationship between star rating and sentiment


In [None]:

SELECT star_rating, sentiment, COUNT(*) AS total_reviews
FROM amazon_reviews_csv
GROUP BY star_rating, sentiment
ORDER BY star_rating, sentiment;



### 5️⃣ Compare results between Parquet vs CSV

| Query | Parquet (MB Scanned) | CSV (MB Scanned) | Efficiency Gain |
|--------|----------------------|------------------|-----------------|
| Count(*) | 0.02 MB | 31 MB | ~1500x faster |
| Group by sentiment | 0.06 MB | 32 MB | ~530x faster |
| Star rating correlation | 0.07 MB | 33 MB | ~470x faster |

> 💡 **Explanation:**  
> Parquet is *columnar*, meaning Athena reads only required columns.  
> CSV is *row-based*, so Athena reads the entire dataset every time.



## 🧠 Step 4: Reflection

- CSV format causes Athena to scan the **entire dataset** for every query.  
- Parquet, being **columnar**, reads only relevant columns, minimizing data scanned.  
- This leads to **lower cost** and **faster performance** in real-world analytics.

✅ Always prefer **Parquet or ORC** for optimized querying.

---

## 🧹 Cleanup (Optional)

To remove the CSV table and data:

```sql
DROP TABLE amazon_reviews_csv;
```

Then delete the S3 folder:
```
s3://aws-glue-yourname/customer_review_csv/
```

---

## 🪞 References

- [AWS Athena CTAS Documentation](https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html)
- [Athena Query Performance Tuning](https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html)
- [Columnar Data Formats in Athena](https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html)
