<a href="https://colab.research.google.com/github/lily-larson/MGMT-467-Analytics-Portfolio/blob/main/Labs/Lab_2_VertexAI_BigQuery_PromptsOnly.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Vertex AI–Assisted BigQuery Analytics — Example Prompts
**Goal:** Practice moving from simple SQL to complex analytics in BigQuery using *only* carefully engineered prompts with Vertex AI (Gemini).  
**Important:** This notebook contains **prompts only** (no starter code). Paste the prompts into **Vertex AI Studio**, **Vertex AI in Colab Enterprise**, or your chosen chat interface, and then run the generated SQL directly in **BigQuery**. If you decide to automate later, you can ask Vertex AI to convert the winning SQL into a Colab pipeline.

## How to use this prompts-only notebook
1. Open **Vertex AI Studio** (or Gemini in Colab Enterprise chat panel).  
2. Copy a prompt from this notebook and paste it into the model. Do **not** paste any code from here; let the model generate it.  
3. Run the generated SQL in **BigQuery** (Console → BigQuery Studio).  
4. Iterate: refine the prompt when results aren’t what you expect.  
5. Document: capture your final SQL, plus a one-sentence takeaway, in your notes/README.

## Dataset assumptions
Use one of these sources (adjust table paths accordingly):
- **Global Superstore (Kaggle)** loaded into BigQuery (e.g., `[YOUR_PROJECT].superstore_data.sales`)  
- **TheLook eCommerce** public dataset: `bigquery-public-data.thelook_ecommerce`  
If you are using *Global Superstore*, make sure column names match your schema (e.g., `Order_Date`, `Region`, `Category`, `Sub_Category`, `Sales`, `Profit`, `Discount`, `State`, `Customer_ID`, `Ship_Mode`).

---
## Prompting guardrails (quick checklist)
- **Be explicit**: table path, column names, filters, output columns, sort order, and limits.  
- **Ask for runnable SQL**: “Return a BigQuery SQL block only.”  
- **Control cost**: ask for `LIMIT` during exploration and remove it for the final run.  
- **Validate**: request a brief explanation of why each clause is present and how you can sanity-check results.
---

## Install Dependencies

In [6]:
# Install the Google Cloud BigQuery client library
#!pip install google-cloud-bigquery==3.17.0 pandas==2.1.4

# Authenticate your Colab environment
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Copy Schema to a dataframe

In [None]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'directed-bongo-471119-d1' # This is derived from your provided table name
dataset_id = 'lab1_helpme'
table_id = 'Global Superstore Sample Lab 1'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

# Get the table object
table_ref = client.dataset(dataset_id).table(table_id)
table = client.get_table(table_ref)

# Extract schema information
schema_list = []
for field in table.schema:
    schema_list.append({
        'name': field.name,
        'field_type': field.field_type,
        'mode': field.mode,
        'description': field.description
    })

# Convert to Pandas DataFrame
schema_df = pd.DataFrame(schema_list)

# Display the schema DataFrame (optional, for verification)
print("Schema DataFrame created:", schema_df)
# To see the output, run the code.


Schema DataFrame created:              name field_type      mode description
0          Row ID    INTEGER  NULLABLE        None
1        Order ID     STRING  NULLABLE        None
2      Order Date       DATE  NULLABLE        None
3       Ship Date       DATE  NULLABLE        None
4       Ship Mode     STRING  NULLABLE        None
5     Customer ID     STRING  NULLABLE        None
6   Customer Name     STRING  NULLABLE        None
7         Segment     STRING  NULLABLE        None
8         Country     STRING  NULLABLE        None
9            City     STRING  NULLABLE        None
10          State     STRING  NULLABLE        None
11    Postal Code    INTEGER  NULLABLE        None
12         Region     STRING  NULLABLE        None
13     Product ID     STRING  NULLABLE        None
14       Category     STRING  NULLABLE        None
15   Sub-Category     STRING  NULLABLE        None
16   Product Name     STRING  NULLABLE        None
17          Sales      FLOAT  NULLABLE        None
18   

## CLean Column Names

In [None]:
# --- 1. Clean the Column Names ---
# Create a 'clean_name' column with standard naming conventions:
# lowercase, with spaces and hyphens replaced by underscores.
schema_df['clean_name'] = schema_df['name'].str.lower().str.replace(' ', '_').str.replace('-', '_')


# --- 2. Generate the Aliases for the SELECT Clause ---
column_expressions = []
for index, row in schema_df.iterrows():
    original_name = row['name']
    clean_name = row['clean_name']

    # If the original name contains a space or special character, it needs to be
    # enclosed in backticks (`) in the SQL statement.
    if ' ' in original_name or '-' in original_name:
        expression = f'`{original_name}` AS {clean_name}'
    else:
        # If the name is already clean, we still alias it for consistency.
        expression = f'{original_name} AS {clean_name}'
    column_expressions.append(expression)

# Join all the individual column expressions into a single, formatted string.
select_clause = ",\n  ".join(column_expressions)


# --- 3. Construct the Final CREATE VIEW Statement ---
new_view_id = 'global_superstore_sample_clean' # You can change this if you like

create_view_sql = f"""
CREATE OR REPLACE VIEW `{project_id}.{dataset_id}.{new_view_id}` AS
SELECT
  {select_clause}
FROM
  `{project_id}.{dataset_id}.{table_id}`;
"""

# --- 4. Print the Final SQL ---
print("--- Copy the SQL below and run it in your BigQuery Console ---")
print(create_view_sql)

--- Copy the SQL below and run it in your BigQuery Console ---

CREATE OR REPLACE VIEW `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean` AS
SELECT
  `Row ID` AS row_id,
  `Order ID` AS order_id,
  `Order Date` AS order_date,
  `Ship Date` AS ship_date,
  `Ship Mode` AS ship_mode,
  `Customer ID` AS customer_id,
  `Customer Name` AS customer_name,
  Segment AS segment,
  Country AS country,
  City AS city,
  State AS state,
  `Postal Code` AS postal_code,
  Region AS region,
  `Product ID` AS product_id,
  Category AS category,
  `Sub-Category` AS sub_category,
  `Product Name` AS product_name,
  Sales AS sales,
  Quantity AS quantity,
  Discount AS discount,
  Profit AS profit
FROM
  `directed-bongo-471119-d1.lab1_helpme.Global Superstore Sample Lab 1`;



## Generate View with standard column naming convention

In [None]:
# Execute the CREATE VIEW SQL query
try:
    query_job = client.query(create_view_sql)  # API request
    query_job.result()  # Waits for the query to finish
    print(f"View '{new_view_id}' created/replaced successfully in dataset '{dataset_id}'.")
except Exception as e:
    print(f"An error occurred while creating the view: {e}")

# Now, let's print 10 rows from the newly created view to verify
print(f"\n--- First 10 rows from the new view '{new_view_id}' ---")
try:
    # Construct a reference to the new view
    view_query = f"SELECT * FROM `{project_id}.{dataset_id}.{new_view_id}` LIMIT 10"

    # Execute the query
    query_job = client.query(view_query)

    # Fetch the results into a DataFrame
    rows_df = query_job.to_dataframe()

    # Print the DataFrame
    display(rows_df)

except Exception as e:
    print(f"An error occurred while fetching rows from the view: {e}")

View 'global_superstore_sample_clean' created/replaced successfully in dataset 'lab1_helpme'.

--- First 10 rows from the new view 'global_superstore_sample_clean' ---


Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,...,postal_code,region,product_id,category,sub_category,product_name,sales,quantity,discount,profit
0,5769,CA-2015-154900,2015-02-25,2015-03-01,Standard Class,SS-20875,Sung Shariari,Consumer,United States,Leominster,...,1453,East,OFF-LA-10001641,Office Supplies,Labels,Avery 518,3.15,1,0.0,1.512
1,5770,CA-2015-154900,2015-02-25,2015-03-01,Standard Class,SS-20875,Sung Shariari,Consumer,United States,Leominster,...,1453,East,OFF-PA-10002377,Office Supplies,Paper,Adams Telephone Message Book W/Dividers/Space ...,22.72,4,0.0,10.224
2,9028,US-2016-152415,2016-09-17,2016-09-22,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,Marlborough,...,1752,East,FUR-FU-10002597,Furniture,Furnishings,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,3,0.0,6.2244
3,9029,US-2016-152415,2016-09-17,2016-09-22,Standard Class,PO-18865,Patrick O'Donnell,Consumer,United States,Marlborough,...,1752,East,FUR-FU-10004864,Furniture,Furnishings,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,3,0.0,61.3824
4,8332,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-ST-10004634,Office Supplies,Storage,"Personal Folder Holder, Ebony",11.21,1,0.0,3.363
5,8333,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,FUR-CH-10002647,Furniture,Chairs,"Situations Contoured Folding Chairs, 4/Set",354.9,5,0.0,88.725
6,8334,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-PA-10001801,Office Supplies,Paper,Xerox 193,17.94,3,0.0,8.7906
7,8335,CA-2016-153269,2016-03-09,2016-03-12,First Class,PS-18760,Pamela Stobb,Consumer,United States,Andover,...,1810,East,OFF-BI-10004632,Office Supplies,Binders,GBC Binding covers,51.8,4,0.0,23.31
8,526,CA-2015-158792,2015-12-26,2016-01-02,Standard Class,BD-11605,Brian Dahlen,Consumer,United States,Lawrence,...,1841,East,OFF-FA-10002815,Office Supplies,Fasteners,Staples,22.2,5,0.0,10.434
9,1312,CA-2016-141082,2016-12-09,2016-12-13,Standard Class,FM-14380,Fred McMath,Consumer,United States,Lawrence,...,1841,East,OFF-LA-10001404,Office Supplies,Labels,Avery 517,3.69,1,0.0,1.7343


In [None]:
# This assumes your 'client' object from the previous cell is still active
# and correctly authenticated.

print("✅ Step 1: Defining the query string...")

query_string = """
SELECT
  order_id,
  customer_name,
  product_name,
  sales,
  profit
FROM
  `mgmt-467-47888.lab1_foundation.superstore_clean`
LIMIT 10;
"""

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result. Please double-check that your 'superstore_clean' view exists and the original table has data.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...
✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


## Part A — SQL Warm‑Up (SELECT, WHERE, ORDER BY, LIMIT, DISTINCT)
**Aim:** Build confidence with precise, unambiguous prompts that yield clean, runnable SQL.

### A1. Unique values (DISTINCT)
**Prompt (paste in Vertex AI):**
```
Act as a senior BigQuery analyst. Produce a **single runnable BigQuery SQL** (no commentary) for:
- Task: List all unique `Sub_Category` values sold in the 'West' region.
- Table: `mgmt-467-47888.lab1_foundation.superstore`
- Filter: `Region = 'West'`
- Output: a single column named `Sub_Category`
- Sort: alphabetically A→Z
- Add: `LIMIT 100` to control cost during exploration.
```
**Reflection:** Did the result match your expectations? If not, what ambiguity in your prompt might have caused the mismatch?

In [None]:
query_string = """
SELECT
    DISTINCT `Sub-Category` AS Sub_Category
FROM
    `mgmt-467-47888.lab1_foundation.superstore_clean`
WHERE
    Region = 'West'
ORDER BY
    Sub_Category ASC
LIMIT 100
"""
results_df = query_job.to_dataframe()
display(results_df)

Unnamed: 0,order_id,customer_name,product_name,sales,profit
0,CA-2015-154900,Sung Shariari,Avery 518,3.15,1.512
1,CA-2015-154900,Sung Shariari,Adams Telephone Message Book W/Dividers/Space ...,22.72,10.224
2,US-2016-152415,Patrick O'Donnell,"C-Line Magnetic Cubicle Keepers, Clear Polypro...",14.82,6.2244
3,US-2016-152415,Patrick O'Donnell,"Howard Miller 14-1/2"" Diameter Chrome Round Wa...",191.82,61.3824
4,CA-2016-153269,Pamela Stobb,"Personal Folder Holder, Ebony",11.21,3.363
5,CA-2016-153269,Pamela Stobb,"Situations Contoured Folding Chairs, 4/Set",354.9,88.725
6,CA-2016-153269,Pamela Stobb,Xerox 193,17.94,8.7906
7,CA-2016-153269,Pamela Stobb,GBC Binding covers,51.8,23.31
8,CA-2015-158792,Brian Dahlen,Staples,22.2,10.434
9,CA-2016-141082,Fred McMath,Avery 517,3.69,1.7343


### A2. Top‑N by metric (ORDER BY … DESC)
**Prompt:**
```
BigQuery SQL only.
Task: Return the top 10 customers by total profit.
Table: `mgmt-467-47888.lab_foundation.superstore`
Columns used: `Customer_ID`, `Profit`
Output columns: `Customer_ID`, `total_profit`
Logic: SUM Profit per customer, order by `total_profit` DESC
Add `LIMIT 10`.
```
**Tip:** If your schema uses different identifiers (e.g., `Customer Name`), restate column names explicitly.

In [None]:
query_string = """
SELECT
    customer_id,
    SUM(profit) AS total_profit
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
GROUP BY
    customer_id
ORDER BY
    total_profit DESC
LIMIT 10;
"""

print("✅ Step 1: Defining the query string...")
print(query_string)

print("✅ Step 2: Sending the query to BigQuery. This may take a moment...")

# Use a try-except block to catch potential errors
try:
    query_job = client.query(query_string)

    print("✅ Step 3: Waiting for query to complete and fetching results...")
    results_df = query_job.to_dataframe()

    print(f"✅ Step 4: Query finished. Found {len(results_df)} rows.")

    if results_df.empty:
        print("\n⚠️ The query ran successfully but returned an empty result.")
    else:
        print("\n--- Displaying Results ---")
        display(results_df)

except Exception as e:
    print(f"\n❌ An error occurred: {e}")

✅ Step 1: Defining the query string...

SELECT
    customer_id,
    SUM(profit) AS total_profit
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
GROUP BY
    customer_id
ORDER BY
    total_profit DESC
LIMIT 10;

✅ Step 2: Sending the query to BigQuery. This may take a moment...
✅ Step 3: Waiting for query to complete and fetching results...
✅ Step 4: Query finished. Found 10 rows.

--- Displaying Results ---


Unnamed: 0,customer_id,total_profit
0,TC-20980,8981.3239
1,RB-19360,6976.0959
2,SC-20095,5757.4119
3,HL-15040,5622.4292
4,AB-10105,5444.8055
5,TA-21385,4703.7883
6,CM-12385,3899.8904
7,KD-16495,3038.6254
8,AR-10540,2884.6208
9,DR-12940,2869.076


### A3. Basic filtering (WHERE) + sanity checks
**Prompt:**
```
BigQuery SQL only.
Task: Count orders shipped with each `Ship_Mode`, but only for orders in the 'Technology' category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Ship_Mode`, `order_count`
Logic: COUNT(*) grouped by `Ship_Mode`
Sort by `order_count` DESC
```
**Validation ask:** “Also list two quick sanity checks to verify the numbers.”

In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    ship_mode,
    COUNT(*) AS order_count
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
WHERE
    category = 'Technology'
GROUP BY
    ship_mode
ORDER BY
    order_count DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,ship_mode,order_count
0,Standard Class,1082
1,Second Class,366
2,First Class,301
3,Same Day,98


In [None]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    COUNT(*) AS total_technology_orders
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
WHERE
    category = 'Technology'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_technology_orders
0,1847


**Sanity Checks:**
1. **Sum of `order_count`:** The sum of the `order_count` for all ship modes equals the total number of orders in the dataset where the `category` is 'Technology'.
2. **Check for other categories:** Verified that the query is filtering for 'Technology' by running the query without the `WHERE category = 'Technology'`. The sum of counts in the original query was less than the total count without the filter.

## Part B — Grouped Analytics (GROUP BY, HAVING)
**Aim:** Turn raw facts into grouped metrics and filtered aggregations.

### B1. KPI aggregation with WHERE + GROUP BY
**Prompt:**
```
BigQuery SQL only.
Task: Compute monthly revenue for the last 12 full months.
Table: `[YOUR_PROJECT].superstore_data.sales`
Assume: `Order_Date` is a DATE or TIMESTAMP column named exactly `Order_Date`.
Output: `year_month` (YYYY-MM format), `monthly_revenue`
Logic: Truncate date to month, SUM `Sales`, filter to last 12 full months.
Sort by `year_month` ascending.
Include a `LIMIT` safeguard for exploration.
```

In [4]:
from google.cloud import bigquery
import pandas as pd

# Replace with your Google Cloud Project ID
project_id = 'directed-bongo-471119-d1' # This is derived from your provided table name
dataset_id = 'lab1_helpme'
table_id = 'global_superstore_sample_clean'

# Construct a BigQuery client object.
client = bigquery.Client(project=project_id)

In [11]:
# Define the SQL query
%%bigquery --project directed-bongo-471119-d1
SELECT
    FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
    SUM(sales) AS monthly_revenue
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
WHERE
    order_date >= DATE_TRUNC((SELECT MAX(order_date)FROM `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`), MONTH) - INTERVAL 11 MONTH
GROUP BY
    year_month
ORDER BY
    year_month ASC
LIMIT 100; -- Added LIMIT for exploration as requested

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,year_month,monthly_revenue
0,2017-01,43971.374
1,2017-02,20301.1334
2,2017-03,58872.3528
3,2017-04,36521.5361
4,2017-05,44261.1102
5,2017-06,52981.7257
6,2017-07,45264.416
7,2017-08,63120.888
8,2017-09,87866.652
9,2017-10,77776.9232


### B2. Post‑aggregation filter (HAVING)
**Prompt:**
```
BigQuery SQL only.
Task: Find sub-categories whose total profit over the entire dataset is negative.
Table: `[YOUR_PROJECT].superstore_data.sales`
Output: `Sub_Category`, `total_profit`
Logic: SUM `Profit` GROUP BY `Sub_Category`, HAVING SUM(Profit) < 0
Sort by `total_profit` ASC (most negative first).
```
**Why HAVING?** Ask the model to include a 1-sentence explanation of why HAVING is used instead of WHERE here.

In [12]:
%%bigquery --project directed-bongo-471119-d1
SELECT
    sub_category,
    SUM(profit) AS total_profit
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
GROUP BY
    sub_category
HAVING
    SUM(profit) < 0
ORDER BY
    total_profit ASC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sub_category,total_profit
0,Tables,-17725.4811
1,Bookcases,-3472.556
2,Supplies,-1189.0995


HAVING is used instead of WHERE because it filters results based on the output of an aggregate function (SUM(profit)), which WHERE cannot do.

## Part C — Joins (dimension enrichment)
**Aim:** Use joins to enhance facts with attributes.

### C1. Join facts to a small dimension
*(If you have a customer or product dimension in your schema, use it. Otherwise, request a synthetic example.)*  
**Prompt:**
```
BigQuery SQL only.
Task: Join the sales table to a product dimension to report `Product_ID`, `Product_Name`, and total sales.
Tables: `[YOUR_PROJECT].superstore_data.sales` as s, `[YOUR_PROJECT].superstore_data.products` as p
Join key: `s.Product_ID = p.Product_ID`
Output: `Product_ID`, `Product_Name`, `total_sales`
Sort by `total_sales` DESC
```
**If you lack a dimension table:** Ask the model how to simulate one temporarily via a CTE.

In [13]:
%%bigquery --project directed-bongo-471119-d1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,product_id,product_name,total_sales
0,TEC-CO-10004722,Canon imageCLASS 2200 Advanced Copier,61599.824
1,OFF-BI-10003527,Fellowes PB500 Electric Punch Plastic Comb Bin...,27453.384
2,TEC-MA-10002412,Cisco TelePresence System EX90 Videoconferenci...,22638.480
3,FUR-CH-10002024,HON 5400 Series Task Chairs for Big and Tall,21870.576
4,OFF-BI-10001359,GBC DocuBind TL300 Electric Binding System,19823.479
...,...,...,...
1889,OFF-AR-10003986,Avery Hi-Liter Pen Style Six-Color Fluorescent...,7.700
1890,OFF-EN-10001535,Grip Seal Envelopes,7.072
1891,OFF-PA-10000048,Xerox 20,6.480
1892,OFF-LA-10003388,Avery 5,5.760


## Part D — Common Table Expressions (CTEs)
**Aim:** Make complex logic readable and testable in steps.

### D1. Multi‑step ranking with CTEs
**Prompt:**
```
BigQuery SQL only.
Goal: Within each `Region`, rank states by total sales and return top 3 per region.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE 1 (`state_sales`): SUM(Sales) by `Region`, `State`
CTE 2 (`ranked_state_sales`): Add `RANK() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as `sales_rank`
Final SELECT: rows where `sales_rank <= 3`
Output columns: `Region`, `State`, `total_sales`, `sales_rank`
Sort: by `Region`, then `sales_rank`
```
**Ask for**: a one-paragraph explanation of each step, then **provide only the final runnable SQL**.

In [14]:
%%bigquery --project directed-bongo-471119-d1

WITH state_sales AS (
    SELECT
        region,
        state,
        SUM(sales) AS total_sales
    FROM
        `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
    GROUP BY
        region,
        state
),
ranked_state_sales AS (
    SELECT
        region,
        state,
        total_sales,
        RANK() OVER (PARTITION BY region ORDER BY total_sales DESC) AS sales_rank
    FROM
        state_sales
)
SELECT
    region,
    state,
    total_sales,
    sales_rank
FROM
    ranked_state_sales
WHERE
    sales_rank <= 3
ORDER BY
    region,
    sales_rank

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,region,state,total_sales,sales_rank
0,Central,Texas,170188.0458,1
1,Central,Illinois,80166.101,2
2,Central,Michigan,76269.614,3
3,East,New York,310876.271,1
4,East,Pennsylvania,116511.914,2
5,East,Ohio,78258.136,3
6,South,Florida,89473.708,1
7,South,Virginia,70636.72,2
8,South,North Carolina,55603.164,3
9,West,California,457687.6315,1


### D2. Time‑boxed “most improved” analysis
**Prompt:**
```
BigQuery SQL only.
Goal: Identify the top 5 sub-categories with the largest YoY revenue increase from 2023 to 2024.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `yr_sales`: SUM(Sales) by `Sub_Category` and `year` extracted from `Order_Date`
Final: pivot or self-join to compute delta (2024 minus 2023) as `yoy_delta`
Output: `Sub_Category`, `sales_2023`, `sales_2024`, `yoy_delta`
Order by `yoy_delta` DESC
Limit 5
```
**Validation:** Ask the model for two quick failure modes (e.g., missing years) and how to handle them.

In [54]:
%%bigquery --project directed-bongo-471119-d1

WITH yearly_sales AS (
    SELECT
        sub_category,
        EXTRACT(YEAR FROM order_date) AS sales_year,
        SUM(sales) AS yearly_revenue
    FROM
        `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
    GROUP BY
        sub_category,
        sales_year
),
ranked_yearly_sales AS (
    SELECT
        sub_category,
        sales_year,
        yearly_revenue,
        RANK() OVER (PARTITION BY sub_category ORDER BY sales_year DESC) as year_rank
    FROM yearly_sales
),
pivoted_sales AS (
    SELECT
        sub_category,
        MAX(CASE WHEN year_rank = 1 THEN yearly_revenue ELSE NULL END) AS sales_most_recent_year,
        MAX(CASE WHEN year_rank = 2 THEN yearly_revenue ELSE NULL END) AS sales_second_most_recent_year
    FROM ranked_yearly_sales
    WHERE year_rank <= 2
    GROUP BY sub_category
)
SELECT
    sub_category,
    sales_second_most_recent_year AS sales_year_1,
    sales_most_recent_year AS sales_year_2,
    (sales_most_recent_year - sales_second_most_recent_year) AS change
FROM pivoted_sales
WHERE sales_second_most_recent_year IS NOT NULL AND sales_most_recent_year IS NOT NULL
ORDER BY change DESC
LIMIT 5;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sub_category,sales_year_1,sales_year_2,change
0,Phones,78962.03,105340.516,26378.486
1,Binders,49683.325,72788.045,23104.72
2,Accessories,41895.854,59946.232,18050.378
3,Appliances,26050.315,42926.932,16876.617
4,Copiers,49599.41,62899.388,13299.978


## Part E — Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD, moving averages)
**Aim:** Compare rows across partitions and time; compute trends and ranks without collapsing rows.

### E1. Top product per region (ROW_NUMBER)
**Prompt:**
```
BigQuery SQL only.
Task: For each `Region`, return only the single highest-revenue `Sub_Category`.
Table: `[YOUR_PROJECT].superstore_data.sales`
CTE `subcat_sales`: SUM(Sales) by `Region`, `Sub_Category`
Add `ROW_NUMBER() OVER (PARTITION BY Region ORDER BY total_sales DESC)` as rn
Final: filter `rn = 1`
Output: `Region`, `Sub_Category`, `total_sales`
Sort by `Region`
```
**Why `ROW_NUMBER` instead of `RANK`?** Ask the model to add a 2-sentence contrast.

In [17]:
%%bigquery --project directed-bongo-471119-d1

WITH subcat_sales AS (
    SELECT
        region,
        sub_category,
        SUM(sales) AS total_sales
    FROM
        `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
    GROUP BY
        region,
        sub_category
),
ranked_subcat_sales AS (
    SELECT
        region,
        sub_category,
        total_sales,
        ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) AS rn
    FROM
        subcat_sales
)
SELECT
    region,
    sub_category,
    total_sales
FROM
    ranked_subcat_sales
WHERE
    rn = 1
ORDER BY
    region

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,region,sub_category,total_sales
0,Central,Chairs,85230.646
1,East,Phones,100614.982
2,South,Phones,58304.438
3,West,Chairs,101781.328


ROW_NUMBER() assigns a unique sequential integer to each row within its partition, even if there are ties in the ordering column. In contrast, RANK() assigns the same rank to rows with the same value in the ordering column, and then skips the subsequent rank numbers, which means you could get multiple rows with rank 1 if there's a tie, whereas ROW_NUMBER() would give them distinct numbers like 1, 2, 3, etc.

### E2. YoY growth with LAG
**Prompt:**
```
BigQuery SQL only.
Task: Compute year-over-year revenue growth for 'Phones' sub-category.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Filter to `Sub_Category = 'Phones'`
- Aggregate yearly revenue using EXTRACT(YEAR FROM Order_Date)
- Add `LAG(yearly_revenue) OVER (ORDER BY year)` as `prev_revenue`
- Compute `yoy_pct = 100.0 * (yearly_revenue - prev_revenue) / prev_revenue`
Output: `year`, `yearly_revenue`, `prev_revenue`, `yoy_pct`
Sort by `year` ASC
```
**Ask for**: a guard against divide-by-zero or NULL previous year.

In [18]:
%%bigquery --project directed-bongo-471119-d1

WITH yearly_phone_sales AS (
    SELECT
        EXTRACT(YEAR FROM order_date) AS sales_year,
        SUM(sales) AS yearly_revenue
    FROM
        `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
    WHERE
        sub_category = 'Phones'
    GROUP BY
        sales_year
),
lagged_sales AS (
    SELECT
        sales_year,
        yearly_revenue,
        LAG(yearly_revenue) OVER (ORDER BY sales_year ASC) AS prev_revenue
    FROM
        yearly_phone_sales
)
SELECT
    sales_year AS year,
    yearly_revenue,
    prev_revenue,
    SAFE_DIVIDE((yearly_revenue - prev_revenue), prev_revenue) * 100.0 AS yoy_pct
FROM
    lagged_sales
ORDER BY
    year ASC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,year,yearly_revenue,prev_revenue,yoy_pct
0,2014,77390.806,,
1,2015,68313.702,77390.806,-11.728918
2,2016,78962.03,68313.702,15.587397
3,2017,105340.516,78962.03,33.406545


### E3. 3‑month moving average (MA)
**Prompt:**
```
BigQuery SQL only.
Task: For the 'Corporate' segment, compute a 3-month moving average of monthly revenue.
Table: `[YOUR_PROJECT].superstore_data.sales`
Steps:
- Derive `month` via DATE_TRUNC(Order_Date, MONTH)
- SUM(Sales) per `month`
- Add `AVG(monthly_revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)` as `ma_3`
Output: `month`, `monthly_revenue`, `ma_3`
Sort by `month` ASC
```
**Tip:** Ask the model to include a 1‑line cost control note (e.g., restrict date range while iterating).

In [51]:
%%bigquery --project directed-bongo-471119-d1

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
  GROUP BY month
)
SELECT
  FORMAT_DATE('%Y-%m', mr.month) AS month,
  mr.monthly_revenue,
  ROUND(
    AVG(mr.monthly_revenue) OVER (
      ORDER BY mr.month
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2
  ) AS moving_avg_3m
FROM monthly_revenue mr
ORDER BY mr.month ASC


Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,month,monthly_revenue,moving_avg_3m
0,2014-01,14236.895,14236.9
1,2014-02,4519.892,9378.39
2,2014-03,55691.009,24815.93
3,2014-04,28295.345,29502.08
4,2014-05,23648.287,35878.21
5,2014-06,34595.1276,28846.25
6,2014-07,33946.393,30729.94
7,2014-08,27909.4685,32150.33
8,2014-09,81777.3508,47877.74
9,2014-10,31453.393,47046.74


## Part F — Debugging & Optimization Prompts
**Aim:** Use the model as a rubber duck for error handling and performance.

### F1. Explain the error, propose a fix
**Prompt:**
```
I ran this BigQuery SQL and got an error:
WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
  GROUP BY month
)
SELECT
  FORMAT_DATE('%Y-%m', month) AS month,
  monthly_revenue,
  ROUND(
    AVG(monthly_revenue) OVER (
      ORDER BY month
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2
  ) AS moving_avg_3m
FROM monthly_revenue
ORDER BY month ASC;

ERROR: 400 No matching signature for aggregate function AVG Argument types: STRUCT<month DATE, monthly_revenue FLOAT64> Signature: AVG(INT64) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type INT64 Signature: AVG(FLOAT64) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type FLOAT64 Signature: AVG(NUMERIC) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type NUMERIC Signature: AVG(BIGNUMERIC) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type BIGNUMERIC Signature: AVG(INTERVAL) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type INTERVAL at [12:5]; reason: invalidQuery, location: query, message: No matching signature for aggregate function AVG Argument types: STRUCT<month DATE, monthly_revenue FLOAT64> Signature: AVG(INT64) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type INT64 Signature: AVG(FLOAT64) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type FLOAT64 Signature: AVG(NUMERIC) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type NUMERIC Signature: AVG(BIGNUMERIC) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type BIGNUMERIC Signature: AVG(INTERVAL) Argument 1: Unable to coerce type STRUCT<month DATE, monthly_revenue FLOAT64> to expected type INTERVAL at [12:5]

Act as a BigQuery trouble‑shooter.
1) Identify the root cause.
2) Propose the smallest possible fix.
3) Suggest a quick sanity check query to verify the fix.
Return only the corrected SQL and a 2‑sentence rationale.
```

In [None]:
"""
WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
  GROUP BY month
)
SELECT
  FORMAT_DATE('%Y-%m', mr.month) AS month,
  mr.monthly_revenue,
  ROUND(
    AVG(mr.monthly_revenue) OVER (
      ORDER BY mr.month
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2
  ) AS moving_avg_3m
FROM monthly_revenue mr
ORDER BY mr.month ASC;
"""
#The problem comes up when BigQuery interprets the entire row (STRUCT) instead
# of just the numeric column. The key fix was explicitly prefixing
# mr.monthly_revenue inside the AVG() so BigQuery knows you mean the numeric
# field, not the entire row.

### F2. Reduce cost / improve speed
**Prompt:**
```
Act as a BigQuery cost optimizer.
Given this query (below), list 3 ways to reduce scanned bytes and improve performance without changing the business logic.

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC(order_date, MONTH) AS month,
    SUM(sales) AS monthly_revenue
  FROM `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
  GROUP BY month
)
SELECT
  FORMAT_DATE('%Y-%m', mr.month) AS month,
  mr.monthly_revenue,
  ROUND(
    AVG(mr.monthly_revenue) OVER (
      ORDER BY mr.month
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2
  ) AS moving_avg_3m
FROM monthly_revenue mr
ORDER BY mr.month ASC

Prioritize: partition filters, column pruning, pre-aggregations, and temporary results via CTEs.
```

**Apply Partition Filters:**
Currently, the query scans all rows in the base table. If you only need recent months or a specific range, filter by partition early.

**Column Pruning:**
The base table likely has many columns (customer, region, product, etc.). Since you only need order_date and sales, explicitly select only those columns in the CTE.

**Use Pre-Aggregations or Temporary Results:**
If you run this query often, you could store pre-aggregated monthly revenue in a materialized view or a separate summary table keyed by month. Then the query would only scan the pre-aggregated dataset (far smaller), instead of recalculating sums from raw order_date and sales each time.

## Part G — Validation & Counter‑examples (DIVE: Validate)
**Aim:** Avoid “first‑answer fallacy” by testing alternatives.

### G1. Ask for counter‑queries
**Prompt:**
```
I concluded that 'Tables' is a high‑sales but negative‑profit sub-category due to high discounts.
Create two alternative BigQuery SQL queries that could falsify or nuance this finding:
- One that slices by region and time
- One that controls for order priority or ship mode
Return BigQuery SQL only, then a one-paragraph note on how to compare outcomes.
```

In [None]:
%%bigquery --project directed-bongo-471119-d1

-- Query 1: Slice by region and time
SELECT
    region,
    FORMAT_DATE('%Y-%m', DATE_TRUNC(order_date, MONTH)) AS year_month,
    SUM(sales) AS monthly_sales,
    SUM(profit) AS monthly_profit,
    AVG(discount) AS average_discount
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    region,
    year_month
ORDER BY
    region,
    year_month;

In [None]:
%%bigquery --project directed-bongo-471119-d1

-- Query 2: Control for ship mode
SELECT
    ship_mode,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    AVG(discount) AS average_discount
FROM
    `directed-bongo-471119-d1.lab1_helpme.global_superstore_sample_clean`
WHERE
    sub_category = 'Tables'
GROUP BY
    ship_mode
ORDER BY
    total_profit ASC;

To compare outcomes: Run these two queries alongside your original query for 'Tables'.
- **Query 1 (Region & Time):** Look for specific regions or time periods where 'Tables' might actually be profitable or have lower discounts. This would suggest the problem isn't universal and might be concentrated in certain markets or seasonal.
- **Query 2 (Ship Mode):** See if certain ship modes are associated with significantly lower profits or higher discounts for 'Tables'. This could indicate logistical or pricing issues tied to how these large items are shipped.
If either of these queries shows scenarios where 'Tables' is profitable or discounts aren't unusually high under specific conditions, it would nuance or potentially falsify the initial conclusion that high discounts are the *sole* or primary driver of negative profit across the board.

## Part H — Synthesis (DIVE: Extend)
**Aim:** Turn analysis into business‑ready insights.

### H1. Executive‑style summary
**Prompt:**
```
Act as a business strategist.
Based on the following metrics/figures (briefly summarize your results here), write a 4-sentence executive summary:
- 1 sentence: what changed and by how much
- 1 sentence: why it likely changed (drivers)
- 1 sentence: recommended action (who/what/when)
- 1 sentence: metric to monitor next
```

While overall trends show growth in some sub-categories like 'Phones' (with a 33.4% YOY increase in 2017), other sub-categories, like 'Tables', continue to show significant negative profit (-$17,725.48 total profit). This likely stems from a combination of factors, including high discounts,inefficient shipping, and regional performance issues. Product management  teams should investigate the 'Tables' sub-category, focusing on discount strategies and logistics in underperforming regions to improve profitability. They should monitor the profitability and average discount of 'Tables' by region and ship mode on a monthly basis to track the impact of any changes.

### H2. Convert final SQL into an automated job (optional)
**Prompt (use only after your SQL is final):**
```
Convert my final BigQuery SQL into a Python script that can run as a scheduled job from Colab or Cloud Functions.
Requirements:
- Use python‑bigquery client
- Parameterize date range
- Write results to a destination table `[YOUR_PROJECT].analytics.outputs_kpi`
- Add basic error handling & logging
Return one complete runnable script.
```

---
## Submission checklist
- [ ] Kept prompts precise and reproducible  
- [ ] Captured at least **one** CTE query and **one** window function query  
- [ ] Documented **two** validation attempts (counter‑queries or alternate slice)  
- [ ] Wrote a 4‑sentence executive summary based on results  
- [ ] (Optional) Converted final query into a scheduled job
---