# DA4 - Module 5 - Data Warehouses in the Cloud
## Notebook 3: Loading into BigQuery

---


---

## What is BigQuery?

**BigQuery** is Google Cloud's fully managed, serverless data warehouse.

Key things to know:
- It is designed for **analytical queries** (OLAP) on very large datasets
- No infrastructure to manage - Google handles everything
- It uses **standard SQL** - the same SQL you already know
- It is organised as: **Project → Dataset → Tables** (similar to Server → Database → Tables in SSMS)
- Today we are using project `ingwane-da4-608` and dataset `superstore_dw`

| SSMS Concept | BigQuery Equivalent |
|---|---|
| SQL Server instance | GCP Project |
| Database | Dataset |
| Table | Table |
| SSMS interface | BigQuery Console (web) |
| SQL query | Standard SQL query |

---
## Step 0: Rebuild the Data Warehouse from Notebook 2



In [None]:
# Rebuild everything from Notebook 2 in one block
# We need these DataFrames ready before we can load them into BigQuery

import pandas as pd

GCS_URL = "https://storage.googleapis.com/ingwane-da4/Superstore.csv"
staging = pd.read_csv(GCS_URL, encoding='latin1')
staging['Order Date'] = pd.to_datetime(staging['Order Date'], dayfirst=True)
staging['Ship Date']  = pd.to_datetime(staging['Ship Date'],  dayfirst=True)

# dimCustomer
dimCustomer = staging[['Customer ID', 'Customer Name', 'Segment']].drop_duplicates().reset_index(drop=True)
dimCustomer.insert(0, 'Customer_SK', range(1, len(dimCustomer) + 1))
dimCustomer.columns = ['Customer_SK', 'CustomerID', 'CustomerName', 'Segment']

# dimProduct
dimProduct = staging[['Product ID', 'Product Name', 'Category', 'Sub-Category']].drop_duplicates(subset=['Product ID']).reset_index(drop=True)
dimProduct.insert(0, 'Product_SK', range(1, len(dimProduct) + 1))
dimProduct.columns = ['Product_SK', 'ProductID', 'ProductName', 'Category', 'SubCategory']

# dimGeography
dimGeography = staging[['Country', 'City', 'State', 'Postal Code', 'Region']].drop_duplicates().reset_index(drop=True)
dimGeography.insert(0, 'Geog_SK', range(1, len(dimGeography) + 1))
dimGeography.columns = ['Geog_SK', 'Country', 'City', 'State', 'PostalCode', 'Region']

# dimDate
all_dates = pd.date_range(start=staging['Order Date'].min(), end=staging['Order Date'].max(), freq='D')
dimDate = pd.DataFrame({'DateValue': all_dates})
dimDate['Day']       = dimDate['DateValue'].dt.day
dimDate['DayOfWeek'] = dimDate['DateValue'].dt.day_name()
dimDate['Week']      = dimDate['DateValue'].dt.isocalendar().week.astype(int)
dimDate['Month']     = dimDate['DateValue'].dt.month
dimDate['MonthName'] = dimDate['DateValue'].dt.month_name()
dimDate['Quarter']   = dimDate['DateValue'].dt.quarter
dimDate['Year']      = dimDate['DateValue'].dt.year
dimDate.insert(0, 'Date_SK', range(1, len(dimDate) + 1))

# factOrderItem
fact = staging.merge(dimCustomer[['CustomerID', 'Customer_SK']], left_on='Customer ID', right_on='CustomerID', how='left')
fact = fact.merge(dimProduct[['ProductID', 'Product_SK']], left_on='Product ID', right_on='ProductID', how='left')
fact = fact.merge(dimGeography[['City', 'State', 'Geog_SK']], on=['City', 'State'], how='left')
fact = fact.merge(dimDate[['DateValue', 'Date_SK']].rename(columns={'Date_SK': 'OrderDate_SK'}), left_on='Order Date', right_on='DateValue', how='left')
fact = fact.merge(dimDate[['DateValue', 'Date_SK']].rename(columns={'Date_SK': 'ShipDate_SK'}), left_on='Ship Date', right_on='DateValue', how='left')
factOrderItem = fact[['Row ID', 'Order ID', 'Customer_SK', 'Product_SK', 'Geog_SK', 'OrderDate_SK', 'ShipDate_SK', 'Sales', 'Quantity', 'Discount', 'Profit']].copy()

print("✅ Data warehouse rebuilt successfully")
print(f"   staging       : {len(staging):,} rows")
print(f"   dimCustomer   : {len(dimCustomer):,} rows")
print(f"   dimProduct    : {len(dimProduct):,} rows")
print(f"   dimGeography  : {len(dimGeography):,} rows")
print(f"   dimDate       : {len(dimDate):,} rows")
print(f"   factOrderItem : {len(factOrderItem):,} rows")

---
## Step 1: Authenticate to Google Cloud



In [None]:
# Authenticate to Google Cloud
# A popup will appear - sign in with the Gmail address your trainer has added to the project

from google.colab import auth
auth.authenticate_user()

print("✅ Authentication complete")

---
## Step 2: Connect to BigQuery



In [None]:
# Connect to BigQuery

from google.cloud import bigquery

PROJECT_ID = 'ingwane-da4-608'
DATASET_ID = 'superstore_dw'

client = bigquery.Client(project=PROJECT_ID)

print(f"✅ Connected to BigQuery")
print(f"   Project : {PROJECT_ID}")
print(f"   Dataset : {DATASET_ID}")

In [None]:
# Verify the connection by listing existing datasets in the project

datasets = list(client.list_datasets())
print("Datasets in this project:")
for ds in datasets:
    print(f"  - {ds.dataset_id}")

---
## Step 3: Load Tables into BigQuery



In [None]:
# Helper function to load a DataFrame into BigQuery

def load_to_bigquery(df, table_name):
    destination = f"{PROJECT_ID}.{DATASET_ID}.{table_name}"
    df.to_gbq(
        destination_table=f"{DATASET_ID}.{table_name}",
        project_id=PROJECT_ID,
        if_exists='replace',
        progress_bar=True
    )
    print(f"✅ {table_name} loaded - {len(df):,} rows → {destination}")

print("Helper function ready")

In [None]:
# Load the staging table first
load_to_bigquery(staging, 'SuperstoreStaging')

In [None]:
# Load the dimension tables
load_to_bigquery(dimCustomer,  'dimCustomer')
load_to_bigquery(dimProduct,   'dimProduct')
load_to_bigquery(dimGeography, 'dimGeography')
load_to_bigquery(dimDate,      'dimDate')

In [None]:
# Load the fact table
load_to_bigquery(factOrderItem, 'factOrderItem')

---
## Step 4: Query BigQuery from Colab



In [None]:
# Query BigQuery using SQL - this runs against the cloud, not our pandas DataFrames
# Note the fully qualified table name: project.dataset.table

sql = f"""
SELECT 
    COUNT(*) AS total_rows,
    ROUND(SUM(Sales), 2) AS total_sales,
    ROUND(SUM(Profit), 2) AS total_profit
FROM `{PROJECT_ID}.{DATASET_ID}.factOrderItem`
"""

result = client.query(sql).to_dataframe()
print("Query result from BigQuery:")
result

In [None]:
# A more interesting query - Sales by Category using a JOIN
# Same JOIN logic as SSMS - just different syntax for the table name

sql = f"""
SELECT 
    p.Category,
    COUNT(*) AS order_lines,
    ROUND(SUM(f.Sales), 2) AS total_sales,
    ROUND(SUM(f.Profit), 2) AS total_profit
FROM `{PROJECT_ID}.{DATASET_ID}.factOrderItem` f
JOIN `{PROJECT_ID}.{DATASET_ID}.dimProduct` p
    ON f.Product_SK = p.Product_SK
GROUP BY p.Category
ORDER BY total_sales DESC
"""

result = client.query(sql).to_dataframe()
print("Sales by Category:")
result

In [None]:
# Sales by Year - joining to dimDate

sql = f"""
SELECT 
    d.Year,
    COUNT(*) AS order_lines,
    ROUND(SUM(f.Sales), 2) AS total_sales
FROM `{PROJECT_ID}.{DATASET_ID}.factOrderItem` f
JOIN `{PROJECT_ID}.{DATASET_ID}.dimDate` d
    ON f.OrderDate_SK = d.Date_SK
GROUP BY d.Year
ORDER BY d.Year
"""

result = client.query(sql).to_dataframe()
print("Sales by Year:")
result

---
## 📝 TASK 1 - Query Your Data Warehouse


Write SQL queries against your BigQuery data warehouse to answer the following:

1. How many customers are in **dimCustomer**?
2. What are the **total Sales and Profit** for each **Region**? (Hint: join to dimGeography)
3. Who are the **top 5 customers** by total Sales?
4. **Stretch:** What is the best performing **Sub-Category** by Profit in each Region?

In [None]:
# TASK 1 - Question 1: How many customers are in dimCustomer?

sql = """

"""

# result = client.query(sql).to_dataframe()
# result

In [None]:
# TASK 1 - Question 2: Total Sales and Profit by Region

sql = """

"""

# result = client.query(sql).to_dataframe()
# result

In [None]:
# TASK 1 - Question 3: Top 5 customers by total Sales

sql = """

"""

# result = client.query(sql).to_dataframe()
# result

In [None]:
# TASK 1 - Question 4 (Stretch): Best Sub-Category by Profit in each Region

sql = """

"""

# result = client.query(sql).to_dataframe()
# result

### ✅ TASK 1 SOLUTIONS



---
## ☕ BREAK - 15 minutes



---
## Step 5: Explore the BigQuery Console


Open a new browser tab and go to:

**https://console.cloud.google.com/bigquery**

You should be able to see:
- Project `ingwane-da4-608` in the left panel
- Dataset `superstore_dw`
- All the tables you loaded from Colab

Try running a query directly in the BigQuery Console - no Python needed!

---
## 🔁 Day Recap


Today you have:

| What you did | Tool used |
|---|---|
| Loaded raw data from cloud storage | Google Colab + pandas |
| Explored a staging table | pandas |
| Built dimension tables | pandas |
| Built a fact table using joins | pandas |
| Validated your data warehouse | pandas |
| Authenticated to Google Cloud | Google Colab auth |
| Loaded tables into a cloud data warehouse | BigQuery |
| Queried a real data warehouse using SQL | BigQuery |

**The full ETL pipeline:**

```
Google Cloud Storage  →  Google Colab (Python)  →  BigQuery
      (Extract)               (Transform)            (Load)
```

---

## 👀 Thursday Preview

On Thursday you will complete the official Google course:
**Modernising Data Lakes and Data Warehouses with Google Cloud**

You will go deeper into:
- BigQuery architecture and how it works at scale
- Loading data natively into BigQuery
- Data Lakes vs Data Warehouses on GCP
- Google Cloud Storage as a data lake

You have already seen BigQuery in action today - Thursday will make complete sense.