# DA4 - Module 5 - Data Warehouses in the Cloud
## Notebook 1: Welcome to Google Colab & Staging Data

---

## Part 1: What is Google Colab?

Google Colaboratory (Colab) is a **free, cloud-based Python notebook environment** provided by Google.

Key things to know:
- It runs entirely in your **browser** - nothing to install
- It is part of the **Google Cloud ecosystem** - the same ecosystem your company uses
- Each notebook runs on a **virtual machine** hosted by Google
- Notebooks are saved to your **Google Drive** automatically
- It uses **Jupyter notebook** format - the industry standard for data work

### Two types of cell

| Cell Type | Purpose | How to run |
|-----------|---------|------------|
| **Text cell** | Documentation, notes, explanations | Just reads - no running needed |
| **Code cell** | Python code | Click ‚ñ∂ or press `Shift + Enter` |

### Useful keyboard shortcuts
- `Shift + Enter` - run current cell and move to next
- `Ctrl + Enter` - run current cell and stay
- `Ctrl + M B` - insert cell below
- `Ctrl + M A` - insert cell above

In [None]:
# This is a code cell. Press Shift+Enter to run it.

print("Welcome to Google Colab!")
print("You are running Python in the cloud.")
print("This virtual machine is hosted by Google.")

In [None]:
# This tells us about the machine Colab has given us

import platform
import os

print(f"Operating system : {platform.system()}")
print(f"Python version   : {platform.python_version()}")
print(f"Working directory: {os.getcwd()}")
print()
print("This machine exists in a Google data centre.")
print("Every time you open a new Colab session, you get a fresh machine.")

---
## Part 2: Loading the Superstore Dataset

The Superstore CSV file is stored in **Google Cloud Storage (GCS)** - Google's cloud file storage service.

Because the file is publicly accessible, we can load it directly using its URL - no login required.

In data warehouse terms, when we load raw data like this we call it a **staging table**.

In [None]:
# Import the pandas library

import pandas as pd

print("pandas imported successfully")
print(f"pandas version: {pd.__version__}")

In [None]:
# Load the Superstore CSV directly from Google Cloud Storage
# This is our STAGING TABLE - the raw data before any transformation

GCS_URL = "https://storage.googleapis.com/ingwane-da4/Superstore.csv"

staging = pd.read_csv(GCS_URL)

print("Data loaded successfully!")
print(f"Rows   : {len(staging):,}")
print(f"Columns: {len(staging.columns)}")

---
## Part 3: Exploring the Staging Table

In [None]:
# Look at the first 5 rows

staging.head()

In [None]:
# Look at all column names and their data types

staging.info()

In [None]:
# Check the shape - rows and columns

print(f"The staging table has {staging.shape[0]:,} rows and {staging.shape[1]} columns")

In [None]:
# Check for null values

null_counts = staging.isnull().sum()
print("Null values per column:")
print(null_counts[null_counts > 0] if null_counts.sum() > 0 else "No null values found - clean data!")

In [None]:
# Check for duplicate rows

duplicates = staging.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

In [None]:
# Statistical summary of numeric columns

staging.describe()

---
## Part 4: Identifying Dimensions in the Data

In [None]:
# Unique values in potential dimension columns

dimension_columns = ['Segment', 'Ship Mode', 'Region', 'Category', 'Sub-Category']

for col in dimension_columns:
    unique_vals = staging[col].nunique()
    print(f"{col:20} - {unique_vals:4} unique values")

In [None]:
# How many unique customers, products and orders?

print(f"Unique customers : {staging['Customer ID'].nunique():,}")
print(f"Unique products  : {staging['Product ID'].nunique():,}")
print(f"Unique orders    : {staging['Order ID'].nunique():,}")
print(f"Unique order rows: {len(staging):,}")
print()
print("Note: More rows than orders - each order can have multiple line items")

---
## üìù TASK 1 - Explore the Staging Data

Using the `staging` dataframe, answer the following questions:

1. What are the four **Regions** in the dataset?
2. What is the **date range** of orders? (earliest and latest Order Date)
3. How many rows are in the **Technology** category?
4. What is the **total Sales** value across all rows?
5. Which **Customer Name** appears most frequently in the dataset?

In [None]:
# Question 1 - What are the four Regions?



In [None]:
# Question 2 - What is the date range of orders?



In [None]:
# Question 3 - How many rows are in the Technology category?



In [None]:
# Question 4 - What is the total Sales value?



In [None]:
# Question 5 - Which Customer Name appears most frequently?



---
## ‚òï BREAK

After the break we move to **Notebook 2** - Building the Data Warehouse in Python.