In [1]:
# --- 0. FIX CONFIGURATION ---
# 1. Set your Project ID (Make sure this matches your GCP Project)
PROJECT_ID = "finalproject-480220"  # Replace if you are using a different project

# 2. Authenticate
from google.colab import auth
try:
    auth.authenticate_user()
    print(f"‚úÖ Authenticated. Using Project ID: {PROJECT_ID}")
except Exception as e:
    print(f"‚ö†Ô∏è Authentication warning (ignore if running locally): {e}")

# 3. Configure BigQuery Magic to use this project
from google.cloud.bigquery import magics
magics.context.project = PROJECT_ID

‚úÖ Authenticated. Using Project ID: finalproject-480220


In [2]:
from google.cloud.bigquery import magics

# This tells BigQuery Magics explicitly: "Do not show a progress bar"
magics.context.progress_bar_type = None

#  **1. DISCOVER: Initial Exploration**
**Goal:** Get a first look at the weather data streaming in.

**What we are looking for:**
* **Schema Check:** Do we have values for key metrics like `temperature`, `humidity`, and `wind_speed`?
* **Coverage:** Are all our target cities (London, Tokyo, New York, etc.) appearing in the stream?
* **Time:** Are the `observation_time` and system `timestamp` aligning?

> **Observation:** We expect to see a row for each city with varying weather conditions.

In [3]:
%%bigquery
# 1. Sample the raw weather data
SELECT
  timestamp,
  location_name,
  country,
  temperature,
  weather_descriptions,
  humidity,
  wind_speed
FROM
  `finalproject-480220.weather_data_dataset.new_weather_data`
ORDER BY
  timestamp DESC
LIMIT 12

Unnamed: 0,timestamp,location_name,country,temperature,weather_descriptions,humidity,wind_speed
0,2025-12-09 20:51:21.689363+00:00,Los Angeles,United States of America,27.0,Sunny,18.0,4.0
1,2025-12-09 20:51:20.126719+00:00,Dubai,United Arab Emirates,24.0,Partly Cloudy,54.0,12.0
2,2025-12-09 20:51:18.567674+00:00,Singapore,Singapore,26.0,Partly cloudy,84.0,17.0
3,2025-12-09 20:51:16.973573+00:00,Toronto,Canada,-1.0,Light Snow,86.0,22.0
4,2025-12-09 20:51:15.692461+00:00,Chicago,United States of America,2.0,Overcast,75.0,19.0
5,2025-12-09 20:51:14.156570+00:00,Mumbai,India,25.0,Smoke,36.0,14.0
6,2025-12-09 20:51:12.605312+00:00,Sydney,Australia,20.0,Partly cloudy,78.0,28.0
7,2025-12-09 20:51:11.084660+00:00,Berlin,Germany,11.0,Light Rain Shower,94.0,13.0
8,2025-12-09 20:51:09.351588+00:00,Paris,France,14.0,Partly cloudy,72.0,18.0
9,2025-12-09 20:51:07.859425+00:00,Tokyo,Japan,6.0,Clear,56.0,9.0


*italicized text* **2. INVESTIGATE: Filtering for Significance**
**Goal:** Filter the data to find "High Impact" weather events.

**The Story:**
Clear skies are boring. We want to know where it is **actively precipitating** (Rain/Snow) or where the **wind is dangerous**. These are the conditions that cause shipping delays or power outages.

**The Filter:**
* `precip > 0`: It is currently raining/snowing.
* `wind_speed > 15`: Breezy to windy conditions.

In [4]:
%%bigquery
# 2. Investigate: Find cities with active precipitation or high winds
SELECT
  location_name,
  weather_descriptions,
  temperature,
  precip,
  wind_speed,
  humidity
FROM
  `finalproject-480220.weather_data_dataset.new_weather_data`
WHERE
  precip > 0 OR wind_speed > 15
ORDER BY
  wind_speed DESC
LIMIT 10

Unnamed: 0,location_name,weather_descriptions,temperature,precip,wind_speed,humidity
0,Sydney,Light Drizzle,19.0,0.1,33.0,88.0
1,Sydney,Overcast,19.0,0.1,33.0,88.0
2,Sydney,Overcast,20.0,0.1,33.0,83.0
3,Sydney,Overcast,20.0,0.0,31.0,78.0
4,London,"Light Rain, Rain",13.0,0.2,30.0,88.0
5,London,"Light Rain, Rain",13.0,0.2,30.0,88.0
6,Sydney,Partly cloudy,19.0,0.0,29.0,78.0
7,Sydney,Partly cloudy,20.0,0.0,28.0,78.0
8,London,Partly cloudy,14.0,0.0,28.0,88.0
9,London,"Light Rain, Rain",13.0,0.4,28.0,88.0


#  **3. VALIDATE: Critiquing the Results**
**Where could the data be wrong?**
1.  **Stale Cache:** Weather APIs often cache data. Even if we ping them every minute, they might return the *same* timestamp for 1 hour.
2.  **API Errors:** Sometimes APIs return `999` or `null` for temperature when sensors fail.
3.  **Pipeline Lag:** Is the `timestamp` column (when GCP received it) significantly later than the `observation_time` (when the sensor read it)?

**The Validation Fix:**
We will calculate the **"Data Age"** to ensure we aren't making decisions based on weather from 3 hours ago.

In [5]:
%%bigquery
# 3. Validation Query: Check for Stale Data or Sensor Errors
SELECT
  location_name,
  timestamp AS ingestion_time,
  # Compare Ingestion Time vs. Current Time
  TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, MINUTE) AS lag_minutes,
  temperature,
  humidity
FROM
  `finalproject-480220.weather_data_dataset.new_weather_data`
WHERE
  # Flag suspicious data
  temperature > 60   -- Earth's record is ~56.7¬∞C; anything higher is an error
  OR temperature < -90
  OR humidity > 100
ORDER BY
  timestamp DESC
LIMIT 10

Unnamed: 0,location_name,ingestion_time,lag_minutes,temperature,humidity


#  **4. EXTEND: Actionable Insights for Stakeholders**
**Who cares about this?**
* **Stakeholder:** **Energy Grid Operators** (e.g., National Grid).
* **Action:** Temperature directly drives energy demand (AC in summer, Heating in winter).
* **The Extension:** We will aggregate the average temperature by **Country** to help grid operators predict regional load balancing needs for the next hour.

In [6]:
%%bigquery
# 4. Extend: Regional Analysis for Energy Demand
SELECT
  country,
  COUNT(DISTINCT location_name) as cities_monitored,
  ROUND(AVG(temperature), 1) as avg_regional_temp,
  ROUND(AVG(wind_speed), 1) as avg_wind_speed,
  MAX(precip) as max_precip_in_region
FROM
  `finalproject-480220.weather_data_dataset.new_weather_data`
GROUP BY
  country
ORDER BY
  avg_regional_temp DESC

Unnamed: 0,country,cities_monitored,avg_regional_temp,avg_wind_speed,max_precip_in_region
0,India,1,26.8,15.3,0.0
1,Singapore,1,26.8,20.7,0.0
2,United Arab Emirates,1,26.0,10.3,0.0
3,Australia,1,19.5,31.2,0.1
4,France,1,14.5,17.3,0.0
5,United Kingdom,1,13.2,26.5,0.4
6,Germany,1,10.8,10.3,0.0
7,Japan,1,8.3,9.3,0.0
8,United States of America,3,4.6,12.8,0.0
9,Canada,1,-1.7,20.8,0.1


In [7]:
# --- STEP 7: VISUALIZE WITH PLOTLY ---
import plotly.express as px
from google.cloud import bigquery

# 1. Initialize Client
bq_client = bigquery.Client(project=PROJECT_ID)

# 2. Fetch the latest 500 records
# We order by timestamp so we get the 'history' of the stream
sql = f"""
SELECT
    location_name,
    temperature,
    humidity,
    wind_speed,
    timestamp
FROM `finalproject-480220.weather_data_dataset.new_weather_data`
ORDER BY timestamp ASC
"""

print("üé® Fetching data for visualization...")
df = bq_client.query(sql).to_dataframe()

if df.empty:
    print("‚ö†Ô∏è No data found! Run the producer (Step 5) first.")
else:
    # 3. Create Interactive Line Chart
    fig = px.line(
        df,
        x="timestamp",
        y="temperature",
        color="location_name",
        title="üå°Ô∏è Real-Time Weather Stream (Last 24h)",
        labels={"timestamp": "Time (UTC)", "temperature": "Temperature (¬∞C)", "location_name": "City"},
        markers=True,
        template="plotly_dark"
    )

    # Add hover details (Humidity & Wind)
    fig.update_traces(
        hovertemplate="<b>%{x}</b><br>Temp: %{y}¬∞C<br>Humidity: %{customdata[0]}%<br>Wind: %{customdata[1]} km/h",
        customdata=df[["humidity", "wind_speed"]]
    )

    # Show the plot
    fig.show()

üé® Fetching data for visualization...


# **Analysis & Visualization Prompts**

Use these prompts to generate the advanced analysis and visualization sections of your notebook.

---

### **Prompt 1: The DIVE Framework (SQL Analysis)**
"I need to audit my streaming weather data in BigQuery using the **'DIVE' framework**. Please write four separate SQL queries using the Python BigQuery client to:
1.  **Discover:** Sample the latest 12 rows to check the schema and current values.
2.  **Investigate:** Filter the data to find 'High Impact' events where it is precipitating (`precip > 0`) or windy (`wind_speed > 15`).
3.  **Validate:** Calculate the 'ingestion lag' (difference between `NOW()` and `timestamp`) to ensure data is fresh, and flag any errors (e.g., Temp > 60¬∞C).
4.  **Extend:** Aggregate the average temperature and wind speed by `country` to spot regional trends."

---

### **Prompt 2: Deep Dive into Validation (Specific Logic)**
"For the **'Validate'** step of the DIVE analysis, I want to be strict about data quality. Write a specific BigQuery SQL query that selects the `location_name`, `timestamp`, and a calculated column `lag_minutes` (using `TIMESTAMP_DIFF`). Filter the results to show only rows where the data is either **suspicious** (Temperature > 60¬∞C or < -90¬∞C) or **stale** (high lag), so I can debug the pipeline."

---

### **Prompt 3: Interactive Visualization (Plotly)**
"Now, visualize the heartbeat of this data stream using **Plotly Express**. Write a Python script that:
1.  Queries BigQuery for the last 24 hours of weather data, sorted by timestamp.
2.  Creates an interactive **Line Chart** of `temperature` vs. `timestamp`.
3.  Colors the lines by `location_name`.
4.  Uses the `plotly_dark` template.
5.  **Customizes the hover tooltip** to show Humidity and Wind Speed alongside the temperature."