Domain Question 1: How do temperature patterns in Chicago evolve during the winter-to-spring transition period?

Domain Question 2: How do environmental sensors in Chicago capture spatial variations in temperature?

Domain Question 3: What relationships exist between temperature, humidity, and precipitation in Chicago?

Domain Question 4: How does soil moisture respond to precipitation events in Chicago?

Domain Question 5: How do wind patterns vary across different parts of Chicago, and how do they correlate with temperature?

Domain Question 6: How do daily cycles of temperature and humidity change throughout the winter-to-spring transition?

In [2]:
!pip install pandas numpy matplotlib geopandas altair

Collecting pandas
  Using cached pandas-2.2.3-cp312-cp312-macosx_10_9_x86_64.whl.metadata (89 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.1-cp312-cp312-macosx_10_13_x86_64.whl.metadata (11 kB)
Collecting geopandas
  Using cached geopandas-1.0.1-py3-none-any.whl.metadata (2.2 kB)
Collecting altair
  Downloading altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp312-cp312-macosx_10_13_x86_64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.57.0-cp312-cp312-macosx_10_13_x86_64.whl.metadata (102 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp312-cp312-macosx_10_13_x86_64.whl.metadata (6.2 kB)
Collecting pillow>=8 (

In [3]:

# --- Block 1: Setup & Imports ---

# Import libraries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
from matplotlib.colors import LinearSegmentedColormap
from shapely.geometry import Point
import matplotlib.dates as mdates
import json # Needed for handling GeoJSON dict

# Set plot style
plt.style.use('ggplot')

# For better visualization (optional, mainly for matplotlib)
import matplotlib
matplotlib.rcParams['figure.figsize'] = (12, 8)

# Import Altair and enable rendering/fusion
import altair as alt
# Enable VegaFusion for potentially large datasets
alt.data_transformers.enable('vegafusion')
# For better rendering in Google Colab (if applicable)
try:
    alt.renderers.enable('colab')
except Exception as e:
    print(f"Colab renderer not available: {e}")
    alt.renderers.enable('default')

# Ensure necessary directories exist for saving data and specs
import os
os.makedirs('data', exist_ok=True)
os.makedirs('specs', exist_ok=True)

print("(Success) Block 1: Setup & Imports complete. Directories 'data' and 'specs' ensured.")

(Success) Block 1: Setup & Imports complete. Directories 'data' and 'specs' ensured.


In [11]:
# --- Block 2: Data Fetching from API ---

import urllib.parse
import pandas as pd # Ensure pandas is imported here as well

def fetch_data(url):
    """Fetches data from a given JSON API URL."""
    print(f"Fetching data from API: {url[:100]}...") # Print snippet of URL
    # After the fetch_data function calls:
    try:
        df = pd.read_json(url)
        print(f"Data successfully retrieved! Rows: {df.shape[0]}")
        return df
    except Exception as e:
        print(f"Error fetching data from {url[:50]}...: {e}")
        return None

# API URL 1 (Jan–Mar 2017 - broader selection)
api_url1 = ("https://data.cityofchicago.org/resource/ggws-77ih.json?"
    "$query=SELECT%20measurement_title,%20measurement_description,%20measurement_type,%20"
    "measurement_medium,%20measurement_time,%20measurement_value,%20units,%20"
    "units_abbreviation,%20measurement_period_type,%20data_stream_id,%20resource_id,%20"
    "measurement_id,%20record_id,%20latitude,%20longitude,%20location%20"
    "WHERE%20measurement_time%20BETWEEN%20%272017-01-01T00:00:00%27::floating_timestamp%20"
    "AND%20%272017-03-31T23:59:59%27::floating_timestamp%20"
    "ORDER%20BY%20measurement_time%20DESC%20NULL%20FIRST,%20data_stream_id%20ASC%20NULL%20LAST%20"
    "LIMIT%201000000")

# API URL 2 (Jan–Jun 2017 - focused on Temp, Atmosphere, Celsius)
# This URL is more specific and might pull less data but target relevant types
raw_query2 = """
SELECT measurement_title, measurement_description, measurement_type, measurement_medium,
       measurement_time, measurement_value, units, units_abbreviation,
       measurement_period_type, data_stream_id, resource_id, measurement_id, record_id,
       latitude, longitude, location
WHERE measurement_time BETWEEN '2017-01-01T00:00:00'::floating_timestamp
  AND '2017-06-30T23:45:00'::floating_timestamp
  AND caseless_contains(units, 'degrees Celsius')
  AND caseless_contains(measurement_type, 'Temperature')
  AND caseless_contains(measurement_medium, 'atmosphere')
ORDER BY data_stream_id ASC NULL LAST
LIMIT 400000
"""
encoded_query2 = urllib.parse.quote(raw_query2.strip(), safe='') # Use strip() to remove leading/trailing whitespace
api_url2 = f"https://data.cityofchicago.org/resource/ggws-77ih.json?$query={encoded_query2}"

# Fetch data
df1 = fetch_data(api_url1)
df2 = fetch_data(api_url2)

print(f"API URL 1 result: {'Success with ' + str(len(df1)) + ' rows' if df1 is not None else 'Failed'}")
print(f"API URL 2 result: {'Success with ' + str(len(df2)) + ' rows' if df2 is not None else 'Failed'}")

# Combine and use df as the base
if df1 is not None and df2 is not None:
    # The 'location' column from the API is likely a dictionary representation of a point.
    # pandas.drop_duplicates can struggle with unhashable types like dictionaries.
    # Drop the 'location' column before dropping duplicates.
    if 'location' in df1.columns:
        df1 = df1.drop(columns=['location'])
    if 'location' in df2.columns:
        df2 = df2.drop(columns=['location'])

    df = pd.concat([df1, df2], ignore_index=True).drop_duplicates().reset_index(drop=True)
elif df1 is not None:
    if 'location' in df1.columns:
        df1 = df1.drop(columns=['location'])
    df = df1.drop_duplicates().reset_index(drop=True)
elif df2 is not None:
    if 'location' in df2.columns:
        df2 = df2.drop(columns=['location'])
    df = df2.drop_duplicates().reset_index(drop=True)
else:
    raise Exception("Failed to fetch data from both API URLs. Exiting.") # Exit if no data is fetched

print(f"Block 2: Data Fetching complete. Combined dataset has {len(df)} rows.")

Fetching data from API: https://data.cityofchicago.org/resource/ggws-77ih.json?$query=SELECT%20measurement_title,%20measurem...
Data successfully retrieved! Rows: 605620
Fetching data from API: https://data.cityofchicago.org/resource/ggws-77ih.json?$query=SELECT%20measurement_title%2C%20measur...
Data successfully retrieved! Rows: 360895
API URL 1 result: Success with 605620 rows
API URL 2 result: Success with 360895 rows
Block 2: Data Fetching complete. Combined dataset has 916230 rows.


In [12]:
# --- Block 3: Initial Cleaning, Type Conversion, and Duplicate Removal ---

print("\nBlock 3: Starting initial data cleaning...")

# Ensure measurement_time is datetime
df['measurement_time'] = pd.to_datetime(df['measurement_time'], errors='coerce')

# Convert measurement_value, latitude, longitude to numeric
df['measurement_value'] = pd.to_numeric(df['measurement_value'], errors='coerce')
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

# Handle dictionary columns if any remain
# Check a sample to confirm if necessary before applying broadly
dict_columns = []
for col in df.columns:
    if df[col].dtype == 'object':
        if not df[col].dropna().empty:
             # Check first non-null value
             sample_val = df[col].dropna().iloc[0]
             if isinstance(sample_val, dict):
                dict_columns.append(col)
                print(f"  Column '{col}' contains dictionary values (sample: {sample_val}). Converting to string.")

if dict_columns:
    for col in dict_columns:
        # Use apply with isinstance check to handle potential non-dict NaNs safely
        df[col] = df[col].apply(lambda x: str(x) if isinstance(x, dict) else x)
    print("  Converted dictionary columns to strings.")
else:
    print("  No dictionary columns found or converted.")


# Drop duplicates again after type conversions
initial_rows = df.shape[0]
df = df.drop_duplicates().reset_index(drop=True)
print(f" (success) Block 3: Initial cleaning complete. Removed {initial_rows - df.shape[0]} duplicate rows. Dataset size: {len(df)}")

# Check for missing values after initial cleaning
print("\nMissing Values Summary after Block 3 Cleaning:")
print(df.isnull().sum())


Block 3: Starting initial data cleaning...
  No dictionary columns found or converted.
 (success) Block 3: Initial cleaning complete. Removed 0 duplicate rows. Dataset size: 916230

Missing Values Summary after Block 3 Cleaning:
measurement_title               0
measurement_type                0
measurement_medium              0
measurement_time                0
measurement_value               0
units                           0
units_abbreviation              0
measurement_period_type         0
data_stream_id                  0
resource_id                     0
measurement_id                  0
record_id                       0
latitude                        0
longitude                       0
measurement_description    204040
dtype: int64


In [13]:
# --- Block 4: Handle Critical Missing Values ---

print("\nBlock 4: Removing rows with missing critical values...")

initial_rows = df.shape[0]
df_clean = df.dropna(subset=['measurement_time', 'latitude', 'longitude', 'measurement_value', 'measurement_type']).reset_index(drop=True)
print(f"Block 4: Removed {initial_rows - df_clean.shape[0]} rows with missing critical values. Cleaned dataset size: {len(df_clean)}")

# Check for missing values again
print("(Success) \nMissing Values Summary after Block 4 Cleaning (df_clean):")
print(df_clean.isnull().sum())

if df_clean.empty:
    raise Exception("Cleaned DataFrame is empty after removing critical missing values. Cannot proceed.")


Block 4: Removing rows with missing critical values...
Block 4: Removed 0 rows with missing critical values. Cleaned dataset size: 916230
(Success) 
Missing Values Summary after Block 4 Cleaning (df_clean):
measurement_title               0
measurement_type                0
measurement_medium              0
measurement_time                0
measurement_value               0
units                           0
units_abbreviation              0
measurement_period_type         0
data_stream_id                  0
resource_id                     0
measurement_id                  0
record_id                       0
latitude                        0
longitude                       0
measurement_description    204040
dtype: int64


In [14]:
# --- Block 5: Temperature Sensor Correction ---

print("\nBlock 5: Applying temperature sensor corrections...")

def correct_temperature_values(df):
    """
    Apply appropriate scaling/conversion to measurement values for Temperature data based on sensor title.
    Heuristics and documented conversions are used to convert potential raw/scaled values to Celsius.
    """
    df_corrected = df.copy()
    # Ensure measurement_value is numeric before operations and drop NaNs if they appeared somehow
    df_corrected['measurement_value'] = pd.to_numeric(df_corrected['measurement_value'], errors='coerce')
    df_corrected = df_corrected.dropna(subset=['measurement_value', 'measurement_type']) # Ensure value and type exist

    # Use .loc with boolean masks
    temp_mask = df_corrected['measurement_type'] == 'Temperature'
    # Create a temporary DataFrame for temperature data to avoid SettingWithCopyWarning on the original df_corrected slice
    df_temp_only = df_corrected.loc[temp_mask].copy()

    if not df_temp_only.empty:
        # Ensure 'measurement_title' and 'units' are strings for comparison
        df_temp_only['measurement_title'] = df_temp_only['measurement_title'].astype(str)
        df_temp_only['units'] = df_temp_only['units'].astype(str)

        # Process MK-III Weather Station Temp sensors (assuming scaling issues)
        mk_mask = df_temp_only['measurement_title'].str.contains("MK-III Weather Station Temp", na=False)
        # Apply scaling heuristics based on value ranges
        df_temp_only.loc[mk_mask & (df_temp_only['measurement_value'] > 10000), 'measurement_value'] /= 10000.0
        df_temp_only.loc[mk_mask & (df_temp_only['measurement_value'] > 1000) & (df_temp_only['measurement_value'] <= 10000), 'measurement_value'] /= 1000.0
        df_temp_only.loc[mk_mask & (df_temp_only['measurement_value'] > 100) & (df_temp_only['measurement_value'] <= 1000), 'measurement_value'] /= 10.0
        # Also apply a division by 10 for values potentially indicating mV as per original Matplotlib code comment
        # This might overlap with or replace the above heuristics depending on observed data patterns.
        # Let's use the condition from the Matplotlib code's correction function, which was based on units containing 'mv'.
        mkiii_mv_mask = mk_mask & df_temp_only['units'].str.lower().str.contains('mv', na=False)
        # Apply *only* if it looks like an mV reading according to the Matplotlib code's logic
        df_temp_only.loc[mkiii_mv_mask, 'measurement_value'] = df_temp_only.loc[mkiii_mv_mask, 'measurement_value'] / 10.0
        if mkiii_mv_mask.any():
             df_temp_only.loc[mkiii_mv_mask, 'units'] = 'Celsius (Corrected)' # Update units


        # Process Cumulus Weather Station Air Temp sensors (assuming scaling issues)
        cumulus_mask = df_temp_only['measurement_title'].str.contains("Cumulus: Weather Station Air Temp", na=False)
        df_temp_only.loc[cumulus_mask & (df_temp_only['measurement_value'] > 1000), 'measurement_value'] /= 1000.0
        df_temp_only.loc[cumulus_mask & (df_temp_only['measurement_value'] > 100) & (df_temp_only['measurement_value'] <= 1000), 'measurement_value'] /= 10.0


        # Process TM1 Temp Sensors (assuming mV and convert to Celsius)
        tm1_mask = df_temp_only['measurement_title'].str.contains("TM1 Temp Sensor", na=False)
        tm1_mv_mask = tm1_mask & df_temp_only['units'].str.lower().str.contains('mv', na=False)
        # Apply conversion formula if it looks like an mV reading based on units string
        df_temp_only.loc[tm1_mv_mask, 'measurement_value'] = (df_temp_only.loc[tm1_mv_mask, 'measurement_value'] - 400.0) / 19.5
        if tm1_mv_mask.any():
             df_temp_only.loc[tm1_mv_mask, 'units'] = 'Celsius (Formula Applied)' # Update units

        # Re-integrate corrected temperature data into the full DataFrame
        df_corrected.loc[temp_mask, :] = df_temp_only.values

    # Verify temperature corrections - print a summary after correction
    temp_data_check = df_corrected[df_corrected['measurement_type'] == 'Temperature']
    if not temp_data_check.empty:
        print("\nTemperature ranges after correction:")
        for title in temp_data_check['measurement_title'].unique():
            # Ensure title is string for filtering and handle potential NaNs
            title_str = str(title) if pd.notna(title) else ''
            title_data = temp_data_check[temp_data_check['measurement_title'].astype(str) == title_str]
            if not title_data.empty and pd.api.types.is_numeric_dtype(title_data['measurement_value']):
                 min_temp = title_data['measurement_value'].min()
                 max_temp = title_data['measurement_value'].max()
                 mean_temp = title_data['measurement_value'].mean()
                 print(f"  {title_str}: Min={min_temp:.2f}°C, Max={max_temp:.2f}°C, Mean={mean_temp:.2f}°C")
                 if max_temp > 50 or min_temp < -40:
                     print(f"    WARNING: Temperatures for '{title_str}' may still need adjustment! Range ({min_temp:.2f}, {max_temp:.2f}) seems extreme.")
            elif pd.notna(title):
                 print(f"  {title_str}: No valid numeric temperature data after correction.")
    else:
         print("\nNo temperature data found after cleaning for correction check.")


    return df_corrected

# Apply temperature scaling correction
df_clean = correct_temperature_values(df_clean)
print("(Success) Block 5: Temperature values have been corrected based on sensor types.")


Block 5: Applying temperature sensor corrections...

Temperature ranges after correction:
  Langley - Cumulus: Weather Station Air Temp: Min=0.00°C, Max=0.00°C, Mean=0.00°C
  UI Labs Bioswale - Cumulus: Weather Station Air Temp: Min=0.00°C, Max=26.00°C, Mean=6.00°C
  Argyle - Cumulus: Weather Station Air Temp: Min=0.00°C, Max=27.00°C, Mean=5.99°C
  Langley - Thunder 1: TM1 Temp Sensor: Min=336.00°C, Max=537.00°C, Mean=438.66°C
  Langley - Thunder 1: MK-III Weather Station Temp: Min=0.00°C, Max=0.00°C, Mean=0.00°C
  UI Labs Bioswale - Thunder 1: TM1 Temp Sensor: Min=327.00°C, Max=600.00°C, Mean=448.77°C
  UI Labs Bioswale - Thunder 1: MK-III Weather Station Temp: Min=0.00°C, Max=100.00°C, Mean=35.48°C
  Argyle - Thunder 1: TM1 Temp Sensor: Min=0.00°C, Max=3298.00°C, Mean=3251.45°C
  Argyle - Thunder 1: MK-III Weather Station Temp: Min=0.00°C, Max=100.00°C, Mean=33.00°C
(Success) Block 5: Temperature values have been corrected based on sensor types.


In [15]:
# --- Block 6: Spatial Data Preparation and Saving Cleaned Data ---

print("\nBlock 6: Preparing spatial data and saving cleaned datasets...")

# Generate a GeoPandas DataFrame
# Ensure latitude/longitude are numeric and not NaN
df_geo_ready = df_clean.dropna(subset=['latitude', 'longitude']).copy()

if not df_geo_ready.empty:
    geometry = [Point(xy) for xy in zip(df_geo_ready['longitude'], df_geo_ready['latitude'])]
    # Assuming original lat/lon are WGS84 (EPSG:4326)
    gdf = gpd.GeoDataFrame(df_geo_ready, geometry=geometry, crs="EPSG:4326")
    print(f"  Created GeoDataFrame with {len(gdf)} entries.")

    # Save GeoDataFrame
    try:
        gdf.to_file('data/chicago_environmental_data.geojson', driver='GeoJSON')
        print("  Saved cleaned data to data/chicago_environmental_data.geojson")
    except Exception as e:
         print(f"  Could not save GeoDataFrame to file: {e}")
else:
    print("  Skipping GeoDataFrame creation due to missing or invalid latitude/longitude in cleaned data.")
    gdf = None


# Save the main cleaned data (CSV format is often useful)
try:
    df_clean.to_csv('data/chicago_environmental_data_clean.csv', index=False)
    print("  Saved cleaned data to data/chicago_environmental_data_clean.csv")
except Exception as e:
     print(f"  Could not save cleaned DataFrame to CSV: {e}")

print("(Success) Block 6: Spatial data prep and initial saving complete.")


Block 6: Preparing spatial data and saving cleaned datasets...
  Created GeoDataFrame with 916230 entries.
  Saved cleaned data to data/chicago_environmental_data.geojson
  Saved cleaned data to data/chicago_environmental_data_clean.csv
(Success) Block 6: Spatial data prep and initial saving complete.


 Data Preparation

In [16]:
# --- Block 7: Data Prep for V1 (Monthly Temperature Boxplots) ---

print("\nBlock 7: Preparing data for Monthly Temperature Boxplots (V1)...")

import calendar
month_names_map = {i: calendar.month_name[i] for i in range(1, 13)}

# Stricter filtering based on *plausible* monthly ranges after sensor correction
month_temp_ranges_strict = {
    1: (-25.0, 10.0), 2: (-22.0, 15.0), 3: (-15.0, 25.0),
    4: (-5.0, 30.0),  5: (0.0, 35.0),   6: (5.0, 38.0)
}

# Filter temperature data for relevant months
temp_data_filtered_v1 = df_clean[\
    (df_clean['measurement_type'] == 'Temperature') &\
    (df_clean['measurement_medium'].astype(str).str.lower() == 'atmosphere') &\
    (df_clean['units'].astype(str).str.lower().str.contains('celsius', na=False)) &\
    (df_clean['measurement_time'].dt.month >= 3) &\
    (df_clean['measurement_time'].dt.month <= 6)\
].copy()

if temp_data_filtered_v1.empty:
     print("  No temperature data found for months 3-6 after cleaning. Skipping V1 prep.")
     temp_data_for_boxplot = pd.DataFrame()
else:
    temp_data_filtered_v1['month'] = temp_data_filtered_v1['measurement_time'].dt.month
    temp_data_filtered_v1['month_name'] = temp_data_filtered_v1['month'].map(month_names_map)

    # Apply stricter range filtering per month
    initial_v1_rows = len(temp_data_filtered_v1)
    valid_indices = []
    # Apply filter using a boolean mask for performance
    month_mask = temp_data_filtered_v1['month'].isin(month_temp_ranges_strict.keys())
    # Create a range mask, default to True for months not in strict ranges
    range_mask = pd.Series(True, index=temp_data_filtered_v1.index)

    for m, (min_t, max_t) in month_temp_ranges_strict.items():
        month_specific_mask = temp_data_filtered_v1['month'] == m
        range_mask[month_specific_mask] = temp_data_filtered_v1.loc[month_specific_mask, 'measurement_value'].between(min_t, max_t)

    # Combine filters
    temp_data_for_boxplot = temp_data_filtered_v1.loc[range_mask].copy()

    print(f"  Filtered temperature data for boxplot: removed {initial_v1_rows - len(temp_data_for_boxplot)} outliers based on monthly ranges.")

    # Sample data for performance if necessary (Altair has limits on data size for inline/URL)
    max_boxplot_points = 200000 # Set a threshold
    if len(temp_data_for_boxplot) > max_boxplot_points:
        print(f"  Sampling boxplot data from {len(temp_data_for_boxplot)} rows...")
        # Stratified sample by month to keep distributions
        temp_data_for_boxplot = temp_data_for_boxplot.groupby('month_name').apply(
            lambda x: x.sample(min(len(x), int(max_boxplot_points / temp_data_for_boxplot['month_name'].nunique())), random_state=42)
        ).reset_index(drop=True)
        print(f"  Sampled down to {len(temp_data_for_boxplot)} rows.")


# Save data for V1
if not temp_data_for_boxplot.empty:
    try:
        # Only save the columns needed for the chart
        temp_data_for_boxplot[['measurement_time', 'measurement_value', 'month_name']].to_json('data/monthly_temp_for_boxplot.json', orient='records')
        print("  Saved data for V1 (Monthly Boxplots) to data/monthly_temp_for_boxplot.json")
    except Exception as e:
        print(f"  Could not save data for V1: {e}")
else:
     print("  No data to save for V1 after filtering.")


print("Block 7: Data prep for V1 complete.")


Block 7: Preparing data for Monthly Temperature Boxplots (V1)...


  range_mask[month_specific_mask] = temp_data_filtered_v1.loc[month_specific_mask, 'measurement_value'].between(min_t, max_t)


  Filtered temperature data for boxplot: removed 83553 outliers based on monthly ranges.
  Sampling boxplot data from 277342 rows...


  temp_data_for_boxplot = temp_data_for_boxplot.groupby('month_name').apply(


  Sampled down to 178906 rows.
  Saved data for V1 (Monthly Boxplots) to data/monthly_temp_for_boxplot.json
Block 7: Data prep for V1 complete.


Create Vega-Lite Spatial Visualization

In [None]:
# --- Block 8: Data Prep for V2 (Linked Choropleth & Bars) ---

print("\nBlock 8: Preparing data for Linked Choropleth & Bars (V2)...")

# Load neighborhoods GeoJSON
neighborhoods_geojson_path = "data/chicago_neighborhoods.json"
try:
    neighborhoods = gpd.read_file(neighborhoods_geojson_path)
    print(f"  Loaded local {neighborhoods_geojson_path}")
except Exception as e:
    print(f"  Could not load local {neighborhoods_geojson_path}: {e}")
    neighborhoods = None

neighborhood_temps_json_path = "data/neighborhood_temps.json"

if neighborhoods is not None and 'community' in neighborhoods.columns and not df_clean.empty:
    print("  Processing temperature data for neighborhoods...")
    
    # Create simple list for neighborhood temperatures
    all_neighborhood_temps = []
    
    # Define months for processing
    months = {1: "January", 2: "February", 3: "March", 4: "April", 5: "May", 6: "June"}
    
    # Filter temperature data - only include valid numeric measurements
    print("  Filtering temperature data from df_clean...")
    temp_data = df_clean[
        (df_clean['measurement_type'] == 'Temperature') & 
        (df_clean['measurement_medium'].astype(str).str.lower() == 'atmosphere') &
        (df_clean['latitude'].notna()) & 
        (df_clean['longitude'].notna())
    ].copy()
    
    temp_data['measurement_value'] = pd.to_numeric(temp_data['measurement_value'], errors='coerce')
    temp_data = temp_data.dropna(subset=['measurement_value'])
    
    print(f"  Found {len(temp_data)} valid temperature readings")
    
    if not temp_data.empty:
        # Convert measurement_time to datetime
        temp_data['measurement_time'] = pd.to_datetime(temp_data['measurement_time'])
        
        # Create geometry for spatial join
        temp_data['geometry'] = [Point(xy) for xy in zip(temp_data['longitude'], temp_data['latitude'])]
        temp_gdf = gpd.GeoDataFrame(temp_data, geometry='geometry', crs="EPSG:4326")
        
        # Prepare neighborhoods for join
        neighborhoods_crs = neighborhoods.to_crs(temp_gdf.crs)
        
        # Process each month
        for month, month_name in months.items():
            month_data = temp_gdf[temp_gdf['measurement_time'].dt.month == month]
            
            if len(month_data) > 0:
                print(f"  Processing {len(month_data)} readings for {month_name}")
                
                try:
                    joined = gpd.sjoin(month_data, neighborhoods_crs, how='inner', predicate='within')
                    
                    if not joined.empty:
                        # Group by community and calculate statistics
                        stats = joined.groupby('community')['measurement_value'].agg(['mean', 'min', 'max']).reset_index()
                        stats.columns = ['community', 'mean_temp', 'min_temp', 'max_temp']
                        
                        stats['month'] = month
                        stats['month_name'] = month_name
                        
                        stats['mean_temp'] = stats['mean_temp'].round(1)
                        stats['min_temp'] = stats['min_temp'].round(1)
                        stats['max_temp'] = stats['max_temp'].round(1)
                        

                        all_neighborhood_temps.extend(stats.to_dict('records'))
                        print(f"  Added temperature data for {len(stats)} neighborhoods in {month_name}")
                    else:
                        print(f"  No neighborhoods matched for {month_name} after spatial join")
                except Exception as e:
                    print(f"  Error in spatial join for {month_name}: {e}")
            else:
                print(f"  No temperature data for {month_name}")
    
    # Save results
    if all_neighborhood_temps:
        try:
            with open(neighborhood_temps_json_path, 'w') as f:
                json.dump(all_neighborhood_temps, f)
            print(f"  Successfully saved {len(all_neighborhood_temps)} records to {neighborhood_temps_json_path}")
            
            # Create DataFrame for future use
            neighborhood_temps_df = pd.DataFrame(all_neighborhood_temps)
        except Exception as e:
            print(f"  Error saving temperature data: {e}")
            neighborhood_temps_df = pd.DataFrame()
    else:
        print("  No neighborhood temperature data was generated")
        neighborhood_temps_df = pd.DataFrame()
else:
    print("  Missing neighborhoods data or df_clean is empty")
    neighborhood_temps_df = pd.DataFrame()

# Prepare neighborhoods GeoJSON dictionary for Altair
if neighborhoods is not None and not neighborhoods.empty:
    try:
        neighborhoods_geo_dict = json.loads(neighborhoods.to_json())
        print("  Prepared neighborhoods GeoJSON for Altair")
    except Exception as e:
        print(f"  Failed to prepare neighborhoods GeoJSON dict: {e}")
        neighborhoods_geo_dict = None
else:
    neighborhoods_geo_dict = None

# Print diagnostic information
print("\n--- DEBUG: Final Status ---")
print(f"neighborhoods is None? {neighborhoods is None}")
if neighborhoods is not None:
    print(f"Number of neighborhoods: {len(neighborhoods)}")
    print(f"'community' in neighborhoods.columns? {'community' in neighborhoods.columns}")

print(f"df_clean is empty? {df_clean.empty}")
if not df_clean.empty:
    temp_count = len(df_clean[df_clean['measurement_type'] == 'Temperature'])
    print(f"Number of temperature readings: {temp_count}")

print(f"neighborhood_temps_df exists? {'neighborhood_temps_df' in locals()}")
if 'neighborhood_temps_df' in locals():
    print(f"neighborhood_temps_df is empty? {neighborhood_temps_df.empty}")
    if not neighborhood_temps_df.empty:
        print(f"Number of rows in neighborhood_temps_df: {len(neighborhood_temps_df)}")

print(f"neighborhoods_geo_dict exists? {'neighborhoods_geo_dict' in locals()}")
if 'neighborhoods_geo_dict' in locals() and neighborhoods_geo_dict is not None:
    print(f"neighborhoods_geo_dict has features? {len(neighborhoods_geo_dict.get('features', []))}")

# Check if the JSON file was created
try:
    with open(neighborhood_temps_json_path, 'r') as f:
        check_data = json.load(f)
    print(f"  Final check: {neighborhood_temps_json_path} contains {len(check_data)} records")
except Exception as e:
    print(f"  Final check: Could not read {neighborhood_temps_json_path}: {e}")

print("Block 8: Data prep for V2 complete.")


Block 8: Preparing data for Linked Choropleth & Bars (V2)...
  Loaded local data/chicago_neighborhoods.json
  Processing temperature data for neighborhoods...
  Filtering temperature data from df_clean...
  Found 411232 valid temperature readings
  No temperature data for January
  No temperature data for February
  Processing 100622 readings for March
  Added temperature data for 2 neighborhoods in March
  Processing 113323 readings for April
  Added temperature data for 2 neighborhoods in April
  Processing 116866 readings for May
  Added temperature data for 2 neighborhoods in May
  Processing 80421 readings for June
  Added temperature data for 2 neighborhoods in June
  Successfully saved 8 records to data/neighborhood_temps.json
  Prepared neighborhoods GeoJSON for Altair

--- DEBUG: Final Status ---
neighborhoods is None? False
Number of neighborhoods: 77
'community' in neighborhoods.columns? True
df_clean is empty? False
Number of temperature readings: 461364
neighborhood_temps

 Create Linked View Implementation (Task 2)

In [33]:
# --- Block 9: Data Prep for Daily Aggregations (Temp, Humid, Precip, Soil) ---

print("\nBlock 9: Preparing daily aggregated data (Temp, Humid, Precip, Soil) for V3, V5...")

# Filter relevant data types from df_clean
temp_data_daily = df_clean[df_clean['measurement_type'] == 'Temperature'].copy()
humid_data_daily = df_clean[df_clean['measurement_type'] == 'RelativeHumidity'].copy()
precip_data_daily = df_clean[df_clean['measurement_type'] == 'CumulativePrecipitation'].copy()
soil_data_daily = df_clean[df_clean['measurement_type'] == 'SoilMoisture'].copy()

# Ensure measurement_value is numeric for all relevant subsets before grouping
for data_subset in [temp_data_daily, humid_data_daily, precip_data_daily, soil_data_daily]:
     if not data_subset.empty:
        data_subset['measurement_value'] = pd.to_numeric(data_subset['measurement_value'], errors='coerce')
        data_subset.dropna(subset=['measurement_value'], inplace=True) # Drop rows where conversion failed or was NaN

# Check if essential data subsets are empty
if temp_data_daily.empty: print("  Warning: No Temperature data after filtering.")
if humid_data_daily.empty: print("  Warning: No RelativeHumidity data after filtering.")
if precip_data_daily.empty: print("  Warning: No CumulativePrecipitation data after filtering.")
if soil_data_daily.empty: print("  Warning: No SoilMoisture data after filtering.")

# Group by day - Use dt.normalize() for consistency (removes time part)
if not temp_data_daily.empty: temp_data_daily['day'] = temp_data_daily['measurement_time'].dt.normalize()
if not humid_data_daily.empty: humid_data_daily['day'] = humid_data_daily['measurement_time'].dt.normalize()
if not precip_data_daily.empty: precip_data_daily['day'] = precip_data_daily['measurement_time'].dt.normalize()
if not soil_data_daily.empty: soil_data_daily['day'] = soil_data_daily['measurement_time'].dt.normalize()

# Aggregate daily values (mean) - handle potential empty dataframes
daily_temp_agg = temp_data_daily.groupby('day')['measurement_value'].mean().reset_index() if not temp_data_daily.empty else pd.DataFrame(columns=['day', 'measurement_value'])
daily_humid_agg = humid_data_daily.groupby('day')['measurement_value'].mean().reset_index() if not humid_data_daily.empty else pd.DataFrame(columns=['day', 'measurement_value'])
daily_soil_agg = soil_data_daily.groupby('day')['measurement_value'].mean().reset_index() if not soil_data_daily.empty else pd.DataFrame(columns=['day', 'measurement_value'])

# Precipitation Handling (Calculate daily change per sensor, then sum)
if not precip_data_daily.empty:
    sensor_daily_precip_agg = precip_data_daily.groupby(['data_stream_id', 'day'])['measurement_value'].max().reset_index()
    sensor_daily_precip_agg = sensor_daily_precip_agg.sort_values(by=['data_stream_id', 'day'])
    sensor_daily_precip_agg['daily_change'] = sensor_daily_precip_agg.groupby('data_stream_id')['measurement_value'].diff().fillna(0)
    # Handle resets: Set negative diff values to 0 (assuming resets are to 0 or a low value)
    sensor_daily_precip_agg.loc[sensor_daily_precip_agg['daily_change'] < 0, 'daily_change'] = 0

    # Aggregate daily change across sensors
    daily_precip_agg = sensor_daily_precip_agg.groupby('day')['daily_change'].sum().reset_index()
    daily_precip_agg['daily_change'] = daily_precip_agg['daily_change'] / 25.4 # mm to inches
    daily_precip_agg.loc[daily_precip_agg['daily_change'] > 3, 'daily_change'] = np.nan # Remove extreme spikes
else:
     daily_precip_agg = pd.DataFrame(columns=['day', 'daily_change'])
     print("  Skipping daily precipitation aggregation due to no precipitation data.")

# Convert 'day' columns to datetime for merging (ensure consistency)
daily_temp_agg['day'] = pd.to_datetime(daily_temp_agg['day'])
daily_humid_agg['day'] = pd.to_datetime(daily_humid_agg['day'])
daily_precip_agg['day'] = pd.to_datetime(daily_precip_agg['day'])
daily_soil_agg['day'] = pd.to_datetime(daily_soil_agg['day'])

# Merge the daily aggregated data into a single DataFrame
daily_env_combined = pd.merge(daily_temp_agg, daily_humid_agg, on='day', how='outer', suffixes=('_temp', '_humid'))
daily_env_combined = pd.merge(daily_env_combined, daily_precip_agg[['day', 'daily_change']], on='day', how='left')
# Merge soil data, ensuring the 'measurement_value' column from soil is kept and renamed
daily_env_combined = pd.merge(daily_env_combined, daily_soil_agg[['day', 'measurement_value']], on='day', how='left') # No suffixes needed if measurement_value is only column from soil_agg

daily_env_combined.rename(columns={
    'measurement_value_temp': 'temperature',
    'measurement_value_humid': 'humidity',
    'daily_change': 'precipitation',
    'measurement_value': 'soil_moisture' # Renaming the measurement_value from soil_agg merge
}, inplace=True)

daily_env_combined.sort_values(by='day', inplace=True)

# Handle potential missing values after outer merge
daily_env_combined['precipitation'].fillna(0, inplace=True) # Assume 0 precip if missing

# Handle soil moisture scaling/cleaning
if 'soil_moisture' in daily_env_combined.columns and not daily_env_combined['soil_moisture'].dropna().empty:
    # Optional: Scale down soil moisture if it exceeds 100 based on max observed value
    max_soil_val = daily_env_combined['soil_moisture'].max()
    if pd.notna(max_soil_val) and max_soil_val > 100:
        print(f"  Scaling soil moisture down from max {max_soil_val:.1f}")
        daily_env_combined['soil_moisture'] = daily_env_combined['soil_moisture'] * (100.0 / max_soil_val)
else:
    print("  Warning: No soil moisture data found for scaling.")
    if 'soil_moisture' not in daily_env_combined.columns: # Ensure column exists even if no data
         daily_env_combined['soil_moisture'] = np.nan

# Filter for March 2017 for V3 (Temp/Humid/Precip/Scatter)
daily_env_march = daily_env_combined[
    (daily_env_combined['day'] >= '2017-03-01') & (daily_env_combined['day'] <= '2017-03-31')
].copy()

# Save data for V3 (March Temp/Humid/Precip/Scatter)
if not daily_env_march.empty:
    try:
        # Only save the columns needed for V3
        daily_env_march[['day', 'temperature', 'humidity', 'precipitation']].to_json('data/daily_env_march.json', orient='records')
        print("  Saved data for V3 (Temp/Humid/Precip/Scatter) to data/daily_env_march.json")
    except Exception as e:
         print(f"  Could not save data for V3: {e}")
else:
     print("  No data for March after aggregation. Skipping V3 data save.")

# Save data for V5 (Soil/Precip Time Series). Filter out days with no soil moisture data.
daily_soil_precip_data = daily_env_combined[['day', 'soil_moisture', 'precipitation']].dropna(subset=['soil_moisture']).copy()
if not daily_soil_precip_data.empty:
    try:
         daily_soil_precip_data.to_json('data/daily_soil_precip.json', orient='records')
         print("  Saved data for V5 (Soil/Precip Time Series) to data/daily_soil_precip.json")
    except Exception as e:
          print(f"  Could not save data for V5: {e}")
else:
     print("  No valid soil moisture data after aggregation. Skipping V5 Time Series data save.")

print("Block 9: Daily aggregated data prep complete.")



Block 9: Preparing daily aggregated data (Temp, Humid, Precip, Soil) for V3, V5...
  Scaling soil moisture down from max 638.5
  Saved data for V3 (Temp/Humid/Precip/Scatter) to data/daily_env_march.json
  Saved data for V5 (Soil/Precip Time Series) to data/daily_soil_precip.json
Block 9: Daily aggregated data prep complete.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  daily_env_combined['precipitation'].fillna(0, inplace=True) # Assume 0 precip if missing


In [34]:
# --- Block 10: Data Prep for V4 (Daily & Hourly Temp Cycles & Trend) ---

print("\nBlock 10: Preparing data for Daily Cycles & Trend (V4)...")

# Filter Temperature data for months 3-6 from the original df_clean
temp_data_v4 = df_clean[
    (df_clean['measurement_type'] == 'Temperature') &
    (df_clean['measurement_medium'].astype(str).str.lower() == 'atmosphere') &
    (df_clean['units'].astype(str).str.lower().str.contains('celsius', na=False)) &
    (df_clean['measurement_time'].dt.month >= 3) &
    (df_clean['measurement_time'].dt.month <= 6)
].copy()

if temp_data_v4.empty:
    print("  No temperature data found for months 3-6. Skipping V4 prep.")
    hourly_agg = pd.DataFrame()
    daily_temp_trend_data = pd.DataFrame() # Ensure variables are defined
else:
    # Extract time fields
    temp_data_v4['hour'] = temp_data_v4['measurement_time'].dt.hour
    temp_data_v4['month'] = temp_data_v4['measurement_time'].dt.month
    temp_data_v4['day'] = temp_data_v4['measurement_time'].dt.normalize() # Daily grouping

    # Aggregate data: Calculate hourly mean and standard deviation by month for Cycles chart
    hourly_agg = temp_data_v4.groupby(['month', 'hour'])['measurement_value'].agg(
        mean_temp='mean',
        std_temp='std'
    ).reset_index()

    if not hourly_agg.empty:
        hourly_agg['std_temp'] = hourly_agg['std_temp'].fillna(0) # Replace NaN std (single measurement hour) with 0
        hourly_agg['lower_band'] = hourly_agg['mean_temp'] - hourly_agg['std_temp']
        hourly_agg['upper_band'] = hourly_agg['mean_temp'] + hourly_agg['std_temp']
        hourly_agg['month_name'] = hourly_agg['month'].map(month_names_map)
        print(f"  Aggregated hourly temperature data for {len(hourly_agg)} month-hour pairs.")
    else:
         print("  No data to aggregate hourly temperature cycles.")

    # Aggregate data: Calculate daily mean, min, max for Trend chart
    # Using the same filtered temp_data_v4 as base
    daily_temp_trend_data = temp_data_v4.groupby('day')['measurement_value'].agg(['mean', 'min', 'max']).reset_index()
    if not daily_temp_trend_data.empty:
        daily_temp_trend_data.columns = ['day', 'mean_temp', 'min_temp', 'max_temp']
        daily_temp_trend_data['day'] = pd.to_datetime(daily_temp_trend_data['day'])
        daily_temp_trend_data['month'] = daily_temp_trend_data['day'].dt.month # Add month for filtering later
        daily_temp_trend_data['month_name'] = daily_temp_trend_data['month'].map(month_names_map)
        print(f"  Aggregated daily temperature data for {len(daily_temp_trend_data)} days.")
    else:
        print("  No data to aggregate daily temperature trend.")


# Save data for V4 Cycles
if 'hourly_agg' in globals() and not hourly_agg.empty:
    try:
        hourly_agg[['month', 'hour', 'mean_temp', 'std_temp', 'lower_band', 'upper_band', 'month_name']].to_json('data/hourly_temp_cycles.json', orient='records')
        print("  Saved data for V4 (Daily Cycles) to data/hourly_temp_cycles.json")
    except Exception as e:
         print(f"  Could not save data for V4 Daily Cycles: {e}")
else:
     print("  No data to save for V4 Daily Cycles.")

# Save data for V4 Trend
if 'daily_temp_trend_data' in globals() and not daily_temp_trend_data.empty:
    try:
        daily_temp_trend_data[['day', 'mean_temp', 'min_temp', 'max_temp', 'month_name']].to_json('data/daily_temp_trend.json', orient='records')
        print("  Saved data for V4 (Daily Trend) to data/daily_temp_trend.json")
    except Exception as e:
         print(f"  Could not save data for V4 Daily Trend: {e}")
else:
     print("  No data to save for V4 Daily Trend.")


print("Block 10: Data prep for V4 complete.")


Block 10: Preparing data for Daily Cycles & Trend (V4)...
  Aggregated hourly temperature data for 96 month-hour pairs.
  Aggregated daily temperature data for 106 days.
  Saved data for V4 (Daily Cycles) to data/hourly_temp_cycles.json
  Saved data for V4 (Daily Trend) to data/daily_temp_trend.json
Block 10: Data prep for V4 complete.


In [36]:
# --- Block 11: Data Prep for V5 Lag Correlation ---

print("\nBlock 11: Preparing data for Soil Moisture Lag Correlation (V5 Part 2)...")

# Ensure daily_env_combined exists and has necessary columns
if 'daily_env_combined' in globals() and \
   'soil_moisture' in daily_env_combined.columns and \
   'precipitation' in daily_env_combined.columns and \
   not daily_env_combined[['soil_moisture', 'precipitation']].dropna().empty:

    max_lag = 7 # Define maximum lag in days

    # Use a copy and ensure 'day' is datetime and set as index for reliable shifting
    lag_corr_df_base = daily_env_combined[['day', 'soil_moisture', 'precipitation']].dropna(subset=['soil_moisture', 'precipitation']).copy()
    if lag_corr_df_base.empty:
        print("  No overlapping soil moisture and precipitation data for lag correlation.")
        lag_corr_plot_df = pd.DataFrame() # Ensure df is empty
    else:
        lag_corr_df_base['day'] = pd.to_datetime(lag_corr_df_base['day'])
        lag_corr_df_base = lag_corr_df_base.set_index('day').sort_index()

        lag_correlations = []
        valid_lags = []

        for lag in range(max_lag + 1):
            # Shift precipitation by the lag amount
            corr_df = lag_corr_df_base.copy()
            corr_df['precip_lag'] = corr_df['precipitation'].shift(lag)

            # Drop any rows that now have NaN due to shifting or original missing values
            corr_df_cleaned = corr_df.dropna()

            # Need at least 2 data points to calculate correlation
            if len(corr_df_cleaned) > 1:
                try:
                    # Calculate Pearson correlation coefficient
                    correlation = np.corrcoef(corr_df_cleaned['soil_moisture'], corr_df_cleaned['precip_lag'])[0, 1]
                    lag_correlations.append(correlation)
                    valid_lags.append(lag)
                except Exception as e:
                     print(f"  Warning: Could not compute correlation for lag {lag}: {e}")
            else:
                 # print(f"  Not enough data points (>1) after cleaning to calculate correlation for lag {lag}. Found {len(corr_df_cleaned)}.")
                 pass # Skip lags with insufficient data silently unless debugging


        # Create DataFrame for plotting correlations
        lag_corr_plot_df = pd.DataFrame({
            'lag': valid_lags,
            'correlation': lag_correlations
        }).dropna() # Drop any NaN correlations that might have slipped through (unlikely with np.corrcoef if input is valid)

    # Save the correlation data
    if 'lag_corr_plot_df' in globals() and not lag_corr_plot_df.empty:
        try:
            lag_corr_plot_df.to_json('data/soil_precip_lag_correlation.json', orient='records')
            print("  Saved data for V5 Lag Correlation to data/soil_precip_lag_correlation.json")
        except Exception as e:
            print(f"  Could not save data for V5 Lag Correlation: {e}")
    else:
        print("  No data to save for V5 Lag Correlation after calculation.")

else:
     print("Block 11: Skipping V5 Lag Correlation data prep due to missing daily_env_combined or essential columns/data.")
     lag_corr_plot_df = pd.DataFrame() # Ensure variable is defined

print("Block 11: Data prep for V5 Lag Correlation complete.")


Block 11: Preparing data for Soil Moisture Lag Correlation (V5 Part 2)...
  Saved data for V5 Lag Correlation to data/soil_precip_lag_correlation.json
Block 11: Data prep for V5 Lag Correlation complete.


In [37]:
# --- Block 12: Data Prep for V6 (Temp vs Wind Speed) ---

print("\nBlock 12: Preparing data for Temperature vs Wind Speed Scatter Plot (V6)...")

# Filter Wind Speed and Temperature data from df_clean
wind_speed_v6 = df_clean[df_clean['measurement_type'] == 'WindSpeed'].copy()
temp_data_v6 = df_clean[df_clean['measurement_type'] == 'Temperature'].copy()

# Ensure measurement_value is numeric, coercing errors
wind_speed_v6['measurement_value'] = pd.to_numeric(wind_speed_v6['measurement_value'], errors='coerce')
temp_data_v6['measurement_value'] = pd.to_numeric(temp_data_v6['measurement_value'], errors='coerce')
wind_speed_v6.dropna(subset=['measurement_value', 'latitude', 'longitude'], inplace=True) # Need location for grouping
temp_data_v6.dropna(subset=['measurement_value', 'latitude', 'longitude'], inplace=True) # Need location for grouping

if wind_speed_v6.empty: print("  Warning: No WindSpeed data after filtering/cleaning.")
if temp_data_v6.empty: print("  Warning: No Temperature data after filtering/cleaning for V6.")

# Group by day AND location (average daily values per sensor/location)
if not wind_speed_v6.empty: wind_speed_v6['day'] = wind_speed_v6['measurement_time'].dt.normalize()
if not temp_data_v6.empty: temp_data_v6['day'] = temp_data_v6['measurement_time'].dt.normalize()

wind_speed_daily = wind_speed_v6.groupby(['day', 'latitude', 'longitude'])['measurement_value'].mean().reset_index() if not wind_speed_v6.empty else pd.DataFrame()
temp_daily_v6 = temp_data_v6.groupby(['day', 'latitude', 'longitude'])['measurement_value'].mean().reset_index() if not temp_data_v6.empty else pd.DataFrame()

# Convert 'day' columns to datetime
if not wind_speed_daily.empty: wind_speed_daily['day'] = pd.to_datetime(wind_speed_daily['day'])
if not temp_daily_v6.empty: temp_daily_v6['day'] = pd.to_datetime(temp_daily_v6['day'])

# Merge the daily wind speed and temperature data by day and location
# Use 'inner' merge to only keep days/locations with both measurements
wind_temp = pd.merge(
    wind_speed_daily,
    temp_daily_v6,
    on=['day', 'latitude', 'longitude'],
    how='inner', # Ensure we have both temp and wind for each location on each day
    suffixes=('_wind', '_temp')
)

if wind_temp.empty:
    print("  DataFrame is empty after merging temperature and wind speed data. Cannot proceed with V6 prep.")
else:
    print(f"  Merged daily Temp and Wind data: {len(wind_temp)} rows.")

    # --- IQR Filtering on the Merged Data ---
    initial_rows_wind_temp = len(wind_temp)
    # Filter temperature outliers
    if len(wind_temp['measurement_value_temp']) > 1: # Need at least 2 for quantile
        Q1_temp = wind_temp['measurement_value_temp'].quantile(0.25)
        Q3_temp = wind_temp['measurement_value_temp'].quantile(0.75)
        IQR_temp = Q3_temp - Q1_temp
        lower_bound_temp = Q1_temp - 1.5 * IQR_temp
        upper_bound_temp = Q3_temp + 1.5 * IQR_temp
        wind_temp = wind_temp[(wind_temp['measurement_value_temp'] >= lower_bound_temp) &\
                              (wind_temp['measurement_value_temp'] <= upper_bound_temp)].copy()
    else:
        print("  Not enough Temp data for IQR filtering.")

    # Filter wind speed outliers
    if len(wind_temp['measurement_value_wind']) > 1: # Need at least 2 for quantile
        Q1_wind = wind_temp['measurement_value_wind'].quantile(0.25)
        Q3_wind = wind_temp['measurement_value_wind'].quantile(0.75)
        IQR_wind = Q3_wind - Q1_wind
        lower_bound_wind = Q1_wind - 1.5 * IQR_wind
        upper_bound_wind = Q3_wind + 1.5 * IQR_wind
        wind_temp = wind_temp[(wind_temp['measurement_value_wind'] >= lower_bound_wind) &\
                              (wind_temp['measurement_value_wind'] <= upper_bound_wind)].copy()
    else:
        print("  Not enough Wind data for IQR filtering.")

    print(f"  Removed {initial_rows_wind_temp - len(wind_temp)} rows during IQR filtering.")

    # Check if data remains after IQR filtering
    if wind_temp.empty:
        print("  DataFrame is empty after IQR filtering. Cannot proceed with V6 prep.")
    else:
        # --- Filter Data to Only Focus on March ---
        wind_temp['month'] = wind_temp['day'].dt.month
        wind_temp = wind_temp[wind_temp['month'] == 3].copy() # Filter for March (month 3)

        # --- Add missing temperature range manually ---
        march_temp_range = (-8.3, 27.8)  # from your previous sensor_ranges_agg for March

        if march_temp_range:
            min_temp, max_temp = march_temp_range
            initial_rows_march = len(wind_temp)
            wind_temp = wind_temp[(wind_temp['measurement_value_temp'] >= min_temp) &\
                                  (wind_temp['measurement_value_temp'] <= max_temp)].copy()
            print(f"  Removed {initial_rows_march - len(wind_temp)} rows outside March temp range ({min_temp}°C to {max_temp}°C).")

        # Check if data remains for March
        if wind_temp.empty:
            print("  No data remaining for March after filtering. Cannot proceed with V6 prep.")
        else:
            # Create a field for the date labels directly in the dataframe
            wind_temp['date_label'] = wind_temp['day'].dt.strftime('%m/%d')

            # Save data for V6
            try:
                # Only save necessary columns
                wind_temp[['day', 'latitude', 'longitude', 'measurement_value_temp', 'measurement_value_wind', 'date_label']].to_json('data/temp_wind_daily.json', orient='records')
                print("  Saved data for V6 (Temp vs Wind) to data/temp_wind_daily.json")
            except Exception as e:
                print(f"  Could not save data for V6: {e}")

print("Block 12: Data prep for V6 complete.")



Block 12: Preparing data for Temperature vs Wind Speed Scatter Plot (V6)...
  Merged daily Temp and Wind data: 40 rows.
  Removed 0 rows during IQR filtering.
  Removed 40 rows outside March temp range (-8.3°C to 27.8°C).
  No data remaining for March after filtering. Cannot proceed with V6 prep.
Block 12: Data prep for V6 complete.


In [38]:
# --- Block 13: Generate Altair Spec for V1 (Monthly Boxplots) ---

print("\nBlock 13: Generating Altair spec for V1 (Monthly Boxplots)...")

# Data is loaded in JS from 'data/monthly_temp_for_boxplot.json'
# Define order and color scale based on *expected* months (3-6) for consistency,
# but make sure they are present in the data if possible for the sort.
# Using the list from Block 7's data prep.
month_order_v1 = [month_names_map[m] for m in range(3, 7) if m in range(3,7)] # Define full range
color_scheme_v1 = alt.Scale(domain=month_order_v1, range=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']) # Range length should match domain if possible

# Create the boxplot specification using a data URL
boxplot_v1 = alt.Chart(alt.Data(url='data/monthly_temp_for_boxplot.json')).mark_boxplot(
    extent='min-max', # equivalent to showfliers=False
    median={'color': 'white'} # White line for median
).encode(
    x=alt.X('month_name:N', title='Month', sort=month_order_v1), # Sort by defined order
    y=alt.Y('measurement_value:Q', title='Temperature (°C)'),
    color=alt.Color('month_name:N', scale=color_scheme_v1, legend=None), # Color by month, no legend needed
    tooltip=[
        alt.Tooltip('month_name:N', title='Month'),
        # Note: Tooltips on boxplots in Vega-Lite often show aggregated stats implicitly or need explicit aggregation in transform
        # Basic tooltip below might show raw value if data is not aggregated before plotting
        alt.Tooltip('measurement_value:Q', title='Temperature (°C)', format='.1f')
    ]
).properties(
    title='Monthly Temperature Distribution in Chicago (Mar-Jun 2017)'
)

# Save the chart specification
try:
    boxplot_v1.save('specs/monthly_boxplot.json')
    print("  Saved V1 (Monthly Boxplot) spec to specs/monthly_boxplot.json")
except Exception as e:
    print(f"  Could not save V1 spec: {e}")

print("Block 13: V1 spec generation complete.")


Block 13: Generating Altair spec for V1 (Monthly Boxplots)...
  Saved V1 (Monthly Boxplot) spec to specs/monthly_boxplot.json
Block 13: V1 spec generation complete.


In [43]:
# --- Block 14: Generate Altair Spec for V2 (Linked Choropleth & Bars) ---

print("\nBlock 14: Generating Altair spec for V2 (Linked Choropleth/Bars)...")

# Data sources are data/chicago_neighborhoods.json (for map shapes) and data/neighborhood_temps.json (for lookup)
# neighborhoods_geo_dict was prepared in Block 8

if 'neighborhoods_geo_dict' in globals() and neighborhoods_geo_dict and \
   'neighborhood_temps_df' in globals() and not neighborhood_temps_df.empty: # Ensure necessary data exists

    selection_v2 = alt.selection_single(fields=["community"], empty="none", on="click", name="sel_community_v2")

    # Choropleth Map (using InlineData for the GeoJSON features)
    choropleth_v2 = alt.Chart(alt.InlineData(values=neighborhoods_geo_dict['features'])).mark_geoshape(
        stroke='white',
        strokeWidth=0.5
    ).encode(
        color=alt.condition(
            selection_v2, # Condition based on the selection
            alt.Color('properties.mean_temp:Q',
                      title='Mean Temperature (°C)',
                      # Use a diverging scheme centered around a plausible April temp midpoint
                      scale=alt.Scale(scheme='redblue', reverse=True, domainMid=10)), # Adjust midpoint as needed
            alt.value('lightgray') # Color for non-selected neighborhoods
        ),
        opacity=alt.condition(selection_v2, alt.value(1.0), alt.value(0.7)), # Highlight selected
        tooltip=[
            alt.Tooltip('properties.community:N', title='Neighborhood'),
            alt.Tooltip('properties.mean_temp:Q', title='Mean Temp (°C)', format='.1f'),
            alt.Tooltip('properties.min_temp:Q', title='Min Temp (°C)', format='.1f'),
            alt.Tooltip('properties.max_temp:Q', title='Max Temp (°C)', format='.1f')
        ]
    ).transform_lookup(
        # Lookup from the *external* neighborhood_temps.json data source
        lookup='properties.community', # Match lookup field in GeoJSON features
        from_=alt.LookupData(
            data=alt.Data(url='data/neighborhood_temps.json'), # Use data source URL
            key='community', # Match key field in the lookup data
            fields=['mean_temp', 'min_temp', 'max_temp', 'month_name'] # Fields to bring into the GeoJSON data
        ),
        as_=['properties.mean_temp', 'properties.min_temp', 'properties.max_temp', 'properties.month_name'] # How to name the new fields
    ).transform_calculate(
        # Create top-level field 'community' from properties for selection (Vega-Lite specific need for selections)
        community="datum.properties.community"
    ).transform_filter(
        # Filter map to show only April data for this specific view (matches report's focus)
        alt.datum.properties.month_name == 'April'
    ).add_params(
        selection_v2 # Add the selection parameter to the chart
    ).properties(
        title='Mean Temperature by Chicago Neighborhood (April)'
    )


    # Bar Chart (using the same lookup data source but filtered by selection)
    bar_chart_v2 = alt.Chart(alt.Data(url='data/neighborhood_temps.json')).transform_filter(
         # Filter data source by Month AND filter by map selection
        (alt.datum.month_name == 'April') & selection_v2
    ).transform_fold(
        fold=['min_temp', 'mean_temp', 'max_temp'], # Metrics to fold
        as_=['temp_metric', 'temp_value'] # New columns for folded data
    ).mark_bar().encode(
        x=alt.X('temp_metric:N', title='Temperature Metric', sort=['min_temp', 'mean_temp', 'max_temp'], # Sort bars logically
                axis=alt.Axis(labelAngle=0, labelExpr="replace(datum.value, '_temp', '')")), # Clean up labels
        y=alt.Y('temp_value:Q', title='Temperature (°C)', scale=alt.Scale(zero=False)), # Scale adapted to data range
        color=alt.Color('temp_metric:N', legend=None, scale=alt.Scale(domain=['min_temp', 'mean_temp', 'max_temp'], range=['lightblue', 'orange', 'firebrick'])), # Specific colors
        tooltip=[
            alt.Tooltip('community:N', title='Neighborhood'), # 'community' field available after filter
            alt.Tooltip('temp_metric:N', title='Metric'),
            alt.Tooltip('temp_value:Q', title='Value (°C)', format='.1f')
        ]
    ).properties(
        height=220,
        # Dynamic title using expression from selection_v2 params
         title=alt.TitleParams(
            text=alt.expr(
                 # Check if the selection is valid (a neighborhood is selected)
                 'isValid(sel_community_v2.community) ? "Temperature Metrics for " + sel_community_v2.community[0] : "Temperature Metrics (Select a Neighborhood)"'
            ),
            subtitle="Min, Mean, and Max Temperature for April",
            anchor='middle',
            offset=10
        )
    )

    # Combine the charts vertically
    linked_view_v2 = alt.vconcat(
        choropleth_v2,
        bar_chart_v2,
        spacing=25 # Space between the two charts
    ).resolve_scale(
        color="independent" # Allow each chart to manage its own color scale
    ).configure_view(stroke=None) # Remove border around the concatenated view

    # Save the chart specification
    try:
        linked_view_v2.save('specs/choropleth_linked_bars.json')
        print("  Saved V2 (Linked Choropleth/Bars) spec to specs/choropleth_linked_bars.json")
    except Exception as e:
        print(f"  Could not save V2 spec: {e}")
else:
    print("Block 14: Skipping V2 spec generation due to missing data (neighborhoods GeoJSON or aggregated temperatures).")

print("Block 14: V2 spec generation complete.")


Block 14: Generating Altair spec for V2 (Linked Choropleth/Bars)...


Deprecated since `altair=5.0.0`. Use selection_point instead.
  selection_v2 = alt.selection_single(fields=["community"], empty="none", on="click", name="sel_community_v2")


AttributeError: 'GetAttrExpression' object has no attribute 'month_name'

In [42]:
# --- Block 15: Generate Altair Spec for V3 (Linked Time Series -> Scatter Filter) ---

print("\nBlock 15: Generating Altair spec for V3 (Linked Time Series -> Scatter Filter)...")

# Data source is data/daily_env_march.json (Daily data for March)
# Ensure daily_env_march has data (checked in Block 9)

# Define a selection that will capture a time interval on the x-axis
time_brush_scatter_link_v3 = alt.selection_interval(encodings=['x'], name='select_time_for_scatter_v3')

# Base chart for shared X-axis
base_ts_scatter_link_v3 = alt.Chart(alt.Data(url='data/daily_env_march.json')).encode(
    x=alt.X('day:T', title='Date (Brush to Select Range)', axis=alt.Axis(format='%a %d', labelAngle=0, grid=True))
)

# Temperature line (Left Y-axis, Red)
temp_line_scatter_link_v3 = base_ts_scatter_link_v3.mark_line(point=False, strokeWidth=2, color='red').encode(
    y=alt.Y('temperature:Q', title='Temperature (°C)', axis=alt.Axis(titleColor='red', titlePadding=10))
).properties(height=150, title='Temperature and Humidity Over Time (March 2017)') # Add title here

# Humidity line (Right Y-axis, Blue)
humid_line_scatter_link_v3 = base_ts_scatter_link_v3.mark_line(point=False, strokeWidth=2, color='blue').encode(
    y=alt.Y('humidity:Q', title='Relative Humidity (%)', axis=alt.Axis(orient='right', titleColor='blue', titlePadding=10))
)

# Layer the lines for the time series panel and add the brush selection
time_series_panel_scatter_link_v3 = alt.layer(
    temp_line_scatter_link_v3,
    humid_line_scatter_link_v3
).resolve_scale(
    y='independent' # Independent Y-axes for temperature and humidity
).add_params( # Add the interval selection parameter
    time_brush_scatter_link_v3
)

# Scatter Plot (Temp vs Humidity), filtered by the brush selection
scatter_panel_scatter_link_v3 = alt.Chart(alt.Data(url='data/daily_env_march.json')).mark_point(
    opacity=0.6,
    filled=True,
    color='green' # Use a distinct color for scatter points
).encode(
    x=alt.X('temperature:Q', title='Temperature (°C)', scale=alt.Scale(zero=False)), # Scale adapted
    y=alt.Y('humidity:Q', title='Relative Humidity (%)', scale=alt.Scale(zero=False)), # Scale adapted
    tooltip=[ # Add tooltips for hovering
        alt.Tooltip('day:T', format='%Y-%m-%d', title='Date'),
        alt.Tooltip('temperature:Q', format='.1f', title='Temp (°C)'),
        alt.Tooltip('humidity:Q', format='.0f', title='Humidity (%)'),
        alt.Tooltip('precipitation:Q', format='.2f', title='Precip (in)') # Include precip info
    ]
).transform_filter( # Filter data based on the brush selection
    time_brush_scatter_link_v3
).properties(
    height=300,
    title='Temperature vs. Humidity Relationship (for selected period)' # Dynamic title could be added here based on selection
)

# Combine the Time Series and Scatter Plot vertically
linked_view_v3 = alt.vconcat(
    time_series_panel_scatter_link_v3,
    scatter_panel_scatter_link_v3,
    spacing=15 # Space between the charts
).configure_axis( # Global axis configurations
    grid=True, gridColor='lightgray', labelFontSize=10, titleFontSize=12
).configure_title( # Chart title configuration
    fontSize=14, anchor='middle'
).configure_view( # Remove border around the concatenated view
    stroke=None
)

# Save the chart specification
try:
    linked_view_v3.save('specs/timeseries_scatter_linked.json')
    print("  Saved V3 (Time Series Brush -> Scatter Filter) spec to specs/timeseries_scatter_linked.json")
except Exception as e:
    print(f"  Could not save V3 spec: {e}")

print("Block 15: V3 spec generation complete.")


Block 15: Generating Altair spec for V3 (Linked Time Series -> Scatter Filter)...
  Saved V3 (Time Series Brush -> Scatter Filter) spec to specs/timeseries_scatter_linked.json
Block 15: V3 spec generation complete.


In [None]:
# --- Block 16: Generate Altair Spec for V4 (Linked Daily Cycles & Trend) ---

print("\nBlock 16: Generating Altair spec for V4 (Linked Daily Cycles & Trend)...")

# Data sources are data/hourly_temp_cycles.json and data/daily_temp_trend.json
# Ensure hourly_agg and daily_temp_trend_data had data during prep (checked in Block 10)

# Determine the order for months Mar-Jun based on expected months
month_order_v4 = [month_names_map[m] for m in range(3, 7)]
# Define color scheme for the months
color_scheme_v4 = alt.Scale(domain=month_order_v4, range=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']) # Blue, Orange, Green, Red

# Define the selection for clicking on a month in the cycles chart/legend
month_cycle_select_v4 = alt.selection_point(fields=['month_name'], empty='all', on='click', name='select_month_v4')

# Chart 1: Daily Cycles (shows average hourly patterns by month)
base_cycle_v4 = alt.Chart(alt.Data(url='data/hourly_temp_cycles.json')).encode(
    x=alt.X('hour:O', title='Hour of Day', # Treat hour as Ordinal for discrete steps
            axis=alt.Axis(labelAngle=0, values=list(range(0, 24, 2)), grid=True)), # Show labels every 2 hours
    color=alt.Color('month_name:N', # Color by month name
                    scale=color_scheme_v4, # Use the defined color scheme
                    sort=month_order_v4, # Ensure months are sorted correctly
                    legend=alt.Legend(title="Month (Click Legend/Line)")), # Add a legend with instructions
    opacity=alt.condition(month_cycle_select_v4, alt.value(0.7), alt.value(0.2)) # Fade non-selected months
)

# Layer 1.1: Error Bands using mark_area (+/- 1 Std Dev)
error_bands_cycle_v4 = base_cycle_v4.mark_area(opacity=0.2).encode(
    y=alt.Y('lower_band:Q', title='Temperature (°C)', axis=alt.Axis(titlePadding=10)), # Y for lower bound
    y2=alt.Y2('upper_band:Q') # Y2 for upper bound
)

# Layer 1.2: Mean Line
mean_line_cycle_v4 = base_cycle_v4.mark_line(point=False, strokeWidth=2).encode( # Point=False to avoid default points
    y=alt.Y('mean_temp:Q'), # Y for the mean line
    strokeWidth=alt.condition(month_cycle_select_v4, alt.value(4), alt.value(2)), # Make selected line thicker
    tooltip=[ # Tooltips for hovering over points/lines
        alt.Tooltip('month_name:N', title='Month'),
        alt.Tooltip('hour:O', title='Hour'),
        alt.Tooltip('mean_temp:Q', format='.1f', title='Avg Temp (°C)'),
        alt.Tooltip('std_temp:Q', format='.1f', title='Std Dev (°C)')
    ]
)

# Combine layers for the first chart (Daily Cycles)
daily_cycle_chart_linked_v4 = alt.layer(
    error_bands_cycle_v4,
    mean_line_cycle_v4
).add_params( # Add the selection parameter to make this chart interactive
    month_cycle_select_v4
).properties(
    height=300,
    title='Daily Temperature Cycles by Month (Select a Month Below)'
)


# Chart 2: Daily Trend (shows daily averages over the study period)
base_trend_filtered_v4 = alt.Chart(alt.Data(url='data/daily_temp_trend.json')).encode(
    x=alt.X('day:T', title='Date', axis=alt.Axis(format='%b %d', labelAngle=-45, grid=True)) # Format date nicely
).transform_filter( # Filter the data for this chart based on the selection from Chart 1
    month_cycle_select_v4
)

# Layer 2.1: Mean Trend Line
line_trend_filtered_v4 = base_trend_filtered_v4.mark_line(
    color='#1A759F', # Consistent color for the trend line
    strokeWidth=2
).encode(
    y=alt.Y('mean_temp:Q', title='Daily Avg Temp (°C)', axis=alt.Axis(titlePadding=10))
)

# Layer 2.2: Points on the Trend Line
points_trend_filtered_v4 = base_trend_filtered_v4.mark_point(
    filled=True, # Filled points
    color='#1A759F', # Same color as the line
    size=60 # Point size
).encode(
    y=alt.Y('mean_temp:Q'), # Y position matches the line
    tooltip=[ # Tooltips for points
        alt.Tooltip('day:T', title='Date', format='%b %d'),
        alt.Tooltip('month_name:N', title='Month'),
        alt.Tooltip('mean_temp:Q', title='Avg Temp (°C)', format='.1f'),
        alt.Tooltip('min_temp:Q', title='Min Temp (°C)', format='.1f'),
        alt.Tooltip('max_temp:Q', title='Max Temp (°C)', format='.1f')
    ]
)

# Combine layers for the second chart (Daily Trend)
daily_trend_chart_linked_v4 = alt.layer(
    line_trend_filtered_v4,
    points_trend_filtered_v4
).properties(
    height=250,
    title='Daily Average Temperatures for Selected Month' # Static title, context from upper chart
)

# Combine the two charts vertically
linked_view_v4 = alt.vconcat(
    daily_cycle_chart_linked_v4,
    daily_trend_chart_linked_v4,
    spacing=20 # Space between charts
).resolve_legend( # Ensure legends are handled correctly
    color="independent", # Color legend from the top chart
    strokeWidth="independent" # Stroke width legend (if used) is also independent
).configure_axis( # Global axis configurations
    grid=True, gridColor='lightgray', labelFontSize=10, titleFontSize=12
).configure_title( # Global title configurations
    fontSize=14, anchor='middle'
).configure_view( # Remove border around the concatenated view
    stroke=None
).interactive() # Enable zooming/panning on the combined view


# Save the chart specification
try:
    linked_view_v4.save('specs/cycles_trend_linked.json')
    print("  Saved V4 (Linked Cycles/Trend) spec to specs/cycles_trend_linked.json")
except Exception as e:
    print(f"  Could not save V4 spec: {e}")

print("Block 16: V4 spec generation complete.")


Block 16: Generating Altair spec for V4 (Linked Daily Cycles & Trend)...
  Saved V4 (Linked Cycles/Trend) spec to specs/cycles_trend_linked.json
Block 16: V4 spec generation complete.


In [27]:
# --- Block 17: Generate Altair Spec for V5 Time Series (Focus + Context) ---

print("\nBlock 17: Generating Altair spec for V5 Time Series (Focus + Context)...")

# Data source is data/daily_soil_precip.json
# Ensure daily_soil_precip_data had data during prep (checked in Block 9)

# Define a selection that will capture a time interval on the x-axis of the context chart
time_brush_v5 = alt.selection_interval(encodings=['x'], name='time_brush_v5')

# Base chart for shared X-axis property (even though scales might resolve independently)
base_v5_ts = alt.Chart(alt.Data(url='data/daily_soil_precip.json')).properties(width=700)

# Lower Chart: Precipitation Context (with Brush)
precip_context_v5 = base_v5_ts.mark_bar(color='steelblue', opacity=0.7).encode(
    x=alt.X('day:T', title='Date (Brush to Select Range)', axis=alt.Axis(format='%b %d', grid=True)), # X-axis for context
    y=alt.Y('precipitation:Q', title='Daily Precip (in)', axis=alt.Axis(titleColor='steelblue', titlePadding=10)), # Y-axis for precipitation
    tooltip=[ # Tooltip for interaction
        alt.Tooltip('day:T', format='%Y-%m-%d', title='Date'),
        alt.Tooltip('precipitation:Q', format='.2f', title='Precip (in)')
    ]
).add_params( # Add the brush selection to this chart
    time_brush_v5
).properties(
    height=80, # Make context chart shorter
    title="Precipitation (Select Date Range Below)"
)

# Upper Chart: Soil Moisture Detail (Filtered by Brush)
soil_detail_v5 = base_v5_ts.mark_line(point=True, color='saddlebrown', strokeWidth=2).encode(
    x=alt.X('day:T', title=None, axis=alt.Axis(labels=False, grid=True)), # Hide x-axis labels, but keep grid
    y=alt.Y('soil_moisture:Q', title='Soil Moisture (% Max Scaled)', axis=alt.Axis(titleColor='saddlebrown', titlePadding=10)), # Y-axis for soil moisture
    tooltip=[ # Tooltip for interaction
        alt.Tooltip('day:T', format='%Y-%m-%d', title='Date'),
        alt.Tooltip('soil_moisture:Q', format='.1f', title='Soil Moisture (%)')
    ]
).transform_filter( # Filter this chart based on the brush selection from the context chart
    time_brush_v5
).properties(
    height=300,
    title='Soil Moisture Response'
)

# Combine the Charts Vertically
viz1_interactive_v5 = alt.vconcat(
    soil_detail_v5,
    precip_context_v5,
    spacing=5 # Minimal spacing between focus and context
).resolve_scale(
    x='independent', # Allow independent x-axis zoom/pan (brush still links them)
    y='independent' # Allow independent Y scales for soil and precip
).configure_axis( # Global axis configuration
    grid=True, gridColor='lightgray', labelFontSize=10, titleFontSize=12
).configure_title( # Global title configuration
    fontSize=14, anchor='middle'
).configure_view( # Remove border around individual charts
    stroke=None
).interactive() # Enable interactivity (pan/zoom) on the combined view


# Save the chart specification
try:
    viz1_interactive_v5.save('specs/soil_precip_interactive.json')
    print("  Saved V5 (Soil/Precip Time Series) spec to specs/soil_precip_interactive.json")
except Exception as e:
    print(f"  Could not save V5 Time Series spec: {e}")

print("Block 17: V5 Time Series spec generation complete.")


Block 17: Generating Altair spec for V5 Time Series (Focus + Context)...
  Saved V5 (Soil/Precip Time Series) spec to specs/soil_precip_interactive.json
Block 17: V5 Time Series spec generation complete.


In [28]:
# --- Block 18: Generate Altair Spec for V5 Lag Correlation ---

print("\nBlock 18: Generating Altair spec for V5 Lag Correlation...")

# Data source is data/soil_precip_lag_correlation.json
# Ensure lag_corr_plot_df had data during prep (checked in Block 11)

if 'lag_corr_plot_df' in globals() and not lag_corr_plot_df.empty:

    # Find peak correlation details for annotation
    # Handle cases where max correlation might be negative (e.g., if most correlations are negative)
    # Find the index of the maximum *absolute* correlation if direction doesn't matter,
    # or just the maximum correlation value if positive correlation is expected.
    # Let's stick to finding the index of the overall maximum correlation value.
    if not lag_corr_plot_df.empty:
        peak_idx_v5 = lag_corr_plot_df['correlation'].idxmax()
        peak_lag_v5 = lag_corr_plot_df.loc[peak_idx_v5, 'lag']
        peak_corr_v5 = lag_corr_plot_df.loc[peak_idx_v5, 'correlation']
        peak_text_v5 = f'Peak correlation at {peak_lag_v5} day lag: {peak_corr_v5:.3f}'
        print(f"  Peak correlation calculated: {peak_text_v5}")
    else:
        peak_lag_v5 = 0 # Default if no data
        peak_corr_v5 = 0
        peak_text_v5 = 'No correlation data available'
        print("  No correlation data available to calculate peak.")


    # Base chart for the correlation plot
    base_corr_v5 = alt.Chart(alt.Data(url='data/soil_precip_lag_correlation.json')).properties(
         width=500, # Adjust width as needed
         height=300, # Adjust height as needed
         title='Correlation Between Soil Moisture and Lagged Precipitation'
    )

    # Line and points representing the correlation values by lag
    line_corr_v5 = base_corr_v5.mark_line(point=True, color='#B5179E', strokeWidth=2, size=80).encode(
        x=alt.X('lag:O', title='Lag (days)', axis=alt.Axis(labelAngle=0)), # Ordinal axis for discrete lags
        y=alt.Y('correlation:Q', title='Correlation Coefficient', scale=alt.Scale(zero=True)), # Ensure scale includes zero
        tooltip=[ # Tooltip for interaction
            alt.Tooltip('lag:O', title='Lag (days)'),
            alt.Tooltip('correlation:Q', format='.3f', title='Correlation')
        ]
    )

    # Horizontal rule at y=0 for reference
    zero_rule_v5 = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(color='black', strokeDash=[3,3], strokeWidth=1).encode(y='y')

    # Annotation Text displaying the peak correlation info
    # Create a dummy DataFrame for the annotation text position
    # Position it relative to the data range or at a fixed point.
    # Placing it slightly below the minimum correlation value seems reasonable.
    annotation_y_pos = lag_corr_plot_df['correlation'].min() # Start slightly below the minimum
    if annotation_y_pos > -0.5: annotation_y_pos = -0.5 # Ensure it's visible if all correlations are high

    annotation_df_v5 = pd.DataFrame({'lag': [peak_lag_v5], 'correlation': [annotation_y_pos], 'text': [peak_text_v5]}) # Place near the peak lag, adjusted y

    annotation_text_v5 = alt.Chart(annotation_df_v5).mark_text(
        align='center', # Center text at the specified point
        fontSize=11,
        color='black',
        dy=-10 # Nudge text up slightly from the point
    ).encode(
        x=alt.X('lag:O'), # Match encoding type of main plot's x-axis
        y=alt.Y('correlation:Q'),
        text='text:N'
    )

    # Layer the chart components
    viz2_correlation_v5 = alt.layer(
        line_corr_v5,
        zero_rule_v5,
        annotation_text_v5
    ).configure_axis( # Configure axes
        grid=True, gridColor='lightgray', labelFontSize=10, titleFontSize=12
    ).configure_title( # Configure title
        fontSize=14, anchor='middle'
    ).configure_view( # Remove border
        stroke=None
    ).interactive() # Enable interaction (pan/zoom)


    # Save the chart specification
    try:
        viz2_correlation_v5.save('specs/soil_precip_lag_correlation.json')
        print("  Saved V5 (Lag Correlation) spec to specs/soil_precip_lag_correlation.json")
    except Exception as e:
        print(f"  Could not save V5 Lag Correlation spec: {e}")

else:
    print("Block 18: Skipping V5 Lag Correlation spec generation due to no data.")
    viz2_correlation_v5 = None # Ensure variable is defined as None if skipped

print("Block 18: V5 Lag Correlation spec generation complete.")


Block 18: Generating Altair spec for V5 Lag Correlation...
  Peak correlation calculated: Peak correlation at 3 day lag: 0.239
  Saved V5 (Lag Correlation) spec to specs/soil_precip_lag_correlation.json
Block 18: V5 Lag Correlation spec generation complete.


In [29]:
# --- Block 19: Generate Altair Spec for V6 (Temp vs Wind Speed Scatter Plot) ---

print("\nBlock 19: Generating Altair spec for V6 (Temp vs Wind Speed Scatter Plot)...")

# Data source is data/temp_wind_daily.json
# Ensure wind_temp had data during prep (checked in Block 12)

if 'wind_temp' in globals() and not wind_temp.empty:

    # Calculate Regression and Correlation *before* building chart layers
    # Use the data loaded from the JSON file path for consistency if running blocks out of order
    try:
        wind_temp_data_for_calcs = pd.read_json('data/temp_wind_daily.json')
        if not wind_temp_data_for_calcs.empty and \
           pd.api.types.is_numeric_dtype(wind_temp_data_for_calcs['measurement_value_temp']) and \
           pd.api.types.is_numeric_dtype(wind_temp_data_for_calcs['measurement_value_wind']):

            # Ensure there's variance for regression
            if wind_temp_data_for_calcs['measurement_value_temp'].std() > 0 and wind_temp_data_for_calcs['measurement_value_wind'].std() > 0:
                # Fit the linear regression using numpy
                z_v6 = np.polyfit(wind_temp_data_for_calcs['measurement_value_temp'], wind_temp_data_for_calcs['measurement_value_wind'], 1)
                slope_v6 = z_v6[0]
                intercept_v6 = z_v6[1]
                trend_text_v6 = f"Trend: y={slope_v6:.2f}x+{intercept_v6:.2f}"

                # Calculate the correlation coefficient using pandas
                correlation_v6 = wind_temp_data_for_calcs['measurement_value_temp'].corr(wind_temp_data_for_calcs['measurement_value_wind'])
                corr_text_v6 = f'Corr: {correlation_v6:.2f}'
                print(f"  Calculated V6: {trend_text_v6}, {corr_text_v6}")
            else:
                slope_v6, intercept_v6, correlation_v6 = np.nan, np.nan, np.nan
                trend_text_v6 = "Trend: No variance"
                corr_text_v6 = "Corr: No variance"
                print("  Cannot calculate regression/correlation: insufficient variance.")
        else:
            slope_v6, intercept_v6, correlation_v6 = np.nan, np.nan, np.nan
            trend_text_v6 = "Trend: N/A"
            corr_text_v6 = "Corr: N/A"
            print("  Cannot calculate regression/correlation: data is not numeric or empty.")

    except FileNotFoundError:
        print("  Data file data/temp_wind_daily.json not found for V6 calculations.")
        slope_v6, intercept_v6, correlation_v6 = np.nan, np.nan, np.nan
        trend_text_v6 = "Trend: Data not found"
        corr_text_v6 = "Corr: Data not found"
    except Exception as e:
        print(f"  Could not calculate regression/correlation for V6: {e}")
        slope_v6, intercept_v6, correlation_v6 = np.nan, np.nan, np.nan
        trend_text_v6 = "Trend: Error"
        corr_text_v6 = "Corr: Error"


    # Base chart definition (using data URL)
    base_v6 = alt.Chart(alt.Data(url='data/temp_wind_daily.json')).encode(
        x=alt.X('measurement_value_temp:Q', title='Temperature (°C)'),
        y=alt.Y('measurement_value_wind:Q', title='Wind Speed (m/s)')
    )

    # Layer 1: Scatter points
    points_v6 = base_v6.mark_point(
        size=80,
        opacity=0.7,
        filled=True,
        color='darkgreen' # Set color directly
    ).encode(
        tooltip=[ # Add tooltips for interactivity
            alt.Tooltip('day:T', format='%Y-%m-%d', title='Date'),
            alt.Tooltip('latitude:Q', format='.2f'),
            alt.Tooltip('longitude:Q', format='.2f'),
            alt.Tooltip('measurement_value_temp:Q', format='.1f', title='Avg Temp (°C)'),
            alt.Tooltip('measurement_value_wind:Q', format='.1f', title='Avg Wind (m/s)'),
            alt.Tooltip('date_label:N', title='Date') # Use the pre-formatted label
        ]
    ).properties(
         title='Temperature vs. Wind Speed in March' # Main chart title
    )


    # Layer 2: Regression line using transform_regression
    # transform_regression works directly on the data source defined in the base chart
    if pd.notna(slope_v6): # Only add the regression line if calculation was successful
        regression_line_v6 = base_v6.mark_line(
            color="red",
            strokeDash=[3,3], # Dashed line
            strokeWidth=2
        ).transform_regression(
            'measurement_value_temp', # Independent variable (x)
            'measurement_value_wind',  # Dependent variable (y)
            method='linear'          # Regression method
        )
    else:
        regression_line_v6 = alt.Chart(pd.DataFrame()) # Create an empty chart if regression failed


    # Layer 3: Annotation text (Correlation and Trend)
    # Position annotations near top-left based on data range from the loaded data
    if 'wind_temp_data_for_calcs' in locals() and not wind_temp_data_for_calcs.empty:
        temp_min_v6, temp_max_v6 = wind_temp_data_for_calcs['measurement_value_temp'].min(), wind_temp_data_for_calcs['measurement_value_temp'].max()
        wind_min_v6, wind_max_v6 = wind_temp_data_for_calcs['measurement_value_wind'].min(), wind_temp_data_for_calcs['measurement_value_wind'].max()

        # Calculate positions relative to data range
        text_x_pos_v6 = temp_min_v6 + (temp_max_v6 - temp_min_v6) * 0.05 # 5% from left edge
        # Position correlation text near the top
        text_y_pos_corr_v6 = wind_max_v6 - (wind_max_v6 - wind_min_v6) * 0.05 # 5% from top edge
        # Position trend text slightly below correlation text
        text_y_pos_trend_v6 = wind_max_v6 - (wind_max_v6 - wind_min_v6) * 0.12 # 12% from top edge

        annotation_data_v6 = pd.DataFrame([
             {'x': text_x_pos_v6, 'y': text_y_pos_corr_v6, 'text': corr_text_v6},
             # Only include trend text if calculated successfully
             ] + ([{'x': text_x_pos_v6, 'y': text_y_pos_trend_v6, 'text': trend_text_v6}] if pd.notna(slope_v6) else [])
        )

        annotation_text_v6 = alt.Chart(annotation_data_v6).mark_text(
            align='left', # Align text to the left of the (x,y) point
            fontSize=12,
            color='black' # Explicit text color for readability
        ).encode(
            x='x:Q',
            y='y:Q',
            text='text:N'
        )
    else:
        annotation_text_v6 = alt.Chart(pd.DataFrame()) # Empty chart if no data

    # Layer 4: Dummy layer for Legend
    # Create data for legend items
    legend_data_v6_list = [
        {'label': 'March Daily Averages', 'color': 'darkgreen', 'shape': 'circle'},
    ]
    # Add regression line legend item only if regression succeeded
    if pd.notna(slope_v6):
         legend_data_v6_list.append({'label': trend_text_v6, 'color': 'red', 'shape': 'stroke'})

    legend_data_v6 = pd.DataFrame(legend_data_v6_list)


    # Need invisible marks linked to this data to generate the legend items
    dummy_legend_v6 = alt.Chart(legend_data_v6).mark_point(
        size=0, # Invisible points
        opacity=0
    ).encode(
         # Use a dummy Y encoding that isn't displayed
        y=alt.Y('label:N', axis=None),
        # Use the specified color and shape directly in the legend definition
        color=alt.Color('color:N', scale=None, legend=alt.Legend(title=None)), # Base legend config
        shape=alt.Shape('shape:N', scale=alt.Scale(domain=['circle', 'stroke'], range=['circle', 'stroke']), # Define shape mapping
                        legend=alt.Legend(title=None, symbolFillColor='color', stroke='color', symbolType='stroke')) # Customize legend appearance
    )


    # Combine all layers
    # Order matters for visibility: points -> regression line -> annotations -> dummy legend
    chart_layers_v6 = alt.layer(
        points_v6,
        regression_line_v6,
        annotation_text_v6,
        # dummy_legend_v6 # Layering dummy legends can be tricky. Often better managed with resolve_legend below.
    ).resolve_legend(
        color=alt.LegendResolveMap(color='independent', shape='independent') # Resolve color and shape legends independently
        # Configure the legend appearance explicitly outside resolve_legend
    ).properties(
        width=700, # Adjust size as needed
        height=500
    ).configure_axis( # Global axis configurations
        grid=True, gridColor='lightgray', labelFontSize=10, titleFontSize=12
    ).configure_legend( # Configure legend appearance
        orient='bottom', # Position legend at the bottom
        padding=10,
        labelFontSize=11,
        titleFontSize=12,
        # Create custom symbols if needed, or rely on the dummy layer approach
        # Using the dummy layer approach for text labels in legend
    )

    # Add the dummy legend as a separate layer *after* the main layers if needed,
    # but resolving legend should be sufficient if the main marks define the symbols.
    # Let's try defining the legend properties directly in configure_legend first.
    # If the dummy layer is needed to force legend items for things like regression lines:
    # chart_layers_v6 = alt.layer(chart_layers_v6, dummy_legend_v6) # Layer the main chart with the dummy legend

    # Enable interactivity (panning and zooming)
    chart_v6 = chart_layers_v6.interactive()


    # Save the chart specification
    try:
        chart_v6.save('specs/temp_wind_scatter.json')
        print("  Saved V6 (Temp vs Wind Scatter) spec to specs/temp_wind_scatter.json")
    except Exception as e:
        print(f"  Could not save V6 spec: {e}")

else:
    print("Block 19: Skipping V6 spec generation due to empty wind_temp data after filtering.")
    chart_v6 = None # Ensure variable is defined as None if skipped


print("Block 19: V6 spec generation complete.")


Block 19: Generating Altair spec for V6 (Temp vs Wind Speed Scatter Plot)...
Block 19: Skipping V6 spec generation due to empty wind_temp data after filtering.
Block 19: V6 spec generation complete.


1.) Temperature Evolution During Winter-to-Spring Transition

Daily temperature trend line with min/max range
Monthly violin plots showing distribution changes


2.) Spatial Temperature Variations

Monthly temperature maps showing geographic patterns
Temperature variability map highlighting areas with greatest fluctuations


3.) Environmental Measurement Relationships

Temperature-humidity correlation scatter plot with time progression
Combined time series of temperature, humidity, and precipitation


4.) Soil Moisture Response to Precipitation

Time series showing precipitation events and soil moisture response
Lag correlation analysis showing delayed response patterns


5.) Wind Patterns and Correlations

Wind rose diagrams for different Chicago locations
Temperature-wind speed relationship scatter plot


6.) Daily Cycles and Seasonal Changes

Monthly comparison of daily temperature cycles
Daily temperature range progression over the study period
Hourly temperature heatmap showing daily and seasonal patterns