## Setup

Import libraries and configure the environment.

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from kanoa import AnalyticsInterpreter

# Set plot style
plt.style.use("seaborn-v0_8-whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 11

# Path to knowledge base
KB_PATH = Path("./knowledge_base_demo/climate_science_kb")

print(f"Knowledge base path: {KB_PATH.absolute()}")
print(f"KB exists: {KB_PATH.exists()}")
if KB_PATH.exists():
    print(f"KB files: {list(KB_PATH.glob('*.md'))}")

## 1. Create Synthetic Climate Data

We'll create realistic synthetic data that mimics actual climate observations:

- **Global Temperature Anomalies** (similar to NASA GISS dataset)
- **Atmospheric CO2 Concentration** (similar to Keeling Curve)
- **Sea Surface Temperature** with marine heatwave events

In [None]:
# Set seed for reproducibility
np.random.seed(42)

# Create time axis (1960-2024)
years = np.arange(1960, 2025)
n_years = len(years)

# Global Temperature Anomaly (relative to 1951-1980 baseline)
# Starts near 0, accelerates to ~1.2°C by 2024
trend = 0.018 * (years - 1960)  # ~0.18°C/decade base trend
acceleration = 0.0002 * (years - 1960) ** 1.5  # Accelerating component
enso_signal = 0.15 * np.sin(2 * np.pi * (years - 1960) / 3.7)  # ENSO-like
volcanic = np.zeros_like(years, dtype=float)
volcanic[years == 1991] = -0.3  # Pinatubo
volcanic[years == 1982] = -0.2  # El Chichón
noise = np.random.normal(0, 0.08, n_years)

temp_anomaly = trend + acceleration + enso_signal + volcanic + noise

# Create DataFrame
climate_df = pd.DataFrame(
    {
        "Year": years,
        "Temperature_Anomaly_C": temp_anomaly,
    }
)

# 5-year running mean
climate_df["Temp_5yr_Mean"] = (
    climate_df["Temperature_Anomaly_C"].rolling(window=5, center=True).mean()
)

print("Climate DataFrame created:")
print(climate_df.tail(10))

In [None]:
# Atmospheric CO2 concentration (Keeling Curve style)
# Pre-industrial ~280 ppm, 1960 ~315 ppm, 2024 ~425 ppm

# Monthly data for seasonal cycle
months = pd.date_range(start="1960-01-01", end="2024-12-31", freq="ME")
n_months = len(months)

# Base trend with acceleration
month_idx = np.arange(n_months)
co2_trend = 315 + 1.5 * (month_idx / 12) + 0.02 * (month_idx / 12) ** 1.3

# Seasonal cycle (larger in NH spring/fall)
seasonal = 3.5 * np.sin(2 * np.pi * month_idx / 12 + np.pi / 2)

# Add noise
co2_noise = np.random.normal(0, 0.3, n_months)

co2_ppm = co2_trend + seasonal + co2_noise

co2_df = pd.DataFrame(
    {
        "Date": months,
        "CO2_ppm": co2_ppm,
    }
)

# Deseasonalized (12-month rolling mean)
co2_df["CO2_Deseasonalized"] = co2_df["CO2_ppm"].rolling(window=12, center=True).mean()

print("\nCO2 DataFrame:")
print(co2_df.tail(10))

## 2. Interpretation WITHOUT Knowledge Base

First, let's see how kanoa interprets the data without domain context.

In [None]:
# Create temperature anomaly visualization
fig1, ax1 = plt.subplots(figsize=(14, 7))

# Plot annual data
colors = [
    "#3498db" if t < 0 else "#e74c3c" for t in climate_df["Temperature_Anomaly_C"]
]
ax1.bar(
    climate_df["Year"],
    climate_df["Temperature_Anomaly_C"],
    color=colors,
    alpha=0.7,
    width=0.8,
    label="Annual Anomaly",
)

# Plot 5-year running mean
ax1.plot(
    climate_df["Year"],
    climate_df["Temp_5yr_Mean"],
    color="black",
    linewidth=2.5,
    label="5-Year Running Mean",
)

# Reference lines
ax1.axhline(y=0, color="gray", linestyle="--", alpha=0.7)
ax1.axhline(y=1.5, color="red", linestyle="--", alpha=0.5, label="1.5°C Target")

# Annotations
ax1.annotate(
    "Mt. Pinatubo\n1991",
    xy=(1991, -0.1),
    xytext=(1985, -0.5),
    arrowprops=dict(arrowstyle="->", color="gray"),
    fontsize=9,
    color="gray",
)

ax1.set_xlabel("Year", fontsize=12)
ax1.set_ylabel("Temperature Anomaly (°C)", fontsize=12)
ax1.set_title(
    "Global Surface Temperature Anomaly (1960-2024)\n"
    + "Relative to 1951-1980 Baseline",
    fontsize=14,
    fontweight="bold",
)
ax1.legend(loc="upper left")
ax1.set_xlim(1958, 2026)
ax1.set_ylim(-0.8, 1.6)

plt.tight_layout()
plt.show()

print("\nVisualization created.")

In [None]:
# Initialize interpreter WITHOUT knowledge base
interpreter_no_kb = AnalyticsInterpreter(backend="gemini-3", track_costs=True)

print("=" * 70)
print("INTERPRETATION WITHOUT KNOWLEDGE BASE")
print("=" * 70)

result_no_kb = interpreter_no_kb.interpret_figure(
    fig=fig1,
    context="Global temperature anomaly data from 1960-2024",
    focus="Analyze the warming trend and identify significant events",
    display_result=True,
)

## 3. Interpretation WITH Knowledge Base

Now let's see how the interpretation improves with domain-specific context from academic literature.

In [None]:
# Initialize interpreter WITH knowledge base
interpreter_with_kb = AnalyticsInterpreter(
    backend="gemini-3", kb_path=str(KB_PATH), kb_type="text", track_costs=True
)

print("=" * 70)
print("INTERPRETATION WITH KNOWLEDGE BASE")
print("=" * 70)

result_with_kb = interpreter_with_kb.interpret_figure(
    fig=fig1,
    context="Global temperature anomaly data from 1960-2024",
    focus="Analyze the warming trend, compare to IPCC findings, "
    + "and identify significant events like volcanic impacts",
    display_result=True,
)

## 4. CO2 Keeling Curve Analysis

Let's analyze the atmospheric CO2 data with the knowledge base.

In [None]:
# Create CO2 visualization (Keeling Curve style)
fig2, (ax2a, ax2b) = plt.subplots(2, 1, figsize=(14, 10))

# Top panel: Full time series with seasonal cycle
ax2a.plot(
    co2_df["Date"],
    co2_df["CO2_ppm"],
    color="#2ecc71",
    alpha=0.6,
    linewidth=0.8,
    label="Monthly",
)
ax2a.plot(
    co2_df["Date"],
    co2_df["CO2_Deseasonalized"],
    color="#27ae60",
    linewidth=2.5,
    label="Deseasonalized (12-mo)",
)

# Reference thresholds
ax2a.axhline(
    y=350, color="orange", linestyle="--", alpha=0.7, label='350 ppm ("Safe" level)'
)
ax2a.axhline(
    y=400, color="red", linestyle="--", alpha=0.7, label="400 ppm (First crossed 2013)"
)

ax2a.set_ylabel("CO₂ Concentration (ppm)", fontsize=12)
ax2a.set_title(
    "Atmospheric CO₂ at Mauna Loa (Keeling Curve Style)\n" + "1960-2024",
    fontsize=14,
    fontweight="bold",
)
ax2a.legend(loc="upper left")
ax2a.set_ylim(310, 440)

# Bottom panel: Rate of change (annual increase)
annual_co2 = co2_df.groupby(co2_df["Date"].dt.year)["CO2_ppm"].mean()
annual_increase = annual_co2.diff()

ax2b.bar(
    annual_increase.index[1:], annual_increase.values[1:], color="#e74c3c", alpha=0.7
)
ax2b.axhline(
    y=2.5,
    color="red",
    linestyle="--",
    alpha=0.5,
    label="Current avg rate (~2.5 ppm/yr)",
)
ax2b.axhline(
    y=1.5, color="orange", linestyle="--", alpha=0.5, label="1960s rate (~1.5 ppm/yr)"
)

ax2b.set_xlabel("Year", fontsize=12)
ax2b.set_ylabel("Annual CO₂ Increase (ppm/year)", fontsize=12)
ax2b.set_title("Rate of CO₂ Increase Over Time", fontsize=12, fontweight="bold")
ax2b.legend(loc="upper left")

plt.tight_layout()
plt.show()

In [None]:
print("=" * 70)
print("CO2 KEELING CURVE INTERPRETATION (WITH KB)")
print("=" * 70)

result_co2 = interpreter_with_kb.interpret_figure(
    fig=fig2,
    context="Atmospheric CO2 concentration from 1960-2024, similar to Mauna Loa data",
    focus="Explain the seasonal sawtooth pattern, analyze the acceleration in "
    + "annual increase rate, and compare to carbon budget from literature",
    display_result=True,
)

## 5. Combined Analysis: Temperature vs CO2 Correlation

Let's create a visualization showing the relationship between temperature and CO2.

In [None]:
# Prepare annual data for correlation
annual_temp = climate_df[["Year", "Temperature_Anomaly_C"]].copy()
annual_co2_mean = co2_df.groupby(co2_df["Date"].dt.year)["CO2_ppm"].mean().reset_index()
annual_co2_mean.columns = ["Year", "CO2_ppm"]

merged = annual_temp.merge(annual_co2_mean, on="Year")

# Create correlation visualization
fig3, (ax3a, ax3b) = plt.subplots(1, 2, figsize=(14, 6))

# Left: Dual-axis time series
color_temp = "#e74c3c"
color_co2 = "#27ae60"

ax3a.plot(
    merged["Year"],
    merged["Temperature_Anomaly_C"],
    color=color_temp,
    linewidth=2,
    label="Temperature Anomaly",
)
ax3a.set_xlabel("Year", fontsize=12)
ax3a.set_ylabel("Temperature Anomaly (°C)", color=color_temp, fontsize=12)
ax3a.tick_params(axis="y", labelcolor=color_temp)

ax3a_twin = ax3a.twinx()
ax3a_twin.plot(
    merged["Year"],
    merged["CO2_ppm"],
    color=color_co2,
    linewidth=2,
    linestyle="--",
    label="CO₂",
)
ax3a_twin.set_ylabel("CO₂ (ppm)", color=color_co2, fontsize=12)
ax3a_twin.tick_params(axis="y", labelcolor=color_co2)

ax3a.set_title("Temperature and CO₂ Over Time", fontsize=12, fontweight="bold")
ax3a.legend(loc="upper left")
ax3a_twin.legend(loc="lower right")

# Right: Scatter plot with regression
ax3b.scatter(
    merged["CO2_ppm"],
    merged["Temperature_Anomaly_C"],
    c=merged["Year"],
    cmap="plasma",
    s=50,
    alpha=0.7,
)

# Add regression line
z = np.polyfit(merged["CO2_ppm"], merged["Temperature_Anomaly_C"], 1)
p = np.poly1d(z)
x_line = np.linspace(merged["CO2_ppm"].min(), merged["CO2_ppm"].max(), 100)
ax3b.plot(x_line, p(x_line), "r--", linewidth=2, alpha=0.8)

# Calculate correlation
corr = merged["Temperature_Anomaly_C"].corr(merged["CO2_ppm"])
ax3b.text(
    0.05,
    0.95,
    f"r = {corr:.3f}",
    transform=ax3b.transAxes,
    fontsize=12,
    verticalalignment="top",
    bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.8),
)

ax3b.set_xlabel("CO₂ Concentration (ppm)", fontsize=12)
ax3b.set_ylabel("Temperature Anomaly (°C)", fontsize=12)
ax3b.set_title("Temperature vs CO₂ Correlation", fontsize=12, fontweight="bold")

cbar = plt.colorbar(ax3b.collections[0], ax=ax3b, label="Year")

plt.tight_layout()
plt.show()

In [None]:
print("=" * 70)
print("TEMPERATURE-CO2 CORRELATION INTERPRETATION (WITH KB)")
print("=" * 70)

result_corr = interpreter_with_kb.interpret_figure(
    fig=fig3,
    context="Comparison of global temperature anomaly and atmospheric CO2 (1960-2024)",
    focus="Interpret the correlation, discuss the physical relationship between "
    + "CO2 forcing and temperature response based on the literature, "
    + "and note any time lags or nonlinearities",
    display_result=True,
)

## 6. Cost Comparison

Let's compare the token usage and costs between interpretations with and without the knowledge base.

In [None]:
# Get cost summaries
costs_no_kb = interpreter_no_kb.get_cost_summary()
costs_with_kb = interpreter_with_kb.get_cost_summary()

print("\n" + "=" * 70)
print("COST COMPARISON")
print("=" * 70)

print("\nWITHOUT Knowledge Base:")
print(f"  Total API calls: {costs_no_kb['total_calls']}")
print(f"  Input tokens: {costs_no_kb['total_tokens']['input']:,}")
print(f"  Output tokens: {costs_no_kb['total_tokens']['output']:,}")
print(f"  Total cost: ${costs_no_kb['total_cost_usd']:.4f}")

print("\nWITH Knowledge Base:")
print(f"  Total API calls: {costs_with_kb['total_calls']}")
print(f"  Input tokens: {costs_with_kb['total_tokens']['input']:,}")
print(f"  Output tokens: {costs_with_kb['total_tokens']['output']:,}")
print(f"  Total cost: ${costs_with_kb['total_cost_usd']:.4f}")

kb_overhead = (
    costs_with_kb["total_tokens"]["input"] - costs_no_kb["total_tokens"]["input"]
)
print(f"\nKB Context Overhead: ~{kb_overhead:,} tokens per call")
print(
    f"Cost difference: ${costs_with_kb['total_cost_usd'] - costs_no_kb['total_cost_usd']:.4f}"
)

## 7. (Optional) PDF Knowledge Base

For users with Gemini backend, you can load actual academic PDFs. The model will see
figures, tables, and equations in the papers - not just extracted text.

```python
# Example: Load PDFs from a directory
interpreter_pdf = AnalyticsInterpreter(
    backend='gemini-3',
    kb_path='./papers/',  # Contains Hansen_2023.pdf, Friedlingstein_2023.pdf
    kb_type='pdf'  # Native PDF processing
)

# The model now has full visual access to figures in the papers!
result = interpreter_pdf.interpret_figure(
    fig=my_plot,
    focus="Compare this to Figure 3 in Hansen et al. 2023"
)
```

## Summary

This notebook demonstrated kanoa's knowledge base integration:

### Key Takeaways

1. **Context Matters**: Knowledge base provides domain-specific terminology and benchmarks
2. **Literature Integration**: Can reference specific papers, findings, and methodologies
3. **Improved Accuracy**: Interpretations include correct thresholds (1.5°C targets, 350 ppm)
4. **Modest Cost Overhead**: KB adds tokens but improves interpretation quality significantly

### When to Use Knowledge Bases

| Scenario | KB Type | Benefit |
|----------|---------|--------|
| Domain-specific analysis | Text (markdown) | Fast, works with all backends |
| Academic paper references | PDF (Gemini only) | Sees figures, tables, equations |
| Custom methodology | Text | Ensures consistent interpretation |
| Company-specific context | Text | Private data integration |

### Next Steps

- Create your own knowledge base for your domain
- Try PDF knowledge base with actual academic papers
- Experiment with different `focus` prompts to guide analysis

---

*For more examples, see the [kanoa documentation](https://github.com/lhzn-io/kanoa)*