In [16]:
# ============================================================
# 009 LP Candidate Pre-Research
# ============================================================
#
# Overview
# ----------------
# This notebook conducts a lightweight, pre-research analysis of potential
# Limited Partner (LP) candidates using only publicly available information.
#
# The objective is not to produce a final evaluation or recommendation, but to
# transform heterogeneous public signals (e.g., institutional profiles, stated
# mandates, investment activities) into a comparable, inspection-ready structure
# that can support deeper, subsequent LP research.
#
# The workflow explicitly separates:
# - Facts: evidence-backed statements with traceable source URLs and confidence
# - Hypotheses: assumption-based interpretations (e.g., capability gaps,
#   geographic reach, specialization limits), clearly labeled as tentative
#
# Emphasis is placed on transparency, traceability, and failure-tolerant design,
# allowing incomplete or ambiguous information to be surfaced rather than hidden.
#
#
# Inputs / Outputs
# ----------------
# Inputs:
# - List of candidate LP organizations (names and primary reference URLs)
# - Public web sources (official websites, disclosures, reports, press releases)
#
# Outputs:
# - Facet-level LP research table (one row per LP × research facet)
# - Normalized LP-level profiles (tabular + JSON)
# - Explicit flags for data gaps, low-confidence signals, and missing evidence
# - Hypothesis-based notes on potential LP challenges for downstream discussion
# - Human-readable, report-style summaries (Markdown-like)
#
#
# Structure
# ----------------
# Cell 0 : Notebook purpose, scope, and design principles
# Cell 1 : Imports, global configuration, and helper utilities
# Cell 2 : Define LP candidate list and basic metadata (widgets)
# Cell 3 : Collect raw public information (primary URLs + Google CSE facets)
# Cell 4 : Facet-wise LLM-based summarization and extraction (robust, DataFrame-based)
# Cell 5 : Normalize extracted facets into a common LP schema
# Cell 6 : Flag uncertainties, assumptions, and missing or weak signals
# Cell 7 : Export structured outputs and generate fact/hypothesis reports
#
#
# Notes
# ----------------
# - This notebook intentionally avoids proprietary or non-public data.
# - Outputs are designed for pre-research and internal discussion purposes only.
# - All hypotheses should be treated as tentative and subject to further validation.
# - The design favors transparency, inspectability, and reuse over completeness.

In [4]:
# ============================================================
# Cell 1 : Imports, Global Configuration, and Helper Utilities
# ============================================================
#
# This cell initializes all required libraries, defines global configuration
# parameters, and implements lightweight helper utilities used throughout
# the notebook.
#
# The goal is to keep downstream cells focused on research logic, while
# centralizing environment setup, defaults, and reusable functions here.
#

# ----------------
# Standard Libraries
# ----------------
import os
import json
import time
from typing import List, Dict, Any, Optional

# ----------------
# Third-Party Libraries
# ----------------
import requests
import pandas as pd
from dotenv import load_dotenv

# ----------------
# LLM / API Configuration
# ----------------
# Environment variables are used for API keys to avoid hardcoding secrets.
# Ensure that required keys are set before execution.
#
# Example:
# export OPENAI_API_KEY="your_api_key_here"
#
# Load env.txt explicitly (recommended for local + GitHub Actions parity)
load_dotenv("env.txt")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if OPENAI_API_KEY is None:
    raise EnvironmentError(
        "OPENAI_API_KEY is not set. "
        "Please define it as an environment variable before running this notebook."
    )

# --- Google Custom Search Engine (required for this notebook) ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
GOOGLE_CSE_CX = os.getenv("GOOGLE_CSE_CX")

# ----------------
# Global Configuration
# ----------------
# These parameters define default behaviors for data collection and processing.
#
REQUEST_TIMEOUT = 20          # seconds
MAX_RETRIES = 3               # retry attempts for HTTP requests
RETRY_SLEEP_SECONDS = 3       # backoff between retries
USER_AGENT = "researchOS-LP-PreResearch/0.1"

DEFAULT_HEADERS = {
    "User-Agent": USER_AGENT
}

# ----------------
# Helper Utilities
# ----------------
def safe_get(
    url: str,
    headers: Optional[Dict[str, str]] = None,
    timeout: int = REQUEST_TIMEOUT,
    max_retries: int = MAX_RETRIES
) -> Optional[str]:
    """
    Safely fetch text content from a URL with basic retry logic.

    Parameters
    ----------
    url : str
        Target URL to fetch.
    headers : dict, optional
        HTTP headers to attach to the request.
    timeout : int
        Request timeout in seconds.
    max_retries : int
        Maximum number of retry attempts.

    Returns
    -------
    str or None
        Response text if successful, otherwise None.
    """
    headers = headers or DEFAULT_HEADERS

    for attempt in range(1, max_retries + 1):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            if response.status_code == 200:
                return response.text
        except requests.RequestException:
            pass

        time.sleep(RETRY_SLEEP_SECONDS)

    return None


def to_pretty_json(data: Any) -> str:
    """
    Convert a Python object into a human-readable JSON string.
    Useful for inspection and debugging inside notebooks.
    """
    return json.dumps(data, ensure_ascii=False, indent=2)


def normalize_text(text: str) -> str:
    """
    Apply lightweight normalization to raw text signals
    (e.g., whitespace cleanup).
    """
    if not text:
        return ""
    return " ".join(text.split())


# ----------------
# Pandas Display Settings (Optional)
# ----------------
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_columns", 50)


In [3]:
# ============================================================
# Cell 2 : Define LP Candidate List and Basic Metadata (Widgets)
# ============================================================
#
# This cell defines LP candidates via interactive widgets.
# Users can input (1) organization name and (2) a primary reference URL.
#
# The resulting candidate list is normalized into a simple table that will be
# used downstream for data collection and profile reconstruction.
#

import ipywidgets as widgets
from IPython.display import display, clear_output

# ----------------
# Widget: Candidate Input Rows
# ----------------
N_ROWS = 8  # adjust as needed

name_inputs = []
url_inputs = []

for i in range(N_ROWS):
    name_w = widgets.Text(
        value="",
        placeholder="e.g., ABC Pension Fund",
        description=f"Name {i+1}:",
        layout=widgets.Layout(width="520px")
    )
    url_w = widgets.Text(
        value="",
        placeholder="e.g., https://example.com/about",
        description=f"URL {i+1}:",
        layout=widgets.Layout(width="520px")
    )
    name_inputs.append(name_w)
    url_inputs.append(url_w)

rows = [
    widgets.HBox([name_inputs[i], url_inputs[i]], layout=widgets.Layout(margin="0 0 6px 0"))
    for i in range(N_ROWS)
]

title = widgets.HTML("<h3>LP Candidate Inputs</h3><p>Enter organization name and a primary public reference URL.</p>")

# ----------------
# Actions
# ----------------
build_button = widgets.Button(
    description="Build Candidate Table",
    button_style="primary",
    icon="check"
)

clear_button = widgets.Button(
    description="Clear Inputs",
    button_style="warning",
    icon="trash"
)

output = widgets.Output()

def _build_candidates() -> pd.DataFrame:
    """Build a normalized candidate table from widget values."""
    records = []
    for i in range(N_ROWS):
        name = (name_inputs[i].value or "").strip()
        url = (url_inputs[i].value or "").strip()
        if name and url:
            records.append({
                "candidate_id": f"lp_{i+1:02d}",
                "org_name": name,
                "primary_url": url,
                "notes": ""  # optional free-text field for manual notes
            })
    return pd.DataFrame(records)

def on_build_clicked(_):
    with output:
        clear_output()
        df = _build_candidates()

        if df.empty:
            print("No candidates found. Please fill at least one Name + URL pair.")
            return

        display(df)
        print(f"\nBuilt {len(df)} candidate(s).")

        # Store as a global variable for downstream cells
        global lp_candidates_df
        lp_candidates_df = df

def on_clear_clicked(_):
    for i in range(N_ROWS):
        name_inputs[i].value = ""
        url_inputs[i].value = ""
    with output:
        clear_output()
        print("Inputs cleared.")

build_button.on_click(on_build_clicked)
clear_button.on_click(on_clear_clicked)

controls = widgets.HBox([build_button, clear_button], layout=widgets.Layout(margin="10px 0 0 0"))

display(title)
display(widgets.VBox(rows))
display(controls)
display(output)


HTML(value='<h3>LP Candidate Inputs</h3><p>Enter organization name and a primary public reference URL.</p>')

VBox(children=(HBox(children=(Text(value='', description='Name 1:', layout=Layout(width='520px'), placeholder=…

HBox(children=(Button(button_style='primary', description='Build Candidate Table', icon='check', style=ButtonS…

Output()

In [5]:
# ============================================================
# Cell 3 : Collect Raw Public Information (Web Text + Google CSE)
# ============================================================
#
# This cell collects raw public signals for each LP candidate from:
#  1) The user-provided primary URL (direct fetch, lightweight)
#  2) Google Custom Search Engine (CSE) queries, separated by research facets:
#     - Business overview
#     - Direct startup investments (if any)
#     - Presence of CVC and (if present) its main investment scope
#     - LP commitments to VC funds (if any) and major recipients
#
# Output is a traceable, inspection-friendly raw signal bundle that preserves:
#  - query used
#  - top results (title / snippet / link)
#  - timestamps
#
# NOTE: This cell intentionally does not "decide" or "score"—it only collects.
#

from datetime import datetime

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
GOOGLE_CSE_CX  = os.getenv("GOOGLE_CSE_CX")

if GOOGLE_API_KEY is None or GOOGLE_CSE_CX is None:
    raise EnvironmentError(
        "GOOGLE_API_KEY and/or GOOGLE_CSE_CX is not set. "
        "Please define them as environment variables before running this notebook."
    )

CSE_ENDPOINT = "https://www.googleapis.com/customsearch/v1"
CSE_NUM_RESULTS = 5          # per query (max 10 for CSE)
CSE_RETRY_SLEEP = 2
CSE_MAX_RETRIES = 3

def cse_search(query: str, num: int = CSE_NUM_RESULTS, start: int = 1) -> Dict[str, Any]:
    """
    Execute a Google CSE query and return the parsed JSON response (or an empty dict on failure).
    """
    params = {
        "key": GOOGLE_API_KEY,
        "cx": GOOGLE_CSE_CX,
        "q": query,
        "num": num,
        "start": start
    }

    for attempt in range(1, CSE_MAX_RETRIES + 1):
        try:
            r = requests.get(CSE_ENDPOINT, params=params, timeout=REQUEST_TIMEOUT, headers=DEFAULT_HEADERS)
            if r.status_code == 200:
                return r.json()
        except requests.RequestException:
            pass
        time.sleep(CSE_RETRY_SLEEP)

    return {}


def extract_cse_items(cse_json: Dict[str, Any]) -> List[Dict[str, str]]:
    """
    Extract a minimal set of fields from a CSE response for traceability.
    """
    items = cse_json.get("items", []) or []
    extracted = []
    for it in items:
        extracted.append({
            "title": it.get("title", ""),
            "link": it.get("link", ""),
            "snippet": it.get("snippet", "")
        })
    return extracted


def build_research_queries(org_name: str, primary_url: str) -> Dict[str, List[str]]:
    """
    Build query sets by facet. We include both the org name and the primary URL/domain
    to increase precision. (No domain restriction is enforced here; you can add later.)
    """
    # Optional: derive domain token (simple heuristic)
    domain_hint = ""
    try:
        domain_hint = primary_url.split("//", 1)[-1].split("/", 1)[0]
    except Exception:
        domain_hint = ""

    # Keep queries short but facet-specific.
    # You can tune these templates over time.
    return {
        "business_overview": [
            f'{org_name} overview',
            f'{org_name} company profile',
            f'{org_name} business description {domain_hint}'.strip()
        ],
        "direct_startup_investments": [
            f'{org_name} startup investment',
            f'{org_name} invested in startup',
            f'{org_name} venture investment portfolio'
        ],
        "cvc_presence": [
            f'{org_name} corporate venture capital',
            f'{org_name} CVC',
            f'{org_name} venture arm'
        ],
        "cvc_investment_focus": [
            f'{org_name} corporate venture capital investment focus',
            f'{org_name} venture arm sectors',
            f'{org_name} CVC investment areas'
        ],
        "lp_to_vc_commitments": [
            f'{org_name} LP commitment venture capital fund',
            f'{org_name} committed to venture capital fund',
            f'{org_name} backed venture capital firm LP'
        ]
    }


def collect_primary_url_text(url: str, max_chars: int = 12000) -> Dict[str, Any]:
    """
    Fetch and store a lightweight raw text payload from the primary URL.
    (We store the first N characters for inspection; deep scraping is out of scope.)
    """
    html = safe_get(url)
    html = html or ""
    html_norm = normalize_text(html)
    return {
        "url": url,
        "fetched_at": datetime.utcnow().isoformat() + "Z",
        "raw_text_head": html_norm[:max_chars],
        "char_len": len(html_norm)
    }


def collect_candidate_signals(org_name: str, primary_url: str) -> Dict[str, Any]:
    """
    Collect raw signals for a single candidate from primary URL + CSE facet queries.
    """
    out = {
        "org_name": org_name,
        "primary_url": primary_url,
        "collected_at": datetime.utcnow().isoformat() + "Z",
        "primary_url_fetch": {},
        "cse": {}
    }

    # 1) Primary URL fetch
    out["primary_url_fetch"] = collect_primary_url_text(primary_url)

    # 2) Google CSE facet searches
    facet_queries = build_research_queries(org_name, primary_url)

    for facet, queries in facet_queries.items():
        facet_results = []
        for q in queries:
            cse_json = cse_search(q, num=CSE_NUM_RESULTS, start=1)
            items = extract_cse_items(cse_json)

            facet_results.append({
                "query": q,
                "items": items
            })

            # Be polite with rate limiting
            time.sleep(0.3)

        out["cse"][facet] = facet_results

    return out


# ----------------
# Run Collection for All Candidates
# ----------------
if "lp_candidates_df" not in globals():
    raise ValueError(
        "lp_candidates_df is not defined. Please run Cell 2 and build the candidate table first."
    )

raw_signals = []
for _, row in lp_candidates_df.iterrows():
    org_name = row["org_name"]
    primary_url = row["primary_url"]
    print(f"Collecting signals: {org_name} ...")
    bundle = collect_candidate_signals(org_name, primary_url)
    raw_signals.append(bundle)

# Store as a global for downstream cells
global lp_raw_signals
lp_raw_signals = raw_signals

print(f"\nDone. Collected raw signals for {len(lp_raw_signals)} candidate(s).")


Collecting signals: DBJ ...

Done. Collected raw signals for 1 candidate(s).


In [9]:
# ============================================================
# Cell 4 : LLM-based Summarization and Field Extraction (Facet-wise → DataFrame)
# ============================================================
#
# Rationale:
# - A single, large JSON per organization is prone to truncation for information-rich LPs.
# - This cell switches to facet-wise extraction and stores outputs in a robust, inspectable
#   DataFrame (one row per org × facet).
#
# Facets:
# - business_overview
# - direct_startup_investments
# - cvc
# - lp_commitments_to_vc
#
# Outputs:
# - lp_facet_df : DataFrame with evidence, confidence, and machine-readable details_json
# - (optional) artifacts/009_lp_facet_df.parquet (if you enable saving)
#

import os
import json
import time
import re
from typing import List, Dict, Any, Optional
from datetime import datetime

import pandas as pd

# ----------------
# OpenAI Client Setup
# ----------------
MODEL_NAME = os.getenv("OPENAI_MODEL", "gpt-4.1-mini")

try:
    from openai import OpenAI
    _openai_client = OpenAI(api_key=OPENAI_API_KEY)
except Exception as e:
    raise ImportError(
        "Failed to initialize OpenAI client. Please ensure `openai` is installed "
        "and OPENAI_API_KEY is set.\n"
        f"Original error: {e}"
    )

# ----------------
# Helpers: Logging
# ----------------
def save_raw_llm_output(org_name: str, facet: str, raw_text: str, out_dir: str = "artifacts/009_llm_raw_facet") -> str:
    os.makedirs(out_dir, exist_ok=True)
    safe_org = re.sub(r"[^a-zA-Z0-9_\-]+", "_", (org_name or "candidate")).strip("_")[:60] or "candidate"
    safe_facet = re.sub(r"[^a-zA-Z0-9_\-]+", "_", (facet or "facet")).strip("_")[:40] or "facet"
    ts = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
    path = os.path.join(out_dir, f"{safe_org}__{safe_facet}__{ts}.txt")
    with open(path, "w", encoding="utf-8") as f:
        f.write(raw_text or "")
    return path

# ----------------
# Helpers: Context Building (facet-wise)
# ----------------
def _compact_items(items: List[Dict[str, Any]], max_items: int = 3) -> List[Dict[str, str]]:
    out = []
    for it in (items or [])[:max_items]:
        out.append({
            "title": (it.get("title") or "").strip(),
            "link": (it.get("link") or "").strip(),
            "snippet": (it.get("snippet") or "").strip(),
        })
    return out

def build_facet_context(
    bundle: Dict[str, Any],
    include_primary_head: bool,
    facets: List[str],
    max_primary_chars: int = 2500,
    max_queries_per_facet: int = 3,
    max_items_per_query: int = 3,
) -> str:
    """
    Build a compact context for a given extraction facet by selecting only relevant CSE blocks.
    This reduces prompt size and prevents truncation.
    """
    org_name = bundle.get("org_name", "")
    primary_url = bundle.get("primary_url", "")

    parts = []
    parts.append(f"ORG_NAME: {org_name}")
    parts.append(f"PRIMARY_URL: {primary_url}")

    if include_primary_head:
        primary = bundle.get("primary_url_fetch", {}) or {}
        primary_head = (primary.get("raw_text_head", "") or "")[:max_primary_chars]
        parts.append("\n=== PRIMARY_URL_RAW_TEXT_HEAD ===")
        parts.append(primary_head if primary_head else "(empty / not fetched)")

    cse = bundle.get("cse", {}) or {}
    parts.append("\n=== GOOGLE_CSE_RESULTS (SELECTED) ===")

    for facet_key in facets:
        facet_block = cse.get(facet_key, []) or []
        parts.append(f"\n--- FACET_SOURCE: {facet_key} ---")

        if not facet_block:
            parts.append("(no results)")
            continue

        # limit number of queries and items
        for qblock in facet_block[:max_queries_per_facet]:
            q = (qblock.get("query") or "").strip()
            items = _compact_items(qblock.get("items", []), max_items=max_items_per_query)
            parts.append(f"[Query] {q}")
            for idx, it in enumerate(items, start=1):
                parts.append(
                    f"  ({idx}) {it['title']}\n"
                    f"      URL: {it['link']}\n"
                    f"      Snippet: {it['snippet']}"
                )

    return "\n".join(parts).strip()

# ----------------
# Facet Extraction Prompt
# ----------------
def facet_prompt(facet: str) -> str:
    """
    Return a strict, facet-specific prompt that enforces compact JSON outputs.
    """
    return f"""
You are extracting ONE facet for a Limited Partner (LP) pre-research profile.

Facet: {facet}

Rules:
- Use ONLY the provided context.
- Do NOT invent facts. If uncertain, use "unknown" and explain in notes.
- Any factual claim should include evidence_urls if available in the context.
- Be concise. Prefer short phrases.
- Return ONLY valid JSON (no markdown, no commentary).

Hard limits:
- evidence_urls: max 5
- details.examples / details.major_recipients: max 3
- summary_text: 1-2 sentences

Return JSON with this schema:
{{
  "facet": "{facet}",
  "has_signal": "yes|no|unknown",
  "summary_text": "string",
  "details": {{}},
  "evidence_urls": ["string"],
  "confidence": 0.0,
  "notes": "string"
}}
""".strip()

def call_llm_facet(facet: str, context_text: str, max_tokens: int = 420, temperature: float = 0.1) -> str:
    resp = _openai_client.chat.completions.create(
        model=MODEL_NAME,
        temperature=temperature,
        max_tokens=max_tokens,
        messages=[
            {"role": "system", "content": "You extract one research facet at a time into strict JSON."},
            {"role": "user", "content": facet_prompt(facet)},
            {"role": "user", "content": "CONTEXT:\n" + context_text},
        ],
        response_format={"type": "json_object"},
    )
    return resp.choices[0].message.content

# ----------------
# JSON Parse (strict; log on failure)
# ----------------
def parse_json_strict(raw: str, org_name: str, facet: str) -> Dict[str, Any]:
    try:
        return json.loads(raw)
    except Exception:
        path = save_raw_llm_output(org_name, facet, raw)
        raise ValueError(f"Invalid JSON for org='{org_name}', facet='{facet}'. Raw saved to: {path}")

def ensure_facet_defaults(obj: Dict[str, Any], facet: str) -> Dict[str, Any]:
    obj = obj if isinstance(obj, dict) else {}
    obj.setdefault("facet", facet)
    obj.setdefault("has_signal", "unknown")
    obj.setdefault("summary_text", "unknown")
    obj.setdefault("details", {})
    obj.setdefault("evidence_urls", [])
    obj.setdefault("confidence", 0.0)
    obj.setdefault("notes", "")
    if not isinstance(obj.get("details"), dict):
        obj["details"] = {}
    if not isinstance(obj.get("evidence_urls"), list):
        obj["evidence_urls"] = []
    return obj

# ----------------
# Define Facet Plan
# ----------------
# Each output facet can draw from one or more CSE source facets from Cell 3.
FACET_PLAN = {
    "business_overview": {
        "include_primary_head": True,
        "source_facets": ["business_overview"],
    },
    "direct_startup_investments": {
        "include_primary_head": False,
        "source_facets": ["direct_startup_investments"],
    },
    "cvc": {
        "include_primary_head": False,
        "source_facets": ["cvc_presence", "cvc_investment_focus"],
    },
    "lp_commitments_to_vc": {
        "include_primary_head": False,
        "source_facets": ["lp_to_vc_commitments"],
    },
}

# ----------------
# Run Extraction for All Candidates → DataFrame
# ----------------
if "lp_raw_signals" not in globals():
    raise ValueError("lp_raw_signals is not defined. Please run Cell 3 first.")

rows: List[Dict[str, Any]] = []

for i, bundle in enumerate(lp_raw_signals, start=1):
    org_name = bundle.get("org_name", f"candidate_{i}")
    primary_url = bundle.get("primary_url", "")

    print(f"Extracting facets with LLM: {org_name} ...")

    for facet, cfg in FACET_PLAN.items():
        ctx = build_facet_context(
            bundle=bundle,
            include_primary_head=cfg["include_primary_head"],
            facets=cfg["source_facets"],
        )

        try:
            raw = call_llm_facet(facet=facet, context_text=ctx, max_tokens=420, temperature=0.1)
            out = parse_json_strict(raw, org_name=org_name, facet=facet)
            out = ensure_facet_defaults(out, facet=facet)

            rows.append({
                "org_name": org_name,
                "primary_url": primary_url,
                "facet": facet,
                "has_signal": out["has_signal"],
                "summary_text": out["summary_text"],
                "details_json": out["details"],
                "evidence_urls": out["evidence_urls"],
                "confidence": float(out.get("confidence", 0.0) or 0.0),
                "notes": out.get("notes", ""),
                "extracted_at": datetime.utcnow().isoformat() + "Z",
                "status": "ok",
            })

        except Exception as e:
            rows.append({
                "org_name": org_name,
                "primary_url": primary_url,
                "facet": facet,
                "has_signal": "unknown",
                "summary_text": "",
                "details_json": {},
                "evidence_urls": [],
                "confidence": 0.0,
                "notes": str(e),
                "extracted_at": datetime.utcnow().isoformat() + "Z",
                "status": "error",
            })

        time.sleep(0.15)

lp_facet_df = pd.DataFrame(rows)

# Store globally
global lp_facet_df

# ----------------
# Optional: Persist
# ----------------
SAVE_PARQUET = False  # set True to persist
if SAVE_PARQUET:
    os.makedirs("artifacts", exist_ok=True)
    out_path = "artifacts/009_lp_facet_df.parquet"
    lp_facet_df.to_parquet(out_path, index=False)
    print(f"Saved: {out_path}")

display(lp_facet_df.sort_values(["org_name", "facet"]))


Extracting facets with LLM: DBJ ...


Unnamed: 0,org_name,primary_url,facet,has_signal,summary_text,details_json,evidence_urls,confidence,notes,extracted_at,status
0,DBJ,https://www.dbj.jp/,business_overview,yes,Development Bank of Japan Inc. (DBJ) is a government-affiliated financial institution focused on long-term investment and financing to create economic value and social contribution.,"{'type': 'government-affiliated financial institution', 'focus': 'long-term investment and financing', 'mission': 'create economic value and social contribution'}","[https://www.dbj.jp/en/co/info/outline.html, https://www.dbj.jp/en/]",0.95,Information is based on official DBJ corporate profile and website. No contradictory data found.,2026-01-07T06:25:31.432008Z,ok
2,DBJ,https://www.dbj.jp/,cvc,yes,"DBJ has a wholly owned venture capital subsidiary, DBJ Capital Co., Ltd., and actively invests in corporate venture capital funds, including AP Ventures Fund III LP focused on hydrogen technologies.","{'venture_subsidiary': 'DBJ Capital Co., Ltd.', 'investment_focus': ['hydrogen technologies', 'life sciences'], 'notable_investments': ['AP Ventures Fund III LP']}","[https://www.dbj.jp/en/topics/dbj_news/2023/html/20240216_204659.html, https://www.dbj-cap.jp/en/, https://www.dbj.jp/en/co/info/quarterly/group-topics/56-3en.html, https://www.amed.go.jp/en/progr...",0.95,"DBJ operates a dedicated venture capital subsidiary and makes CVC investments, with a focus on innovative sectors such as hydrogen and life sciences.",2026-01-07T06:25:40.319415Z,ok
1,DBJ,https://www.dbj.jp/,direct_startup_investments,yes,"DBJ has made direct investments in multiple startups including Form Energy, Heirloom Carbon Technologies, and 3DEO Inc. They also provide long-term financing for startups at various growth stages.","{'examples': ['Form Energy, Inc.', 'Heirloom Carbon Technologies, Inc.', '3DEO Inc.']}","[https://www.dbj.jp/en/topics/dbj_news/2024/html/20250110_205046.html, https://www.dbj.jp/en/topics/dbj_news/2025/html/20251203_206236.html, https://corporate.epson/en/news/2024/240119.html, https...",0.95,DBJ's direct startup investments are evidenced by multiple news releases and their venture capital subsidiary's description. No contradictory information found.,2026-01-07T06:25:35.871335Z,ok
3,DBJ,https://www.dbj.jp/,lp_commitments_to_vc,yes,"DBJ has committed to multiple venture capital funds including Vertex Master Fund II LP, 4BIO Ventures III LP, and AP Ventures Fund III LP, indicating active LP commitments to VC.","{'examples': ['Vertex Master Fund II LP', '4BIO Ventures III LP', 'AP Ventures Fund III LP']}","[https://www.dbj.jp/en/topics/dbj_news/2019/html/dbj_has_invested_in_vertex_master_fund_ii_lp_an_affiliated_venture_capital_fund_of_temasek_hings.html, https://www.dbj.jp/en/topics/dbj_news/2023/h...",0.9,Commitment amounts and total exposure are not specified in the context.,2026-01-07T06:25:44.483551Z,ok


In [10]:
# ============================================================
# Cell 5 : Normalize Extracted Attributes into a Common Schema
# ============================================================
#
# This cell takes the facet-wise extraction table (lp_facet_df) from Cell 4 and
# normalizes it into a single, common LP schema (one row per organization).
#
# Goals:
# - Produce an analysis-ready table with consistent columns across LPs
# - Preserve traceability (evidence URLs) and uncertainty (confidence / notes)
# - Keep "details_json" machine-readable for downstream enrichment
#
# Outputs:
# - lp_profile_df  : one row per org, common schema
# - lp_profile_json: optional dict profiles for export / inspection
#

import json
from typing import Tuple

# ----------------
# Preconditions
# ----------------
if "lp_facet_df" not in globals():
    raise ValueError("lp_facet_df is not defined. Please run Cell 4 first.")

required_cols = {"org_name", "primary_url", "facet", "has_signal", "summary_text", "details_json", "evidence_urls", "confidence", "notes", "status"}
missing = required_cols - set(lp_facet_df.columns)
if missing:
    raise ValueError(f"lp_facet_df is missing required columns: {missing}")

# ----------------
# Helpers
# ----------------
def _as_list(x) -> List[Any]:
    if x is None:
        return []
    if isinstance(x, list):
        return x
    return [x]

def _merge_unique_lists(lists: List[List[Any]], max_items: Optional[int] = None) -> List[Any]:
    seen = set()
    out = []
    for lst in lists:
        for v in _as_list(lst):
            key = json.dumps(v, ensure_ascii=False, sort_keys=True) if isinstance(v, (dict, list)) else str(v)
            if key not in seen:
                seen.add(key)
                out.append(v)
            if max_items is not None and len(out) >= max_items:
                return out
    return out

def _best_row(df: pd.DataFrame) -> Optional[pd.Series]:
    """
    Choose the 'best' facet row for an org+facet based on:
      1) status == ok
      2) highest confidence
      3) longest summary_text
    """
    if df.empty:
        return None
    ok = df[df["status"] == "ok"].copy()
    cand = ok if not ok.empty else df.copy()
    cand["summary_len"] = cand["summary_text"].fillna("").astype(str).str.len()
    cand = cand.sort_values(["confidence", "summary_len"], ascending=[False, False])
    return cand.iloc[0]

def _extract_details(obj: Any) -> Dict[str, Any]:
    return obj if isinstance(obj, dict) else {}

def normalize_one_org(org_df: pd.DataFrame) -> Tuple[Dict[str, Any], Dict[str, Any]]:
    """
    Convert all facet rows for one org into:
      - flat schema row (dict)
      - full profile json (dict, structured)
    """
    org_name = org_df["org_name"].iloc[0]
    primary_url = org_df["primary_url"].iloc[0]

    # Pick best facet rows
    bo_row  = _best_row(org_df[org_df["facet"] == "business_overview"])
    di_row  = _best_row(org_df[org_df["facet"] == "direct_startup_investments"])
    cvc_row = _best_row(org_df[org_df["facet"] == "cvc"])
    lp_row  = _best_row(org_df[org_df["facet"] == "lp_commitments_to_vc"])

    def _safe(row: Optional[pd.Series], col: str, default=None):
        if row is None:
            return default
        v = row.get(col, default)
        return v if v is not None else default

    # Collect normalized fields
    bo_summary = _safe(bo_row, "summary_text", "unknown")
    bo_urls = _safe(bo_row, "evidence_urls", [])
    bo_conf = float(_safe(bo_row, "confidence", 0.0) or 0.0)

    di_has = _safe(di_row, "has_signal", "unknown")
    di_summary = _safe(di_row, "summary_text", "")
    di_details = _extract_details(_safe(di_row, "details_json", {}))
    di_urls = _safe(di_row, "evidence_urls", [])
    di_conf = float(_safe(di_row, "confidence", 0.0) or 0.0)

    cvc_has = _safe(cvc_row, "has_signal", "unknown")
    cvc_summary = _safe(cvc_row, "summary_text", "")
    cvc_details = _extract_details(_safe(cvc_row, "details_json", {}))
    cvc_urls = _safe(cvc_row, "evidence_urls", [])
    cvc_conf = float(_safe(cvc_row, "confidence", 0.0) or 0.0)

    lp_has = _safe(lp_row, "has_signal", "unknown")
    lp_summary = _safe(lp_row, "summary_text", "")
    lp_details = _extract_details(_safe(lp_row, "details_json", {}))
    lp_urls = _safe(lp_row, "evidence_urls", [])
    lp_conf = float(_safe(lp_row, "confidence", 0.0) or 0.0)

    # Merge evidence urls across facets (optional)
    all_urls = _merge_unique_lists([bo_urls, di_urls, cvc_urls, lp_urls], max_items=20)

    # Combine notes
    all_notes = []
    for row in [bo_row, di_row, cvc_row, lp_row]:
        n = _safe(row, "notes", "")
        if n:
            all_notes.append(n)

    # Build structured profile json (keeps facet separation)
    profile_json = {
        "org_name": org_name,
        "primary_url": primary_url,
        "business_overview": {
            "summary": bo_summary,
            "evidence_urls": bo_urls,
            "confidence": bo_conf,
        },
        "direct_startup_investments": {
            "has_direct_investment": di_has,
            "summary": di_summary,
            "details": di_details,
            "evidence_urls": di_urls,
            "confidence": di_conf,
        },
        "cvc": {
            "has_cvc": cvc_has,
            "summary": cvc_summary,
            "details": cvc_details,
            "evidence_urls": cvc_urls,
            "confidence": cvc_conf,
        },
        "lp_commitments_to_vc": {
            "has_lp_commitments": lp_has,
            "summary": lp_summary,
            "details": lp_details,
            "evidence_urls": lp_urls,
            "confidence": lp_conf,
        },
        "all_evidence_urls": all_urls,
        "notes": all_notes,
    }

    # Build flat table row (analysis-friendly)
    flat_row = {
        "org_name": org_name,
        "primary_url": primary_url,

        "business_overview_summary": bo_summary,
        "business_overview_confidence": bo_conf,

        "has_direct_startup_investments": di_has,
        "direct_startup_investments_confidence": di_conf,

        "has_cvc": cvc_has,
        "cvc_confidence": cvc_conf,

        "has_lp_commitments_to_vc": lp_has,
        "lp_commitments_confidence": lp_conf,

        # Keep machine-readable details for drill-down
        "direct_investment_details_json": di_details,
        "cvc_details_json": cvc_details,
        "lp_commitments_details_json": lp_details,

        # Traceability
        "evidence_urls": all_urls,
        "notes": " | ".join(all_notes)[:2000],  # prevent runaway length
    }

    return flat_row, profile_json

# ----------------
# Normalize all orgs
# ----------------
flat_rows = []
profile_jsons = {}

for org_name, org_df in lp_facet_df.groupby("org_name"):
    flat, prof = normalize_one_org(org_df)
    flat_rows.append(flat)
    profile_jsons[org_name] = prof

lp_profile_df = pd.DataFrame(flat_rows).sort_values("org_name").reset_index(drop=True)

# Store globally for downstream export / analysis
global lp_profile_df, lp_profile_json
lp_profile_json = profile_jsons

display(lp_profile_df)


Unnamed: 0,org_name,primary_url,business_overview_summary,business_overview_confidence,has_direct_startup_investments,direct_startup_investments_confidence,has_cvc,cvc_confidence,has_lp_commitments_to_vc,lp_commitments_confidence,direct_investment_details_json,cvc_details_json,lp_commitments_details_json,evidence_urls,notes
0,DBJ,https://www.dbj.jp/,Development Bank of Japan Inc. (DBJ) is a government-affiliated financial institution focused on long-term investment and financing to create economic value and social contribution.,0.95,yes,0.95,yes,0.95,yes,0.9,"{'examples': ['Form Energy, Inc.', 'Heirloom Carbon Technologies, Inc.', '3DEO Inc.']}","{'venture_subsidiary': 'DBJ Capital Co., Ltd.', 'investment_focus': ['hydrogen technologies', 'life sciences'], 'notable_investments': ['AP Ventures Fund III LP']}","{'examples': ['Vertex Master Fund II LP', '4BIO Ventures III LP', 'AP Ventures Fund III LP']}","[https://www.dbj.jp/en/co/info/outline.html, https://www.dbj.jp/en/, https://www.dbj.jp/en/topics/dbj_news/2024/html/20250110_205046.html, https://www.dbj.jp/en/topics/dbj_news/2025/html/20251203_...",Information is based on official DBJ corporate profile and website. No contradictory data found. | DBJ's direct startup investments are evidenced by multiple news releases and their venture capita...


In [11]:
# ============================================================
# Cell 6 : Flag Uncertainties, Assumptions, and Missing Signals
# ============================================================
#
# This cell adds explicit "research flags" on top of the normalized LP table
# (lp_profile_df) produced in Cell 5.
#
# Objectives:
# - Identify incomplete profiles (missing facets / low confidence / extraction errors)
# - Surface uncertainty drivers in a structured, review-friendly way
# - Produce a prioritized "follow-up queue" for manual research or re-runs
#
# Outputs:
# - lp_flags_df        : lp_profile_df + flags + follow-up priority
# - lp_followup_df     : subset of candidates requiring attention (sorted)
#

import math

# ----------------
# Preconditions
# ----------------
if "lp_profile_df" not in globals():
    raise ValueError("lp_profile_df is not defined. Please run Cell 5 first.")

# Optional but recommended: use the facet table for better diagnostics
has_facet_table = "lp_facet_df" in globals()

# ----------------
# Configuration (tune as needed)
# ----------------
LOW_CONF_THRESHOLD = 0.55
VERY_LOW_CONF_THRESHOLD = 0.35

# We treat these fields as "core facets" for pre-research completeness.
CORE_FACETS = [
    ("business_overview_confidence", "Business overview"),
    ("direct_startup_investments_confidence", "Direct startup investments"),
    ("cvc_confidence", "CVC presence / scope"),
    ("lp_commitments_confidence", "LP commitments to VC funds"),
]

# ----------------
# Helper Functions
# ----------------
def _is_unknown(x: Any) -> bool:
    if x is None:
        return True
    if isinstance(x, str) and x.strip().lower() in {"unknown", "n/a", "na", ""}:
        return True
    return False

def _safe_float(x: Any, default: float = 0.0) -> float:
    try:
        if x is None or (isinstance(x, float) and math.isnan(x)):
            return default
        return float(x)
    except Exception:
        return default

def _count_urls(urls: Any) -> int:
    if urls is None:
        return 0
    if isinstance(urls, list):
        return len([u for u in urls if isinstance(u, str) and u.strip()])
    return 0

def _facet_status(org_name: str, facet: str) -> str:
    """
    Look up facet extraction status from lp_facet_df if available.
    Returns: "ok" | "error" | "missing"
    """
    if not has_facet_table:
        return "missing"
    df = lp_facet_df[(lp_facet_df["org_name"] == org_name) & (lp_facet_df["facet"] == facet)]
    if df.empty:
        return "missing"
    # If any ok exists, treat as ok; else error
    if (df["status"] == "ok").any():
        return "ok"
    return "error"

# ----------------
# Build Flags
# ----------------
flag_rows = []

for _, row in lp_profile_df.iterrows():
    org = row.get("org_name", "")
    primary_url = row.get("primary_url", "")

    flags = []
    gaps = []

    # URL traceability
    url_n = _count_urls(row.get("evidence_urls"))
    if url_n == 0:
        flags.append("no_evidence_urls")
        gaps.append("No evidence URLs captured in the merged profile.")

    # Business overview
    bo_sum = row.get("business_overview_summary", "")
    bo_conf = _safe_float(row.get("business_overview_confidence"), 0.0)
    if _is_unknown(bo_sum) or bo_conf < VERY_LOW_CONF_THRESHOLD:
        flags.append("missing_or_weak_business_overview")
        gaps.append("Business overview is missing or very low confidence.")

    # Direct startup investments signal
    di_has = row.get("has_direct_startup_investments", "unknown")
    di_conf = _safe_float(row.get("direct_startup_investments_confidence"), 0.0)
    if _is_unknown(di_has) or di_conf < LOW_CONF_THRESHOLD:
        flags.append("uncertain_direct_investment_signal")
        gaps.append("Direct startup investment signal is unknown or low confidence.")

    # CVC signal
    cvc_has = row.get("has_cvc", "unknown")
    cvc_conf = _safe_float(row.get("cvc_confidence"), 0.0)
    if _is_unknown(cvc_has) or cvc_conf < LOW_CONF_THRESHOLD:
        flags.append("uncertain_cvc_signal")
        gaps.append("CVC presence/scope is unknown or low confidence.")

    # LP commitments signal
    lp_has = row.get("has_lp_commitments_to_vc", "unknown")
    lp_conf = _safe_float(row.get("lp_commitments_confidence"), 0.0)
    if _is_unknown(lp_has) or lp_conf < LOW_CONF_THRESHOLD:
        flags.append("uncertain_lp_commitment_signal")
        gaps.append("LP commitments to VC funds are unknown or low confidence.")

    # Facet-level errors (if facet table exists)
    facet_errors = []
    if has_facet_table:
        facet_map = {
            "business_overview": "business_overview",
            "direct_startup_investments": "direct_startup_investments",
            "cvc": "cvc",
            "lp_commitments_to_vc": "lp_commitments_to_vc",
        }
        for f in facet_map.values():
            st = _facet_status(org, f)
            if st == "error":
                facet_errors.append(f)
        if facet_errors:
            flags.append("facet_extraction_error")
            gaps.append(f"Facet extraction errors: {', '.join(facet_errors)}")

    # Confidence-based summary
    confs = [
        bo_conf,
        di_conf,
        cvc_conf,
        lp_conf
    ]
    min_conf = min(confs) if confs else 0.0
    mean_conf = sum(confs) / len(confs) if confs else 0.0

    if min_conf < VERY_LOW_CONF_THRESHOLD:
        priority = 3  # highest
    elif mean_conf < LOW_CONF_THRESHOLD:
        priority = 2
    elif len(flags) > 0:
        priority = 1
    else:
        priority = 0

    flag_rows.append({
        "org_name": org,
        "primary_url": primary_url,
        "evidence_url_count": url_n,
        "min_confidence": round(min_conf, 3),
        "mean_confidence": round(mean_conf, 3),
        "flags": flags,
        "gap_notes": gaps,
        "followup_priority": priority,
    })

lp_flags_core = pd.DataFrame(flag_rows)

# ----------------
# Merge back into normalized profile table
# ----------------
lp_flags_df = lp_profile_df.merge(lp_flags_core, on=["org_name", "primary_url"], how="left")

# Store globally
global lp_flags_df

display(lp_flags_df.sort_values(["followup_priority", "mean_confidence"], ascending=[False, True]))

# ----------------
# Follow-up Queue (prioritized subset)
# ----------------
lp_followup_df = (
    lp_flags_df[lp_flags_df["followup_priority"] > 0]
    .sort_values(["followup_priority", "mean_confidence"], ascending=[False, True])
    .reset_index(drop=True)
)

global lp_followup_df

print(f"\nFollow-up candidates: {len(lp_followup_df)} / {len(lp_flags_df)}")
display(lp_followup_df[[
    "org_name",
    "followup_priority",
    "flags",
    "min_confidence",
    "mean_confidence",
    "evidence_url_count"
]])


Unnamed: 0,org_name,primary_url,business_overview_summary,business_overview_confidence,has_direct_startup_investments,direct_startup_investments_confidence,has_cvc,cvc_confidence,has_lp_commitments_to_vc,lp_commitments_confidence,direct_investment_details_json,cvc_details_json,lp_commitments_details_json,evidence_urls,notes,evidence_url_count,min_confidence,mean_confidence,flags,gap_notes,followup_priority
0,DBJ,https://www.dbj.jp/,Development Bank of Japan Inc. (DBJ) is a government-affiliated financial institution focused on long-term investment and financing to create economic value and social contribution.,0.95,yes,0.95,yes,0.95,yes,0.9,"{'examples': ['Form Energy, Inc.', 'Heirloom Carbon Technologies, Inc.', '3DEO Inc.']}","{'venture_subsidiary': 'DBJ Capital Co., Ltd.', 'investment_focus': ['hydrogen technologies', 'life sciences'], 'notable_investments': ['AP Ventures Fund III LP']}","{'examples': ['Vertex Master Fund II LP', '4BIO Ventures III LP', 'AP Ventures Fund III LP']}","[https://www.dbj.jp/en/co/info/outline.html, https://www.dbj.jp/en/, https://www.dbj.jp/en/topics/dbj_news/2024/html/20250110_205046.html, https://www.dbj.jp/en/topics/dbj_news/2025/html/20251203_...",Information is based on official DBJ corporate profile and website. No contradictory data found. | DBJ's direct startup investments are evidenced by multiple news releases and their venture capita...,11,0.9,0.937,[],[],0



Follow-up candidates: 0 / 1


Unnamed: 0,org_name,followup_priority,flags,min_confidence,mean_confidence,evidence_url_count


In [15]:
# ============================================================
# Cell 7 (Add-on) : Print a Markdown-style Report (Facts + Hypotheses)
# ============================================================
#
# This block prints a human-readable, Markdown-like report in notebook output.
# - Facts are grounded in extracted fields (with evidence URLs + confidence).
# - Hypotheses are generated by LLM and MUST be treated as tentative.
#
# Requirements:
# - lp_profile_json (Cell 5)
# - lp_flags_df (Cell 6)
# - lp_export_df (Cell 7: after hypothesis generation)
#

from datetime import datetime
import json

if "lp_profile_json" not in globals():
    raise ValueError("lp_profile_json is not defined. Run Cell 5 first.")
if "lp_flags_df" not in globals():
    raise ValueError("lp_flags_df is not defined. Run Cell 6 first.")
if "lp_export_df" not in globals():
    raise ValueError("lp_export_df is not defined. Run Cell 7 hypothesis generation first.")

def _md_list(items, indent=0):
    pad = "  " * indent
    if not items:
        return pad + "- (none)"
    return "\n".join([pad + f"- {x}" for x in items])

def _fmt_conf(x):
    try:
        return f"{float(x):.2f}"
    except Exception:
        return "n/a"

def _take_urls(urls, k=5):
    urls = urls or []
    urls = [u for u in urls if isinstance(u, str) and u.strip()]
    return urls[:k]

def render_one_lp_markdown(org_name: str) -> str:
    prof = lp_profile_json.get(org_name, {}) or {}

    # Flags / gaps
    frow = lp_flags_df[lp_flags_df["org_name"] == org_name]
    frow = frow.iloc[0].to_dict() if not frow.empty else {}
    flags = frow.get("flags", []) or []
    gap_notes = frow.get("gap_notes", []) or []
    priority = frow.get("followup_priority", None)
    mean_conf = frow.get("mean_confidence", None)
    url_n = frow.get("evidence_url_count", None)

    # Hypotheses
    erow = lp_export_df[lp_export_df["org_name"] == org_name]
    erow = erow.iloc[0].to_dict() if not erow.empty else {}
    challenges = erow.get("lp_challenges_hypotheses", []) or []
    hyp_notes = erow.get("lp_challenges_notes", "")

    # Facts: facet objects
    bo = prof.get("business_overview", {}) or {}
    di = prof.get("direct_startup_investments", {}) or {}
    cvc = prof.get("cvc", {}) or {}
    lp = prof.get("lp_commitments_to_vc", {}) or {}

    # Evidence
    bo_urls = _take_urls(bo.get("evidence_urls", []))
    di_urls = _take_urls(di.get("evidence_urls", []))
    cvc_urls = _take_urls(cvc.get("evidence_urls", []))
    lp_urls = _take_urls(lp.get("evidence_urls", []))

    lines = []
    lines.append(f"## {org_name}")
    lines.append(f"- Primary URL: {prof.get('primary_url', '')}")
    lines.append(f"- Follow-up priority: {priority} | Mean confidence: {_fmt_conf(mean_conf)} | Evidence URLs: {url_n}")
    lines.append("")

    # ----------------
    # Facts
    # ----------------
    lines.append("### Facts (Evidence-backed)")
    lines.append("")
    lines.append("**Business overview**")
    lines.append(f"- Summary: {bo.get('summary', 'unknown')}")
    lines.append(f"- Confidence: {_fmt_conf(bo.get('confidence'))}")
    lines.append("- Evidence URLs:")
    lines.append(_md_list(bo_urls, indent=1))
    lines.append("")

    lines.append("**Direct startup investments**")
    lines.append(f"- Has direct investment: {di.get('has_direct_investment', 'unknown')}")
    lines.append(f"- Confidence: {_fmt_conf(di.get('confidence'))}")
    # details may be large; show compact
    di_details = di.get("details", {}) or {}
    if di_details:
        lines.append("- Details (compact):")
        lines.append("  - " + json.dumps(di_details, ensure_ascii=False)[:600] + ("..." if len(json.dumps(di_details, ensure_ascii=False)) > 600 else ""))
    lines.append("- Evidence URLs:")
    lines.append(_md_list(di_urls, indent=1))
    lines.append("")

    lines.append("**CVC**")
    lines.append(f"- Has CVC: {cvc.get('has_cvc', 'unknown')}")
    lines.append(f"- Confidence: {_fmt_conf(cvc.get('confidence'))}")
    cvc_details = cvc.get("details", {}) or {}
    if cvc_details:
        lines.append("- Details (compact):")
        lines.append("  - " + json.dumps(cvc_details, ensure_ascii=False)[:600] + ("..." if len(json.dumps(cvc_details, ensure_ascii=False)) > 600 else ""))
    lines.append("- Evidence URLs:")
    lines.append(_md_list(cvc_urls, indent=1))
    lines.append("")

    lines.append("**LP commitments to VC funds**")
    lines.append(f"- Has LP commitments: {lp.get('has_lp_commitments', 'unknown')}")
    lines.append(f"- Confidence: {_fmt_conf(lp.get('confidence'))}")
    lp_details = lp.get("details", {}) or {}
    if lp_details:
        lines.append("- Details (compact):")
        lines.append("  - " + json.dumps(lp_details, ensure_ascii=False)[:600] + ("..." if len(json.dumps(lp_details, ensure_ascii=False)) > 600 else ""))
    lines.append("- Evidence URLs:")
    lines.append(_md_list(lp_urls, indent=1))
    lines.append("")

    # Flags & gaps
    lines.append("### Data gaps / uncertainty flags")
    lines.append(_md_list(flags, indent=0))
    if gap_notes:
        lines.append("")
        lines.append("**Gap notes**")
        lines.append(_md_list(gap_notes, indent=0))
    lines.append("")

    # ----------------
    # Hypotheses
    # ----------------
    lines.append("### Hypotheses (Tentative; derived from facts + gaps)")
    if not challenges:
        lines.append("- (none generated)")
    else:
        for idx, ch in enumerate(challenges, start=1):
            hyp = ch.get("hypothesis", "")
            rat = ch.get("rationale", "")
            conf = _fmt_conf(ch.get("confidence", 0.0))
            lines.append(f"{idx}. **Hypothesis:** {hyp}")
            lines.append(f"   - Rationale: {rat}")
            lines.append(f"   - Confidence (hypothesis): {conf}")
    if hyp_notes:
        lines.append("")
        lines.append(f"**Hypothesis notes:** {hyp_notes}")

    lines.append("\n---\n")
    return "\n".join(lines)

def print_markdown_report(max_orgs: Optional[int] = None):
    orgs = lp_profile_df["org_name"].tolist()
    if max_orgs is not None:
        orgs = orgs[:max_orgs]

    header = []
    header.append("# LP Candidate Pre-Research Report (Facts + Hypotheses)")
    header.append(f"- Generated at: {datetime.utcnow().isoformat()}Z")
    header.append(f"- Candidates: {len(orgs)}")
    header.append("\n---\n")

    report = "\n".join(header) + "\n".join([render_one_lp_markdown(o) for o in orgs])
    print(report)
    return report

# ----------------
# Print report in notebook output
# ----------------
from IPython.display import Markdown, display; display(Markdown(md_text))

# ----------------
# Optional: Save the same report as a .md artifact
# ----------------
SAVE_MD = True
if SAVE_MD:
    md_text = print_markdown_report(max_orgs=None)
    out_path = os.path.join(OUT_DIR, "009_lp_pre_research_report.md")
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(md_text)
    print(f"\nSaved Markdown report: {out_path}")




# LP Candidate Pre-Research Report (Facts + Hypotheses)
- Generated at: 2026-01-07T07:09:28.384048Z
- Candidates: 1

---
## DBJ
- Primary URL: https://www.dbj.jp/
- Follow-up priority: 0 | Mean confidence: 0.94 | Evidence URLs: 11

### Facts (Evidence-backed)

**Business overview**
- Summary: Development Bank of Japan Inc. (DBJ) is a government-affiliated financial institution focused on long-term investment and financing to create economic value and social contribution.
- Confidence: 0.95
- Evidence URLs:
  - https://www.dbj.jp/en/co/info/outline.html
  - https://www.dbj.jp/en/

**Direct startup investments**
- Has direct investment: yes
- Confidence: 0.95
- Details (compact):
  - {"examples": ["Form Energy, Inc.", "Heirloom Carbon Technologies, Inc.", "3DEO Inc."]}
- Evidence URLs:
  - https://www.dbj.jp/en/topics/dbj_news/2024/html/20250110_205046.html
  - https://www.dbj.jp/en/topics/dbj_news/2025/html/20251203_206236.html
  - https://corporate.epson/en/news/2024/240119.html
  - https://www.dbj-cap.jp/en/

**CVC**
- Has CVC: yes
- Confidence: 0.95
- Details (compact):
  - {"venture_subsidiary": "DBJ Capital Co., Ltd.", "investment_focus": ["hydrogen technologies", "life sciences"], "notable_investments": ["AP Ventures Fund III LP"]}
- Evidence URLs:
  - https://www.dbj.jp/en/topics/dbj_news/2023/html/20240216_204659.html
  - https://www.dbj-cap.jp/en/
  - https://www.dbj.jp/en/co/info/quarterly/group-topics/56-3en.html
  - https://www.amed.go.jp/en/program/list/19/02/005_Capital.html

**LP commitments to VC funds**
- Has LP commitments: yes
- Confidence: 0.90
- Details (compact):
  - {"examples": ["Vertex Master Fund II LP", "4BIO Ventures III LP", "AP Ventures Fund III LP"]}
- Evidence URLs:
  - https://www.dbj.jp/en/topics/dbj_news/2019/html/dbj_has_invested_in_vertex_master_fund_ii_lp_an_affiliated_venture_capital_fund_of_temasek_hings.html
  - https://www.dbj.jp/en/topics/dbj_news/2023/html/20240105_204596.html
  - https://www.dbj.jp/en/topics/dbj_news/2023/html/20240216_204659.html

### Data gaps / uncertainty flags
- (none)

### Hypotheses (Tentative; derived from facts + gaps)
1. **Hypothesis:** Hypothesis: DBJ's government affiliation may impose constraints on investment agility and risk tolerance compared to private sector LPs.
   - Rationale: Rationale: As a government-affiliated institution, DBJ might face regulatory or policy-driven limitations impacting investment speed or risk appetite.
   - Confidence (hypothesis): 0.70
2. **Hypothesis:** Hypothesis: DBJ's focus on long-term financing could limit its ability to capitalize on shorter-term, high-growth startup opportunities.
   - Rationale: Rationale: The business overview emphasizes long-term investment and financing, which may deprioritize more dynamic, short-term venture opportunities.
   - Confidence (hypothesis): 0.65
3. **Hypothesis:** Hypothesis: DBJ’s LP commitments appear concentrated in specific sectors such as hydrogen technologies, potentially limiting diversification.
   - Rationale: Rationale: DBJ’s CVC subsidiary invests in funds like AP Ventures Fund III LP focused on hydrogen, suggesting sector concentration risk.
   - Confidence (hypothesis): 0.60

**Hypothesis notes:** DBJ demonstrates strong engagement in direct startup investments, CVC activity, and LP commitments with high confidence. However, its government affiliation and strategic focus on long-term and sector-specific investments may present practical challenges in portfolio diversification and investment flexibility.

---


# LP Candidate Pre-Research Report (Facts + Hypotheses)
- Generated at: 2026-01-07T07:09:51.105805Z
- Candidates: 1

---
## DBJ
- Primary URL: https://www.dbj.jp/
- Follow-up priority: 0 | Mean confidence: 0.94 | Evidence URLs: 11

### Facts (Evidence-backed)

**Business overview**
- Summary: Development Bank of Japan Inc. (DBJ) is a government-affiliated financial institution focused on long-term investment and financing to create economic value and social contribution.
- Confidence: 0.95
- Evidence URLs:
  - https://www.dbj.jp/en/co/info/outline.html
  - https://www.dbj.jp/en/

**Direct startup investments**
- Has direct investment: yes
- Confidence: 0.95
- Details (compact):
  - {"examples": ["Form Energy, Inc.", "Heirloom Carbon Technologies, Inc.", "3DEO Inc."]}
- Evidence URLs:
  - https://www.dbj.jp/en/topics/dbj_news/2024/html/20250110_205046.html
  - https://www.dbj.jp/en/topics/dbj_news/2025/html/20251203_206236.html
  - https://corporate.epson/en/news/2024/240119.html
  - ht