# Scrape Disfold Japan Companies Page

This notebook demonstrates how to:

1. Send an HTTP GET request to `https://disfold.com/japan/companies/` with polite headers.
2. Parse the returned HTML (DOM) using **BeautifulSoup**.
3. Extract company table data into a **pandas DataFrame**.
4. Provide fallbacks (e.g., `pandas.read_html`) and resilient parsing patterns.

We'll keep requests light and respectful (single fetch, no rapid-fire crawling).

In [17]:
# Dependency setup (run this first)
import sys, subprocess, importlib, datetime

def install(pkg):
    print(f"Installing {pkg} ...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])

REQUIRED = ["requests", "beautifulsoup4", "pandas"]  # base libs
for pkg in REQUIRED:
    module_name = pkg if pkg != "beautifulsoup4" else "bs4"
    try:
        importlib.import_module(module_name)
    except ImportError:
        install(pkg)

# Attempt lxml (preferred) with fallback to html5lib if build/wheel missing
PARSERS = []
try:
    importlib.import_module("lxml")
    PARSERS.append("lxml")
except ImportError:
    try:
        install("lxml")
        importlib.import_module("lxml")
        PARSERS.append("lxml")
    except Exception as e:
        print("Could not install lxml:", e)
        # fallback html5lib
        try:
            importlib.import_module("html5lib")
        except ImportError:
            install("html5lib")
        PARSERS.append("html5lib")

import requests
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path
print("Dependencies ready at", datetime.datetime.utcnow(), "UTC", "Parsers available:", PARSERS)

Dependencies ready at 2025-08-11 15:04:05.686625 UTC Parsers available: ['lxml']


In [18]:
import pandas as pd

# Ausgabedaten (Excel-Format)
# Unternehmensname (JP / EN)
# Branche
# Europaaktivität (Ja/Nein + Länder)
# Kontaktperson (Name)
# Funktion / Abteilung
# Standort
# E-Mail-Adresse (falls verfügbar)
# Quelle / Link

SCHEMA_COLUMNS = [
    "Unternehmensname (EN)",
    "Unternehmensname (JP)",
    "Branche",
    "Europaaktivität",
    "Kontaktperson (Name)",
    "Funktion / Abteilung",
    "Standort",
    "E-Mail-Adresse (falls verfügbar)",
    "Quelle / Link"
]

companies = pd.DataFrame(columns=SCHEMA_COLUMNS)
print("Initialized empty companies schema DataFrame")

companies

Initialized empty companies schema DataFrame


Unnamed: 0,Unternehmensname (EN),Unternehmensname (JP),Branche,Europaaktivität,Kontaktperson (Name),Funktion / Abteilung,Standort,E-Mail-Adresse (falls verfügbar),Quelle / Link


# 1. find top 50 companies from website
keep row data with class keyword in raw html.

On the website, they have the following columns:  
- Rank	Company	Market Cap (USD)	Country	Sector	Industry

input: number of companies to scrape, e.g., 50.

output: DataFrame with columns:
- Rank
- Company
- Market Cap (USD)
- Country
- Sector
- Industry

### 简化采集逻辑（Incremental While Loop）
我们使用一个基础信息表 `companies_base_info` 逐页追加，直到达到 `TARGET_ROWS` (默认 50)。

伪代码:

```
companies_base_info = empty DataFrame
page = 1
while len(companies_base_info) < TARGET_ROWS:
    fetch page
    parse table rows
    append (去重)
    page += 1
```

特点:
- 不缓存、不写文件
- 逐页追加，随时可以中断
- 以公司名称去重
- 达到目标行数提前停止

In [28]:
# Use reusable utility instead of inline duplication
import importlib, pandas as pd
import utils  # ensures we use the centralized implementation

# Parameters
TARGET_ROWS = 50  # 可调

# Fetch using the shared utility (returns columns: Company, Sector, Industry)
_raw_df = utils.get_target_companies(target_rows=TARGET_ROWS)

# Align with previous notebook expected structure
EXPECTED_COLS = ["rank", "company", "market_cap_usd", "country", "sector", "industry", "detail_link"]
companies_base_info = (
    _raw_df
    .rename(columns={"Company": "company", "Sector": "sector", "Industry": "industry"})
    .assign(
        rank=None,
        market_cap_usd=None,
        country="Japan",  # source list is Japan companies
        detail_link=None,
    )[ ["rank", "company", "market_cap_usd", "country", "sector", "industry", "detail_link"] ]
    .head(TARGET_ROWS)
)

print("Loaded companies_base_info rows:", len(companies_base_info))
companies_base_info.head()

Loaded companies_base_info rows: 50


Unnamed: 0,rank,company,market_cap_usd,country,sector,industry,detail_link
0,,Toyota Motor Corporation,,Japan,Japanese Consumer Discretionary,Japanese Auto Manufacturers,
1,,"Mitsubishi UFJ Financial Group, Inc.",,Japan,Japanese Financials,Japanese Banks—Diversified,
2,,Sony Group Corporation,,Japan,Japanese Technology,Japanese Consumer Electronics,
3,,Hitachi Ltd,,Japan,Japanese Industrials,Japanese Conglomerates,
4,,Nintendo Co Ltd,,Japan,Japanese Communication Services,Japanese Electronic Gaming & Multimedia,


In [29]:
companies_base_info

Unnamed: 0,rank,company,market_cap_usd,country,sector,industry,detail_link
0,,Toyota Motor Corporation,,Japan,Japanese Consumer Discretionary,Japanese Auto Manufacturers,
1,,"Mitsubishi UFJ Financial Group, Inc.",,Japan,Japanese Financials,Japanese Banks—Diversified,
2,,Sony Group Corporation,,Japan,Japanese Technology,Japanese Consumer Electronics,
3,,Hitachi Ltd,,Japan,Japanese Industrials,Japanese Conglomerates,
4,,Nintendo Co Ltd,,Japan,Japanese Communication Services,Japanese Electronic Gaming & Multimedia,
5,,SoftBank Group Corp.,,Japan,Japanese Communication Services,Japanese Telecom Services,
6,,Fast Retailing Co. Ltd,,Japan,Japanese Consumer Discretionary,Japanese Apparel Retail,
7,,"Sumitomo Mitsui Financial Group, Inc.",,Japan,Japanese Financials,Japanese Banks—Diversified,
8,,Keyence Corp,,Japan,Japanese Technology,Japanese Scientific & Technical Instruments,
9,,Nippon Telegraph & Telephone Corp,,Japan,Japanese Communication Services,Japanese Telecom Services,


# 初步处理数据
先把最初的数据，填入最终的 output 里。

## Branche
Sector（行业板块）是更大的分类，Industry（细分行业）是更具体的分类。

- **Sector（板块/行业）**
    - 指经济中的大类，比如 Technology（科技）、Healthcare（医疗）、Financials（金融）、Consumer Discretionary（可选消费）等。
    - 一个 sector 包含多个相关的 industry。
- **Industry（细分行业）**
    - 是 sector 下的具体业务领域，比如在 Technology sector 下有 Semiconductor Equipment（半导体设备）、Consumer Electronics（消费电子）等。
    - 更细致地描述公司实际经营的业务类型。
- **举例说明：**
    - Sector: Technology
        - Industry: Consumer Electronics（消费电子）
        - Industry: Semiconductor Equipment（半导体设备）
    - Sector: Healthcare
        - Industry: Drug Manufacturers（制药）
        - Industry: Medical Instruments（医疗器械）
- **实际应用：**
    - 投资分析时，先看 sector 把握大趋势，再看 industry 选具体赛道和公司。

In [21]:
# company.Unternehmensname (EN) = companies_base_info['company']
companies['Unternehmensname (EN)'] = companies_base_info['company']

# Branche = sector, industry
companies['Branche'] = companies_base_info['sector'].fillna('') + ' / ' + companies_base_info['industry'].fillna('')

companies.head()

Unnamed: 0,Unternehmensname (EN),Unternehmensname (JP),Branche,Europaaktivität,Kontaktperson (Name),Funktion / Abteilung,Standort,E-Mail-Adresse (falls verfügbar),Quelle / Link
0,Toyota Motor Corporation,,Japanese Consumer Discretionary / Japanese Aut...,,,,,,
1,"Mitsubishi UFJ Financial Group, Inc.",,Japanese Financials / Japanese Banks—Diversified,,,,,,
2,Sony Group Corporation,,Japanese Technology / Japanese Consumer Electr...,,,,,,
3,Hitachi Ltd,,Japanese Industrials / Japanese Conglomerates,,,,,,
4,Nintendo Co Ltd,,Japanese Communication Services / Japanese Ele...,,,,,,


# 分析 web 结构

In [26]:
# Directly fetch a company detail page (no prior global soup dependency) and extract
# Company Description and Detailed Description sections.

import re, requests, time
from bs4 import BeautifulSoup

DETAIL_URL = "https://disfold.com/company/toyota-motor-corporation/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9"
}

print("Fetching:", DETAIL_URL)
resp = requests.get(DETAIL_URL, headers=HEADERS, timeout=30)
print("Status:", resp.status_code)
if resp.status_code != 200:
    raise RuntimeError(f"Failed to fetch detail page: {resp.status_code}")

soup = BeautifulSoup(resp.text, 'lxml') if 'lxml' in PARSERS else BeautifulSoup(resp.text, 'html.parser')

TARGET_LABELS = ["Company Description", "Detailed Description"]

# Helpers

def _clean(txt: str) -> str:
    return re.sub(r"\s+", " ", txt).strip()

def _match_nodes(soup, label):
    pattern = re.compile(rf"\b{re.escape(label)}\b", re.I)
    hits = []
    for text_node in soup.find_all(string=pattern):
        parent = text_node.parent
        if not parent:
            continue
        hits.append(parent)
    return hits

def _extract_value_from_node(node, label):
    label_lower = label.lower()
    # 1. If inside a table header cell (th), try next td
    if node.name == 'th':
        td = node.find_next('td')
        if td:
            val = _clean(td.get_text(" ", strip=True))
            if val and label_lower not in val.lower():
                return val, td
    # 2. If inside a definition list (dt), try next dd
    if node.name == 'dt':
        dd = node.find_next('dd')
        if dd:
            val = _clean(dd.get_text(" ", strip=True))
            if val and label_lower not in val.lower():
                return val, dd
    # 3. Look at immediate next siblings' text
    for sib in node.next_siblings:
        if getattr(sib, 'name', None) in ['script', 'style']:
            continue
        if isinstance(sib, str):
            sib_text = sib
        else:
            sib_text = sib.get_text(" ", strip=True)
        sib_text = _clean(sib_text)
        if sib_text:
            if 2 <= len(sib_text) <= 8000:
                cleaned = re.sub(rf"^{label}\s*:?\s*", '', sib_text, flags=re.I)
                if cleaned and cleaned.lower() != label_lower:
                    return cleaned, sib
            break
    # 4. Inline pattern within same node
    node_text = _clean(node.get_text(" ", strip=True))
    inline_match = re.search(rf"{re.escape(label)}\s*:?\s*(.+)", node_text, re.I)
    if inline_match:
        candidate = inline_match.group(1).strip()
        if 0 < len(candidate) < 8000:
            return candidate, node
    return None, node

results = {}
contexts = {}
for label in TARGET_LABELS:
    nodes = _match_nodes(soup, label)
    value = None
    context_snippets = []
    for n in nodes:
        v, origin = _extract_value_from_node(n, label)
        if v and not value:
            value = v
        container = origin
        if container and len(_clean(container.get_text(" ", strip=True))) < 15 and container.parent:
            container = container.parent
        snippet = _clean(container.get_text(" ", strip=True))
        if snippet and snippet not in context_snippets:
            context_snippets.append(snippet if len(snippet) < 300 else snippet[:297] + '...')
        if value:
            break
    results[label] = value or "<not found>"
    contexts[label] = context_snippets[:2]

company_description = results.get('Company Description')
detailed_description = results.get('Detailed Description')

print("Extracted values:")
for k, v in results.items():
    display_v = (v[:500] + '...') if isinstance(v, str) and len(v) > 500 else v
    print(f"- {k}: {display_v}")


Fetching: https://disfold.com/company/toyota-motor-corporation/
Status: 200
Extracted values:
- Company Description: Toyota is a global automobile company that designs, manufactures, assembles, and sells a range of vehicles and related parts and accessories. The company's portfolio includes hybrid and fuel cell vehicles, conventional engine vehicles, mini-vehicles, passenger vehicles, commercial vehicles, and auto parts. In addition, Toyota offers financial services such as retail financing, leasing, insurance, and credit cards, manufactures prefabricated housing, and operates an auto information web portal, G...
- Detailed Description: Toyota Motor Corporation designs, manufactures, assembles, and sells passenger vehicles, minivans and commercial vehicles, and related parts and accessories. It operates in Automotive, Financial Services, and All Other segments. The company offers hybrid cars under the Prius, Prius PHV, C-HR, LC HV, ES HV, Camry, JPN TAXI, Avalon HV, Crown HV, Century H

In [23]:
# company_description
# detailed_description
print("Company Description:", company_description)
print("Detailed Description:", detailed_description)

Company Description: <not found>
Detailed Description: <not found>


In [27]:
# implementation (revised)
# Simplified requirement updates:
# - Remove explicit EUROPE_COUNTRIES enumeration (no long list in prompt or code).
# - branches must be taken ONLY as they explicitly appear in the context text (no invented or summarized/inferred labels).
# - No branch summarization or inference if absent -> allow empty list.
# - Keep detection of european/german presence minimally (keyword heuristic + model output) without enumerating all countries.

from __future__ import annotations
import json, re, textwrap, unicodedata
from typing import List, Optional

# Ensure pydantic
try:
    from pydantic import BaseModel, Field, ValidationError
except ImportError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pydantic'])
    from pydantic import BaseModel, Field, ValidationError

# -------------------- Simplified Schema (no summary) -------------------- #
class ExtractionResult(BaseModel):
    company: str = Field(..., description="Canonical company name")
    branches: List[str] = Field(default_factory=list, description="Branch / segment strings copied verbatim from context")
    europe_presence: bool = Field(..., description="Any presence/activity in Europe (overall)")
    germany_presence: bool = Field(..., description="Specific presence/activity in Germany")
    countries: List[str] = Field(default_factory=list, description="European countries explicitly appearing in context (strings as-is, no expansion)")

# -------------------- Context Prep -------------------- #
company_desc = (company_description if isinstance(company_description, str) else '')
if company_desc == '<not found>':
    company_desc = ''

detailed_desc = (detailed_description if isinstance(detailed_description, str) else '')
if detailed_desc == '<not found>':
    detailed_desc = ''

raw_context = (company_desc + "\n\n" + detailed_desc).strip() or ""

# Infer company name (best effort, no heavy inference)
company_name_guess = None
if 'detail_url' in globals():
    m = re.search(r"/company/([\w-]+)/?", detail_url)
    if m:
        company_name_guess = m.group(1).replace('-', ' ').title()
if not company_name_guess and company_desc:
    company_name_guess = company_desc.split('\n', 1)[0][:80]
company_name_guess = company_name_guess or "Unknown Company"

# -------------------- Prompt -------------------- #
# Pydantic v2 deprecates schema_json; prefer model_json_schema. Provide fallback for v1.
try:
    schema_dict = ExtractionResult.model_json_schema()  # Pydantic v2
    schema_json = json.dumps(schema_dict, indent=2, ensure_ascii=False)
except AttributeError:  # Pydantic v1 fallback
    schema_json = ExtractionResult.schema_json(indent=2)

SYSTEM_MSG = "You are an analyst extracting only explicitly stated branch labels and European (esp. German) presence from raw text. Avoid invention."

INSTRUCTIONS = f"""
Extract ONLY explicitly mentioned branch / segment labels and any European (esp. German) presence.
Rules:
- Output ONLY JSON (no markdown fences).
- JSON must EXACTLY match this schema (keys/types):\n{schema_json}
- branches: take verbatim phrases that denote divisions, segments, business units, or product/segment categories IF they appear literally in the context. Do NOT create or generalize new labels. If none, leave empty list.
- europe_presence: true only if context text explicitly mentions Europe or any specific European country or operation located in Europe.
- germany_presence: true only if 'Germany' (or 'German' clearly tied to a location/activity) appears.
- countries: include ONLY the European country names that literally appear (case-insensitive) in the context; do not normalize or expand synonyms; list each at most once preserving original capitalization of first occurrence.
- company: best direct name from context; if absent use provided guess.
- Do NOT add explanations or comments outside JSON.
Context:\n""" + textwrap.shorten(raw_context, width=12000, placeholder=' ...[truncated]...') + f"""
(If the context is empty, return europe_presence=false, germany_presence=false, branches=[], countries=[], company=\"{company_name_guess}\").
"""

user_prompt = INSTRUCTIONS

# -------------------- Model Call -------------------- #
try:
    extraction_agent  # noqa: F821
except NameError:
    from llm import make_agent
    extraction_agent = make_agent(SYSTEM_MSG)

MAX_REPAIRS = 2
attempt = 0
parsed = None
last_raw = None
last_error = None
while attempt <= MAX_REPAIRS and parsed is None:
    if attempt == 0:
        raw_reply = extraction_agent(user_prompt, temperature=0, max_tokens=550)
    else:
        repair_prompt = (
            "Previous JSON invalid: " + str(last_error) + "\n" +
            "Return ONLY valid JSON adhering to schema (no commentary)."
        )
        raw_reply = extraction_agent(repair_prompt, temperature=0, max_tokens=400)
    last_raw = raw_reply.strip()
    if last_raw.startswith('```'):
        last_raw = re.sub(r'^```(json)?', '', last_raw).strip()
        if last_raw.endswith('```'):
            last_raw = last_raw[:-3].strip()
    try:
        obj = json.loads(last_raw)
        parsed = ExtractionResult(**obj)
    except (json.JSONDecodeError, ValidationError) as e:
        last_error = e
        attempt += 1
        parsed = None

if parsed is None:
    raise RuntimeError(f"Failed after {MAX_REPAIRS+1} attempts. Last error: {last_error}\nRaw: {last_raw[:800]}")

# -------------------- Minimal Heuristic Adjustments -------------------- #
# Ensure no invented branches: filter out branches not literally in context
ctx_lower = raw_context.lower()
filtered_branches = []
seen = set()
for b in parsed.branches:
    if isinstance(b, str) and b.strip():
        if b.lower() in ctx_lower and b.lower() not in seen:
            seen.add(b.lower())
            filtered_branches.append(b.strip())
parsed.branches = filtered_branches

# If germany mentioned in context but model missed it, patch flags/countries
if 'germany' in ctx_lower or re.search(r'\bgerman\b', ctx_lower):
    if not parsed.germany_presence:
        parsed.germany_presence = True
    # Add 'Germany' if not present but word appears
    if not any(c.lower() == 'germany' for c in parsed.countries):
        parsed.countries.append('Germany')
    if not parsed.europe_presence:
        parsed.europe_presence = True

# If 'europe' appears and europe_presence false, set it true
if 'europe' in ctx_lower and not parsed.europe_presence:
    parsed.europe_presence = True

# Deduplicate countries preserving first occurrence
seen_c = set()
unique_countries = []
for c in parsed.countries:
    lc = c.lower()
    if lc not in seen_c:
        seen_c.add(lc)
        unique_countries.append(c)
parsed.countries = unique_countries

# -------------------- Update DataFrame -------------------- #
updated_row_preview = None
try:
    if 'companies' in globals() and 'Unternehmensname (EN)' in companies.columns:
        norm_target_word = unicodedata.normalize('NFKC', parsed.company).lower().split()[0]
        if norm_target_word:
            mask = companies['Unternehmensname (EN)'].fillna('').str.normalize('NFKC').str.lower().str.contains(re.escape(norm_target_word))
            if mask.any():
                idx = mask[mask].index[0]
                if parsed.europe_presence:
                    label = 'Ja (' + (', '.join(parsed.countries) if parsed.countries else 'Europa (unspecified)') + ')'
                else:
                    label = 'Nein'
                companies.at[idx, 'Europaaktivität'] = label
                updated_row_preview = companies.loc[idx]
except Exception as e:
    print('Warning: could not update companies DataFrame:', e)

# -------------------- Output -------------------- #
print('Extraction (explicit branches only) successful:')
print(json.dumps(parsed.dict(), indent=2, ensure_ascii=False)[:4000])
if updated_row_preview is not None:
    print('\nUpdated DataFrame row:')
    display(updated_row_preview.to_frame().T)
else:
    print('\n(No matching DataFrame row updated)')

branch_extraction_result = parsed


Extraction (explicit branches only) successful:
{
  "company": "Toyota Motor Corporation",
  "branches": [
    "Automotive",
    "Financial Services",
    "All Other"
  ],
  "europe_presence": true,
  "germany_presence": false,
  "countries": [
    "Europe"
  ]
}

Updated DataFrame row:


/var/folders/g_/sdz9l6491svf9xy796csm3c80000gn/T/ipykernel_79196/2740886095.py:169: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  print(json.dumps(parsed.dict(), indent=2, ensure_ascii=False)[:4000])


Unnamed: 0,Unternehmensname (EN),Unternehmensname (JP),Branche,Europaaktivität,Kontaktperson (Name),Funktion / Abteilung,Standort,E-Mail-Adresse (falls verfügbar),Quelle / Link
0,Toyota Motor Corporation,,Japanese Consumer Discretionary / Japanese Aut...,Ja (Europe),,,,,
