# Talent Search System - 模块化演示

这个notebook展示了如何使用模块化的Talent Search System来搜索和分析人才。

## 系统架构

系统被拆分为以下模块：
- `config.py` - 配置和常量
- `utils.py` - 通用工具函数
- `schemas.py` - Pydantic数据模型
- `llm.py` - LLM配置和封装
- `search.py` - 搜索功能（SearXNG）
- `extraction.py` - 抽取和解析功能
- `graph.py` - LangGraph工作流
- `main.py` - 主入口点

## 工作流程

1. **查询解析** - 将自然语言查询转换为结构化参数
2. **搜索规划** - 基于查询参数生成搜索词
3. **内容搜索** - 使用SearXNG搜索相关内容
4. **URL选择** - 选择最相关的URL进行抓取
5. **内容抓取** - 抓取和提取网页内容
6. **作者提取** - 从论文中提取作者信息
7. **候选人合成** - 生成候选人卡片

现在让我们开始使用这些模块！

## 1. 设置和导入模块

In [3]:
# 导入所有模块
import sys
import os
# 导入模块
import config
import utils
import schemas
import llm
import search
import extraction
import graph
import main

print("✅ 所有模块导入成功！")

✅ 所有模块导入成功！


## 2. 查看配置

In [4]:
# 查看默认会议配置
print("默认会议库：")
for conf, aliases in config.DEFAULT_CONFERENCES.items():
	print(f"  {conf}: {aliases}")

print(f"\n默认年份: {config.DEFAULT_YEARS}")
print(f"SearXNG基础URL: {config.SEARXNG_BASE_URL}")
print(f"默认搜索结果数: {config.SEARCH_K}")
print(f"默认选择URL数: {config.SELECT_K}")

默认会议库：
  NeurIPS: ['NeurIPS']
  AAAI: ['AAAI']
  IJCAI: ['IJCAI']
  ICML: ['ICML']
  ICLR: ['ICLR']
  KDD: ['KDD']
  ACL: ['ACL']
  EMNLP: ['EMNLP']
  NAACL: ['NAACL']
  AE: ['AE']
  IJCAR: ['IJCAR']
  CVPR: ['CVPR']
  ICCV: ['ICCV']
  ECCV: ['ECCV']
  COLING: ['COLING']
  SIGGRAPH: ['SIGGRAPH']
  ACM-MM: ['ACM MM']
  VIS: ['VIS']
  SIGMOD: ['SIGMOD']
  VLDB: ['VLDB']
  SIGIR: ['SIGIR']
  PODC: ['PODC']
  SIGCOMM: ['SIGCOMM']
  INFOCOM: ['INFOCOM']
  NSDI: ['NSDI']
  CHI: ['CHI']

默认年份: [2025, 2024, 2026]
SearXNG基础URL: http://127.0.0.1:8888
默认搜索结果数: 8
默认选择URL数: 16


## 3. 测试查询解析

In [5]:
test_query = "Find 5 current PhD/MSc candidates working on graph foundation model."

In [6]:

print(f"测试查询: {test_query}")
print("\n" + "="*50 + "\n")

# 使用LLM解析查询（如果LLM可用）
try:
	llm_instance = llm.get_llm("parse", temperature=0.3)
	
	conf_list = ", ".join(config.DEFAULT_CONFERENCES.keys())
	prompt = (
		"You are a professional talent recruitment analysis assistant responsible for parsing recruitment queries and extracting structured information.\n\n"
		"=== PARSING TASK INSTRUCTIONS ===\n"
		"Please carefully analyze the user's recruitment query and extract the following key information:\n\n"
		"1. **top_n** (int): Number of candidates needed. Look for numbers in the query like '10 candidates', '20 people', etc.\n\n"
		"2. **years** (int[]): Years to focus on for papers. Prioritize recent years like [2024,2025]. Default to [2025,2024] if not specified.\n\n"
		"3. **venues** (string[]): Target conferences/journals. Users will explicitly mention venues like 'ACL', 'NeurIPS', etc.\n"
		f"   Known venues include (not exhaustive): {conf_list}\n"
		"   Recognition rules:\n"
		"   - Direct conference names: ACL, EMNLP, NAACL, NeurIPS, ICLR, ICML\n"
		"   - Conference variants: NIPS→NeurIPS, The Web Conference→WWW\n"
		"   - Platforms: OpenReview (counts as a venue)\n\n"
		"4. **keywords** (string[]): Research areas and technical keywords. Focus on identifying:\n"
		"   Technical keywords: LLM, large language models, transformer, attention, multi-agent, multi-agent systems, reinforcement learning, RL, graph neural networks, GNN, foundation model, social simulation, computer vision, CV, natural language processing, NLP\n"
		"   Research areas: machine learning, deep learning, AI alignment, robotics, computer vision, NLP, speech recognition, recommendation systems\n"
		"   Application areas: autonomous driving, medical AI, finance, social simulation, game theory, human-AI interaction\n\n"
		"5. **must_be_current_student** (bool): Whether candidates must be current students. Look for:\n"
		"   - Explicit requirements: current student, currently enrolled, active student\n"
		"   - Degree phases: PhD student, Master's student, graduate student\n"
		"   - Default: true (unless explicitly stated otherwise)\n\n"
		"6. **degree_levels** (string[]): Acceptable degree levels.\n"
		"   Recognition: PhD, MSc, Master, Graduate, Undergraduate, Bachelor, Postdoc\n"
		"   Default: ['PhD', 'MSc', 'Master', 'Graduate']\n\n"
		"7. **author_priority** (string[]): Author position preferences.\n"
		"   Recognition: first author, last author, corresponding author\n"
		"   Default: ['first', 'last']\n\n"
		"8. **extra_constraints** (string[]): Other constraints.\n"
		"   Recognition: geographic requirements (e.g., 'Asia', 'North America')\n"
		"   institutional requirements (e.g., 'top universities', 'Ivy League')\n"
		"   language requirements, experience requirements, etc.\n\n"
		"=== PARSING STRATEGY TIPS ===\n"
		"• Prioritize explicitly mentioned information, then make reasonable inferences\n"
		"• For technical keywords, identify specific models, methods, and research areas\n"
		"• Distinguish between different recruitment goals: interns vs researchers vs postdocs\n"
		"• Pay attention to time-sensitive information: recent publications, accepted papers, upcoming deadlines\n\n"
		"Return STRICT JSON format only, no additional text.\n\n"
		"User Query:\n"
		f"{test_query}\n"
	)
	mock_spec = llm.safe_structured(llm_instance, prompt, schemas.QuerySpec)
 
	if mock_spec.venues == []:
		mock_spec.venues = ["ICLR", "ICML", "NeurIPS", "ACL", "EMNLP", "NAACL", "KDD", "WWW", "AAAI", "IJCAI", "CVPR", "ECCV", "ICCV", "SIGIR"]
 
	print("解析后的查询规格:")
	print(f"  目标数量: {mock_spec.top_n}")
	print(f"  年份: {mock_spec.years}")
	print(f"  会议: {mock_spec.venues}")
	print(f"  关键词: {mock_spec.keywords}")
	print(f"  必须是学生: {mock_spec.must_be_current_student}")
	print(f"  学位级别: {mock_spec.degree_levels}")
	print(f"  作者优先级: {mock_spec.author_priority}")

except Exception as e:
	print(f"LLM解析失败，使用模拟数据: {e}")

	mock_spec = schemas.QuerySpec(
		top_n=10,
		years=[2025, 2024],
		venues=["ICLR", "ICML", "NeurIPS"],
		keywords=["social simulation", "multi-agent systems"],
		must_be_current_student=True,
		degree_levels=["PhD", "Master"],
		author_priority=["first"],
	)

测试查询: Find 5 current PhD/MSc candidates working on graph foundation model.


解析后的查询规格:
  目标数量: 5
  年份: [2025, 2024]
  会议: ['ICLR', 'ICML', 'NeurIPS', 'ACL', 'EMNLP', 'NAACL', 'KDD', 'WWW', 'AAAI', 'IJCAI', 'CVPR', 'ECCV', 'ICCV', 'SIGIR']
  关键词: ['graph foundation model', 'foundation model', 'graph neural networks', 'GNN']
  必须是学生: True
  学位级别: ['PhD', 'MSc']
  作者优先级: ['first', 'last']


## 4. 测试搜索查询构建

In [7]:
# 构建搜索查询
search_terms = search.build_conference_queries(mock_spec, config.DEFAULT_CONFERENCES, cap=10)

print(f"为查询构建了 {len(search_terms)} 个搜索词:\n")
for i, term in enumerate(search_terms[:10], 1):  # 只显示前10个
	print(f"{i:2d}. {term}")
	
if len(search_terms) > 10:
	print(f"    ... 还有 {len(search_terms) - 10} 个搜索词")

为查询构建了 10 个搜索词:

 1. ICLR 2025 "graph foundation model"
 2. ICLR 2025 "graph foundation model" accepted papers
 3. ICLR 2025 "graph foundation model" accept
 4. ICLR 2025 "graph foundation model" acceptance
 5. ICLR 2025 "graph foundation model" program
 6. ICLR 2025 "graph foundation model" proceedings
 7. ICLR 2025 "graph foundation model" schedule
 8. ICLR 2025 "graph foundation model" paper list
 9. ICLR 2025 "graph foundation model" main conference
10. ICLR 2025 "graph foundation model" research track


In [8]:
## 4.1 使用搜索词进行实际搜索

# 使用构建的搜索词进行实际搜索
print("使用SearXNG搜索引擎测试搜索词...")
print(f"将使用前 {min(3, len(search_terms))} 个搜索词进行演示搜索")
print()

# 演示搜索（使用前3个搜索词）
demo_results = []
for i, search_term in enumerate(search_terms[:3], 1):
	print(f"🔍 搜索 {i}: {search_term}")
	try:
		
		results = search.searxng_search(
			query=search_term,
			engines=["google"],  # 使用配置的搜索引擎
			pages=2,  # 只搜索前两页页用于演示
			k_per_query=10  # 每个查询返回10个结果
		)
		demo_results.extend(results)

		print(f"   📊 找到 {len(results)} 个结果")
		for j, result in enumerate(results[:2], 1):  # 只显示前2个结果
			print(f"   {j}. {result['title'][:60]}...")
			print(f"      URL: {result['url']}")
			print(f"      引擎: {result.get('engine', 'unknown')}")
		print()

	except Exception as e:
		print(f"   ❌ 搜索失败: {e}")
		print()

print(f"🎯 总共找到 {len(demo_results)} 个搜索结果")
print(f"📈 使用的搜索引擎: {config.SEARXNG_ENGINES}")
print(f"🌐 SearXNG服务: {config.SEARXNG_BASE_URL}")

# 显示搜索结果的分布
if demo_results:
    engine_counts = {}
    for result in demo_results:
        engine = result.get('engine', 'unknown')
        engine_counts[engine] = engine_counts.get(engine, 0) + 1

    print("\n📊 引擎结果分布:")
    for engine, count in engine_counts.items():
        print(f"   {engine}: {count} 个结果")

# 扩展测试：测试不同的搜索参数
print("\n🔧 扩展测试：不同搜索参数对比")
print("="*50)

print("\n✅ 搜索词与引擎测试完成！")
print("💡 提示：如需测试更多功能，可调整参数或添加新的测试用例")


# simuluate the search remove the duplicate
seen = set()
uniq_results = []
for r in demo_results:
	u = r.get("url", "")
	if u and u not in seen:
		seen.add(u)
		uniq_results.append(r)


使用SearXNG搜索引擎测试搜索词...
将使用前 3 个搜索词进行演示搜索

🔍 搜索 1: ICLR 2025 "graph foundation model"


   📊 找到 17 个结果
   1. AnyGraph: Graph Foundation Model in the Wild...
      URL: https://openreview.net/forum?id=Kdcqzfypry
      引擎: google
   2. GOFA: A Generative One-For-All Model for Joint Graph ......
      URL: https://openreview.net/forum?id=mIjblC9hfm
      引擎: google

🔍 搜索 2: ICLR 2025 "graph foundation model" accepted papers
   📊 找到 18 个结果
   1. GOFA: A Generative One-For-All Model for Joint Graph ......
      URL: https://proceedings.iclr.cc/paper_files/paper/2025/hash/652c104b5b0652a03684efeaf805463b-Abstract-Conference.html
      引擎: google
   2. AnyGraph: Graph Foundation Model in the Wild...
      URL: https://openreview.net/forum?id=Kdcqzfypry
      引擎: google

🔍 搜索 3: ICLR 2025 "graph foundation model" accept
   📊 找到 19 个结果
   1. AnyGraph: Graph Foundation Model in the Wild...
      URL: https://openreview.net/forum?id=Kdcqzfypry
      引擎: google
   2. GOFA: A Generative One-For-All Model for Joint Graph ......
      URL: https://proceedings.iclr.cc/paper_files/paper/2

In [9]:
# save unique result to jsonl
import json
with open("mock_serp.jsonl", "w") as f:
    for r in uniq_results:
        f.write(json.dumps(r) + "\n")

## 5. 测试URL选择逻辑

In [10]:
# load the jsonl file
import json
with open("mock_serp.jsonl", "r") as f:
    mock_serp = [json.loads(line) for line in f]
    

llm_instance = llm.get_llm("select", temperature=0.3)


selected_urls = []
selected_serp = []

# More strict URL filtering - remove low-quality domains
not_allowed_domains = [
	"x.com", "twitter.com", "github.com", "linkedin.com", "facebook.com",
	"youtube.com", "reddit.com", "medium.com", "substack.com"
]

# Also filter out news sites and general forums
low_quality_domains = [
	"news", "blog", "forum", "discussion", "comment", "review"
]

filtered_items = []
for item in mock_serp:
	url = item.get("url", "").lower()
	
	# Check if URL contains any blocked domains
	if any(domain in url for domain in not_allowed_domains):
		continue
		
	# Check if URL contains low-quality indicators
	if any(indicator in url for indicator in low_quality_domains):
		continue
		
	filtered_items.append(item)

# Judge each filtered URL individually
for i, item in enumerate(filtered_items):
	title = item.get("title", "")[:150]
	snippet = item.get("snippet", "")[:200]
	url = item.get("url", "")
	
	query_context = ""
	if mock_spec.venues:
		query_context += f"Target conferences: {', '.join(mock_spec.venues)}\n"
	if mock_spec.years:
		query_context += f"Target years: {', '.join(map(str, mock_spec.years))}\n"
	if mock_spec.keywords:
		query_context += f"Research keywords: {', '.join(mock_spec.keywords)}\n"
	
	prompt = (
		f"Look at this search result and decide if it's worth fetching for academic talent search.\n\n"
		f"Query Context:\n{query_context}\n"
		f"Title: {title}\n"
		f"URL: {url}\n"
		f"Snippet: {snippet}\n\n"
		f"SELECT this URL if it contains ANY of these valuable information:\n"
		f"1. **Paper details**: Paper titles, author names, abstracts, or paper content\n"
		f"2. **Academic lists**: Accepted papers lists, conference programs, workshop papers\n"
		f"3. **Researcher info**: Academic profiles, university pages, research institutions\n"
		f"4. **Conference content**: Proceedings, accepted papers, research tracks\n\n"
		f"PRIORITY: URLs that match the query context (conferences, years, keywords) are preferred.\n\n"
		f"REJECT only if it's clearly:\n"
		f"- Pure social media posts (x.com, facebook, linkedin)\n"
		f"- General news without academic content\n"
		f"- Personal blogs with no research information\n"
		f"- Spam or irrelevant content\n\n"
		f"Be reasonable - if it looks like it might contain academic/research info, select it.\n"
		f"Should I fetch this URL? Return JSON: {{ \"should_fetch\": true/false }}"
	)
	
	try:
		# Get LLM decision for this URL
		result = llm.safe_structured(llm_instance, prompt, schemas.LLMSelectSpec)
		
		if result and result.should_fetch:
			url = item.get("url", "")
			selected_urls.append(url)
			selected_serp.append(item)
			print(f"[select] URL {i+1} selected: {url}...")
		else:
			print(f"[select] URL {i+1} rejected: {url}...")
				
	except Exception as e:
		if config.VERBOSE:
			print(f"[select] URL {i+1} error: {e}")
		continue
    


[select] URL 1 selected: https://iclr.cc/virtual/2025/poster/28473...
[select] URL 2 selected: https://dl.acm.org/doi/10.1145/3711896.3736568...
[select] URL 3 selected: https://proceedings.iclr.cc/paper_files/paper/2025/file/652c104b5b0652a03684efeaf805463b-Paper-Conference.pdf...
[select] URL 4 selected: https://arxiv.org/html/2508.20906v1...
[select] URL 5 selected: https://dl.acm.org/doi/10.1145/3696410.3714952...
[select] URL 6 selected: https://icml.cc/virtual/2025/poster/44147...
[select] URL 7 selected: https://www.computer.org/csdl/journal/tp/2025/06/10915556/24RViEuz34Q...
[select] URL 8 selected: https://arxiv.org/pdf/2403.16137...
[select] URL 9 rejected: https://towardsdatascience.com/foundation-models-in-graph-geometric-deep-learning-f363e2576f58/...
[select] URL 10 selected: https://neurips.cc/virtual/2024/poster/94900...
[select] URL 11 selected: https://aclanthology.org/2024.findings-emnlp.132.pdf...
[select] URL 12 selected: https://www.researchgate.net/publication/38

In [11]:
selected_serp

[{'title': 'GOFA: A Generative One-For-All Model for Joint Graph ...',
  'url': 'https://iclr.cc/virtual/2025/poster/28473',
  'snippet': '2025 ... However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM).',
  'engine': 'google',
  'authors': []},
 {'title': 'Graph Foundation Models: Challenges, Methods, and ...',
  'url': 'https://dl.acm.org/doi/10.1145/3711896.3736568',
  'snippet': 'by Z Wang · 2025 — Boosting Graph Foundation Model from Structural Perspective. arXiv (2024). Google Scholar. [18]. Yuanning Cui, Zequn Sun, and Wei Hu. 2025. A ...',
  'engine': 'google',
  'authors': []},
 {'title': 'GOFA: A GENERATIVE ONE-FOR-ALL MODEL FOR ...',
  'url': 'https://proceedings.iclr.cc/paper_files/paper/2025/file/652c104b5b0652a03684efeaf805463b-Paper-Conference.pdf',
  'snippet': 'by L Kong · Cited by 24 — We introduce GOFA, a generative One-for-All graph foundation model. GOFA is pre-trained

## 6. 内容抓取

In [13]:
# 注意：这个需要实际的网络连接和SearXNG服务
fetch_sources = {}

for serp_item in selected_serp:
	url = serp_item["url"]
	snippet = serp_item["snippet"]
	try:
		text = search.fetch_text(url, max_chars=config.FETCH_MAX_CHARS, snippet=snippet)
		
		if len(text) < config.MIN_TEXT_LENGTH:
			continue
		fetch_sources[url] = text
	except Exception as e:
		print(f"[fetch-error] {url} -> {e}")
		continue
	utils.safe_sleep(0.08)
    

In [14]:
fetch_sources

{'https://iclr.cc/virtual/2025/poster/28473': 'SNIPPET: 2025 ... However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM).\n\nTITLE: Main Navigation\n\nBODY:\nPoster\nGOFA: A Generative One-For-All Model for Joint Graph Language Modeling\nLecheng Kong · Jiarui Feng · Hao Liu · Chengsong Huang · Jiaxin Huang · Yixin Chen · Muhan Zhang\nFoundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph data into a language format for LLM-based prediction or still train a GNN model with LLM as an assistant. The former can handle unlimited tasks, while the latter capt

## 7.0 paper name提取

In [15]:
import config
import utils
import schemas
import llm
import search
import extraction
import graph
import main

import importlib
importlib.reload(config)
importlib.reload(utils)
importlib.reload(extraction)
importlib.reload(schemas)
importlib.reload(llm)
importlib.reload(search)
importlib.reload(graph)
importlib.reload(main)

<module 'main' from '/home/zxia545/_Code/Research_repo/25_0819_kdd_hr/kdd_hr_system/search_dev/talent_search_modules/main.py'>

In [16]:
# 使用新的结构化方法管理论文信息，支持自动去重
paper_collection = schemas.PaperCollection()

# 从源中提取论文名称并添加到集合中
for fetch_item in fetch_sources.items():
    paper_name_spec_item = extraction.extract_paper_name_from_sources(fetch_item, mock_spec)
    if paper_name_spec_item.have_paper_name:
        # 使用新的 add_paper 方法，会自动去重
        is_new = paper_collection.add_paper(
            paper_name=paper_name_spec_item.paper_name,
            url=fetch_item[0]
        )
        if is_new:
            print(f"新增论文: {paper_name_spec_item.paper_name}")
        else:
            print(f"论文已存在，添加新URL: {paper_name_spec_item.paper_name}")

# 获取所有唯一的论文名称用于后续处理
unique_paper_names = paper_collection.get_paper_names()
print(f"\n共找到 {len(unique_paper_names)} 篇唯一论文")




# 创建 URL 到论文名称的映射，用于语义搜索
final_fetch_paper_name = {}
for paper_info in paper_collection.get_all_papers():
    # 使用第一个URL作为代表
    primary_url = paper_info.primary_url or paper_info.urls[0]
    final_fetch_paper_name[primary_url] = paper_info.paper_name
        
        


[extract_paper_name] Have paper name: True, extracted paper: GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
新增论文: GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
[extract_paper_name] Have paper name: True, extracted paper: Boosting Graph Foundation Model from Structural Perspective
新增论文: Boosting Graph Foundation Model from Structural Perspective
[extract_paper_name] Have paper name: True, extracted paper: GOFA: A GENERATIVE ONE-FOR-ALL MODEL FOR JOINT GRAPH LANGUAGE MODELING
新增论文: GOFA: A GENERATIVE ONE-FOR-ALL MODEL FOR JOINT GRAPH LANGUAGE MODELING
[extract_paper_name] Have paper name: True, extracted paper: Turning Tabular Foundation Models into Graph Foundation Models
新增论文: Turning Tabular Foundation Models into Graph Foundation Models
[extract_paper_name] Have paper name: False, extracted paper: 
[extract_paper_name] Have paper name: True, extracted paper: How Expressive are Knowledge Graph Foundation Models?
新增论文: How Expressive are

In [17]:
final_fetch_paper_name

{'https://iclr.cc/virtual/2025/poster/28473': 'GOFA: A Generative One-For-All Model for Joint Graph Language Modeling',
 'https://dl.acm.org/doi/10.1145/3711896.3736568': 'Boosting Graph Foundation Model from Structural Perspective',
 'https://proceedings.iclr.cc/paper_files/paper/2025/file/652c104b5b0652a03684efeaf805463b-Paper-Conference.pdf': 'GOFA: A GENERATIVE ONE-FOR-ALL MODEL FOR JOINT GRAPH LANGUAGE MODELING',
 'https://arxiv.org/html/2508.20906v1': 'Turning Tabular Foundation Models into Graph Foundation Models',
 'https://icml.cc/virtual/2025/poster/44147': 'How Expressive are Knowledge Graph Foundation Models?',
 'https://arxiv.org/pdf/2403.16137': 'A Survey on Self-Supervised Graph Foundation Models: Knowledge-Based Perspective',
 'https://neurips.cc/virtual/2024/poster/94900': 'A Prompt-Based Knowledge Graph Foundation Model for Universal In-Context Reasoning',
 'https://aclanthology.org/2024.findings-emnlp.132.pdf': 'OpenGraph: Towards Open Graph Foundation Models',
 'htt

## 7. 作者提取

In [18]:
import semantic_paper_search
import importlib
importlib.reload(semantic_paper_search)
from semantic_paper_search import SemanticScholarClient

"""演示新的语义搜索功能"""
print("=== 演示新的语义搜索功能 ===")

# 初始化客户端
client = SemanticScholarClient(api_key="17kd4WbzBh3qzF0kbVqf81zNFv7qg26mJJwTcAq9")


print("使用新的批量搜索方法：")
results = client.search_papers_with_authors_batch(
	url_to_title=final_fetch_paper_name,
	min_score=0.80,
)

valid_results = [r for r in results if r.found]
print(f"\n搜索结果（共 {len(valid_results)} 篇论文）：")
for result in results:
	print(f"\n📖 论文: {result.paper_name}")
	print(f"   URL: {result.url}")
	print(f"   找到: {'是' if result.found else '否'}")

	if result.found:
		print(f"   匹配度: {result.match_score}")
		if result.paper_id:
			print(f"   论文ID: {result.paper_id}")
		if result.year:
			print(f"   年份: {result.year}")
		if result.venue:
			print(f"   会议: {result.venue}")
		if result.authors:
			print("   作者列表:")
			for author in result.authors:
				author_info = f"{author.name}"
				if author.author_id:
					author_info += f" (ID: {author.author_id})"
				print(f"     • {author_info}")
	else:
		print("   原因: 论文未找到或匹配度不足")



=== 演示新的语义搜索功能 ===
使用新的批量搜索方法：

搜索结果（共 16 篇论文）：

📖 论文: GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
   URL: https://iclr.cc/virtual/2025/poster/28473
   找到: 是
   匹配度: 248.019
   论文ID: 055af7789c29452bf808a48f74ec25e01049425e
   年份: 2024
   会议: International Conference on Learning Representations
   作者列表:
     • Lecheng Kong (ID: 2164063663)
     • Jiarui Feng (ID: 2239091724)
     • Hao Liu (ID: 2264134998)
     • Chengsong Huang (ID: 31937655)
     • Jiaxin Huang (ID: 2300246276)
     • Yixin Chen (ID: 2266810999)
     • Muhan Zhang (ID: 2239188141)

📖 论文: Boosting Graph Foundation Model from Structural Perspective
   URL: https://dl.acm.org/doi/10.1145/3711896.3736568
   找到: 是
   匹配度: 212.0903
   论文ID: 1051287627ca209b0fdc7899d6abf366f076cb59
   年份: 2024
   会议: arXiv.org
   作者列表:
     • Yao Cheng (ID: 2165643958)
     • Yige Zhao (ID: 2266749154)
     • Jianxiang Yu (ID: 2198507600)
     • Xiang Li (ID: 2258789569)

📖 论文: GOFA: A GENERATIVE ONE-FOR-ALL MODEL

In [19]:
valid_results

[PaperAuthorsResult(url='https://iclr.cc/virtual/2025/poster/28473', paper_name='GOFA: A Generative One-For-All Model for Joint Graph Language Modeling', paper_id='055af7789c29452bf808a48f74ec25e01049425e', match_score=248.019, year=2024, venue='International Conference on Learning Representations', paper_url='https://www.semanticscholar.org/paper/055af7789c29452bf808a48f74ec25e01049425e', authors=[AuthorWithId(name='Lecheng Kong', author_id='2164063663'), AuthorWithId(name='Jiarui Feng', author_id='2239091724'), AuthorWithId(name='Hao Liu', author_id='2264134998'), AuthorWithId(name='Chengsong Huang', author_id='31937655'), AuthorWithId(name='Jiaxin Huang', author_id='2300246276'), AuthorWithId(name='Yixin Chen', author_id='2266810999'), AuthorWithId(name='Muhan Zhang', author_id='2239188141')], found=True),
 PaperAuthorsResult(url='https://dl.acm.org/doi/10.1145/3711896.3736568', paper_name='Boosting Graph Foundation Model from Structural Perspective', paper_id='1051287627ca209b0fdc7

In [None]:
!pip install objprint

In [20]:
from objprint import objjson
import json
json_obj = objjson(results[12])

print(json.dumps(json_obj, indent=2))

{
  ".type": "PaperAuthorsResult",
  "url": "https://link.springer.com/article/10.1007/s11227-025-07029-9",
  "paper_name": "Scalable training of trustworthy and energy-efficient predictive graph foundation models for atomistic materials modeling: a case study with HydraGNN",
  "paper_id": "454e2d2c6799ff02e80fb41241f597e28d95bf11",
  "match_score": 354.61664,
  "year": 2024,
  "venue": "Journal of Supercomputing",
  "paper_url": "https://www.semanticscholar.org/paper/454e2d2c6799ff02e80fb41241f597e28d95bf11",
  "authors": [
    {
      ".type": "AuthorWithId",
      "name": "Massimiliano Lupo Pasini",
      "author_id": "48460197"
    },
    {
      ".type": "AuthorWithId",
      "name": "Jong Youl Choi",
      "author_id": "2155367341"
    },
    {
      ".type": "AuthorWithId",
      "name": "Kshitij Mehta",
      "author_id": "2266401208"
    },
    {
      ".type": "AuthorWithId",
      "name": "Pei Zhang",
      "author_id": "2266463752"
    },
    {
      ".type": "AuthorWithId"

In [21]:
valid_results

[PaperAuthorsResult(url='https://iclr.cc/virtual/2025/poster/28473', paper_name='GOFA: A Generative One-For-All Model for Joint Graph Language Modeling', paper_id='055af7789c29452bf808a48f74ec25e01049425e', match_score=248.019, year=2024, venue='International Conference on Learning Representations', paper_url='https://www.semanticscholar.org/paper/055af7789c29452bf808a48f74ec25e01049425e', authors=[AuthorWithId(name='Lecheng Kong', author_id='2164063663'), AuthorWithId(name='Jiarui Feng', author_id='2239091724'), AuthorWithId(name='Hao Liu', author_id='2264134998'), AuthorWithId(name='Chengsong Huang', author_id='31937655'), AuthorWithId(name='Jiaxin Huang', author_id='2300246276'), AuthorWithId(name='Yixin Chen', author_id='2266810999'), AuthorWithId(name='Muhan Zhang', author_id='2239188141')], found=True),
 PaperAuthorsResult(url='https://dl.acm.org/doi/10.1145/3711896.3736568', paper_name='Boosting Graph Foundation Model from Structural Perspective', paper_id='1051287627ca209b0fdc7

## 8 拿到一作的各种link和这个其发的paper的list 

In [22]:
from author_discovery import discover_author_profile, fetch_author_publications_via_s2



hit = valid_results[0]
paper_title = hit.paper_name
first_author = hit.authors[0].name
first_author_id = hit.authors[0].author_id
profile_0 = discover_author_profile(first_author, paper_title, aliases=[first_author], author_id=first_author_id)
print("== PROFILE ==")
print(profile_0)

hit = valid_results[1]
paper_title = hit.paper_name
first_author = hit.authors[0].name
first_author_id = hit.authors[0].author_id
profile_1 = discover_author_profile(first_author, paper_title, aliases=[first_author], author_id=first_author_id)
print("== PROFILE ==")
print(profile_1)

hit = valid_results[2]
paper_title = hit.paper_name
first_author = hit.authors[0].name
first_author_id = hit.authors[0].author_id
profile_2 = discover_author_profile(first_author, paper_title, aliases=[first_author], author_id=first_author_id)
print("== PROFILE ==")
print(profile_2)


# hit = valid_results[2]
# paper_title = hit.paper_name
# first_author = hit.authors[0].name
# profile_2 = discover_author_profile(first_author, paper_title, aliases=[first_author])
# print("== PROFILE ==")
# print(profile_2)


# hit = valid_results[3]
# paper_title = hit.paper_name
# first_author = hit.authors[0].name
# first_author_id = hit.authors[0].author_id
# profile_3 = discover_author_profile(first_author, paper_title, aliases=[first_author], author_id=first_author_id)
# print("== PROFILE ==")
# print(profile_3)


    # # 更多论文（若已拿到 author_id）
    # if hit.authors[0].author_id:
    #     pubs = fetch_author_publications_via_s2(hit.authors[0].author_id, k=8)
    #     profile.selected_publications = (profile.selected_publications or []) + pubs


[S2 Integration] Added 11 papers from Semantic Scholar
[Homepage Candidate] Processing: https://lechengkong.github.io/
[Homepage Identity] True (conf: 0.98, reason: Identity verified: The homepage explicitly states the name 'Lecheng Kong (孔乐成)', which matches the target author exactly. The research focus is on Graph Neural Networks (GNNs), Graph Foundation Models (GFM), and their intersection with Large Language Models, which aligns with the known paper 'GOFA: A Generative One-For-All Model for Joint Graph Language Modeling'. The homepage also mentions that the work was accepted by ICLR 2025, confirming publication overlap. The affiliation with Washington University in St. Louis and the timeline of being a fifth-year Ph.D. candidate are consistent with the expected background. There is no conflicting information or ambiguity about the identity.)
[Homepage] Identity verified, starting comprehensive fetch
[Homepage Fetcher] Starting comprehensive fetch for: https://lechengkong.github.io/

In [25]:
from objprint import objjson
import json

print("== PROFILE 0 ==")
print('\n\n')
json_obj_0 = objjson(profile_0)
print(json.dumps(json_obj_0, indent=2))
print('\n\n')
print("== PROFILE 1 ==")
print('\n\n')
json_obj_1 = objjson(profile_1)
print(json.dumps(json_obj_1, indent=2))
print('\n\n')

print("== PROFILE 2 ==")
print('\n\n')
json_obj_2 = objjson(profile_2)
print(json.dumps(json_obj_2, indent=2))
print('\n\n')


== PROFILE 0 ==



{
  ".type": "AuthorProfile",
  "name": "Lecheng Kong",
  "aliases": [],
  "platforms": {
    "linkedin": "https://www.linkedin.com/in/lecheng-kong-12a045152",
    "dblp": "https://dblp.org/pid/319/5576.html",
    "github": "https://github.com/LechengKong",
    "scholar": "https://scholar.google.com/citations?user=yk3-_EgAAAAJ",
    "orcid": "https://orcid.org/0000-0001-9427-8799"
  },
  "ids": {
    "scholar": "yk3-_EgAAAAJ"
  },
  "homepage_url": "https://lechengkong.github.io/",
  "affiliation_current": "Washington University in St. Louis",
  "emails": [
    "jerry.kong@wustl.edu",
    "Verified email at wustl.edu"
  ],
  "interests": [
    "Graph Neural Networks (GNNs)",
    "Graph Foundation Models (GFM)",
    "Graph Models and Large Language Models",
    "graph-based retrieval augmented generation",
    "universal graph foundation model",
    "graph learning",
    "machine learning",
    "graph language modeling"
  ],
  "selected_publications": [
    {
      "t