Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

@ChenZiHong-Gavin ChenZiHong-Gavin commented Jul 31, 2025

The current integrated search capabilities include:
web: Google and Bing;
kg: Wikipedia;
db: UniProt—repository of protein sequence and functional information, offering the manually reviewed UniProtKB/Swiss-Prot and the computationally annotated UniProtKB/TrEMBL datasets, covering protein function, sub-cellular localization, interactions, post-translational modifications, and disease associations.

When raw data are missing internally, we can use web search, leveraging the above engines and databases to gather and verify information broadly.

@ChenZiHong-Gavin ChenZiHong-Gavin requested a review from Copilot July 31, 2025 12:39
@ChenZiHong-Gavin ChenZiHong-Gavin changed the title 丰富search功能 Implemented web search. Jul 31, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances GraphGen's search functionality by implementing a comprehensive web search system. The update replaces the previous Wikipedia-only search with a multi-source search capability including Google, Bing, and database sources like UniProt.

Key changes:

  • Replaces single Wikipedia search with multi-source search architecture supporting web (Google/Bing) and knowledge graph (Wikipedia) sources
  • Refactors search configuration from boolean flag to structured config with enabled status and search types
  • Adds new search client implementations for Google, Bing, and UniProt APIs

Reviewed Changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
webui/translation.json Updates UI translation strings, reorders some entries
webui/app.py Removes web_search boolean parameter, improves code formatting and structure
resources/examples/keywords_demo.txt Adds new example file with biological keywords
graphgen/utils/file.py Creates new utility function for reading different file formats
graphgen/utils/init.py Updates imports to include new file utility
graphgen/operators/search_wikipedia.py Removes old Wikipedia search implementation
graphgen/operators/search/web/search_google.py Implements Google search functionality
graphgen/operators/search/web/search_bing.py Implements Bing search functionality
graphgen/operators/search/search_all.py Creates unified search orchestrator
graphgen/operators/search/kg/search_wikipedia.py Refactored Wikipedia search implementation
graphgen/operators/init.py Updates exports to use new search_all function
graphgen/models/search/web/google_search.py Google Search API client implementation
graphgen/models/search/web/bing_search.py Bing Search API client implementation
graphgen/models/search/kg/wiki_search.py Updates Wikipedia search with num_results parameter
graphgen/models/search/db/uniprot_search.py UniProt database search client implementation
graphgen/models/init.py Updates imports for new search clients
graphgen/graphgen.py Refactors search configuration and integrates new search system
graphgen/generate.py Updates to use new search config and file reading utility
graphgen/configs/graphgen_config.yaml Changes configuration format for search settings

Comment on lines +233 to +235
new_search_entities = await self.full_docs_storage.filter_keys(
all_nodes_names
)
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter_keys method is being called with all_nodes_names but this method typically filters for existing keys, not missing ones. For search functionality, you likely want entities that don't already exist in storage, which would require a different approach or method.

Suggested change
new_search_entities = await self.full_docs_storage.filter_keys(
all_nodes_names
)
existing_keys = await self.full_docs_storage.get_keys()
new_search_entities = [
node_name for node_name in all_nodes_names if node_name not in existing_keys
]

Copilot uses AI. Check for mistakes.

# Get more details from the first search result
first_result = search_results[0]
content = trafilatura.fetch_url(first_result["link"])
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.

Suggested change
content = trafilatura.fetch_url(first_result["link"])
content = trafilatura.fetch_url(first_result["link"])
if content is None:
logger.warning("Failed to fetch content for URL: %s", first_result["link"])
return None

Copilot uses AI. Check for mistakes.

# Get more details from the first search result
first_result = search_results[0]
content = trafilatura.fetch_url(first_result["url"])
Copy link

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.

Suggested change
content = trafilatura.fetch_url(first_result["url"])
content = trafilatura.fetch_url(first_result["url"])
if content is None:
logger.warning("Failed to fetch content from URL: %s", first_result["url"])
return None

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin merged commit b64f24a into main Jul 31, 2025
2 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the search branch July 31, 2025 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants