Implemented web search. #41

ChenZiHong-Gavin · 2025-07-31T12:39:29Z

The current integrated search capabilities include:
web: Google and Bing;
kg: Wikipedia;
db: UniProt—repository of protein sequence and functional information, offering the manually reviewed UniProtKB/Swiss-Prot and the computationally annotated UniProtKB/TrEMBL datasets, covering protein function, sub-cellular localization, interactions, post-translational modifications, and disease associations.

When raw data are missing internally, we can use web search, leveraging the above engines and databases to gather and verify information broadly.

Copilot

Pull Request Overview

This PR enhances GraphGen's search functionality by implementing a comprehensive web search system. The update replaces the previous Wikipedia-only search with a multi-source search capability including Google, Bing, and database sources like UniProt.

Key changes:

Replaces single Wikipedia search with multi-source search architecture supporting web (Google/Bing) and knowledge graph (Wikipedia) sources
Refactors search configuration from boolean flag to structured config with enabled status and search types
Adds new search client implementations for Google, Bing, and UniProt APIs

Reviewed Changes

Copilot reviewed 19 out of 28 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
webui/translation.json	Updates UI translation strings, reorders some entries
webui/app.py	Removes web_search boolean parameter, improves code formatting and structure
resources/examples/keywords_demo.txt	Adds new example file with biological keywords
graphgen/utils/file.py	Creates new utility function for reading different file formats
graphgen/utils/init.py	Updates imports to include new file utility
graphgen/operators/search_wikipedia.py	Removes old Wikipedia search implementation
graphgen/operators/search/web/search_google.py	Implements Google search functionality
graphgen/operators/search/web/search_bing.py	Implements Bing search functionality
graphgen/operators/search/search_all.py	Creates unified search orchestrator
graphgen/operators/search/kg/search_wikipedia.py	Refactored Wikipedia search implementation
graphgen/operators/init.py	Updates exports to use new search_all function
graphgen/models/search/web/google_search.py	Google Search API client implementation
graphgen/models/search/web/bing_search.py	Bing Search API client implementation
graphgen/models/search/kg/wiki_search.py	Updates Wikipedia search with num_results parameter
graphgen/models/search/db/uniprot_search.py	UniProt database search client implementation
graphgen/models/init.py	Updates imports for new search clients
graphgen/graphgen.py	Refactors search configuration and integrates new search system
graphgen/generate.py	Updates to use new search config and file reading utility
graphgen/configs/graphgen_config.yaml	Changes configuration format for search settings

graphgen/operators/search/kg/search_wikipedia.py

Copilot · 2025-07-31T12:40:30Z

graphgen/graphgen.py

+            new_search_entities = await self.full_docs_storage.filter_keys(
+                all_nodes_names
+            )


The filter_keys method is being called with all_nodes_names but this method typically filters for existing keys, not missing ones. For search functionality, you likely want entities that don't already exist in storage, which would require a different approach or method.

Suggested change

new_search_entities = await self.full_docs_storage.filter_keys(

all_nodes_names

)

existing_keys = await self.full_docs_storage.get_keys()

new_search_entities = [

node_name for node_name in all_nodes_names if node_name not in existing_keys

]

Copilot · 2025-07-31T12:40:30Z

graphgen/operators/search/web/search_google.py

+
+    # Get more details from the first search result
+    first_result = search_results[0]
+    content = trafilatura.fetch_url(first_result["link"])


If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.

Suggested change

content = trafilatura.fetch_url(first_result["link"])

content = trafilatura.fetch_url(first_result["link"])

if content is None:

logger.warning("Failed to fetch content for URL: %s", first_result["link"])

return None

Copilot · 2025-07-31T12:40:31Z

graphgen/operators/search/web/search_bing.py

+
+    # Get more details from the first search result
+    first_result = search_results[0]
+    content = trafilatura.fetch_url(first_result["url"])


If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.

Suggested change

content = trafilatura.fetch_url(first_result["url"])

content = trafilatura.fetch_url(first_result["url"])

if content is None:

logger.warning("Failed to fetch content from URL: %s", first_result["url"])

return None

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ChenZiHong-Gavin added 5 commits July 30, 2025 17:51

refactor: refact search_wiki

05581dc

feat: use search results to enrich data

b54632a

feat: add google search

760b27e

feat: uniprot search

e63968b

fix: fix async_clear()

4180b9b

ChenZiHong-Gavin requested a review from Copilot July 31, 2025 12:39

ChenZiHong-Gavin changed the title ~~丰富search功能~~ Implemented web search. Jul 31, 2025

Copilot AI reviewed Jul 31, 2025

View reviewed changes

Update graphgen/operators/search/kg/search_wikipedia.py

e35162d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ChenZiHong-Gavin merged commit b64f24a into main Jul 31, 2025
2 checks passed

ChenZiHong-Gavin deleted the search branch July 31, 2025 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implemented web search. #41

Implemented web search. #41

Uh oh!

ChenZiHong-Gavin commented Jul 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            new_search_entities = await self.full_docs_storage.filter_keys(
-                all_nodes_names
-            )
+            existing_keys = await self.full_docs_storage.get_keys()
+            new_search_entities = [
+                node_name for node_name in all_nodes_names if node_name not in existing_keys
+            ]

-    content = trafilatura.fetch_url(first_result["link"])
+    content = trafilatura.fetch_url(first_result["link"])
+    if content is None:
+        logger.warning("Failed to fetch content for URL: %s", first_result["link"])
+        return None

Implemented web search. #41

Implemented web search. #41

Uh oh!

Conversation

ChenZiHong-Gavin commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenZiHong-Gavin commented Jul 31, 2025 •

edited

Loading