-
Notifications
You must be signed in to change notification settings - Fork 45
Implemented web search. #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances GraphGen's search functionality by implementing a comprehensive web search system. The update replaces the previous Wikipedia-only search with a multi-source search capability including Google, Bing, and database sources like UniProt.
Key changes:
- Replaces single Wikipedia search with multi-source search architecture supporting web (Google/Bing) and knowledge graph (Wikipedia) sources
- Refactors search configuration from boolean flag to structured config with enabled status and search types
- Adds new search client implementations for Google, Bing, and UniProt APIs
Reviewed Changes
Copilot reviewed 19 out of 28 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| webui/translation.json | Updates UI translation strings, reorders some entries |
| webui/app.py | Removes web_search boolean parameter, improves code formatting and structure |
| resources/examples/keywords_demo.txt | Adds new example file with biological keywords |
| graphgen/utils/file.py | Creates new utility function for reading different file formats |
| graphgen/utils/init.py | Updates imports to include new file utility |
| graphgen/operators/search_wikipedia.py | Removes old Wikipedia search implementation |
| graphgen/operators/search/web/search_google.py | Implements Google search functionality |
| graphgen/operators/search/web/search_bing.py | Implements Bing search functionality |
| graphgen/operators/search/search_all.py | Creates unified search orchestrator |
| graphgen/operators/search/kg/search_wikipedia.py | Refactored Wikipedia search implementation |
| graphgen/operators/init.py | Updates exports to use new search_all function |
| graphgen/models/search/web/google_search.py | Google Search API client implementation |
| graphgen/models/search/web/bing_search.py | Bing Search API client implementation |
| graphgen/models/search/kg/wiki_search.py | Updates Wikipedia search with num_results parameter |
| graphgen/models/search/db/uniprot_search.py | UniProt database search client implementation |
| graphgen/models/init.py | Updates imports for new search clients |
| graphgen/graphgen.py | Refactors search configuration and integrates new search system |
| graphgen/generate.py | Updates to use new search config and file reading utility |
| graphgen/configs/graphgen_config.yaml | Changes configuration format for search settings |
| new_search_entities = await self.full_docs_storage.filter_keys( | ||
| all_nodes_names | ||
| ) |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filter_keys method is being called with all_nodes_names but this method typically filters for existing keys, not missing ones. For search functionality, you likely want entities that don't already exist in storage, which would require a different approach or method.
| new_search_entities = await self.full_docs_storage.filter_keys( | |
| all_nodes_names | |
| ) | |
| existing_keys = await self.full_docs_storage.get_keys() | |
| new_search_entities = [ | |
| node_name for node_name in all_nodes_names if node_name not in existing_keys | |
| ] |
|
|
||
| # Get more details from the first search result | ||
| first_result = search_results[0] | ||
| content = trafilatura.fetch_url(first_result["link"]) |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.
| content = trafilatura.fetch_url(first_result["link"]) | |
| content = trafilatura.fetch_url(first_result["link"]) | |
| if content is None: | |
| logger.warning("Failed to fetch content for URL: %s", first_result["link"]) | |
| return None |
|
|
||
| # Get more details from the first search result | ||
| first_result = search_results[0] | ||
| content = trafilatura.fetch_url(first_result["url"]) |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If trafilatura.fetch_url fails or returns None, the subsequent trafilatura.extract call will fail with an unclear error. Consider adding error handling and validation.
| content = trafilatura.fetch_url(first_result["url"]) | |
| content = trafilatura.fetch_url(first_result["url"]) | |
| if content is None: | |
| logger.warning("Failed to fetch content from URL: %s", first_result["url"]) | |
| return None |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The current integrated search capabilities include:
web: Google and Bing;
kg: Wikipedia;
db: UniProt—repository of protein sequence and functional information, offering the manually reviewed UniProtKB/Swiss-Prot and the computationally annotated UniProtKB/TrEMBL datasets, covering protein function, sub-cellular localization, interactions, post-translational modifications, and disease associations.
When raw data are missing internally, we can use web search, leveraging the above engines and databases to gather and verify information broadly.