Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 2 additions & 9 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,9 +1,2 @@
# Generated files
dummy_git_config.txt
git_history.json
target/
.git_commit_message.txt
result
result/
submodule_collector_strace.log
*~

.emacs.d/
32 changes: 32 additions & 0 deletions docs/crq_standardized/CRQ-059-wikipedia-wikidata-extractor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
**CRQ: CRQ-059: Implement Wikipedia and Wikidata Extractor**

**Problem/Goal:**
To expand our knowledge base by extracting content from Wikipedia articles and their associated Wikidata RDF facts. This data will be processed and stored in a structured file system cache for efficient retrieval and analysis. The extraction process needs to be robust, capable of parsing page content, extracting links, and associating data with version information from Wikipedia.

**Proposed Solution:**

1. **Create `wikipedia_extractor` Rust Crate:** (Already initiated) A new Rust crate named `wikipedia_extractor` will be developed to encapsulate all Wikipedia and Wikidata extraction logic.
2. **Define Data Structures:**
* Implement Rust structs to represent Wikipedia article content (text, links, metadata) and Wikidata RDF facts.
* Consider using existing crates for RDF parsing if available and suitable.
3. **Wikipedia Article Fetching:**
* Implement functionality to fetch Wikipedia article content given a URL or article title.
* Handle different content formats (e.g., HTML, MediaWiki API responses).
4. **Wikidata RDF Fact Extraction:**
* Integrate with the Wikidata API to retrieve RDF facts associated with Wikipedia articles.
* Parse and store these facts in the defined data structures.
5. **Link Extraction:**
* Develop logic to parse Wikipedia article HTML/content and extract internal and external links.
6. **Structured File System Cache:**
* Implement a mechanism to store the extracted Wikipedia content, Wikidata facts, and links in a structured file system.
* Include versioning based on Wikipedia's revision IDs or timestamps to ensure data freshness.
7. **Integration with URL Tool:**
* Modify the existing `url_extractor` or create a new "URL tool" that can identify Wikipedia URLs.
* When a Wikipedia URL is encountered, it will dispatch the URL to the `wikipedia_extractor` for processing.
* The `wikipedia_extractor` will then return the processed data to the calling tool or directly store it in the cache.
8. **Tests (Extreme Programming Approach):**
* Develop comprehensive unit and integration tests for each component of the `wikipedia_extractor` crate, following an Extreme Programming (XP) "test-first" approach.
* Tests will cover fetching, parsing, extraction, and caching functionalities.

**Justification/Impact:**
This CRQ will significantly enhance our ability to gather and integrate structured knowledge from Wikipedia and Wikidata into our project's knowledge base. By automating this process, we can enrich our contextual understanding of various topics, improve data analysis capabilities, and support more intelligent agent behaviors. The structured caching with versioning will ensure data integrity and efficient access.