GpuStorageNode Phase 2: type support, incoming sets, bulk ops#3
Closed
plankatron wants to merge 1 commit intomainfrom
Closed
GpuStorageNode Phase 2: type support, incoming sets, bulk ops#3plankatron wants to merge 1 commit intomainfrom
plankatron wants to merge 1 commit intomainfrom
Conversation
Previously, all nodes were stored as SCHEMA_NODE and all links as LIST_LINK. The atom Type was not preserved in GPU pools, causing same-name-different-type atoms to collide (e.g., ConceptNode "foo" and SchemaNode "foo" shared a single GPU slot). Type storage: - Added uint16_t type column to word pool (nodes) - Repurposed pair_flags field for link type storage - Mixed Type into hash key via golden-ratio hashing to prevent same-name collisions across different atom types - Updated all store/fetch/load paths to use stored types Incoming set scan (fetchIncomingByType): - New gpu-incoming.cl OpenCL kernel: parallel scan of pair pool - New cuda_incoming_scan CUDA kernel: same algorithm - One thread per pair slot, atomic counter for matches - Type filtering in fetchIncomingByType reconstructs only matching link types Bulk operations (loadAtomSpace, loadType, storeAtomSpace): - loadAtomSpace iterates all occupied word/pair pool slots - loadType filters by stored type column - storeAtomSpace delegates to per-atom storeAtom New tests: - BasicSaveUTest: adapted from RocksDB test, basic round-trip - GpuIncomingUTest: incoming set scan, type filtering, loadType - GpuBulkUTest: multi-type nodes/links, collision safety, bulk store/load, incoming at scale, throughput showcase Known limitations: - GPU pair pool keyed by (min(a,b), max(a,b)) — one slot per node pair. Multiple link types between the same two nodes share a slot (last write wins). - storeAtom is per-atom (individual host→GPU transfers), so bulk store at large scale (>1K atoms) is slow (~35K nodes/sec). GpuBulkUTest uses reduced scale to pass within CI timeout. Load path is fast (~700K atoms/sec) since it reads GPU pools in bulk. 16/16 tests pass (excluding slow compartment kernel benchmark). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Closing — builds on wrong foundation. The WordPool/PairPool/SectionPool design is domain-specific language learning, not a general GPU AtomSpace. Need to start from the actual atomspace/ and atoms/base/ source code per Linas' guidance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
uint16_ttype column to word pool, repurposedpair_flagsfor link type. Golden-ratio type hashing prevents same-name-different-type collisions (ConceptNode "foo" vs SchemaNode "foo" now get separate GPU slots).gpu-incoming.cl(OpenCL) andcuda_incoming_scan(CUDA) kernels — parallel scan of pair pool with atomic match counter.fetchIncomingByTypefilters by link type.loadAtomSpace,loadType,storeAtomSpaceiterate GPU pool slots directly.Known limitations
(min(a,b), max(a,b))— one slot per node pair. Multiple link types between the same two nodes share a slot (last write wins). Tests use distinct node pairs per link type. See Design exploration: Full GPU AtomSpace — unified atom table, N-ary links, performance envelope #2 for design exploration on fixing this.storeAtomdoes per-atom host→GPU transfers (~35K nodes/sec). GpuBulkUTest uses reduced scale to pass within CI timeout. Load path is fast (~700K atoms/sec) since it reads GPU pools in bulk.Test results
16/16 pass (12 kernel tests + 4 StorageNode tests). The
test-compartment-kernelbenchmark is excluded from default timeout — it runs a full GPU learning simulation (~30 min).Files changed (13)
opencog/gpu/CMakeLists.txtopencog/gpu/gpu-incoming.clopencog/persist/gpu/GpuBackend.hopencog/persist/gpu/CudaBackend.h+.cuopencog/persist/gpu/OpenCLBackend.h+.ccopencog/persist/gpu/GpuStorageNode.hopencog/persist/gpu/GpuStorageNode.cctests/persist/gpu/CMakeLists.txttests/persist/gpu/BasicSaveUTest.cxxtesttests/persist/gpu/GpuIncomingUTest.cxxtesttests/persist/gpu/GpuBulkUTest.cxxtestTest plan
🤖 Generated with Claude Code