Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add shard service kernel #16565

Merged
merged 49 commits into from
Jun 3, 2024

Conversation

zhangxu19830126
Copy link
Contributor

@zhangxu19830126 zhangxu19830126 commented May 31, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #16438

What this PR does / why we need it:

Add shardservice kernel.


PR Type

Feature, Tests


Description

  • Added ShardServiceAddress field to CNService and TNService structs and implemented related methods.
  • Implemented shard service with functionalities for creating, deleting, and managing shards.
  • Added unit tests for shard service operations including create, delete, and heartbeat.
  • Implemented tests for shard allocation and balancing across multiple CNs.

Changes walkthrough 📝

Relevant files
Enhancement
metadata.pb.go
Add ShardServiceAddress field and related methods               

pkg/pb/metadata/metadata.pb.go

  • Added ShardServiceAddress field to CNService and TNService structs.
  • Implemented GetShardServiceAddress method for CNService and TNService.
  • Updated MarshalToSizedBuffer, Size, and Unmarshal methods to handle
    ShardServiceAddress.
  • +150/-47
    Tests
    runtime_test.go
    Add tests for shard service runtime operations                     

    pkg/shardservice/runtime_test.go

  • Added tests for shard service runtime operations.
  • Implemented tests for add, delete, allocate, and heartbeat
    functionalities.
  • +652/-0 
    service_test.go
    Add unit tests for shard service operations                           

    pkg/shardservice/service_test.go

  • Added unit tests for shard service create and delete operations.
  • Implemented tests for asynchronous shard updates and deletions.
  • Verified shard allocation and balancing across multiple CNs.
  • +724/-0 
    scheduler_balance_test.go
    Add tests for shard balancing scheduler                                   

    pkg/shardservice/scheduler_balance_test.go

  • Added tests for shard balancing scheduler.
  • Implemented test cases for balancing shards across multiple CNs.
  • +301/-0 
    Feature
    service.go
    Implement shard service with core functionalities               

    pkg/shardservice/service.go

  • Implemented shard service with create, delete, and heartbeat
    functionalities.
  • Added methods for handling shard allocation and deletion.
  • Integrated remote communication for shard operations.
  • +868/-0 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    4, because the PR introduces a significant amount of new functionality related to shard service handling, including the addition of new methods, extensive modifications to existing files, and the introduction of new test cases. The complexity of the changes, especially those involving transaction and shard management, requires a thorough review to ensure correctness, performance, and maintainability.

    🧪 Relevant tests

    Yes

    ⚡ Possible issues

    Possible Bug: The handling of error scenarios and edge cases in shard allocation and balancing might not be fully covered by the tests. It's crucial to ensure that all possible states and transitions are tested, especially under failure conditions.

    Performance Concern: The shard balancing and allocation logic could introduce performance bottlenecks, especially in large-scale deployments. It's important to benchmark these scenarios and consider optimizations.

    🔒 Security concerns

    No

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add validation checks in the Unmarshal methods to ensure data integrity

    Add validation checks in the Unmarshal methods of the structs to ensure that the data
    being unmarshaled is valid and within expected ranges.

    pkg/pb/shard/shard.pb.go [226-227]

     func (m *ShardsMetadata) XXX_Unmarshal(b []byte) error {
    -    return m.Unmarshal(b)
    +    if err := m.Unmarshal(b); err != nil {
    +        return err
    +    }
    +    if m.ShardsCount == 0 {
    +        return fmt.Errorf("ShardsCount cannot be zero")
    +    }
    +    return nil
     }
     
    Suggestion importance[1-10]: 9

    Why: Adding validation checks in the Unmarshal methods is critical to ensure that the data being processed is within expected ranges, thus preventing potential data corruption or unexpected behavior.

    9
    Add a length check before accessing elements in a slice to prevent potential out-of-bounds errors

    In the TestAddWithNewVersion function, the require.Equal assertions for
    r.cns["cn1"].incompleteOps should also check the length of the slice before accessing its
    elements to avoid potential out-of-bounds errors.

    pkg/shardservice/runtime_test.go [83-85]

     require.Equal(t, 2, len(r.cns["cn1"].incompleteOps))
    -require.Equal(t, pb.Operator{Type: pb.OpType_DeleteReplica, TableShard: s1, Replica: r1}, r.cns["cn1"].incompleteOps[0])
    -require.Equal(t, pb.Operator{Type: pb.OpType_DeleteReplica, TableShard: s2, Replica: r2}, r.cns["cn1"].incompleteOps[1])
    +if len(r.cns["cn1"].incompleteOps) >= 2 {
    +    require.Equal(t, pb.Operator{Type: pb.OpType_DeleteReplica, TableShard: s1, Replica: r1}, r.cns["cn1"].incompleteOps[0])
    +    require.Equal(t, pb.Operator{Type: pb.OpType_DeleteReplica, TableShard: s2, Replica: r2}, r.cns["cn1"].incompleteOps[1])
    +}
     
    Suggestion importance[1-10]: 8

    Why: The suggestion correctly identifies a potential out-of-bounds error in accessing slice elements without a prior length check, which is crucial for preventing runtime panics in Go.

    8
    Add a check to ensure ShardServiceAddress is not empty before returning it

    Consider adding a check to ensure that ShardServiceAddress is not an empty string before
    returning it in the GetShardServiceAddress method. This will help avoid potential issues
    where an empty address might be used inadvertently.

    pkg/pb/metadata/metadata.pb.go [639-643]

    -if m != nil {
    +if m != nil && m.ShardServiceAddress != "" {
         return m.ShardServiceAddress
     }
     return ""
     
    Suggestion importance[1-10]: 7

    Why: Adding a check for an empty string can prevent potential runtime issues and is a good practice for robustness.

    7
    Best practice
    Add a default case to the String methods for enums to handle unexpected values

    Consider adding a default case to the String methods for the enums Policy, OpType,
    CNState, ReplicaState, and Method to handle unexpected values gracefully.

    pkg/pb/shard/shard.pb.go [63-65]

     func (x Policy) String() string {
    -	return proto.EnumName(Policy_name, int32(x))
    +    if name, ok := Policy_name[int32(x)]; ok {
    +        return name
    +    }
    +    return fmt.Sprintf("Policy(%d)", x)
     }
     
    Suggestion importance[1-10]: 8

    Why: Adding a default case to handle unexpected values in enum string methods improves robustness and error handling, which is crucial for maintaining data integrity.

    8
    Rename the GetShardServiceAddress method to ShardServiceAddress for consistency with Go conventions

    To improve consistency and readability, consider renaming the GetShardServiceAddress
    method to ShardServiceAddress to match Go's convention of omitting the Get prefix in
    getter methods.

    pkg/pb/metadata/metadata.pb.go [638-643]

    -func (m *CNService) GetShardServiceAddress() string {
    +func (m *CNService) ShardServiceAddress() string {
         if m != nil {
             return m.ShardServiceAddress
         }
         return ""
     }
     
    Suggestion importance[1-10]: 5

    Why: While aligning with Go conventions improves readability, the impact on functionality is minimal, hence a moderate score.

    5
    Test
    Add a unit test for the new method to ensure it returns the correct address

    Consider adding a unit test for the new GetShardServiceAddress method to ensure it returns
    the correct address when the ShardServiceAddress field is set and an empty string when it
    is not.

    pkg/pb/logservice/logservice.pb.go [586-592]

     func (m *CNStore) GetShardServiceAddress() string {
     	if m != nil {
     		return m.ShardServiceAddress
     	}
     	return ""
     }
     
    +// Unit test for GetShardServiceAddress
    +func TestGetShardServiceAddress(t *testing.T) {
    +    cnStore := &CNStore{ShardServiceAddress: "test_address"}
    +    if addr := cnStore.GetShardServiceAddress(); addr != "test_address" {
    +        t.Errorf("expected 'test_address', got %s", addr)
    +    }
    +
    +    cnStore = &CNStore{}
    +    if addr := cnStore.GetShardServiceAddress(); addr != "" {
    +        t.Errorf("expected empty string, got %s", addr)
    +    }
    +}
    +
    Suggestion importance[1-10]: 8

    Why: Adding unit tests for new methods is crucial for verifying their functionality and preventing future regressions.

    8
    Security
    Add validation and sanitization for the ShardServiceAddress field

    Ensure that the ShardServiceAddress field is properly validated and sanitized before being
    assigned or used. This can help prevent potential security issues or bugs related to
    invalid or malicious addresses.

    pkg/pb/metadata/metadata.pb.go [550]

    -ShardServiceAddress  string   `protobuf:"bytes,9,opt,name=ShardServiceAddress,proto3" json:"ShardServiceAddress,omitempty"`
    +ShardServiceAddress  string   `protobuf:"bytes,9,opt,name=ShardServiceAddress,proto3" json:"ShardServiceAddress,omitempty" validate:"required,url"`
     
    Suggestion importance[1-10]: 8

    Why: Ensuring data validation and sanitization is crucial for security, especially for fields that might be exposed to user input or external systems.

    8
    Performance
    Use sync.Pool for frequently used objects to reduce memory allocation overhead

    To improve performance, consider using sync.Pool for frequently used objects like
    ShardsMetadata and TableShard to reduce the overhead of memory allocation.

    pkg/pb/shard/shard.pb.go [220-222]

    -func (m *ShardsMetadata) Reset()         { *m = ShardsMetadata{} }
    +var shardsMetadataPool = sync.Pool{
    +    New: func() interface{} {
    +        return &ShardsMetadata{}
    +    },
    +}
    +
    +func (m *ShardsMetadata) Reset() {
    +    *m = ShardsMetadata{}
    +    shardsMetadataPool.Put(m)
    +}
     func (m *ShardsMetadata) String() string { return proto.CompactTextString(m) }
     func (*ShardsMetadata) ProtoMessage()    {}
     
    Suggestion importance[1-10]: 7

    Why: Using sync.Pool for objects like ShardsMetadata and TableShard can significantly improve performance by reducing garbage collection overhead, which is important in high-throughput environments.

    7
    Validation
    Add validation to the ShardServiceAddress field to ensure it contains a valid URL

    Ensure that the ShardServiceAddress field is properly validated when being set, to avoid
    potential issues with invalid addresses.

    pkg/pb/logservice/logservice.pb.go [456]

    -ShardServiceAddress  string                        `protobuf:"bytes,15,opt,name=ShardServiceAddress,proto3" json:"ShardServiceAddress,omitempty"`
    +ShardServiceAddress  string                        `protobuf:"bytes,15,opt,name=ShardServiceAddress,proto3" json:"ShardServiceAddress,omitempty" validate:"required,url"`
     
    Suggestion importance[1-10]: 7

    Why: Validating input data is important for data integrity and security, but the suggestion assumes URL format which might not be necessary for all use cases.

    7
    Robustness
    Add error handling for invalid or unreachable ShardServiceAddress to improve robustness

    Consider adding error handling for cases where ShardServiceAddress might be invalid or
    unreachable, to improve robustness.

    pkg/pb/logservice/logservice.pb.go [586-592]

    -func (m *CNStore) GetShardServiceAddress() string {
    -	if m != nil {
    -		return m.ShardServiceAddress
    +func (m *CNStore) GetShardServiceAddress() (string, error) {
    +	if m == nil {
    +		return "", errors.New("CNStore is nil")
     	}
    -	return ""
    +	if m.ShardServiceAddress == "" {
    +		return "", errors.New("ShardServiceAddress is empty")
    +	}
    +	return m.ShardServiceAddress, nil
     }
     
    Suggestion importance[1-10]: 7

    Why: Adding error handling enhances the robustness of the code by gracefully handling potential runtime issues.

    7
    Enhancement
    Use proto.RegisterEnum to register enums for better interoperability and debugging

    Consider using proto.RegisterEnum to register the enums Policy, OpType, CNState,
    ReplicaState, and Method to improve interoperability and debugging.

    pkg/pb/shard/shard.pb.go [51-60]

    +func init() {
    +    proto.RegisterEnum("shard.Policy", Policy_name, Policy_value)
    +    proto.RegisterEnum("shard.OpType", OpType_name, OpType_value)
    +    proto.RegisterEnum("shard.CNState", CNState_name, CNState_value)
    +    proto.RegisterEnum("shard.ReplicaState", ReplicaState_name, ReplicaState_value)
    +    proto.RegisterEnum("shard.Method", Method_name, Method_value)
    +}
    +
     var Policy_name = map[int32]string{
         0: "None",
         1: "Partition",
         2: "Hash",
     }
     
    Suggestion importance[1-10]: 6

    Why: Registering enums using proto.RegisterEnum enhances interoperability and debugging capabilities, which is beneficial for development and maintenance but not as critical as runtime performance or data integrity issues.

    6
    Maintainability
    Refactor repeated methods into a shared interface or embedded struct for maintainability

    To improve maintainability, consider refactoring the repeated GetShardServiceAddress
    methods into a single method in a shared interface or embedded struct.

    pkg/pb/logservice/logservice.pb.go [586-592]

    +type ShardServiceAddressGetter interface {
    +    GetShardServiceAddress() string
    +}
    +
    +func getShardServiceAddress(m ShardServiceAddressGetter) string {
    +    if m != nil {
    +        return m.GetShardServiceAddress()
    +    }
    +    return ""
    +}
    +
     func (m *CNStore) GetShardServiceAddress() string {
    -	if m != nil {
    -		return m.ShardServiceAddress
    -	}
    -	return ""
    +    return getShardServiceAddress(m)
     }
     
     func (m *TNStore) GetShardServiceAddress() string {
    -	if m != nil {
    -		return m.ShardServiceAddress
    -	}
    -	return ""
    +    return getShardServiceAddress(m)
     }
     
     func (m *CNStoreHeartbeat) GetShardServiceAddress() string {
    -	if m != nil {
    -		return m.ShardServiceAddress
    -	}
    -	return ""
    +    return getShardServiceAddress(m)
     }
     
     func (m *TNStoreHeartbeat) GetShardServiceAddress() string {
    -	if m != nil {
    -		return m.ShardServiceAddress
    -	}
    -	return ""
    +    return getShardServiceAddress(m)
     }
     
    Suggestion importance[1-10]: 6

    Why: Refactoring to reduce code duplication is good for maintainability, but the impact on the existing architecture and other dependencies should be carefully considered.

    6
    Extract repeated logic into a helper function to reduce code duplication

    To enhance maintainability, consider extracting the repeated logic of checking if m is not
    nil and returning the address into a helper function. This will reduce code duplication
    and make future changes easier.

    pkg/pb/metadata/metadata.pb.go [639-643]

    -if m != nil {
    -    return m.ShardServiceAddress
    +func getAddress(m *CNService, address string) string {
    +    if m != nil {
    +        return address
    +    }
    +    return ""
     }
    -return ""
     
    +func (m *CNService) ShardServiceAddress() string {
    +    return getAddress(m, m.ShardServiceAddress)
    +}
    +
    Suggestion importance[1-10]: 6

    Why: Extracting repeated logic into a helper function enhances maintainability and reduces duplication, but the impact is moderate as it affects only a small part of the codebase.

    6

    @mergify mergify bot merged commit 8789a8b into matrixorigin:main Jun 3, 2024
    16 of 18 checks passed
    XuPeng-SH pushed a commit to XuPeng-SH/matrixone that referenced this pull request Jun 4, 2024
    * GC needs to consume all the mo_snapshot tables (matrixorigin#16539)
    
    Each tenant of the current mo has a mo_snapshot table to store snapshot information. GC needs to consume all mo_snapshot tables.
    
    Approved by: @XuPeng-SH
    
    * append log for upgrade and sqlExecutoer (matrixorigin#16575)
    
    append log for upgrader and sqlExecutor
    
    Approved by: @daviszhen, @badboynt1, @zhangxu19830126, @m-schen
    
    * [enhancement] proxy: filter CNs that are not in working state. (matrixorigin#16558)
    
    1. filter CNs that are not in working state.
    2. add some logs for migration
    
    Approved by: @zhangxu19830126
    
    * fix lock service ut (matrixorigin#16517)
    
    fix lock service ut
    
    Approved by: @zhangxu19830126
    
    * Add cost of GC Check (matrixorigin#16470)
    
    To avoid List() operations on oss, tke or s3, you need to add the Cost interface.
    
    Approved by: @reusee, @XuPeng-SH
    
    * optimize explain info for tp/ap query (matrixorigin#16578)
    
    optimize explain info for tp/ap query
    
    Approved by: @daviszhen, @ouyuanning, @aunjgr
    
    * Bvt disable trace (matrixorigin#16581)
    
    aim to exclude the `system,system_metrics` part case.
    changes:
    1. move `cases/table/system_table_cases` system,system_metrics part into individule case file.
    
    Approved by: @heni02
    
    * remove log print from automaxprocs (matrixorigin#16546)
    
    remove log print from automaxprocs
    
    Approved by: @triump2020, @m-schen, @ouyuanning, @aunjgr, @zhangxu19830126
    
    * rmTag15901 (matrixorigin#16585)
    
    rm 15901
    
    Approved by: @heni02
    
    * remove some MustStrCol&MustBytesCol (matrixorigin#16361)
    
    Remove some unnecessary MustStrCol, MustBytesCol calls.
    
    Approved by: @daviszhen, @reusee, @m-schen, @aunjgr, @XuPeng-SH
    
    * add bvt tag (matrixorigin#16589)
    
    add bvt tag
    
    Approved by: @heni02, @aressu1985
    
    * fix a bug that cause load performance regression issue (matrixorigin#16600)
    
    fix a bug that cause load performance regression issue
    
    Approved by: @m-schen
    
    * add case for restore pub_sub (matrixorigin#16602)
    
    add case for restore pub_sub
    
    Approved by: @heni02
    
    * add shard service kernel (matrixorigin#16565)
    
    Add shardservice kernel.
    
    Approved by: @reusee, @m-schen, @daviszhen, @XuPeng-SH, @volgariver6, @badboynt1, @ouyuanning, @triump2020, @w-zr, @sukki37, @aunjgr, @fengttt
    
    * [BugFix]: Use L2DistanceSq instead of L2Distance during IndexScan (matrixorigin#16366)
    
    During `KNN Select` and `Mapping Entries to Centroids via CROSS_JOIN_L2`, we can make use of L2DistanceSq instead of L2Distance, as it avoids `Sqrt()`. We can see the improvement in QPS for SIFT128 from 90 to 100. However, for GIST960, the QPS did not change much.
    
    L2DistanceSq is suitable only when there is a comparison (ie ORDER BY), and when the absolute value (ie actual L2Distance) is not required.
    - In the case of `CROSS JOIN L2` we find the nearest centroid for the Entry using `L2DistanceSq`. `CROSS JOIN L2` is used in both INSERT and CREATE INDEX.
    - In the case of `KNN SELECT`, our query has ORDER BY L2_DISTANCE(...), which can make use of `L2DistanceSq` as the L2Distance value is not explicitly required.
    
    **NOTE:** L2DistanceSq is not suitable in Kmenas++ for Centroid Computation, as it will impact the centroids picked.
    
    Approved by: @heni02, @m-schen, @aunjgr, @badboynt1
    
    * add sharding metrics (matrixorigin#16606)
    
    add sharding metrics
    
    Approved by: @aptend
    
    * fix data race (matrixorigin#16608)
    
    fix data race
    
    Approved by: @reusee
    
    * Refactor reshape (matrixorigin#15879)
    
    Reshape objects block by block.
    
    Approved by: @XuPeng-SH
    
    * refactor system variables to support account isolation (matrixorigin#16551)
    
    - system variable now is account isolated
    - table `mo_mysql_compatibility_mode` only saves delta info between account's and cluster's default system variable values
    - always use session variable except `show global variables`
    
    Approved by: @daviszhen, @aunjgr, @aressu1985
    
    * fix merge
    
    * [cherry-pick-16594] : fix moc3399 (matrixorigin#16611)
    
    When truncate table, if the table does not have any auto-incr col, there is no need to call the Reset interface of increment_service
    
    Approved by: @ouyuanning
    
    * bump go to 1.22.3, fix make compose and optimize ut script (matrixorigin#16604)
    
    1. bump go version from 1.21.5 to 1.22.3
    2. fix `make compose` to make it work
    3. `make ut` will read `UT_WORKDIR` env variable to store report, it will be `$HOME` if `UT_WORKDIR` is empty
    
    Approved by: @zhangxu19830126, @sukki37
    
    * remove isMerge from build operator (matrixorigin#16622)
    
    remove isMerge from build operator
    
    Approved by: @m-schen
    
    ---------
    
    Co-authored-by: GreatRiver <2552853833@qq.com>
    Co-authored-by: qingxinhome <70939751+qingxinhome@users.noreply.github.com>
    Co-authored-by: LiuBo <g.user.lb@gmail.com>
    Co-authored-by: iamlinjunhong <49111204+iamlinjunhong@users.noreply.github.com>
    Co-authored-by: nitao <badboynt@126.com>
    Co-authored-by: Jackson <xzxiong@yeah.net>
    Co-authored-by: Ariznawlll <ariznawl@163.com>
    Co-authored-by: Wei Ziran <weiziran125@gmail.com>
    Co-authored-by: YANGGMM <www.yangzhao123@gmail.com>
    Co-authored-by: fagongzi <zhangxu19830126@gmail.com>
    Co-authored-by: Arjun Sunil Kumar <arjunsk@users.noreply.github.com>
    Co-authored-by: Kai Cao <ck89119@users.noreply.github.com>
    Co-authored-by: Jensen <jensenojs@qq.com>
    Co-authored-by: brown <endeavorjia@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    None yet