Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sharding metrics #16606

Merged
merged 1 commit into from
Jun 4, 2024
Merged

Conversation

zhangxu19830126
Copy link
Contributor

@zhangxu19830126 zhangxu19830126 commented Jun 4, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #16438

What this PR does / why we need it:

add sharding metrics


PR Type

Enhancement, Tests


Description

  • Added Prometheus metrics for various shard operations including AddReplica, DeleteReplica, and DeleteAllReplica.
  • Integrated metrics into shard service operations to track replica counts and read operations.
  • Created a new Grafana dashboard to visualize sharding metrics including replica counts, operations, and reads.
  • Defined new Prometheus counters and gauges for tracking sharding metrics.

Changes walkthrough 📝

Relevant files
Enhancement
runtime.go
Add metrics counters for shard operations.                             

pkg/shardservice/runtime.go

  • Added counters for AddReplica and DeleteReplica operations.
  • Imported new metrics package.
  • +7/-0     
    service.go
    Integrate metrics into shard service operations.                 

    pkg/shardservice/service.go

  • Added metrics for replica count and operations in doTask and
    doHeartbeat functions.
  • Added a method to count replicas in allocatedCache.
  • +16/-0   
    service_read.go
    Add metrics for local and remote reads.                                   

    pkg/shardservice/service_read.go

    • Added local and remote read counters in the Read function.
    +8/-0     
    grafana_dashboard.go
    Initialize sharding dashboard in Grafana.                               

    pkg/util/metric/v2/dashboard/grafana_dashboard.go

    • Added initialization of sharding dashboard.
    +3/-0     
    grafana_dashboard_sharding.go
    Create Grafana dashboard for sharding metrics.                     

    pkg/util/metric/v2/dashboard/grafana_dashboard_sharding.go

  • Created a new sharding dashboard with graphs for replica count,
    operators, and reads.
  • +80/-0   
    sharding.go
    Define Prometheus metrics for sharding.                                   

    pkg/util/metric/v2/sharding.go

  • Defined Prometheus metrics for replica operations and reads.
  • Added a gauge for replica count.
  • +52/-0   

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    @matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Jun 4, 2024
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    3, because the PR involves multiple changes across various files including metric implementation, service logic adjustments, and dashboard configurations. The changes are moderate in complexity, involving both backend logic and monitoring aspects.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Metric Initialization: The metrics are used directly in the service files, but there is no explicit registration shown in the PR for these metrics with Prometheus. This could lead to runtime errors if the metrics are not properly registered.

    Error Handling: In the new metric increment statements, there is no error handling or logging. If any of these operations fail or produce errors, it might be silent at runtime, which could complicate debugging and operational monitoring.

    🔒 Security concerns

    No

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Register the Prometheus metrics during the package initialization

    To ensure that the metrics are registered with Prometheus, add a function to register the
    counters and gauges, and call it during the initialization of the package.

    pkg/util/metric/v2/sharding.go [1-52]

     var (
         replicaOperatorCounter = prometheus.NewCounterVec(
             prometheus.CounterOpts{
                 Namespace: "mo",
                 Subsystem: "sharding",
                 Name:      "schedule_replica_total",
                 Help:      "Total number of replica schedule.",
             }, []string{"type"})
         AddReplicaOperatorCounter       = replicaOperatorCounter.WithLabelValues("add")
         DeleteReplicaOperatorCounter    = replicaOperatorCounter.WithLabelValues("delete")
         DeleteAllReplicaOperatorCounter = replicaOperatorCounter.WithLabelValues("delete-all")
         
         replicaReadCounter = prometheus.NewCounterVec(
             prometheus.CounterOpts{
                 Namespace: "mo",
                 Subsystem: "sharding",
                 Name:      "replica_read_total",
                 Help:      "Total number of replica read.",
             }, []string{"type"})
         ReplicaLocalReadCounter  = replicaReadCounter.WithLabelValues("local")
         ReplicaRemoteReadCounter = replicaReadCounter.WithLabelValues("remote")
     )
     
     var (
         ReplicaCountGauge = prometheus.NewGauge(
             prometheus.GaugeOpts{
                 Namespace: "mo",
                 Subsystem: "sharding",
                 Name:      "replica_count",
                 Help:      "Count of running replica.",
             })
     )
     
    +func init() {
    +    prometheus.MustRegister(replicaOperatorCounter, replicaReadCounter, ReplicaCountGauge)
    +}
    +
    Suggestion importance[1-10]: 9

    Why: Registering Prometheus metrics during package initialization is a best practice to ensure that metrics are available as soon as the package is used, enhancing the reliability of metric reporting.

    9
    Possible issue
    Add a default case to handle unexpected operation types in the switch statement

    To avoid potential issues with uninitialized or nil values in op.Type, it would be prudent
    to add a default case in the switch statement to handle unexpected operation types.

    pkg/shardservice/runtime.go [556-561]

     switch op.Type {
     case pb.OpType_AddReplica:
         v2.AddReplicaOperatorCounter.Inc()
     case pb.OpType_DeleteReplica:
         v2.DeleteReplicaOperatorCounter.Inc()
    +default:
    +    c.logger.Warn("unexpected operation type", zap.String("type", string(op.Type)))
     }
     
    Suggestion importance[1-10]: 8

    Why: Adding a default case to handle unexpected operation types is crucial for robust error handling and logging, preventing potential runtime errors.

    8
    Add a mutex lock to ensure thread-safe access to the s.values map in the replicasCount method

    To ensure that the replicasCount method is thread-safe, consider using a mutex to protect
    access to the s.values map, as it could be modified concurrently.

    pkg/shardservice/service.go [752-758]

     func (s *allocatedCache) replicasCount() int {
    +    s.mu.Lock()
    +    defer s.mu.Unlock()
         n := 0
         for _, v := range s.values {
             n += len(v.Replicas)
         }
         return n
     }
     
    Suggestion importance[1-10]: 8

    Why: Ensuring thread safety in concurrent environments is critical, especially when accessing shared resources like maps. The suggestion correctly identifies a potential race condition.

    8
    Maintainability
    Extract the local and remote replica counting logic into a separate function for better readability

    To improve readability and maintainability, consider extracting the local and remote
    replica counting logic into separate functions.

    pkg/shardservice/service_read.go [81-112]

    -local := 0
    -remote := 0
    -for i, shard := range selected.values {
    -    if s.isLocalReplica(shard.Replicas[0]) {
    -        selected.local = append(selected.local, i)
    -        local++
    -        continue
    -    }
    -
    -    remote++
    -    if opts.adjust != nil {
    -        opts.adjust(&shard)
    -    }
    -}
    +local, remote := countReplicas(selected.values, s.isLocalReplica)
     
     v2.ReplicaLocalReadCounter.Add(float64(local))
     v2.ReplicaRemoteReadCounter.Add(float64(remote))
     
    +func countReplicas(values []Shard, isLocal func(Replica) bool) (int, int) {
    +    local, remote := 0, 0
    +    for i, shard := range values {
    +        if isLocal(shard.Replicas[0]) {
    +            selected.local = append(selected.local, i)
    +            local++
    +            continue
    +        }
    +        remote++
    +        if opts.adjust != nil {
    +            opts.adjust(&shard)
    +        }
    +    }
    +    return local, remote
    +}
    +
    Suggestion importance[1-10]: 7

    Why: Extracting complex logic into separate functions improves code readability and maintainability, making it easier to manage and understand the codebase.

    7

    @mergify mergify bot merged commit 897293a into matrixorigin:main Jun 4, 2024
    16 of 18 checks passed
    XuPeng-SH pushed a commit to XuPeng-SH/matrixone that referenced this pull request Jun 4, 2024
    * GC needs to consume all the mo_snapshot tables (matrixorigin#16539)
    
    Each tenant of the current mo has a mo_snapshot table to store snapshot information. GC needs to consume all mo_snapshot tables.
    
    Approved by: @XuPeng-SH
    
    * append log for upgrade and sqlExecutoer (matrixorigin#16575)
    
    append log for upgrader and sqlExecutor
    
    Approved by: @daviszhen, @badboynt1, @zhangxu19830126, @m-schen
    
    * [enhancement] proxy: filter CNs that are not in working state. (matrixorigin#16558)
    
    1. filter CNs that are not in working state.
    2. add some logs for migration
    
    Approved by: @zhangxu19830126
    
    * fix lock service ut (matrixorigin#16517)
    
    fix lock service ut
    
    Approved by: @zhangxu19830126
    
    * Add cost of GC Check (matrixorigin#16470)
    
    To avoid List() operations on oss, tke or s3, you need to add the Cost interface.
    
    Approved by: @reusee, @XuPeng-SH
    
    * optimize explain info for tp/ap query (matrixorigin#16578)
    
    optimize explain info for tp/ap query
    
    Approved by: @daviszhen, @ouyuanning, @aunjgr
    
    * Bvt disable trace (matrixorigin#16581)
    
    aim to exclude the `system,system_metrics` part case.
    changes:
    1. move `cases/table/system_table_cases` system,system_metrics part into individule case file.
    
    Approved by: @heni02
    
    * remove log print from automaxprocs (matrixorigin#16546)
    
    remove log print from automaxprocs
    
    Approved by: @triump2020, @m-schen, @ouyuanning, @aunjgr, @zhangxu19830126
    
    * rmTag15901 (matrixorigin#16585)
    
    rm 15901
    
    Approved by: @heni02
    
    * remove some MustStrCol&MustBytesCol (matrixorigin#16361)
    
    Remove some unnecessary MustStrCol, MustBytesCol calls.
    
    Approved by: @daviszhen, @reusee, @m-schen, @aunjgr, @XuPeng-SH
    
    * add bvt tag (matrixorigin#16589)
    
    add bvt tag
    
    Approved by: @heni02, @aressu1985
    
    * fix a bug that cause load performance regression issue (matrixorigin#16600)
    
    fix a bug that cause load performance regression issue
    
    Approved by: @m-schen
    
    * add case for restore pub_sub (matrixorigin#16602)
    
    add case for restore pub_sub
    
    Approved by: @heni02
    
    * add shard service kernel (matrixorigin#16565)
    
    Add shardservice kernel.
    
    Approved by: @reusee, @m-schen, @daviszhen, @XuPeng-SH, @volgariver6, @badboynt1, @ouyuanning, @triump2020, @w-zr, @sukki37, @aunjgr, @fengttt
    
    * [BugFix]: Use L2DistanceSq instead of L2Distance during IndexScan (matrixorigin#16366)
    
    During `KNN Select` and `Mapping Entries to Centroids via CROSS_JOIN_L2`, we can make use of L2DistanceSq instead of L2Distance, as it avoids `Sqrt()`. We can see the improvement in QPS for SIFT128 from 90 to 100. However, for GIST960, the QPS did not change much.
    
    L2DistanceSq is suitable only when there is a comparison (ie ORDER BY), and when the absolute value (ie actual L2Distance) is not required.
    - In the case of `CROSS JOIN L2` we find the nearest centroid for the Entry using `L2DistanceSq`. `CROSS JOIN L2` is used in both INSERT and CREATE INDEX.
    - In the case of `KNN SELECT`, our query has ORDER BY L2_DISTANCE(...), which can make use of `L2DistanceSq` as the L2Distance value is not explicitly required.
    
    **NOTE:** L2DistanceSq is not suitable in Kmenas++ for Centroid Computation, as it will impact the centroids picked.
    
    Approved by: @heni02, @m-schen, @aunjgr, @badboynt1
    
    * add sharding metrics (matrixorigin#16606)
    
    add sharding metrics
    
    Approved by: @aptend
    
    * fix data race (matrixorigin#16608)
    
    fix data race
    
    Approved by: @reusee
    
    * Refactor reshape (matrixorigin#15879)
    
    Reshape objects block by block.
    
    Approved by: @XuPeng-SH
    
    * refactor system variables to support account isolation (matrixorigin#16551)
    
    - system variable now is account isolated
    - table `mo_mysql_compatibility_mode` only saves delta info between account's and cluster's default system variable values
    - always use session variable except `show global variables`
    
    Approved by: @daviszhen, @aunjgr, @aressu1985
    
    * fix merge
    
    * [cherry-pick-16594] : fix moc3399 (matrixorigin#16611)
    
    When truncate table, if the table does not have any auto-incr col, there is no need to call the Reset interface of increment_service
    
    Approved by: @ouyuanning
    
    * bump go to 1.22.3, fix make compose and optimize ut script (matrixorigin#16604)
    
    1. bump go version from 1.21.5 to 1.22.3
    2. fix `make compose` to make it work
    3. `make ut` will read `UT_WORKDIR` env variable to store report, it will be `$HOME` if `UT_WORKDIR` is empty
    
    Approved by: @zhangxu19830126, @sukki37
    
    * remove isMerge from build operator (matrixorigin#16622)
    
    remove isMerge from build operator
    
    Approved by: @m-schen
    
    ---------
    
    Co-authored-by: GreatRiver <2552853833@qq.com>
    Co-authored-by: qingxinhome <70939751+qingxinhome@users.noreply.github.com>
    Co-authored-by: LiuBo <g.user.lb@gmail.com>
    Co-authored-by: iamlinjunhong <49111204+iamlinjunhong@users.noreply.github.com>
    Co-authored-by: nitao <badboynt@126.com>
    Co-authored-by: Jackson <xzxiong@yeah.net>
    Co-authored-by: Ariznawlll <ariznawl@163.com>
    Co-authored-by: Wei Ziran <weiziran125@gmail.com>
    Co-authored-by: YANGGMM <www.yangzhao123@gmail.com>
    Co-authored-by: fagongzi <zhangxu19830126@gmail.com>
    Co-authored-by: Arjun Sunil Kumar <arjunsk@users.noreply.github.com>
    Co-authored-by: Kai Cao <ck89119@users.noreply.github.com>
    Co-authored-by: Jensen <jensenojs@qq.com>
    Co-authored-by: brown <endeavorjia@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants