Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karmada-controller-manager: panic: runtime error: index out of range #3854

Closed
tedli opened this issue Jul 28, 2023 · 36 comments · Fixed by #4145
Closed

karmada-controller-manager: panic: runtime error: index out of range #3854

tedli opened this issue Jul 28, 2023 · 36 comments · Fixed by #4145
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@tedli
Copy link
Contributor

tedli commented Jul 28, 2023

What happened:

karmada-controller-manager crashed, caused by panic: runtime error: index out of range

What you expected to happen:

index checked, not to panic, not to crash.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The index may be -1.

// AddToResourceSummary add resource node into modeling summary
func (rs *ResourceSummary) AddToResourceSummary(crn ClusterResourceNode) {
index := rs.getIndex(crn)
modeling := &(*rs)[index]
if rs.GetNodeNumFromModel(modeling) <= 5 {
root := modeling.linkedlist
if root == nil {

func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int {
index := math.MaxInt
for i, m := range defaultModelSorting {
tmpIndex := searchLastLessElement(modelSortings[i], crn.resourceList[m])
if tmpIndex < index {
index = tmpIndex
}
}
return index
}

Environment:

  • Karmada version: v1.6.1
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:
@tedli tedli added the kind/bug Categorizes issue or PR as related to a bug. label Jul 28, 2023
@liangyuanpeng
Copy link
Contributor

liangyuanpeng commented Jul 28, 2023

Any message for reproduce it? and environment of kubernetes cluster and karmada cluster .

@RainbowMango
Copy link
Member

The searchLastLessElement might return -1:

func searchLastLessElement(nums []resource.Quantity, target resource.Quantity) int {
low, high := 0, len(nums)-1
for low <= high {
mid := low + ((high - low) >> 1)
diff1 := nums[mid].Cmp(target)
var diff2 int
if mid != len(nums)-1 {
diff2 = nums[mid+1].Cmp(target)
}
// diff < 1 means nums[mid] <= target
// diff == 1 means nums[mid+1] > target
if diff1 < 1 {
if (mid == len(nums)-1) || (diff2 == 1) {
// find the last less element that equal to target element
return mid
}
low = mid + 1
} else {
high = mid - 1
}
}
return -1
}

@tedli Do you know how to reproduce it?

@tedli
Copy link
Contributor Author

tedli commented Jul 29, 2023

Hi @RainbowMango @liangyuanpeng

My case as I mententioned, was caused by line 164, 165

// AddToResourceSummary add resource node into modeling summary
func (rs *ResourceSummary) AddToResourceSummary(crn ClusterResourceNode) {
index := rs.getIndex(crn)
modeling := &(*rs)[index]
if rs.GetNodeNumFromModel(modeling) <= 5 {
root := modeling.linkedlist
if root == nil {

And I didn't dive deep to find how to reproduce. I just havd a quick fix, that adding index check, to avoid panic.

@RainbowMango RainbowMango added this to the v1.7 milestone Jul 29, 2023
@RainbowMango RainbowMango added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 29, 2023
@RainbowMango
Copy link
Member

/help

Let's call for help to figure out the root cause.

I'm not sure if the author(@halfrost) has time to take a look :)

@karmada-bot
Copy link
Collaborator

@RainbowMango:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Let's call for help to figure out the root cause.

I'm not sure if the author(@halfrost) has time to take a look :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@karmada-bot karmada-bot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 29, 2023
@halfrost
Copy link
Contributor

halfrost commented Jul 29, 2023

I can fix this issue. index returns negative one, which means that no matching resource model can be found. Let me write some test code to test it again.

@halfrost
Copy link
Contributor

@tedli Hi, could you provide some parameters to initialize resource modeling? I want to reproduce this issue on my local machine.

@tedli
Copy link
Contributor Author

tedli commented Jul 31, 2023

Hi @halfrost ,

Sorry for late reply, here is my Cluster cr.

Click to view

apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
  creationTimestamp: "2023-07-19T07:33:03Z"
  finalizers:
  - karmada.io/cluster-controller
  generation: 3
  labels:
    k8s.qihoo.net/cluster-type: public
    karmada-cluster.k8s.qihoo.net/managed: "true"
  name: pub-shbt
  resourceVersion: "4113183"
  uid: 49cbf86d-8d65-4f03-bbad-bf78d53b6aa3
spec:
  apiEndpoint: https://pub-shbt.k8s.qihoo.net
  id: 447a3118-39a1-4bbd-bd41-539c3071a2e2
  impersonatorSecretRef:
    name: pub-shbt-impersonator
    namespace: karmada-cluster
  resourceModels:
  - grade: 0
    ranges:
    - max: "1"
      min: "0"
      name: cpu
    - max: 4Gi
      min: "0"
      name: memory
  - grade: 1
    ranges:
    - max: "2"
      min: "1"
      name: cpu
    - max: 16Gi
      min: 4Gi
      name: memory
  - grade: 2
    ranges:
    - max: "4"
      min: "2"
      name: cpu
    - max: 32Gi
      min: 16Gi
      name: memory
  - grade: 3
    ranges:
    - max: "8"
      min: "4"
      name: cpu
    - max: 64Gi
      min: 32Gi
      name: memory
  - grade: 4
    ranges:
    - max: "16"
      min: "8"
      name: cpu
    - max: 128Gi
      min: 64Gi
      name: memory
  - grade: 5
    ranges:
    - max: "32"
      min: "16"
      name: cpu
    - max: 256Gi
      min: 128Gi
      name: memory
  - grade: 6
    ranges:
    - max: "64"
      min: "32"
      name: cpu
    - max: 512Gi
      min: 256Gi
      name: memory
  - grade: 7
    ranges:
    - max: "128"
      min: "64"
      name: cpu
    - max: 1Ti
      min: 512Gi
      name: memory
  - grade: 8
    ranges:
    - max: "9223372036854775807"
      min: "128"
      name: cpu
    - max: "9223372036854775807"
      min: 1Ti
      name: memory
  secretRef:
    name: pub-shbt
    namespace: karmada-cluster
  syncMode: Push
status:
  apiEnablements:
  - groupVersion: acme.cert-manager.io/v1
    resources:
    - kind: Challenge
      name: challenges
    - kind: Order
      name: orders
  - groupVersion: admissionregistration.k8s.io/v1
    resources:
    - kind: MutatingWebhookConfiguration
      name: mutatingwebhookconfigurations
    - kind: ValidatingWebhookConfiguration
      name: validatingwebhookconfigurations
  - groupVersion: apiextensions.k8s.io/v1
    resources:
    - kind: CustomResourceDefinition
      name: customresourcedefinitions
  - groupVersion: apiregistration.k8s.io/v1
    resources:
    - kind: APIService
      name: apiservices
  - groupVersion: apps/v1
    resources:
    - kind: ControllerRevision
      name: controllerrevisions
    - kind: DaemonSet
      name: daemonsets
    - kind: Deployment
      name: deployments
    - kind: ReplicaSet
      name: replicasets
    - kind: StatefulSet
      name: statefulsets
  - groupVersion: authentication.k8s.io/v1
    resources:
    - kind: TokenReview
      name: tokenreviews
  - groupVersion: authorization.k8s.io/v1
    resources:
    - kind: LocalSubjectAccessReview
      name: localsubjectaccessreviews
    - kind: SelfSubjectAccessReview
      name: selfsubjectaccessreviews
    - kind: SelfSubjectRulesReview
      name: selfsubjectrulesreviews
    - kind: SubjectAccessReview
      name: subjectaccessreviews
  - groupVersion: autoscaling/v1
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2beta1
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2beta2
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: batch/v1
    resources:
    - kind: CronJob
      name: cronjobs
    - kind: Job
      name: jobs
  - groupVersion: batch/v1beta1
    resources:
    - kind: CronJob
      name: cronjobs
  - groupVersion: catalog.cattle.io/v1
    resources:
    - kind: App
      name: apps
    - kind: ClusterRepo
      name: clusterrepos
    - kind: Operation
      name: operations
  - groupVersion: cert-manager.io/v1
    resources:
    - kind: CertificateRequest
      name: certificaterequests
    - kind: Certificate
      name: certificates
    - kind: ClusterIssuer
      name: clusterissuers
    - kind: Issuer
      name: issuers
  - groupVersion: certificates.k8s.io/v1
    resources:
    - kind: CertificateSigningRequest
      name: certificatesigningrequests
  - groupVersion: cilium.io/v2
    resources:
    - kind: CiliumClusterwideNetworkPolicy
      name: ciliumclusterwidenetworkpolicies
    - kind: CiliumEndpoint
      name: ciliumendpoints
    - kind: CiliumExternalWorkload
      name: ciliumexternalworkloads
    - kind: CiliumIdentity
      name: ciliumidentities
    - kind: CiliumNetworkPolicy
      name: ciliumnetworkpolicies
    - kind: CiliumNode
      name: ciliumnodes
  - groupVersion: cluster.cattle.io/v3
    resources:
    - kind: ClusterAuthToken
      name: clusterauthtokens
    - kind: ClusterUserAttribute
      name: clusteruserattributes
  - groupVersion: config.gatekeeper.sh/v1alpha1
    resources:
    - kind: Config
      name: configs
  - groupVersion: config.koordinator.sh/v1alpha1
    resources:
    - kind: ClusterColocationProfile
      name: clustercolocationprofiles
  - groupVersion: constraints.gatekeeper.sh/v1alpha1
    resources:
    - kind: K8sAllowedRegistries
      name: k8sallowedregistries
    - kind: K8sDisallowedCaps
      name: k8sdisallowedcaps
    - kind: K8sDisallowedHostIpc
      name: k8sdisallowedhostipc
    - kind: K8sDisallowedHostNetwork
      name: k8sdisallowedhostnetwork
    - kind: K8sDisallowedHostpathVolumes
      name: k8sdisallowedhostpathvolumes
    - kind: K8sDisallowedHostPid
      name: k8sdisallowedhostpid
    - kind: K8sDisallowedPrivilegedContainer
      name: k8sdisallowedprivilegedcontainer
    - kind: K8sDisallowedPrivilegeEscalation
      name: k8sdisallowedprivilegeescalation
  - groupVersion: constraints.gatekeeper.sh/v1beta1
    resources:
    - kind: K8sAllowedRegistries
      name: k8sallowedregistries
    - kind: K8sDisallowedCaps
      name: k8sdisallowedcaps
    - kind: K8sDisallowedHostIpc
      name: k8sdisallowedhostipc
    - kind: K8sDisallowedHostNetwork
      name: k8sdisallowedhostnetwork
    - kind: K8sDisallowedHostpathVolumes
      name: k8sdisallowedhostpathvolumes
    - kind: K8sDisallowedHostPid
      name: k8sdisallowedhostpid
    - kind: K8sDisallowedPrivilegedContainer
      name: k8sdisallowedprivilegedcontainer
    - kind: K8sDisallowedPrivilegeEscalation
      name: k8sdisallowedprivilegeescalation
  - groupVersion: coordination.k8s.io/v1
    resources:
    - kind: Lease
      name: leases
  - groupVersion: csi.aliyun.com/v1alpha1
    resources:
    - kind: NodeLocalStorageInitConfig
      name: nodelocalstorageinitconfigs
    - kind: NodeLocalStorage
      name: nodelocalstorages
  - groupVersion: discovery.k8s.io/v1
    resources:
    - kind: EndpointSlice
      name: endpointslices
  - groupVersion: discovery.k8s.io/v1beta1
    resources:
    - kind: EndpointSlice
      name: endpointslices
  - groupVersion: events.k8s.io/v1
    resources:
    - kind: Event
      name: events
  - groupVersion: events.k8s.io/v1beta1
    resources:
    - kind: Event
      name: events
  - groupVersion: expansion.gatekeeper.sh/v1alpha1
    resources:
    - kind: ExpansionTemplate
      name: expansiontemplate
  - groupVersion: externaldata.gatekeeper.sh/v1alpha1
    resources:
    - kind: Provider
      name: providers
  - groupVersion: externaldata.gatekeeper.sh/v1beta1
    resources:
    - kind: Provider
      name: providers
  - groupVersion: flowcontrol.apiserver.k8s.io/v1beta1
    resources:
    - kind: FlowSchema
      name: flowschemas
    - kind: PriorityLevelConfiguration
      name: prioritylevelconfigurations
  - groupVersion: flowcontrol.apiserver.k8s.io/v1beta2
    resources:
    - kind: FlowSchema
      name: flowschemas
    - kind: PriorityLevelConfiguration
      name: prioritylevelconfigurations
  - groupVersion: kubeadm.io.qihoo.com/v1
    resources:
    - kind: Kubecluster
      name: kubeclusters
  - groupVersion: management.cattle.io/v3
    resources:
    - kind: APIService
      name: apiservices
    - kind: AuthConfig
      name: authconfigs
    - kind: ClusterRegistrationToken
      name: clusterregistrationtokens
    - kind: Cluster
      name: clusters
    - kind: Feature
      name: features
    - kind: GroupMember
      name: groupmembers
    - kind: Group
      name: groups
    - kind: Preference
      name: preferences
    - kind: Setting
      name: settings
    - kind: Token
      name: tokens
    - kind: UserAttribute
      name: userattributes
    - kind: User
      name: users
  - groupVersion: metrics.k8s.io/v1beta1
    resources:
    - kind: NodeMetrics
      name: nodes
    - kind: PodMetrics
      name: pods
  - groupVersion: monitoring.coreos.com/v1
    resources:
    - kind: Alertmanager
      name: alertmanagers
    - kind: PodMonitor
      name: podmonitors
    - kind: Prometheus
      name: prometheuses
    - kind: PrometheusRule
      name: prometheusrules
    - kind: ServiceMonitor
      name: servicemonitors
    - kind: ThanosRuler
      name: thanosrulers
  - groupVersion: mutations.gatekeeper.sh/v1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: mutations.gatekeeper.sh/v1alpha1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: mutations.gatekeeper.sh/v1beta1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: network.hulk.qihoo.net/v1alpha1
    resources:
    - kind: LoadBalancer
      name: loadbalancers
  - groupVersion: networking.k8s.io/v1
    resources:
    - kind: IngressClass
      name: ingressclasses
    - kind: Ingress
      name: ingresses
    - kind: NetworkPolicy
      name: networkpolicies
  - groupVersion: node.k8s.io/v1
    resources:
    - kind: RuntimeClass
      name: runtimeclasses
  - groupVersion: node.k8s.io/v1beta1
    resources:
    - kind: RuntimeClass
      name: runtimeclasses
  - groupVersion: policy/v1
    resources:
    - kind: PodDisruptionBudget
      name: poddisruptionbudgets
  - groupVersion: policy/v1beta1
    resources:
    - kind: PodDisruptionBudget
      name: poddisruptionbudgets
    - kind: PodSecurityPolicy
      name: podsecuritypolicies
  - groupVersion: rbac.authorization.k8s.io/v1
    resources:
    - kind: ClusterRoleBinding
      name: clusterrolebindings
    - kind: ClusterRole
      name: clusterroles
    - kind: RoleBinding
      name: rolebindings
    - kind: Role
      name: roles
  - groupVersion: scheduling.k8s.io/v1
    resources:
    - kind: PriorityClass
      name: priorityclasses
  - groupVersion: scheduling.koordinator.sh/v1alpha1
    resources:
    - kind: Device
      name: devices
    - kind: PodMigrationJob
      name: podmigrationjobs
    - kind: Reservation
      name: reservations
  - groupVersion: scheduling.sigs.k8s.io/v1alpha1
    resources:
    - kind: ElasticQuota
      name: elasticquotas
    - kind: PodGroup
      name: podgroups
  - groupVersion: slo.koordinator.sh/v1alpha1
    resources:
    - kind: NodeMetric
      name: nodemetrics
    - kind: NodeSLO
      name: nodeslos
  - groupVersion: snapshot.storage.k8s.io/v1
    resources:
    - kind: VolumeSnapshotClass
      name: volumesnapshotclasses
    - kind: VolumeSnapshotContent
      name: volumesnapshotcontents
    - kind: VolumeSnapshot
      name: volumesnapshots
  - groupVersion: snapshot.storage.k8s.io/v1beta1
    resources:
    - kind: VolumeSnapshotClass
      name: volumesnapshotclasses
    - kind: VolumeSnapshotContent
      name: volumesnapshotcontents
    - kind: VolumeSnapshot
      name: volumesnapshots
  - groupVersion: status.gatekeeper.sh/v1beta1
    resources:
    - kind: ConstraintPodStatus
      name: constraintpodstatuses
    - kind: ConstraintTemplatePodStatus
      name: constrainttemplatepodstatuses
    - kind: MutatorPodStatus
      name: mutatorpodstatuses
  - groupVersion: storage.k8s.io/v1
    resources:
    - kind: CSIDriver
      name: csidrivers
    - kind: CSINode
      name: csinodes
    - kind: CSIStorageCapacity
      name: csistoragecapacities
    - kind: StorageClass
      name: storageclasses
    - kind: VolumeAttachment
      name: volumeattachments
  - groupVersion: storage.k8s.io/v1beta1
    resources:
    - kind: CSIStorageCapacity
      name: csistoragecapacities
  - groupVersion: templates.gatekeeper.sh/v1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: templates.gatekeeper.sh/v1alpha1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: templates.gatekeeper.sh/v1beta1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: topology.node.k8s.io/v1alpha1
    resources:
    - kind: NodeResourceTopology
      name: noderesourcetopologies
  - groupVersion: ui.cattle.io/v1
    resources:
    - kind: NavLink
      name: navlinks
  - groupVersion: v1
    resources:
    - kind: Binding
      name: bindings
    - kind: ComponentStatus
      name: componentstatuses
    - kind: ConfigMap
      name: configmaps
    - kind: Endpoints
      name: endpoints
    - kind: Event
      name: events
    - kind: LimitRange
      name: limitranges
    - kind: Namespace
      name: namespaces
    - kind: Node
      name: nodes
    - kind: PersistentVolumeClaim
      name: persistentvolumeclaims
    - kind: PersistentVolume
      name: persistentvolumes
    - kind: Pod
      name: pods
    - kind: PodTemplate
      name: podtemplates
    - kind: ReplicationController
      name: replicationcontrollers
    - kind: ResourceQuota
      name: resourcequotas
    - kind: Secret
      name: secrets
    - kind: ServiceAccount
      name: serviceaccounts
    - kind: Service
      name: services
  conditions:
  - lastTransitionTime: "2023-07-19T07:38:49Z"
    message: cluster is healthy and ready to accept workloads
    reason: ClusterReady
    status: "True"
    type: Ready
  kubernetesVersion: v1.24.13
  nodeSummary:
    readyNum: 211
    totalNum: 212
  resourceSummary:
    allocatable:
      cpu: "7560"
      ephemeral-storage: "502308137393688"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      koordinator.sh/gpu: "12800"
      koordinator.sh/gpu-core: "12800"
      koordinator.sh/gpu-memory: 1891602Mi
      koordinator.sh/gpu-memory-ratio: "12800"
      kubernetes.io/batch-cpu: "470529"
      kubernetes.io/batch-memory: "3540083593912"
      memory: 44085786916Ki
      netEgress: 218375Mi
      netIngress: 218375Mi
      nvidia.com/gpu: "136"
      pods: "12720"
    allocatableModelings:
    - count: 5
      grade: 0
    - count: 46
      grade: 1
    - count: 56
      grade: 2
    - count: 17
      grade: 3
    - count: 7
      grade: 4
    - count: 16
      grade: 5
    - count: 12
      grade: 6
    - count: 0
      grade: 7
    - count: 0
      grade: 8
    allocated:
      cpu: 5567273m
      kubernetes.io/batch-cpu: 529k
      kubernetes.io/batch-memory: "3439597715456"
      memory: "28402599938109"
      nvidia.com/gpu: "113"
      pods: "11054"
    allocating:
      cpu: 10m
      kubernetes.io/batch-cpu: "126720"
      kubernetes.io/batch-memory: "824633720832"
      pods: "25"

@halfrost
Copy link
Contributor

halfrost commented Aug 2, 2023

I have received your @tedli case. I'll fix this tonight. I'm sooooo sorry I've been busy with work these two days.

@RainbowMango
Copy link
Member

No worries, take your time @halfrost. I appreciate that you can help with it.

@halfrost
Copy link
Contributor

halfrost commented Aug 3, 2023

@RainbowMango I am definitely responsible for my code to the end, this is my responsibility. I am also very sorry for the bug in my code.

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

@halfrost
Copy link
Contributor

halfrost commented Aug 3, 2023

@tedli Hi, Is it possible, can you tell me what is the input parameter of AddToResourceSummary(crn ClusterResourceNode) function when panic occurs?

@tedli
Copy link
Contributor Author

tedli commented Aug 3, 2023

Hi @halfrost ,

I don't know what the parameter is.

I can add some debug log, to dump this parameter for debugging, but I'm not sure when it will reproduce again.

@halfrost
Copy link
Contributor

halfrost commented Aug 3, 2023

@tedli return -1 is a special case. Since model 8 goes to positive infinity, an index should be found. If you want to fix your panic quickly, the solution is to add a piece of code to line 165 of modeling.go:

	if index == -1 {
		klog.Info("ClusterResource can not add to resource summary: index is invalid.")
		return
	}

I will definitely need to add this code, but I am more curious about how to reproduce your panic, I want to find the root cause. Please help me to add some debug logs to get this parameter.

@RainbowMango
Copy link
Member

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

resourceModels:
  - grade: 0
    ranges:
    - max: "1"
      min: "0"
      name: cpu
    - max: 4Gi
      min: "0"
      name: memory
  - grade: 1
    ranges:
    - max: "2"
      min: "1"
      name: cpu
    - max: 16Gi
      min: 4Gi
      name: memory
  - grade: 2
    ranges:
    - max: "4"
      min: "2"
      name: cpu
    - max: 32Gi
      min: 16Gi
      name: memory

If a node has CPU=1 and even with a larger memory than 32Gi, it still belongs grade 2.

@halfrost
Copy link
Contributor

halfrost commented Aug 9, 2023

@tedli Hi, any updates?

@tedli
Copy link
Contributor Author

tedli commented Aug 9, 2023

Hi @halfrost ,
image
I did add the debug log, but untill now nothing produced. I'm not sure how to reproduce this.

@halfrost
Copy link
Contributor

@tedli Hi, any updates? Can you reproduce this panic again?

@RainbowMango
Copy link
Member

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

@tedli
Copy link
Contributor Author

tedli commented Aug 18, 2023

Hi @halfrost ,

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

Hi @RainbowMango ,
I'm using 1.6.1 .

@RainbowMango
Copy link
Member

It is the #3472, we fixed it in release-1.6.

@halfrost
Copy link
Contributor

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

@RainbowMango So sorry about that. It's my fault. I'm trying to reproduce this problem again.

@RainbowMango
Copy link
Member

I always appreciate your help, definitely not blaming you. Just merged the defensive solution(#3950). So, no rush.

@RainbowMango
Copy link
Member

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

I remember you fixed it with a defensive code, so, this panic is unlikely to reproduce on your side, isn't it?

@tedli
Copy link
Contributor Author

tedli commented Aug 18, 2023

Hi all,

I remember you fixed it with a defensive code

Yes, what's more, I add dump args logic, to log args to reproduce.
#3854 (comment)

I will post these dumped args, if it appears again.

@tedli
Copy link
Contributor Author

tedli commented Aug 21, 2023

UPDATE: This is another place, which goes index out of range, and it's easier than the previous one to debug, because it's only relys on Cluster cr, I unmarshal the cluster, to reproduce it locally, didn't reproduce.

image


image

Another place caused index out of range.

modelSortings = make([][]resource.Quantity, len(rsName))
for index := 0; index < len(rsList); index++ {
for i, name := range rsName {
modelSortings[i] = append(modelSortings[i], rsList[index][name])
}
}

line 91, index.

The capture shows line 106, is because I added this:

image

for dumpping args for reproducing the previous panic. (but it didn't reproduced again)

@halfrost
Copy link
Contributor

@tedli Got it, let me dig deeper.

@RainbowMango RainbowMango modified the milestones: v1.7, v1.8 Aug 25, 2023
@NickYadance
Copy link

One of my talented colleagues solved this by adding mutex before reading or writing modelSortings (and of course unlock after the reading or writing). The issue never happened again since the fix.

+	mu.Lock()
+	defer mu.Unlock()
	// generate a sorted array by first priority of ResourceName
	modelSortings = make([][]resource.Quantity, len(rsName))
	for index := 0; index < len(rsList); index++ {
		for i, name := range rsName {
			modelSortings[i] = append(modelSortings[i], rsList[index][name])
		}
	}
func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int {
	index := math.MaxInt
+	mu.Lock()
+	defer mu.Unlock()

In our panic cases, this function can never return -1 in normal conditions because the the first element of nums is 0, but it did. That's why we tried mutex.

func searchLastLessElement(nums []resource.Quantity, target resource.Quantity) int {

@halfrost
Copy link
Contributor

@NickYadance

Your modification certainly makes sense. First, regarding the initialization of modelSortings, I believe this function only runs once during initialization, and scenarios where this function is called concurrently don't exist. Hence, I didn't add a lock().

The getIndex function looks like needs a lock. I've been focused on ensuring there are no bugs in my algorithm logic that I overlooked the fact that this piece of code could be called concurrently in a multi-threaded environment. I tested in a local environment with only 2 machines and didn't encounter any concurrency issue.

Please extend my gratitude to your colleague. I'm truly grateful for his fix.

@NickYadance
Copy link

Hope this helps, though i failed to find the root cause that can cause the concurrent situation.
It just works ...

@jwcesign
Copy link
Member

Hi, @halfrost @NickYadance

I reviewed the code and found that the function AddToResourceSummary will be called when there is a change in the status of the clusters.

The default concurrency for handling cluster status is 5:

flags.IntVar(&o.ConcurrentClusterSyncs, "concurrent-cluster-syncs", 5, "The number of Clusters that are allowed to sync concurrently.")

Based on #3854 (comment), the root reason is: Concurrent read and write global variable modelSortings.

And I also checked the code here:
https://github.com/karmada-io/karmada/blob/5c77f4516eb9c27c0ce38ed25a389ab9e56d8122/pkg/controllers/status/cluster_status_controller.go#L595C1-L620C1

The modelSortings looks like can be changed to a local variable, then this problem will not exist, what do you think? @halfrost

@halfrost
Copy link
Contributor

halfrost commented Oct 3, 2023

@jwcesign Hi, I'm back and so sorry for the late reply. I got Covid last two weeks. I felt so bad. I have recovered. I have 2 questions about your comment:

  1. The number 5, how did you come up with this number? By lots of concurrent testing?
  2. For the modelSortings, modelSortings cannot be a local variable because this state needs to be kept up to date in real time. In the binary search below, we need to search for the latest status. I prefer the above approach of adding a lock(). After I locked it locally, there was indeed no panic.

Regarding the concurrency investigation of function AddToResourceSummary, what you actually mean is that the getAllocatableModelings() function will be called concurrently in the controller, right? AddToResourceSummary will be called in the function getAllocatableModelings(). I just tested it locally and I can indeed see a lot of calls to the AddToResourceSummary function. I think this function caused the panic. I plan to add lock() every time I update the resource. Do you agree with me about this solution?

I haven't submitted the PR yet because I'm afraid that the change to add lock() will be too general and affect performance. In all update operations, resources are only added and not deleted.

@jwcesign
Copy link
Member

jwcesign commented Oct 7, 2023

Hi, @halfrost

  1. By reviewing the code, getAllocatableModelings will be called
    func (c *ClusterStatusController) Reconcile(ctx context.Context, req controllerruntime.Request) (controllerruntime.Result, error) {
    , and the concurrency of calling this function is 5 by default.
  2. function AddToResourceSummary will initialize defaultModelSorting and modelSortings every time it's called, and function AddToResourceSummary will use them, and these two functions are only used in getAllocatableModelings. So I think defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings. And if they are local variables in the function getAllocatableModelings, the value is real-time.

Do I understand correctly for 2? @halfrost

@chaosi-zju
Copy link
Member

chaosi-zju commented Oct 7, 2023

Hi, @halfrost

I generally agree to what @jwcesign said:

defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings

Actually, I think defaultModelSorting and modelSortings can be the member variables of ResourceSummary, in this way, it can avoid concurrency issues caused by global variables and avoid the use of locks.

@halfrost
Copy link
Contributor

@jwcesign Thanks for the advice, let me test it this weekend. I will find out the final solution.

@chaosi-zju OK, make sense. Let me revise this part of the code. Thanks.

@halfrost
Copy link
Contributor

@jwcesign @chaosi-zju

I have already converted modelSortings into a member variable of ResourceSummary. Modifying defaultModelSorting is more challenging because it's used as a global variable within the clusterResourceNodeComparator comparator. Additionally, this comparator is used in other comparison functions.

🙏Please help me review the pr #4145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants