Skip to content

Conversation

galiacheng
Copy link
Contributor

@galiacheng galiacheng commented Dec 28, 2021

Description

This pr is a workaround for the evicted issue on WLS 14c. We have discussion on this Slack thread

The WLS14c admin pod become evicted at age 2/3 days with error like Container weblogic-server was using 3300040Ki, which exceeds its request of 1610612736.

The admin pod used memory more than the request studently, see the monitoring screenshot
memory-usage
I am working with Anil to find out the root cause.

Per our observation, applying the resources.limits can resolve the issue, we decided to apply the settings by default.
As the issue does not happen to 12c, so we agree to only apply to 14c, otherwise it will cause the deployment slower.

Test

Test on WLS14c: https://github.com/galiacheng/weblogic-azure/actions/runs/1628854471
Test on WLS 12c:
image

Signed-off-by: galiacheng <haixia.cheng@microsoft.com>

 Changes to be committed:
	new file:   weblogic-azure-aks/src/main/arm/scripts/applyGuaranteedQos.sh
	modified:   weblogic-azure-aks/src/main/bicep/mainTemplate.bicep
	new file:   weblogic-azure-aks/src/main/bicep/modules/_deployment-scripts/_ds-apply-guaranteed-qos.bicep

Fix script

update

update

update

update introspectorJobActiveDeadlineSeconds

debug

update script

remove debug code

fix script

fix timestamp

fix domain uid

fix interval

create global const for JVM args.

Signed-off-by: galiacheng <haixia.cheng@microsoft.com>

 Changes to be committed:
	modified:   weblogic-azure-aks/src/main/arm/scripts/common.sh
	modified:   weblogic-azure-aks/src/main/arm/scripts/genDomainConfig.sh
Signed-off-by: galiacheng <haixia.cheng@microsoft.com>

 Changes to be committed:
	modified:   weblogic-azure-aks/pom.xml
@galiacheng galiacheng marked this pull request as draft December 28, 2021 03:18
@galiacheng galiacheng marked this pull request as ready for review December 28, 2021 07:58
version="${version##*\#version\#\:}" # match #version#:, this is a special mark for the version output, please do not change it.
echo_stdout ${version}

if [ "${version#*WebLogic Server 14.1.1.0}" != "$version" ]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried to fully understand this script so I could be wrong, but even so, I have two comments:

  • To future proof, assuming the problem could be in all future 14.x versions, should the version check be a "greater than or equal"? I think you can leverage sort --version-sort for this.
  • Would the check be better off running before deploying the domain resource? E.g. a kubectl run equivalent of docker run SOME_IMAGE_NAME sh -c 'source $ORACLE_HOME/wlserver/server/bin/setWLSEnv.sh > /dev/null 2>&1 && java weblogic.version'.

Copy link
Contributor Author

@galiacheng galiacheng Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review @tbarnes-us Wish you a Merry Christmas though it's a late greeting.

For #1, you are right, we should consider future version, while I would like to only apply the workaround to 14.1.1.0 now, as we can not make sure the issue will happen in the future version, and we may have other configuration to resolve the issue after we find out the root cause. Also cc @edburns

For #2, we can check the version before deploying the domain resource. Considering the usage of post deployment for data source set up, we separate the processes of basic cluster set up and other configuration that configured via configmap which requires a restart. Current work flow is:

  • Deploy the basic without data source
  • [Optional if the customer select data source] deploy configmap for data source and patch the cluster, this will cause a restart
  • Apply resource limits to WLS 14c.

The domain deployment will be much slower by applying resource limits (2min/pod to 8min/pod). So we apply the setting after there is not further restart(rolling update) process.

For getting version using docker run, this is a great idea, without executing the script using kubectl exec.
We have to run the command in a machine with docker deamon, I will check and add another commit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2 I was thinking of kubectl run. This implicitly launches a temporary pod in the k8s cluster, and so has the advantage of using the same docker image cache as the k8s cluster. One issue is that it may need a pull secret - not sure how that'd be done.

Copy link

@tbarnes-us tbarnes-us Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2:

while I would like to only apply the workaround to 14.1.1.0 now

That's fine by me - as long as one of us remembers in the case that we still need the work-around once 14.1.2 comes out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2:

while I would like to only apply the workaround to 14.1.1.0 now

That's fine by me - as long as one of us remembers in the case that we still need the work-around once 14.1.2 comes out.

We can create an issue to track that, let's discuss with Ed when he is back next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2 I was thinking of kubectl run. This implicitly launches a temporary pod in the k8s cluster, and so has the advantage of using the same docker image cache as the k8s cluster. One issue is that it may need a pull secret - not sure how that'd be done.

We have the domain running when running the shell script to apply resources limits, I enhanced the script using java weblogic.version, please find 5523347

@galiacheng galiacheng changed the title Apply resources limits to wls 14 to solve the evicted pod issue. Apply resources limits to wls 14.1.1.0 to solve the evicted pod issue. Dec 31, 2021
Signed-off-by: galiacheng <haixia.cheng@microsoft.com>
Signed-off-by: galiacheng <haixia.cheng@microsoft.com>
Signed-off-by: galiacheng <haixia.cheng@microsoft.com>
@galiacheng
Copy link
Contributor Author

version=${stringArray[2]}
echo_stdout "WebLogic Server version: ${version}"

if [ "${version#*14.1.1.0}" != "$version" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include this comment:

# Temporary workaround for https://github.com/oracle/weblogic-kubernetes-operator/issues/2693

resources:
requests:
cpu: "${cpu}m"
memory: "${memoryRequest}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, we are not modifying the memoryRequest from its previous value, retrieved on line 64, though we are modifying the cpu value. Is that what we want? I thought the essence of the workaround was to increase the cpu and memory requests and limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @edburns per my observations, 1.5Gi is enough, so will not increase the memory in this workaround.

export constClusterName='cluster-1'
export constClusterT3AddressEnvName="T3_TUNNELING_CLUSTER_ADDRESS"
export constDefaultJavaOptions="-Dlog4j2.formatMsgNoLookups=true -Dweblogic.StdoutDebugEnabled=false" # the java options will be applied to the cluster
export constDefaultJVMArgs="-Djava.security.egd=file:/dev/./urandom -XX:MinRAMPercentage=25.0 -XX:MaxRAMPercentage=50.0 " # the JVM options will be applied to the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to do this for all wls versions? The other script only applies to 14c, but this change applies unconditionally. Is that what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, we should specify -Xms256m -Xmx512m for lower version. Thanks for the finding.

}

/*
* Apply resource limits to WebLogic Server 14.1.1.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include this in the comment.

* Temporary workaround for https://github.com/oracle/weblogic-kubernetes-operator/issues/2693

Copy link
Contributor

@edburns edburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review my comments before merging. I assume you will consider them and apply them if necessary.

Signed-off-by: galiacheng <haixia.cheng@microsoft.com>
…56m -Xmx512m.

Signed-off-by: galiacheng <haixia.cheng@microsoft.com>
Signed-off-by: galiacheng <haixia.cheng@microsoft.com>

 Changes to be committed:
	modified:   weblogic-azure-aks/src/main/arm/scripts/applyGuaranteedQos.sh
	modified:   weblogic-azure-aks/src/main/bicep/mainTemplate.bicep
	modified:   weblogic-azure-aks/src/main/bicep/modules/_deployment-scripts/_ds-apply-guaranteed-qos.bicep
@galiacheng
Copy link
Contributor Author

galiacheng commented Jan 7, 2022

Hello @edburns @tbarnes-us There is a bug, querying the version from admin pod causes failure, there is no admin pod for slim image, but we can make sure there is at least one managed server pod, as the UI definition requires at least 1 replicas. I fixed it by running the command in one of the managed server pod, see 0082b26.

@galiacheng
Copy link
Contributor Author

@mriccell mriccell merged commit 8742285 into oracle:main Jan 8, 2022
@galiacheng galiacheng deleted the fix-wls14c-evicted branch January 24, 2022 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants