New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Engine continues to attempt to rebuild replica while detaching #6217
Comments
馃殌 |
cc @longhorn/qa |
I would prefer to include the fix in 1.5.0 if possible. |
Pre Ready-For-Testing Checklist
Under the conditions described, the provided script fails in <10 iterations without the fix and should proceed indefinitely with the fix. IMPORTANT: The script provided only works with a modified Longhorn with work from #5845. The script below is similar except that it monitors for signs of inappropriate expansion instead of logs saying it was prevented. #!/bin/bash
current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
counter=0
while true; do
counter=$((counter+1))
echo "Iteration: $counter"
# Execute kubectl -n longhorn-system delete command
kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-4639bda14281d41f3af00d64bc364bb9
# Wait for 300 seconds
sleep 300
# Execute kubectl -n longhorn-system logs and grep commands
kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "expand"
log_result_1=$?
# Check if there was any output from the grep commands
if [[ $log_result_1 -eq 0 ]]; then
echo "Execution stopped. Trigger keywords found in logs."
break
else
echo "Execution completed successfully."
fi
done Failed output for this new script looks like:
|
Verified passed on master-head (longhorn-manager 57b4596) and v1.5.x-head (longhorn-manager a41e906) following the test steps. After running the scripts for more than 20 iterations, the unexpected expansion is still not present. |
Describe the bug (馃悰 if you encounter this issue)
This is another cause of inappropriate replica expansion uncovered while implementing #5845. Maybe this is the mode of failure in #6078? I'll have to figure out a way to confirm.
When a volume is detached, the following things happen in the following order:
There is a minimum of a 10 second window between the time longhorn-manager starts a snapshot purge for a rebuild and the time the rebuild actually starts. During that time, the engine is not reconciled by the engine controller (it is already in the middle of a reconciliation). If the timing is right, it is possible for the rebuilding replica to be killed and a new replica (for a different volume) to take its place. The engine controller continues with the rebuild and attempts to use the engine to rebuild the wrong replica. This can lead to inappropriate expansion.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The engine controller should not attempt to continue with the rebuild using the wrong replica.
Log or Support bundle
A summary of logs from various components when the issue occurs:
Environment
Additional context
The recreate works because killing the instance-manager pod causes the following chain of events:
Ideas
engine.spec.desireState != stopped
before continuing to rebuild.The text was updated successfully, but these errors were encountered: