Skip to content
This repository has been archived by the owner on Aug 22, 2022. It is now read-only.

AppServers Cleanup v2 - Only keep active, fallback and RC servers #170

Merged
merged 8 commits into from
Jan 23, 2017

Conversation

antoviaque
Copy link
Member

@antoviaque antoviaque commented Jan 17, 2017

Terminate app servers that were created more than days before now, except:

  • the active appserver if there is one,
  • a "release candidate" (rc) appserver, to allow testing before the next appserver activation (we keep the most recent running appserver)
  • a fallback appserver, for days after activating an appserver, to allow reverts (we keep the most recent running appserver created before the current activation)

See OC-2041 . Follow-up to #143

Test Instructions

  1. Create an instance that is not associated with a PR:

    • Run make shell.

    • Create instance via:

      from instance.factories import instance_factory
      regular_instance = instance_factory(name="Regular instance, no PR", sub_domain="regular-instance-zohoRek7")
  2. Spawn four app servers for the instance, faking their age:

    from datetime import timedelta
    from django.utils import timezone
    from freezegun import freeze_time
    
    with freeze_time(timezone.now() - timedelta(days=5)):
        oldest_appserver_id = regular_instance.spawn_appserver()
    
    with freeze_time(timezone.now() - timedelta(days=4)):
        fallback_appserver_id = regular_instance.spawn_appserver()
    
    with freeze_time(timezone.now() - timedelta(days=1)):
        active_appserver_id = regular_instance.spawn_appserver()
    
    with freeze_time(timezone.now() + timedelta(days=4)):
        rc_appserver_id = regular_instance.spawn_appserver()
  3. Activate active_appserver:

    regular_instance.set_appserver_active(active_appserver_id)
  4. Check the state of the appservers. You can use the GUI or do it directly in the shell via something like:

    for appserver in regular_instance.appserver_set.all():
        print(appserver)
        print("Status: ", appserver.status)
        print("Status (server): ", appserver.server.status)

    This should produce the following output:

    AppServer 1
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 2
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 3
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 4
    Status:  Running [running]
    Status (server):  Ready [ready]
    
  5. Run the clean_up task:

    from instance.tasks import clean_up
    clean_up()
  6. Verify that only the oldest appserver has been shut down. This should produce the following output:

    AppServer 1
    Status:  Terminated [terminated]
    Status (server):  Terminated [terminated]
    AppServer 2
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 3
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 4
    Status:  Running [running]
    Status (server):  Ready [ready]
    
  7. Re-run the cleanup_task from different dates to check that the behavior correctly matches the description at the top of this PR. Especially:

    • The active appserver and the most recent running appserver (rc) should never be terminated
    • The fallback appserver should only be terminated after days have passed
    • Far in the future, only the active appserver and the most recent running appserver should remain

    To run the cleanup_task from different dates, use (make sure you use HUEY_ALWAYS_EAGER=true):

    with freeze_time(timezone.now() + timedelta(days=3)):
        clean_up()
    

    At D+3, you should get:

    AppServer 1
    Status:  Terminated [terminated]
    Status (server):  Terminated [terminated]
    AppServer 2
    Status:  Terminated [terminated]
    Status (server):  Terminated [terminated]
    AppServer 3
    Status:  Running [running]
    Status (server):  Ready [ready]
    AppServer 4
    Status:  Running [running]
    Status (server):  Ready [ready]
    
  8. Ensure AppServers which aren't in the 'running' state always get deprovisioned after two days.

  9. Double-check "OVH - OpenStack - Horizon - OpenCraft IM - Dev" account to see if correct set of app servers left.

Reviewers

Copy link
Member

@itsjeyd itsjeyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque I had a look at the code and left some comments. Nothing major, the approach looks good in general. Did not get around to testing this yet; do you think it would make sense to deploy the changes to stage? That might speed things up, especially w/r/t being able to provision multiple app servers at once. (I don't have a working instance manager configuration on my local machine at the moment; the last time I had to spin up instances locally was before Sven finished the load balancer implementation.)

"""
Check if delta between `date` and `reference_date` is greater than or equal to `expected_days_since`.
Check if `later_date` is at least `expected_days_passed` after `earlier_date`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque Typo: expected_days_passed should be expected_days_since. That change might make the sentence a little less readable, so perhaps change the entire thing to read:

Check if at least `expected_days_since` have passed between `earlier_date` and `later_date`.

@@ -162,6 +162,7 @@ class AppServer(ValidateModelMixin, TimeStampedModel):
server = models.OneToOneField(OpenStackServer, on_delete=models.CASCADE, related_name='+')
# The Instance that owns this. InstanceReference has related_name accessors like 'openedxappserver_set'
owner = models.ForeignKey(InstanceReference, on_delete=models.CASCADE, related_name='%(class)s_set')
last_activated = models.DateTimeField(null=True, blank=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque Nit: Maybe add a comment above this line that mentions the purpose of this field?


# Keep a running appserver as fallback for `days` after activation, to allow reverts
if active_appserver and appserver.created < active_appserver.last_activated:
if not sufficient_time_passed(active_appserver.last_activated, timezone.now(), days) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque Is there a specific reason for calling timezone.now() in three different places in this method? If not, maybe it would be worth calculating a single reference_date at the beginning (before starting to loop over app servers), and using that to check if sufficient_time_passed. If nothing else, using one specific date per run of the terminate_obsolete_appservers method could be helpful for debugging purposes.

correctly identifies and terminates app servers created more than `days` before now, except:
- the active appserver
- a release candidate server (the most recent running appserver)
- a fallback appserver for `days` after activation (the most recent running appserver created before)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque "... created before the currently-active app server was activated)", perhaps?

])

def test_terminate_obsolete_appservers_no_active(self):
# Terminate app servers after `days` have passed - the fallback appserver should be terminated
# and the other appservers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque "and the other appservers": Might be worth being a bit more specific here?

(rc_appserver_failed, AppServerStatus.ConfigurationFailed, ServerStatus.Pending),
])

# Terminated app servers again, much later - this time only the active appserver and the most recent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque Typo: "Terminated" should be "Terminate"


pip install python-novaclient
pip install python-openstackclient
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) yw!

@itsjeyd
Copy link
Member

itsjeyd commented Jan 19, 2017

@antoviaque A quick question regarding the test instructions, just to clarify:

Ensure AppServers which aren't in the 'running' state always get deprovisioned after two days.

"After two days" -- are you referring to the (fake) date at which I'd be running the clean_up task here?

@antoviaque
Copy link
Member Author

@itsjeyd Thanks for the review! I'll go through your comments to address them, but I wanted to answer your questions quickly:

I had a look at the code and left some comments. Nothing major, the approach looks good in general. Did not get around to testing this yet; do you think it would make sense to deploy the changes to stage? That might speed things up, especially w/r/t being able to provision multiple app servers at once. (I don't have a working instance manager configuration on my local machine at the moment; the last time I had to spin up instances locally was before Sven finished the load balancer implementation.)

Yup, we could definitely deploy on stage - I'll try to do this after addressing your current comments if I have time. Or feel free to do it directly if you want to test this before I get around to do it.

"After two days" -- are you referring to the (fake) date at which I'd be running the clean_up task here?

Yes, exactly.

@antoviaque
Copy link
Member Author

@itsjeyd Addressed your comments - all good ones, thank you!

I also looked into deploying on stage, but it is currently on the rabbitmq branch, so I wasn't sure if it was free to use now. @bdero do you still need stage?

@antoviaque
Copy link
Member Author

@itsjeyd I've updated stage with the current branch, @bdero confirmed on IRC that he does't need it. That actually allowed me to spot an issue - it was missing a data migration for the existing active appservers, which need a value for last_activated - I've set it to the time of the data migration.


# Keep a running appserver as fallback for `days` after activation, to allow reverts
if active_appserver and appserver.created < active_appserver.last_activated:
if not sufficient_time_passed(active_appserver.last_activated, now, days) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque If there is a negative delta between the first and the second date passed to sufficient_time_passed (i.e., if the first date is more recent than the second one), the function will return False. If that situation ever came up (highly unlikely), not sufficient_time_passed(...) would evaluate to True and we'd end up keeping an additional fallback_appserver.

Just wanted to mention this; no need to change anything. The "dangerous" code that terminates VMs below will only run if sufficient_time_passed returns True, so a negative delta between the first and second date would not cause VMs to get terminated.

@itsjeyd
Copy link
Member

itsjeyd commented Jan 20, 2017

@antoviaque Thanks for the updates, and for creating the data migration! I ended up bringing my local installation of OC IM up to date yesterday, and completed a first round of testing this morning (including verifying that the data migration correctly updates last_activated for active app servers).

The code is looking good, and I didn't find any issues during the first round of testing. But I'll still need to verify that we're getting desired behavior when there is no active app server. I'll use stage to do that, and let you know when I'm done.

Copy link
Member

@itsjeyd itsjeyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoviaque 👍

  • I tested this both locally and on stage (instance); works as advertised in PR description.
  • I read through the code
  • I checked for accessibility issues N/A
  • Includes documentation The README does not include documentation about Huey tasks. It might be worth adding that at some point, but I'm fine with considering it out-of-scope here.

@antoviaque
Copy link
Member Author

@itsjeyd Thank you for the prompt review and testing! Merging and deploying this now.

@antoviaque antoviaque merged commit e840528 into master Jan 23, 2017
@smarnach smarnach deleted the cleanup-v2 branch August 22, 2017 17:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants