Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge TaskResult into JobResult objects #3164

Merged
merged 18 commits into from
Feb 3, 2023
Merged

Conversation

jathanism
Copy link
Contributor

@jathanism jathanism commented Jan 25, 2023

Closes: #3085

What's Changed

WARNING: This is ugly. Getting this out there to show a checkpoint.

JobResult fields renamed:

  • job_id => task_id
  • created => date_created
  • completed => date_done
  • job_kwargs => task_kwargs

JobResult fields changed

  • status choices changed to mirror that of TaskResult.status.choices
  • Added all the missing fields found on TaskResult to JobResult
    -JobResult.task_name now mirrors TaskResult.task_name and is used distinctly
  • obj_type is now nullable so that system jobs like nautobot.extras.tasks.provision_field can be run and stored in DB, but we might want to revisit this to make it conditional around "system jobs", but this was necessary to make non-Job Celery tasks run at all without an IntegrityError
  • The JobResultManager now sets name to match task_name if name isn't set.
  • Caught some field names I missed in the Job Result templates.

Other changes

  • Celery testing has now been updated to run tasks with ALWAYS_EAGER=True so that instead of requiring a single-threaded worker process using the solo pool strategy, we can just run them "synchronously" and have them correctly update the state of the JobResult.
  • nautobot.core.testing.run_job_for_testing now stores the Celery result from a job on the JobResult object it returns as JobResult.celery_result
  • This was necessary due to the fact that nautobot.extras.jobs.run_job no longer handles exceptions internally and allows them to bubble up. run_job is called synchronously within run_job_for_testing, so a need to catch exceptions was introduced as a result.
  • Now instead of using a custom context manager, run_job_for_testing, a dummy Request object is created (if one isn't passed) and wrappped using copy_safe_request
  • Added the traceback field to the "Additional Data" tab on JobResult detail view template
  • Tests for extras API and views had to be revised for new expectations in that there will ALWAYS be JobResult objects now and to filter them by the name of the Job model class.
  • Revised nautobot.extras.tests.test_jobs.JobFileUploadTest.test_run_job_fail to check that an exception was raised by its job run and that the exception matches the expected type.
  • Placed TODO comments across various places in run_job() where I had to restore calls to job_result.save() to preserve the functionality of how the job model is mutating JobResult.data as it goes. This pattern needs to be refactored for this new paradigm as we work to eliminate run_job entirely.
  • Excluded content_type and content_encoding fields from JobResult. Had to effectively duplicate the base JobResultManager.store_result() manager method instead of overloading it and calling the base in order to accomplish this, but it should be more efficient overall since it won't result in double saving of an object.

Outstanding issues

  • Still need to migrate business logic from run_job into db backend/task classes
  • There's an issue where date_done is being nullified that needs to be addressed likely because of some race condition between run_job internals and database backend state engine (see previous bullet)
  • TaskStateChoices and celery_states need to be folded into a revamped JobResultStatusChoices choice set, so please wait on nitpicking that.
  • Most of UI is working except for tracking success/done state refresh of results
  • Unit tests omega broken again because of the status enums

TODO

  • Explanation of Change(s)
  • Added change log fragment(s) (for more information see the documentation)
  • Attached Screenshots, Payload Example
  • Unit, Integration Tests
  • Documentation Updates (when adding/changing features)
  • Example Plugin Updates (when adding/changing features)
  • Outline Remaining Work, Constraints from Design

@jathanism jathanism marked this pull request as draft January 25, 2023 18:39
Comment on lines 577 to 586
content_type = models.CharField(
default="application/x-nautobot-json",
max_length=128,
verbose_name=('Result Content Type'),
help_text='Content type of the result data')
content_encoding = models.CharField(
default="utf-8",
max_length=64,
verbose_name=('Result Encoding'),
help_text='The encoding used to save the task result data')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are currently required by Celery, but since we're always using JSON this is just a passthrough right now. Need to review the base DatabaseBackend from django-celery-results and see if we can just always use JSON and eliminate or ignore these fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to see if I can eliminate these fields since they are 100% no longer necessary. This would be done in JobResultManager.store_result().

@bryanculver bryanculver linked an issue Jan 26, 2023 that may be closed by this pull request
10 tasks
@jathanism
Copy link
Contributor Author

This is the current issue with encoding, specifically the args/kwargs seem to not being Kombu encoded:

{'_state': <django.db.models.base.ModelState at 0x11843b850>,
 'id': UUID('f9a98b40-77cf-4e50-bf7c-66b0a5175395'),
 '_custom_field_data': {},
 'job_model_id': UUID('64b239ab-71aa-4a59-a99f-002665f642bb'),
 'name': 'plugins/example_plugin.jobs/ExampleLoggingJob',
 'task_name': 'nautobot.extras.jobs.run_job',
 'obj_type_id': 16,
 'date_created': datetime.datetime(2023, 1, 27, 19, 48, 31, 714111, tzinfo=<UTC>),
 'date_done': datetime.datetime(2023, 1, 27, 19, 48, 36, 222092, tzinfo=<UTC>),
 'user_id': UUID('2b12f865-52db-4405-8fc8-1c4d4375c2c5'),
 'status': 'SUCCESS',
 'data': None,
 'periodic_task_name': None,
 'worker': 'celery@jathy-studio.lan',
 'task_args': '"()"',
 'task_kwargs': '"{\'data\': {\'interval\': 4, \'_task_queue\': \'default\'}, \'request\': <nautobot.utilities.utils.NautobotFakeRequest object at 0x1149db280>, \'commit\': True, \'task_queue\': \'default\', \'job_result_pk\': UUID(\'f9a98b40-77cf-4e50-bf7c-66b0a5175395\')}"',
 'content_type': 'application/x-nautobot-json',
 'content_encoding': 'utf-8',
 'result': 'null',
 'traceback': None,
 'meta': '{"children": []}',
 'schedule_id': None,
 'task_id': UUID('7ded996a-e588-4fef-b214-7df7d6e699db'),
 'use_job_logs_db': True}

@jathanism jathanism force-pushed the jathanism-celery-merge-models branch from d48efd8 to 601ed2d Compare January 30, 2023 23:46
JobResult fields renamed:
- `job_id` => `task_id`
- `created` => `date_created`
- `completed` => `date_done`
- `name` => `task_name`

JobResult fields changed
- `status` choices changed to mirror that of `TaskResult.status.choices`
- `JobResult.task_name` now mirrors `TaskResult.task_name` and is used distinctly
- There's an issue where `date_done` is being nullified that needs to be addressed
- Most of UI is working except for tracking success/done state refresh of results
- Unit tests omega broken again because of the status enums
ripped out.

- All of it is being handled by the database backend.
- Customized the DB backend to manage the `date_done` field for us.
- Encoding/decoding task args/kwargs is still funky. Have not figured
  that out yet.
- Tests still broken.
- Refactored `JobResultStatusChoices` to align with Celery task state names
- Squashed migrations down into a single file.
- Made flake8 and black happy ^_^
@jathanism jathanism force-pushed the jathanism-celery-merge-models branch from ac2b4fc to 7d95957 Compare January 31, 2023 20:56
- `JobResult.obj_type` is now nullable, but we might want to revisit this to make it conditional around "system jobs", but this was necessary to make non-Job Celery tasks run at all without an `IntegrityError`
- Fixed an encoding issue with the database backend, but still not the double-encoding one.
- The `JobResultManager` now sets `name` to match `task_name` if `name` isn't set.
- Fixed the table column for `JobResultTable.date_created` to be linkified in the UI.
- Caught some field names I missed in the Job Result templates.
def _get_extended_properties(self, request, traceback):
"""
Overload default so that `argsrepr` and `kwargsrepr` aren't used to construct `args` and
`kwargs`. Our need to properly (de)serialize `NautobotFakeRequest` facilitates this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get rid of NautobotFakeRequest, will that get rid of the need to override this private method? Not saying that has to be done as part of this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. The other problem are the nullable CharField fields we forklifted. If we want "correctness" I would rather those default to "" vs. None but that's actually more invasive than what I had to do here.

nautobot/extras/api/views.py Show resolved Hide resolved
@@ -46,7 +46,7 @@ def refresh_datasource_content(model_name, record, request, job_result, delete=F
job_result.log(
f"Error while refreshing {entry.name}: {exc}", level_choice=LogLevelChoices.LOG_FAILURE
)
job_result.set_status(JobResultStatusChoices.STATUS_ERRORED)
job_result.set_status(JobResultStatusChoices.STATUS_FAILURE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this file also need a TODO about removing the set_status calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I meant to remove all of those. Anywhere explicit status changes are being made must go away. We want Celery to make these decisions, along with any explicit calls to JobResult.save(), moving those to the NautobotTask class or the backend or a mix of both, depending.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jathanism Do you intend to do this in this PR or open a follow-on story?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-on as a part of the "make run_job a thing of the past" work.


except AbortTransaction:
if not job_model.read_only:
job.log_info(message="Database changes have been reverted automatically.")

except Exception as exc:
stacktrace = traceback.format_exc()
job.log_failure(message=f"An exception occurred: `{type(exc).__name__}: {exc}`\n```\n{stacktrace}\n```")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do exceptions thrown during job execution get recorded now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They get stored on the JobResult.traceback field and unless otherwise handled by decorating the task with a throws= argument, they result in the job status of FAILURE.

Cursor_and_Job_Result__File_Upload_Failure_-_Nautobot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I guess we may want to consider enhancing the UI so that (maybe only if DEBUG=True?) we display the .traceback value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any traceback will result in a failure unless it's defined as a known exception type such as using autoretry_for

Also traceback is currently part of the "Additional Data" tab:

image

Unless you're thinking of showing it in the job log as well?

finally:
if output:
job.results["output"] += "\n" + str(output)
output = job.post_run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/except was added specifically for #1807 - are exceptions captured/logged differently now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should allow all exceptions to bubble up and they are handled by Celery and then the traceback gets captured.

nautobot/extras/managers.py Outdated Show resolved Hide resolved
nautobot/extras/managers.py Outdated Show resolved Hide resolved
nautobot/extras/models/jobs.py Outdated Show resolved Hide resolved
nautobot/extras/models/jobs.py Outdated Show resolved Hide resolved
@jathanism jathanism marked this pull request as ready for review February 1, 2023 16:47
- Celery testing has now been updated to run tasks with `ALWAYS_EAGER=True` so that instead of requiring a single-threaded worker process using the `solo` pool strategy, we can just run them "synchronously" and have them correctly update the state of the `JobResult`.
- `nautobot.core.testing.run_job_for_testing` now stores the Celery result from a job on the `JobResult` object it returns as `JobResult.celery_result`
   - This was necessary due to the fact that `nautobot.extras.jobs.run_job` no longer handles exceptions internally and allows them to bubble up. `run_job` is called synchronously within `run_job_for_testing`, so a need to catch exceptions was introduced as a result.
  - Now instead of using a custom context manager, `run_job_for_testing`, a dummy `Request` object is created (if one isn't passed) and wrappped using `copy_safe_request`
- Added the `traceback` field to the "Additional Data" tab on `JobResult` detail view template
- Tests for extras API and views had to be revised for new expectations in that there will ALWAYS be `JobResult` objects now and to filter them by the name of the Job model class.
- Revised `nautobot.extras.tests.test_jobs.JobFileUploadTest.test_run_job_fail` to check that an exception was raised by its job run and that the exception matches the expected type.
- Placed `TODO` comments across various places in `run_job()` where I had to restore calls to `job_result.save()` to preserve the functionality of how the job model is mutating `JobResult.data` as it goes. This pattern needs to be refactored for this new paradigm as we work to eliminate `run_job` entirely.
@bryanculver bryanculver self-requested a review February 2, 2023 16:04
Comment on lines +50 to 55
user = data.pop("user")
data.pop("_cached_user", None)
if "_user_pk" not in data:
# We have a user but haven't stored its PK yet, so look it up and store it
data["_user_pk"] = data.pop("user").pk
else:
# We already have stored the PK, and we need to AVOID actually resolving the lazy user reference.
data.pop("user")
data["_user_pk"] = user.pk
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if this should be back-ported to 1.5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be necessary, seeing as we don't have the recurring serialize/deserialize issue we have here with the database backend. The _cached_user only becomes a part of the payload when it is re-serialized from a deserialized payload a 2nd time.

<strong>Traceback</strong>
</div>
<div class="panel-body">
{% include 'extras/inc/json_data.html' with data=result.traceback format="python" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should do a check for settings.DEBUG here - similar to how we only present full tracebacks for other exceptions (500 errors) if it's True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if there is a traceback we should show it. I was thinking that since it's stored on the object, presenting it if it exists is easy. Is there a concern about data exposure or something tangible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid. IMO let's leave that as a v2 UI thing. I've tracked it to add to the backlog.

Copy link
Member

@bryanculver bryanculver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great step forward. Not going to hold up the absence of the change/migration documentation (status value changes, created => date_created) but so long as we ensure that gets added to the top of the backlog to do, the longer it sits the more conflicts/blockings we might run into.

# TODO(jathan): Remove this once this body of work is done. This is just useful for debugging but it
# results int a lot of noise and slows things down.
# from celery import signals
# @signals.task_prerun.connect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the TODO to remove but you disable the connect. I think if it's valuable to have this around for debugging why not just comment the whole thing out or document how to use this, instead of having to go back into the git history to restore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just uncomment it locally when I need it for now. Leaving it in for testing generates a stupendous amount of noise.

I think documenting how to use it would be valuable.

nautobot/extras/managers.py Outdated Show resolved Hide resolved
@@ -46,7 +46,7 @@ def refresh_datasource_content(model_name, record, request, job_result, delete=F
job_result.log(
f"Error while refreshing {entry.name}: {exc}", level_choice=LogLevelChoices.LOG_FAILURE
)
job_result.set_status(JobResultStatusChoices.STATUS_ERRORED)
job_result.set_status(JobResultStatusChoices.STATUS_FAILURE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jathanism Do you intend to do this in this PR or open a follow-on story?

status = django_filters.MultipleChoiceFilter(choices=JobResultStatusChoices, null_value=None)

class Meta:
model = JobResult
fields = ["id", "created", "completed", "status", "user", "obj_type", "name"]
fields = ["id", "date_created", "date_done", "name", "status", "user", "obj_type"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be sure to add either as a requirement to this PR or a follow-on to create the migration docs for all the model changes here.

Probably would be a good pre-release checklist item is to review every migration in next and ensure we have a migration guide time around it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's on my list of items I'll be adding to the epic for wrapping up the database backend work in #1622


if not commit:
raise AbortTransaction()
raise AbortTransaction("Database changes have been reverted automatically.")

except AbortTransaction:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help if we completed #2726 shortly after merging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that'll just be another one for ripping out run_job. Leave out the atomic block while we're at it.

- Had to effectively duplicate the base `store_result()` manager method instead of overloading it and calling the base in order to accomplish this, but it should be more efficient overall since it won't result in double saving of an object
- Added more verbose commenting in areas of the code that were overloaded where some motives may not have been
 clear (case in point above).
- Addressed some PR feedback for clarity in some other code commennts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jobs Overhaul: Merge TaskResult and JobResult data models
3 participants