Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OEP-0049 Django App Patterns #181

Merged
merged 14 commits into from Apr 12, 2021
265 changes: 265 additions & 0 deletions oeps/oep-0049-django-app-patterns.rst
@@ -0,0 +1,265 @@

=============================
OEP-0049: Django App Patterns
=============================

.. list-table::
:widths: 25 75

* - OEP
- :doc:`OEP-0045 </oeps/oep-0049>`
* - Title
- Django App Patterns
* - Last Modified
- 2021-01-29
* - Authors
- Matt Tuchfarber <mtuchfarber@edx.org>
* - Arbiter
- Jinder Singh
* - Status
- Under Review
* - Type
- Architecture
* - Created
- 2021-01-29
* - Review Period
- 2021-02-22 - 2021-04-10

.. contents::
:local:
:depth: 2

Abstract
--------
Proposes a common set of code patterns for Open edX Django apps.

Motivation
-----------
As our number of Django apps continue to grow in our many services, we want to coalesce around a couple of standard design patterns to both make switching between codebases easier and to help untangle some of the links between codebases we have today. These decisions should be considered "best practices" or the default patterns, and should only be violated if the code base requires it.

Decision
--------
All of our Django apps should have a common structure. This structure consists of a combination of default Django-style code and Open edX-style code. This document will only attempt to detail the common Open edX patterns that we would like to see everywhere, ignoring Django-default items (e.g. ``admin.py``, ``urls.py``, etc) and situation-specific items (e.g. a separate ``constants.py`` file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocking comment but I wasn't able to find a good example of this but it would be nice to link to some docs that explicitly spell out the django side of our layout. If you're new to django, you won't really know what those are and if you come here for answers, it might be nice to give them a place to go. The closest I found was this: https://docs.djangoproject.com/en/3.1/intro/tutorial01/#creating-the-polls-app

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a table of the things we won't have opinions on with links to their individual docs. Lemme know if this covers your concern.


The common Django files and folders this **won't** set preferences for are:

.. list-table::
:widths: 25 60 15

* - admin.py
- The additions to the Django admin site live here
- `Docs <https://docs.djangoproject.com/en/3.1/ref/contrib/admin/>`_
* - migrations/
- The migrations for the models live here
- `Docs <https://docs.djangoproject.com/en/3.1/topics/migrations/>`_
* - management/
- The management commands live here
- `Docs <https://docs.djangoproject.com/en/3.1/howto/custom-management-command/>`_
* - models.py
- The Django ORM models live here
- `Docs <https://docs.djangoproject.com/en/3.1/topics/db/models/>`_
* - urls.py
- The URL structure of your app is defined here
- `Docs <https://docs.djangoproject.com/en/3.1/topics/http/urls/>`_

More detailed

Listed below are each of the files or folders your app should contain and what they should consist of.


README.rst
++++++++++
Each app should contain a README.rst to explain its use. See full details of what should go in the README.rst in OEP-0019_

.. _OEP-0019: https://open-edx-proposals.readthedocs.io/en/latest/oep-0019-bp-developer-documentation.html#readmes

.. ___init__.py:
__init__.py
+++++++++++
The ``__init__.py`` file should contain a line for the ``default_app_config`` for the app. This ``default_app_config`` should point to the ``AppConfig`` located in ``<app_name>/apps.py``. It may also contain small app details such as a version. However, unlike many packages, ``__init__.py`` should *not* be used to as the way to export the app's public methods. These should be exported using ``api.py`` (and thus imported as ``from path.to.app.api import public_function``). See api.py_ below.

Example
~~~~~~~

.. code-block:: python

default_app_config = "service_name.apps.app_name.apps.CustomAppConfig"


.. _apps.py:
apps.py
+++++++
The ``apps.py`` file should contain a subclass of a Django ``AppConfig``. The AppConfig should set the app's name to its full path (e.g. ``name = "service_name.apps.app_name"``) and should (optionally) have an overriding ``ready()`` function which initializes the app. Any imports that need to happen during app initialization (such as signals_) need to happen inside the ``ready`` function or else there's risk of circular imports.

Example
~~~~~~~

.. code-block:: python

class MyAppConfig(AppConfig):
"""
Application Configuration for MyApp.
"""
name = "service_name.apps.app_name"

# (optional) Set up plugin. See https://github.com/edx/edx-django-utils/tree/master/edx_django_utils/plugins

def ready(self):
"""
Connect handlers to recalculate grades.
"""
from .signals import handlers

.. _api.py:
api.py
++++++
This should be single point of entry for other Python code to talk to your app. This is *not* a Rest API, this is a Python API (see rest_api_). Some rules for ``api.py`` are as follows:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 for differentiating between Python API and REST API here. Sometimes "API" is used loosely and interchangeably and it's confusing.


1. API methods defined in ``api.py`` should be well-named, self-consistent, and relevant to its own domain (without exposing technical and implementation details)
2. An app's Django models and other internal data structures should not be exposed via its Python APIs (unless performance requires it).


Not exposing an app's data structures can be tricky because it's very easy to expose them without meaning to. Therefore there are a couple common strategies we employ.

1. When importing internal app code to be used in the ``api.py`` file, prefix it with an underscore so it's clear it's for internal use only.

2. Create a ``data.py`` file to house simple data objects that can be passed from your app's function to the calling app. By creating these objects, we can avoid both passing Django model objects or querysets directly and having to serialize data. Other apps may import data classes from ``data.py`` in additional to functionality from ``api.py``. See data.py_ for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about not exposing QuerySets, due to future performance concerns, but also due to extensibility. For instance, I was just discussing w/ @MichaelRoytman about the course_enrollment api in edx-platform which doesn't currently expose a QuerySet, and as a result he is unable to additionally filter those enrollments inside the database (and instead had to pull the data into memory and then filter in python). That pattern, of pulling into memory and filtering, makes pagination difficult, because to find a page of N users, you may have to filter through the entire data set in python (in the worst case). Even exposing generators just mitigates the problem by only pulling users until you fill up the filtered page, but that doesn't prevent you from needing to iterate the users in python, rather than filtering in the database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Context: this was mainly pulled from https://github.com/edx/edx-platform/blob/master/docs/decisions/0002-inter-app-apis.rst#decisions.)

I've also been hesitant about performance on this point. There are ways around the performance issue, such as allowing kwargs that get passed to the initial queryset, but that makes the hard line we're trying to draw between apps a lot more hazy. It also defeats the purpose of not exposing the internal data structure, but it'd give us a performant API that also only returns plain data classes.

There was a recent discussion with about models_api.py (which I had pulled from this doc), which set it up as the "when you need to break the API contract for performance reasons" API to use. I'm not sure if having a separate file, or having kwargs (above) would be less hazy.

cc: @nasthagiri since you wrote the referenced ADR and had that models_api.py discussion with me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even exposing **kwargs doesn't actually help that much. For instance, you couldn't use .annotate or .aggregate if the result isn't returning a QuerySet. I wonder if there's a path forward where we advocate returning values() or values_list() querysets. Even that, though, means that clients can't choose which columns they care about, which isn't great from performance purposes.

Sigh... I wish we had a better way of being able to use the Python standard of "we're all consenting adults" to feel more comfortable exposing honest-to-goodness querysets. Like, can we just say that non-public fields on a model will be prefixed with _ on the object (and then we can name the column whatever we want on the field definition)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So coincidentally, I had something like this come up in the ticket that I'm working on today, and my mind also wandered to trying to address it with values_list. The problem was:

  • I wanted to return an iterable of course contexts that didn't have any learning_sequence outline data associated with them.
  • Because I want to punt learning_sequences out of edx-platform and into its own repo, I didn't want to make it know about CourseOverview.
  • The actual pushing of data into learning_sequences happens in contentstore (i.e. Studio), since learning_sequences likewise doesn't know about modulestore.

So I settled on a new addition to the learning_sequences API that was:

def get_course_keys_with_outlines() -> QuerySet:
    """Queryset of ContextKeys, iterable as a flat list."""
    return LearningContext.objects.values_list('context_key', flat=True)

So iterating through it simply goes like you'd expect:

for course_key in get_course_keys_with_outlines():
    print(course_key)

Yields:

course-v1:TNL+100+2021_01_21
course-v1:TNL+100+ABC
course-v1:TNL+LibContentTest+2021
course-v1:edX+DemoX+Demo_Course

But you can also use it as a qset:

all_course_keys_qs = CourseOverview.objects.values_list('id', flat=True)
missing_outlines_qs = all_course_keys_qs.exclude(
    id__in=get_course_keys_with_outlines()
)

The query it uses is:

SELECT `course_overviews_courseoverview`.`id` FROM `course_overviews_courseoverview`
WHERE NOT (`course_overviews_courseoverview`.`id` IN (SELECT U0.`context_key` FROM `learning_sequences_learningcontext` U0))

Not that this will solve every problem, but I think there is something we can do here that is to encourage some predictable subset of the Queryset API in some disciplined way.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, would it work if we subclassed QuerySet to make something that maps into attrs over it when you actually do the iteration? I'll experiment a little...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the last part, something might look like:

tasks = api.get_tasks_for_user(user_id)

urgent_tasks = tasks.overdue().high_priority()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, still doing this in bits and pieces where I can fit in the time...)

I think the kind of interface I posted above could work for the "there is a system out there that I want answers from" type of query, but we'd need something more for the "I want to efficiently add my own data to this" type of query.

Say I was attaching course-specific survey data to enrollments (I am picking totally at random). There's an enrollments API built in the fashion of this OEP, and I have survey objects that have a reverse relationship name of "survey". The approach I suggested above would obliterate the idea of a model relation, and I'm super-skittish about reimplementing that type of logic. But a client should be able to do something like:

tasks = task_api.get_tasks_for_user(user_id)
urgent_tasks = tasks.overdue().high_priority()

jira_tickets = jira_api.tickets_for_tasks(urgent_tasks)

Implementing the JIRA API tickets_for_tasks by taking a queryset, doing a select_related, and returning a matching qset with JIRA information corresponding to the qset that was passed in–that general pattern should be a common use case, right? The JIRA app here gets to determine the reverse relationship name in their model...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FYI I extended the review date by a month so we could hash this out. New date is April 10.)

I really like the "have your cake and eat it too" design. I have a lot of hesitations on adding any of it in here though. Reasons being:

  1. Whatever we decide on will be an idea and not tested in multiple places. I think this should be an ideate -> implement -> ADR -> OEP flow, rather than tagged on here. (Very interested in this topic though, so I'd be happy to be a part of the future work)
  2. Sort of in regard to "reimplementing that type of logic", this feels like it's straddling the line between a customization and a contradiction to how Django does things. I don't think we should bend to Django's will, but as one of the larger Django projects, I think this may be a good space for cross-community collaboration to see if this is something someone's already tried, or if they have any thoughts. Honestly it feels like it'd be a rabbit hole to dive into.

TL;DR: I want this; I want to be in on the discussion; I don't think this is where we decide it.

I'm thinking this section might become more of a "use attrs if possible, querysets are okay for performance reasons". When we have a working use case, we can update this section to use it.

That seem fair to you? I'll update the PR with that language if so.

P.S. I hate stopping a brainstorm, so know that I feel bad about it. Haha.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't reply to this. I agree that this idea is too rough to be part of this OEP, and it makes sense to experiment with it first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've updated the language to be less strict about not passing querysets. If you have any thoughts on it, let me know. If not, I think this was the last major question, so we'll just wait it out until April 10 :)


Performance caveat
~~~~~~~~~~~~~~~~~~
While there are many situations that the above solution works well for, there are a number of situations where the need for performance outweighs the preference for strong code boundaries. In these situations, APIs may return querysets of models so the code consuming the API may efficiently filter and retrieve the data. We don't have solutions that keep strong boundaries and have good performance today, but are working towards them.

If you simply need to page your results and want to keep code boundaries intact, you can use Django's Paginator class to keep the retrievals performant without passing Querysets around.

Example
~~~~~~~

.. code-block:: python

from django.conf.settings import UNSUPPORTED_PROGRAM_UUIDS
from django.core.paginator import Paginator

from .data import ProgramData
from .models import Program as _Program

def get_supported_programs_simple():
"""
Gets all programs that aren't in UNSUPPORTED_PROGRAM_UUIDS settings

Returns a page of results if page_size is specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doc string incorrect? It seems like copy pasta from docstring of get_supported_programs_paged function

"""
supported_programs = _Program.objects.exclude(
uuid__in=UNSUPPORTED_PROGRAM_UUIDS
)

return [
ProgramData(
uuid=program.uuid,
title=program.title,
status=program.status
)
for program in supported_programs
]

def get_supported_programs_paged(page_size=None, page=None):
"""
Gets all programs that aren't in UNSUPPORTED_PROGRAM_UUIDS settings

Returns a page of results if page_size is specified
"""
q_supported_programs = _Program.objects.exclude(
uuid__in=UNSUPPORTED_PROGRAM_UUIDS
)

if page_size:
# passing a queryset to a paginator allows it to stay performant on large tables
program_paginator = Paginator(q_supported_programs, page_size)
# get_page returns the first page if page is None
supported_programs = program_paginator.get_page(page)
else:
supported_programs = q_supported_programs

return [
ProgramData(
uuid=program.uuid,
title=program.title,
status=program.status
)
for program in supported_programs
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I think I need to think about it some more, but a few reflex reactions:

  1. I do think some sort of Paginator usage makes sense, though I'm not really that familiar with the options and requirements there.
  2. Something like this makes sense for a DRF query or something like Django admin, and making use of Django's builtins to accomplish that would be great.
  3. We'd probably need some different approach (generator?) if this was being run by a management command and wanted to iterate through a whole bunch of records. Otherwise, this would do a lot of unnecessary computation on the database side to recompute a bunch of queries instead of having one big one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about 3 a lot. The best way I can think of to keep performance and app model isolation is to pass the paginator directly and lazily transform the responses into data classes. Leaving this code in a comment for now because it's it involves creating custom classes that don't exist yet. I'm curious 1) What you think of the pattern and 2) If it's worth putting in here since without the shared generic code, this would be cumbersome code for an OEP.

# Things we'd probably throw in edx-django-utils
class TransformingPage(Page):
    def __init__(self, *args, **kwargs):
        self.transformer = kwargs.get("transformer")
        del kwargs["transformer"]
        super().__init__(*args, **kwargs)

    def __getitem__(self, index):
        # Transform the data at the last possible moment so we don't evaluate the queryset
        obj = super().__getitem__(index)
        return self.transformer(obj)

class TransformingPaginator(Paginator):
    def __init__(self, *args, **kwargs):
        self.transformer = kwargs.get("transformer")
        del kwargs["transformer"]
        super().__init__(self, *args, **kwargs)

    # Suggested method to override on Paginator
    def _get_page(self, *args, **kwargs):
        return TransformingPage(*args, **kwargs, transformer=self.transform_function)

###############################
# in api.py
from django.conf.settings import UNSUPPORTED_PROGRAM_UUIDS

from edx_django_utils import TransformingPage, TransformingPaginator

from .data import ProgramData
from .models_api import get_programs as _get_programs

def get_supported_programs(page_size=None):
    """
    Gets all programs that aren't in UNSUPPORTED_PROGRAM_UUIDS settings

    Returns a Paginator of programs
    """
    DEFAULT_PAGE_SIZE = 20

    # _get_programs() returns a queryset
    q_supported_programs = _get_programs().exclude(
        uuid__in=UNSUPPORTED_PROGRAM_UUIDS
    )

    # transformer would probably be a common function elsewhere, but putting it inline for brevity
    return TransformingPaginator(
        q_supported_programs, 
        page_size or DEFAULT_PAGE_SIZE,
        transformer=(lambda program: ProgramData(
            uuid=program.uuid,
            title=program.title,
            status=program.status
        ))
    )


.. _data.py:
data.py
+++++++
This file should include the public data structures for the app that can be passed between apps without exposing internal features. These should be used instead of sending Django model objects or querysets to apps that call the functions in ``api.py``. This file should not import anything other than stdlib modules, so that it may be imported by any other app without issue. These data objects should be simple objects with all business logic handled by ``api.py``. They may however perform simple validation, as long as it is self-contained (doesn't reach out to database, network, or any code outside of the class)
tuchfarber marked this conversation as resolved.
Show resolved Hide resolved

Example
~~~~~~~

.. code-block:: python


from enum import Enum

import attr

def ProgramStatus(Enum):
ACTIVE = "active"
RETIRED = "retired"

@attr.attrs(frozen=True)
class ProgramData:
uuid: str = attr.attrib(
validator=[attr.validators.instance_of(str)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too much to hope for mypy running, so we don't have to do runtime checking of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vastly prefer mypy, but we haven't followed through enabling it and I forget why what it's being up up for. I was mostly using it as a simple example until we live in a static type checking world though. If you think it'd be harmful during the in-between time, let me know though.

)
title: str = attr.attrib(
validator=[attr.validators.instance_of(str)]
)
status: str = attr.attrib(
validator=[
attr.validators.instance_of(str),
attr.validators.in_(ProgramStatus)
]
)

.. _rest_api:
rest_api/
+++++++++
If an app will have its own REST API, it should live in a folder called ``rest_api`` to distinguish it from the ``api.py`` file used for intra-app communication.

APIs should be versioned and the serializers and permissions associated with that version should be kept inside that version's folder. This prevents breakages when an API needs to be updated.

An example of a common folder structure for a versioned REST API::

app_name
├── rest_api
│   ├── v1
│   │ ├── permissions.py
│   │ ├── serializers.py
│   │ ├── urls.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything useful to state about standards for urls.py files? (This one and the two below.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thoughts on URLs, namespacing, and such, but that feels like an argument left for a different OEP. So I just added a note that this doesn't attempt to go into it.

│   │ └── views.py
│   └── urls.py
├── urls.py
└── views.py # existing legacy non-REST APIs

Because API conventions (including URL structure, namespacing, and versioning) are separate concerns than the app structure, please reference https://openedx.atlassian.net/wiki/spaces/AC/pages/18350757/edX+REST+API+Conventions for any questions.


.. _signals:
signals/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Did you mean to give signals its own directory?
  2. It sounds like signal handlers are supposed to go in signals.py. If I have a long running task that is triggered by a signal handler, should it go in in the same module, or a separate tasks.py file?
  3. Where is the right place to define a custom signal that your app will emit? It sounds like we probably don't want it in signals.py, since that has celery registration side-effects. Maybe data.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I had two conflicting thoughts while writing this and got my wires crossed. I've updated it to show that signals should be a directory, with signals.py and handlers.py files. That lines it up with my from .signals import handlers in the apps.py example.

I added a separate entry for tasks.py because I think that's important to call out.

I think the change fixes the registration side effect concerns, but let me know if it doesn't.

+++++++++
If an app is consuming Django Signals from other apps or creating its own Signals, it should include a ``signals`` directory which will include both its signal handlers and Signals it owns. If possible, the signal handlers should only be thin layer between the signal and more generalized functions in the app. This way we can keep business logic out of the "plumbing". The signals directory should look like::

app_name
├── signals
│   │ ├── signals.py # for defining new signals
│   │ ├── handlers.py # for listening to existing signals

.. _tasks:
tasks/ or tasks.py
++++++++++++++++++
If an app contains long running tasks (i.e. tasks that run outside of a request, often a celery task), they should live in in either either a ``tasks.py`` file or a ``tasks`` folder.

Consequences
------------
At this time, there is no plan to enforce any of these guidelines. The vast majority of current Open edX code doesn't yet meet these guidelines, and there will always be exceptions to the rule. The hope is that as developers write new code or refactor existing code, they follow these patterns as best they can. We also hope that code reviewers will ensure these guidelines are followed in the code they approve.