Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, I can bulk_create subclasses of Content #1946

Open
pulpbot opened this issue Jan 17, 2022 · 1 comment
Open

As a developer, I can bulk_create subclasses of Content #1946

pulpbot opened this issue Jan 17, 2022 · 1 comment

Comments

@pulpbot
Copy link
Member

pulpbot commented Jan 17, 2022

Author: @dralley (dalley)

Redmine Issue: 7824, https://pulp.plan.io/issues/7824


It is currently disallowed to bulk_create() multi-table inherited Django models, due to the technical challenges with doing so, however it is not necessarily impossible in theory.

https://code.djangoproject.com/ticket/28821

As a Pulp developer, if we are able to make assumptions about the way it will be used, then we can avoid most of the problems that prevent a generic implementation. For instance, only one level of inheritance.

I developed a proof of concept strategy that is unfortunately made more difficult by Django's proxy model behavior and the fact that there's no class which represents just the subclass table. So, this code emulates how multi-table inherited models set up their internal relationships, but doesn't actually use model inheritance.

Strategy:

  • Transactions don't check the integrity of foreign keys until they are committed
  • So save our child table first with a random uuid as the content_ptr, in bulk
  • Then go back and save the parent model, in bulk
class PulpBase(models.Model):
    pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    pulp_created = models.DateTimeField(auto_now_add=True)
    pulp_type = models.TextField(null=False, default=None)

    class Meta:
        abstract = True

class NewContent(PulpBase):
    pulp_type = models.TextField(null=False, default=None)

class NewPulpManager(models.Manager):
    # Ignore the hacky workarounds for stuff like pulp_type, I was trying to keep things simple
    def bulk_get_or_create(self, objs, batch_size=None):
        with transaction.atomic():
            pulp_type = self.model.get_pulp_type()

            q = models.Q(pk__in=[])
            unsaved_idxs_by_nat_key = defaultdict(list)
            for idx, obj in enumerate(objs):
                content_already_saved = not obj._state.adding
                if not content_already_saved:
                    unsaved_idxs_by_nat_key[obj.natural_key()].append(idx)
                    q |= models.Q(**obj.natural_key_dict())

            existing_objs = self.model.objects.filter(q)
            for existing_obj in existing_objs.iterator(chunk_size=batch_size or 2000):
                for idx in unsaved_idxs_by_nat_key[existing_objs.natural_key()]:
                    objs[idx] = existing_obj

            new_base_content = []
            for obj in objs:
                content_already_saved = not obj._state.adding
                if not content_already_saved:
                    new_base_content.append(NewContent(pulp_type=pulp_type, pulp_id=obj.pk))

            self.bulk_create(objs, batch_size=batch_size)
            NewContent.objects.bulk_create(new_base_content, batch_size=batch_size)

        return objs


class ContentBase(models.Model):
    content_ptr = models.OneToOneField(NewContent, primary_key=True, default=uuid.uuid4, on_delete=models.CASCADE)

    objects = NewPulpManager()

    @classmethod
    def get_pulp_type(cls):
        return cls.TYPE

    @classmethod
    def natural_key_fields(cls):
        """
        Returns a tuple of the natural key fields which usually equates to unique_together fields
        """
        return tuple(chain.from_iterable(cls._meta.unique_together))

    def natural_key(self):
        """
        Get the model's natural key based on natural_key_fields.

        Returns:
            tuple: The natural key.
        """
        return tuple(getattr(self, f) for f in self.natural_key_fields())

    def natural_key_dict(self):
        """
        Get the model's natural key as a dictionary of keys and values.
        """
        to_return = {}
        for key in self.natural_key_fields():
            to_return[key] = getattr(self, key)
        return to_return


    class Meta:
        abstract=True

class NewFileContent(ContentBase):
    TYPE = "file.file"

    relative_path = models.TextField(null=False)
    digest = models.CharField(max_length=64, null=False)

    class Meta:
        unique_together = ("relative_path", "digest")

The speedup is about 3x vs what we currently do.

Old

In [5]: %time PulpFileContent.objects.bulk_get_or_create(old_content)
CPU times: user 555 ms, sys: 88.6 ms, total: 643 ms
Wall time: 1.77 s

New

In [7]: %time NewFileContent.objects.bulk_get_or_create(new_content)
CPU times: user 648 ms, sys: 1.39 ms, total: 649 ms
Wall time: 694 ms
@stale
Copy link

stale bot commented May 24, 2022

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

@stale stale bot added the stale label May 24, 2022
@dralley dralley removed the stale label May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants