Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import donation data #42

Merged
merged 8 commits into from
Aug 8, 2018
Merged

Import donation data #42

merged 8 commits into from
Aug 8, 2018

Conversation

cuducos
Copy link
Contributor

@cuducos cuducos commented Aug 1, 2018

This PR depends on Election data to be available and, so, is untested (thus, WIP).

Implements two layers in the CSV importer command to allow the importion
of huge CSV files:

* `DiskCache` is a cache based on the OS filesystem, but which can be
used together with Django cache (usually in memory) for greater
performance
* `CsvSpliter` splits a CSV file in numerous smaller CSV files so the
system can process them with little hassle in terms of memory
@cuducos cuducos force-pushed the cuducos-import-donation-data branch from cedf21e to fae0786 Compare August 6, 2018 23:55
@cuducos cuducos force-pushed the cuducos-import-donation-data branch from ae12911 to ebf3729 Compare August 6, 2018 23:59
@cuducos cuducos changed the title [WIP] Import donation data Import donation data Aug 7, 2018
@cuducos
Copy link
Contributor Author

cuducos commented Aug 7, 2018

@leportella I think this is ready for code review. I found out that even if the Election model has ~3 million records, and even if the donation CSV has ~7 million lines, it is only being able to identify and link roughly 80 thousand donations.

I think this is enough to code review and eventually merge this PR, but I'm opening an issue to debug data and the import process on both ends (Election and Donation) to make sure we're not leaving anything behind.

from perfil.person.models import Person


def election_keys(election):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initally I thought about putting this in the ImportCSVCommand or another class to be inherited by our commands, so you wouldn't have to import it everytime. Don't have any problem to leave it this way, but maybe you didn't think about the other possibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be a method of ImportCsvCommand (or Command) otherwise we cannot access in in the definition of our custom Commands. What I mean is that where we have:

class Command(ImportCsvCommand):

    model =bulk_size =to_cache (Election, election_keys)
    …

It would be an invalid Python code to use:

class Command(ImportCsvCommand):

    model =bulk_size =to_cache (Election, self.election_keys)
    …

This is invalid becauseself doesn't exist yet. Some alternatives I've though about (but discarded them all):

  • using (Election, ImportCsvCommand.election_key) would be possible if we use the @staticmethod decorator, but unnecessarily verbose IMHO
  • Defining it as the default method to cache Election so we don't have to pass it in the to_cache instance variable would work, but would put some boundaries in the way we do lookups in the database and cache for each given model (making it difficult or unnecessarily verbose to customize)
  • We could define it per load_whatever.py file, but we would repeat code in our code base (election_keys as it is is used in two or three commands I guess)

That's why it is the way it is. Does it make sense? Do you prefer any other implementation? Let me know. IMHO importing it every time is tolerable comparing to the downside of the three alternatives I could think of. What about you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes total sense. Thank you for the detailed explanation :)

class Command(ImportCsvCommand):

help = dedent("""
Import CSV generated by: https://github.com/rafapolo/tribuna
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can start a proper documentation to add this kind of comments? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern regarding this suggestion: I don't want to keep two documentation references. So if we document it elsewhere (let's say a .md file or GitHub wiki) we have to decide between:

  1. assuming the risk of having non-matching documentation and command help text (that is to say, we might update the .md file and forget to change the help variable for the command)
  2. or document everything elsewhere (.md or GitHub wiki in the exemple) creating a downside in terms of UX, that is to say: when the user is in fact at the terminal trying to import data, the help text wouldn't be helpful (would point him to the browser or to the .md file)

That's why IMHO documenting Django commands should be it in the help message of the command themselves: no duplicates to be kept in sync, and the information is delivered in the environment in which it is needed.

What do you think about it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure if docs are the best approach. I don't have a define argument for defend it or not.

class Donation(models.Model):
election = models.ForeignKey(Election, on_delete=models.CASCADE)
donator = models.CharField(max_length=250)
donator_id = models.CharField(max_length=14, null=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the CharField fields should not have null=True because we could both have empty string and None as a value. We should change it to blank=True

Avoid using null on string-based fields such as CharField and TextField. If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL. on this part of documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e8af033

@leportella
Copy link
Collaborator

@cuducos I think you created a nice structure and that will be easier to use. I just left you a comment about the models, and some suggestions I've been thinking about :)

@leportella
Copy link
Collaborator

@cuducos can you please fix the codeclimate issue so we can merge this?

@leportella leportella merged commit 94aa509 into master Aug 8, 2018
@leportella leportella deleted the cuducos-import-donation-data branch August 8, 2018 11:48
@cuducos
Copy link
Contributor Author

cuducos commented Aug 8, 2018

@cuducos can you please fix the codeclimate issue so we can merge this?

Done in 9932c0b but TBH I tend to value PEP8 in a more flexible way, as Raymond Hettinger argues ; )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants