Skip to content
This repository has been archived by the owner on Feb 1, 2018. It is now read-only.

Conversation

peterbe
Copy link
Contributor

@peterbe peterbe commented Jun 10, 2013

The way this basically works is that each model now lists their keys. But, because the structure of the response (from the middleware) often very different, the way you list keys needs to be very flexible. The simplest example is:

# for crashstats.crashstats.models.CurrentVersions

    API_WHITELIST = (
        'end_date',
        'featured',
        'id',
        'product',
        'release',
        'start_date',
        'throttle',
        'version',
    )

A most complex on can be this:

# for crashstats.crashstats.models.CurrentProducts
    API_WHITELIST = {
        'hits': {
            Cleaner.ANY: (
                'end_date',
                'featured',
                'id',
                'product',
                'release',
                'start_date',
                'throttle',
                'version',
            )
        }
    }

How to scrub data is best explained with an example:

class ReportList(SocorroMiddleware):

    API_CLEAN_SCRUB = (
        ('user_comments', scrubber.EMAIL),
        ('user_comments', scrubber.URL),
    )

"Cleaning" is the only form of scrubbing (apart from removing fields entirely) that is implemented. The cleaning is done by doing replacements with an empty string ''.

Some models have NO whitelisting at all. That's because my gut tells me they'll never ever contain PII. These are:

  • CrashPairsByCrashId
  • Status
  • CrontabberState

There is only one model that is now entirely blacklisted:

  • RawCrash

There are a lot of models now that list ALL fields available. This means I might as well have switched off whitelisting (e.g. setting API_WHITELIST = None) but I felt this is easier to see and it makes it explicit what's going on. Basically, where this does NOT happen I write down what I deliberately exclude. E.g.:

class CommentsBySignature(SocorroMiddleware):
    ...

    API_WHITELIST = {
        'hits': (
            'user_comments',
            'date_processed',
            'uuid',
        ),
        # deliberately not including:
        #    email
    }

    API_CLEAN_SCRUB = (
        ('user_comments', scrubber.EMAIL),
        ('user_comments', scrubber.URL),
    )

NOTE this PR manually puts the API back into action after it was disabled with this commit

@peterbe
Copy link
Contributor Author

peterbe commented Jun 10, 2013

@rhelmer looking at you for the final verdict.

@lauraxt @AdrianGaudebert @brandonsavage Would appreciate feedback from you folks too.

@peterbe
Copy link
Contributor Author

peterbe commented Jun 10, 2013

@AdrianGaudebert note that I change the scrubber quite substantially. It would be inefficient to re-write the dicts with a new copy so I made it default that it changes it in-place instead.

'os_name',
'uuid',
'hangid',
'url',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like it would be a URL field that should not be actually exposed. Do I get that wrong?

@adngdb
Copy link
Contributor

adngdb commented Jun 12, 2013

The code looks excellent, as usual.

I think you forgot to add a whitelist to a few services though. From the code that means those services won't be accessible (they raise an APIWhitelistError) and I don't think that's what we want.

Last but not least, I must say I really don't like this whitelist thing. I think it's going to be a pain to maintain. I would be in favor of having a whitelist only when we want to filter something out, and thus have most services just not have one. I do not suspect that we will often add a new PII field to an existing service, and that is our responsibility for making sure that field is secured everywhere.

Yeah, that's exactly the opposite of what we previously agreed on. I wanted to write my feeling about that but I am not going to fight over it. If you guys think this way is fine, let's do it.

@peterbe
Copy link
Contributor Author

peterbe commented Jun 12, 2013

If there are models that lack API_WHITELIST that aren't in the BLACKLIST then we have a problem. It means that those models lack test coverage. Especially through the crashstats/api/views.py view.

Can you see which models in particular they are?

Regarding the vastly verbosity of writing down all these mundane fields, I'll invite Benjamin (who is much more experienced than us combined) to participate.

@peterbe
Copy link
Contributor Author

peterbe commented Jul 10, 2013

Merged cddb33b
:shipit:

@peterbe peterbe closed this Jul 10, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants