Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ImpactedPackage and ResolvedPackage #219

Merged
merged 5 commits into from
Jul 22, 2020

Conversation

sbs2001
Copy link
Collaborator

@sbs2001 sbs2001 commented Jun 28, 2020

Changes are made in import_runner.py to refactor the models.
Other major changes in import_runner.py is to change the _process_updated_advisories
method to use less db queries and increase performance by several times.

Having a single table for these, fixes the following:

Issue 1 :

You can do this, which doesn't make any sense.

In [1]: from vulnerabilities import models                                                                                   

In [2]: v1 = models.Vulnerability.objects.create(cve_id="CVE-foo")                                                           

In [3]: p1 = models.Package.objects.create(name="cream",type="ice",version='mango') 
   
In [4]: vp1 = models.ImpactedPackage.objects.create(vulnerability=v1, package=p1)                                            

In [5]: vp2 = models.ResolvedPackage.objects.create(vulnerability=v1, package=p1)   

This is pure garbage, nothing can be interpreted from these entries.

With a single table + flag, we can use a unique_together=('vulnerability','package')

Issue 2 :

Updating the vulnerability status of a package is not possible due to cascade deletes.

Check https://github.com/nexB/vulnerablecode/blob/58d0376e7319d06387662cb393f3c39d9893088d/vulnerabilities/import_runner.py#L121 .

And also check for more details https://github.com/nexB/vulnerablecode/blob/58d0376e7319d06387662cb393f3c39d9893088d/vulnerabilities/tests/test_import_runner.py#L201

Signed-off-by: Shivam Sandbhor shivam.sandbhor@gmail.com

@sbs2001 sbs2001 requested a review from haikoschol June 28, 2020 08:20
vulnerabilities/import_runner.py Outdated Show resolved Hide resolved
vulnerabilities/import_runner.py Outdated Show resolved Hide resolved
@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jun 30, 2020

@haikoschol I have pushed a commit to stop using ignore_conflicts=True and properties and magic methods. It's still faster than the current implementation, but not as fast as the one with properties.

I think I can acheive the same results as the one with properties, if I could remove duplicates of model instances(these dont yet have pk ) from a list. Do you have any ideas to do that ?

@haikoschol
Copy link
Collaborator

I think I can acheive the same results as the one with properties, if I could remove duplicates of model instances(these dont yet have pk ) from a list. Do you have any ideas to do that ?

Have you tried the naive approach of just iterating the list and comparing all fields?

@sbs2001
Copy link
Collaborator Author

sbs2001 commented Jul 1, 2020

@haikoschol

Have you tried the naive approach of just iterating the list and comparing all fields?

That is a valid approach, I instead went with creating a dataclass and put it into a set and then instantiate model instances. I've tested it for alpine, debian, rust, openssl importers, all works fine.

Could you suggest an OK name for the dataclass ?

…ackage_Relation

Changes are made in import_runner.py to refactore the models.
Other major changes in import_runner.py is to change the _process_updated_advisories
method to use less db queries and increase performance by several times.

Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
@sbs2001 sbs2001 changed the title [WIP] Refactor ImpactedPackage and ResolvedPackage Refactor ImpactedPackage and ResolvedPackage Jul 9, 2020
@sbs2001 sbs2001 requested a review from pombredanne July 9, 2020 17:01
Copy link
Collaborator

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I commented mostly here, but I think I would still prefer concrete types (e.g. a real ImpactedPackage and ResolvedPackage class) that a generic type and an attribute.



@dataclasses.dataclass(frozen=True)
class VulnerabilityReference_inserter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use TitleCase for classes, not a mix of Title_and_snake case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



@dataclasses.dataclass(frozen=True)
class PackageRelatedVulnerability_inserter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above and elsewhere: stick to TitleCase for classes and snake_case elsewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.importer.data_source_cfg = dataclasses.asdict(data_source.config)
self.importer.save()

logger.debug(f'Successfully finished import for {self.importer.name}.')


def _process_updated_advisories(data_source: DataSource) -> None:
"""
TODO: Break this method into smaller functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment still applies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

bulk_create_vuln_pkg_refs = set()
for batch in data_source.updated_advisories():
for advisory in batch:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this blank line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Removed

vuln, vuln_created, advisory.resolved_package_urls, is_vulnerable=False)
bulk_create_vuln_pkg_refs.update(inew_refs.union(rnew_refs))

models.VulnerabilityReference.objects.bulk_create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benefit do you get from calling bulk_create vs creating as you go above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this discussion in the chat. I think this is justified

packages = models.Package.objects.bulk_create(packages)
impacted: List[PackageURL],
resolved: List[PackageURL]
) -> Tuple[Dict[PackageURL, int], Dict[PackageURL, int]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you enabled type checking at least in the CI?
Otherwise, type hints are not always helpful and if the types are not checked somehow, they can be misleading and i most case they are not conducive to improve readability... So let's not get carried away with type hints for the sake of them. Instead enter a ticket so we can discuss and plan for their possible usage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. In the CI it would be something along the lines of running a simple Mypy check right ? That probably won't be much of pain.

IMHO using typing help in most cases to atleast know what's going in and out, way better than comments, but that's just my opinion. Looking at this again I feel rather ashamed of this case though, it really seems I've used typing in a rather cryptic way.

Reverting this method, to the old one which was readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ticket #226 :)

impacted_packages[purl] = pkg
elif purl in resolved:
resolved_packages[purl] = pkg
impacted_packages = dict(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the zip here, but is there a simpler way to get there?
I am always nervous when we start using "parallel lists" to later zip them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to previous approach, which was more readable.


return impacted_packages, resolved_packages


def _bulk_insert_impacted_and_resolved_packages(
batch: Set[Advisory],
vulnerabilities: Set[models.Vulnerability],
impacted_packages: Dict[PackageURL, models.Package],
resolved_packages: Dict[PackageURL, models.Package],
impacted_packages: Dict[PackageURL, int],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this int about?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overenginnering/pre-mature optimization . The idea was to use Dict[PackageURL, int] where the int is pk of Package , to save some memory because these dictionaries would grow big very fast. Carrying the Objects is way more expensive than pks so I had used it .

Reverting this.

vulnerability=vuln,
package=p,
package_id=p,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is p a package or a package_id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting this. p is a package.


for resolved_purl in advisory.resolved_package_urls:
# TODO Figure out when/how it happens that a package is missing from the dict and fix it
p = resolved_packages.get(resolved_purl)
if not p:
p = _package_url_to_package(resolved_purl)
p.save()
resolved_packages[resolved_purl] = p
resolved_packages[resolved_purl] = p.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the benefits of using ids vs objects there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above this is :

Overenginnering/pre-mature optimization . The idea was to use Dict[PackageURL, int] where the int is pk of Package , to save some memory because these dictionaries would grow big very fast. Carrying the Objects is way more expensive than pks so I had used it .

Reverting this.

@pombredanne
Copy link
Collaborator

@sbs2001 after our discussion I am not fine with the merging of ImpactedPackage and ResolvedPackage in a single model. :)
You call for a simpler DB-level referential integrity makes sense!

Signed-off-by: Shivam Sandbhor <shivam.sandbhor@gmail.com>
This was referenced Jul 16, 2020
Copy link
Collaborator

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more comments

vulnerabilities/import_runner.py Show resolved Hide resolved
for batch in data_source.updated_advisories():
for advisory in batch:
vuln, _ = _get_or_create_vulnerability(advisory)
def _create_vulnerability_and_references(advisory: Advisory):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just asking :)

vulnerabilities/import_runner.py Show resolved Hide resolved
@pombredanne
Copy link
Collaborator

At this stage since this is blocking further updates, I am merging it with the caveats noted otherwise... 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants