Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

Sync NPM data to graph #2047

Closed
1 task done
msrb opened this issue Jan 25, 2018 · 6 comments
Closed
1 task done

Sync NPM data to graph #2047

msrb opened this issue Jan 25, 2018 · 6 comments

Comments

@msrb
Copy link
Collaborator

msrb commented Jan 25, 2018

Description

Not all NPM data made it all the way to graph. We need to sync them.

Acceptance criteria

  • all NPM data from production S3 buckets are available in graph
@msrb msrb added this to the Analytics Backlog milestone Jan 25, 2018
@qodfathr qodfathr added this to To Do in OpenShift.io Plan Jan 25, 2018
@qodfathr qodfathr moved this from To Do to In progress in OpenShift.io Plan Jan 25, 2018
@qodfathr qodfathr moved this from In progress to Needs Parent in OpenShift.io Plan Jan 25, 2018
@msrb msrb modified the milestones: Analytics Backlog, Sprint 145 Feb 14, 2018
@tuxdna tuxdna self-assigned this Feb 23, 2018
@sivaavkd
Copy link
Collaborator

sivaavkd commented Mar 1, 2018

@tuxdna - are we targeting this for the current sprint ? cc @msrb

@msrb
Copy link
Collaborator Author

msrb commented Mar 1, 2018

@sivaavkd yeah, the plan is to have the NPM data in graph by the end of the sprint. The sync process is already running and @tuxdna is keeping an eye on it. Also there is a parallel effort to make the sync faster.

@tuxdna
Copy link
Collaborator

tuxdna commented Mar 4, 2018

The graph sync ran for few days now, and we have 92695 NPM package versions in the pending list. There are no more package-versions going through the sync anymore.

There are known reasons due to which this is happening

  • package data (prod-bayesian-core-package-data) for a particular version is not present in S3 but package version does (prod-bayesian-core-data) have data in S3. For example there are 75 versions for arachne-ui but there is no package data present.
  • there is no data at all in S3 for a particular package
  • other reasons we haven't figured out yet.

Essentially for NPM, the graph-sync is done to the extent which it could be done in the current state.

@tuxdna
Copy link
Collaborator

tuxdna commented Mar 5, 2018

Here is the summary for package version counts using BookKeeping and GraphSync APIs

From: GET /bookkeeping

{
  "summary": [
    {
      "name": "rubygems",
      "package_count": 0,
      "package_version_count": 0
    },
    {
      "name": "npm",
      "package_count": 175658,
      "package_version_count": 416920
    },
    {
      "name": "maven",
      "package_count": 178756,
      "package_version_count": 635159
    },
    {
      "name": "pypi",
      "package_count": 104166,
      "package_version_count": 273733
    },
    {
      "name": "go",
      "package_count": 84825,
      "package_version_count": 102239
    },
    {
      "name": "crates",
      "package_count": 0,
      "package_version_count": 0
    },
    {
      "name": "nuget",
      "package_count": 94114,
      "package_version_count": 199632
    }
  ]
}

From: GET /graphsync/pending/npm

{"data": {
    "all_counts": 92715,
    "pending_list": [{ ... }]}}

From above we have 416920 package version count for NPM in total, and out of these 92715 are not synced to graph yet because of reasons mentioned in earlier comment above. This gives us 77.76 % coverage of NPM data by package-version counts in Graph:

>>> (1 - 92715 / 416920) * 100
77.76192075218266

@msrb
Copy link
Collaborator Author

msrb commented Mar 5, 2018

@tuxdna

package data (prod-bayesian-core-package-data) for a particular version is not present in S3 but package version does (prod-bayesian-core-data) have data in S3. For example there are 75 versions for arachne-ui but there is no package data present

I don't think that having no package-level data should be an issue for graph import. In case of arachne-ui, it looks like an error in ingestion pipeline, but in general, there can be components for which we simply won't have any package-level data.

Could you please check in data-importer, how it behaves when there are no package-level data? And file an issue, if it fails :) Thanks 😉

@msrb msrb closed this as completed Mar 5, 2018
@tuxdna
Copy link
Collaborator

tuxdna commented Mar 6, 2018

@msrb: Sure, will check and update.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
OpenShift.io Plan
  
Needs Parent
Development

No branches or pull requests

3 participants