Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/probot/stats endpoint is not scalable #380

Closed
bkeepers opened this issue Dec 21, 2017 · 15 comments
Closed

/probot/stats endpoint is not scalable #380

bkeepers opened this issue Dec 21, 2017 · 15 comments

Comments

@bkeepers
Copy link
Contributor

At app startup, the /probot/stats endpoint fetches all the repositories for all the installations that the app is installed on. This endpoint is used by probot.github.io to show the number of installations and show some of the more popular orgs using an app.

It is frequently timing out on stale, which causes other tests to fail.

@tcbyrd
Copy link
Contributor

tcbyrd commented Jan 14, 2018

I was working on an option that uses GraphQL here, but it basically ended up in me recreating a lot of the logic, so I noticed something. If the idea is that we're refreshing the stats on an interval and caching it, I'm not sure this part is right...

robot.router.get('/probot/stats', async (req, res) => {
// ensure stats are loaded
await initializing
res.json(stats)

If I'm understanding it correctly, that means every time the page is loaded, it will wait until another refresh is done. I removed the await initializing line on my local system and it still gets the stats when the server is first loaded. Wouldn't that mean even without a refresh interval the stats would only be as old as the last dyno restart? Then hitting probot/stats would just return the latest results.

Also, what if we used the Search API to get the top 10 repos sorted by stars? We'd still need to use github.apps.getInstallationRepositories() to construct the query for the search (and I don't know what the limit is on how many repositories you can put in a search query), but that would move some of the processing for sorting and reducing to GitHub's API.

@bkeepers
Copy link
Contributor Author

bkeepers commented Feb 3, 2018

@tcbyrd Sorry for the delay here.

If I'm understanding it correctly, that means every time the page is loaded, it will wait until another refresh is done.

This logic is confusing, but it's actually only awaits for the stats to load the first time. Once the first load resolves, then await initializing essentially becomes a no-op.

Also, what if we used the Search API to get the top 10 repos sorted by stars?

I think that's a great solution! We could even get top ~30 or so. As long as it fits in a single request, it'll be much faster than the current approach.

We'd still need to use github.apps.getInstallationRepositories() to construct the query for the search

Why would it need to get the list of repositories? I think it would work to search for the user/org name. For example, @probot is:public will find public repos on the @probot org sorted by stars.

@tcbyrd
Copy link
Contributor

tcbyrd commented Feb 3, 2018

@bkeepers I thought about that too. It just means we're defining popularity at the Org/User level, whether or not they chose to install the app on all their repositories. This is fine if all we want to know is "Users and Organizations with popular public repos" instead of "Users and Organizations that are using [app] on their public repos". I'm completely fine with the former, since in most cases I think the latter will still be true. If that's the case, then we can definitely axe that call entirely.

@bkeepers
Copy link
Contributor Author

bkeepers commented Feb 5, 2018

Good point. It would be ideal if the stats only reflected the repositories that the app is installed on, but I don't think it's a big deal either way. The stats endpoint is just for vanity right now, so I think it's more important to make it more scalable than to get precise data.

@JasonEtco
Copy link
Member

The stats endpoint is just for vanity right now, so I think it's more important to make it more scalable than to get precise data.

That's fair, but I think there'd be a significant difference in the data returned. Electron for example has a ton of repos; would that count as a ton of installations if the app is installed on one of those repos? Or am I misunderstanding?

@tcbyrd
Copy link
Contributor

tcbyrd commented Feb 6, 2018

@JasonEtco In that scenario, we wouldn't run the getInstallationRepositories method and just determine popularity based on search results of the user/org. We're only using it for the "Used By" section of the site at the moment, so it's not really that relevant to store anything about the top repositories.

@JasonEtco
Copy link
Member

Ah I see. I understand why, but its a shame we can't collect a number of installations; its something GitHub should have in the UI, because right now the API (and for my lazy self, the /stats endpoint) is the only way to see how many installations your app has. It matters for vanity, but also for knowing how far your app has scaled.

@bkeepers
Copy link
Contributor Author

bkeepers commented Feb 6, 2018

Ah I see. I understand why, but its a shame we can't collect a number of installations

We can still collect number of installations by paginating GET /app/installations. Even if you're app has ~1000 installations, it's still only 10 API calls on startup. The change we're discussing here is to drop the paginated call to fetch all repositories for each installation, which can lead to hundreds of API calls on startup.

its something GitHub should have in the UI

cc @jeffrafter @chobberoni @tarebyte since I briefly discussed the desire for App status with you a few weeks ago

@stale
Copy link

stale bot commented Apr 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 7, 2018
@stale stale bot closed this as completed Apr 14, 2018
@dessant
Copy link
Contributor

dessant commented Sep 7, 2018

Could this be reopened and pinned?

@hiimbex hiimbex reopened this Sep 7, 2018
@stale stale bot removed the wontfix label Sep 7, 2018
@hiimbex hiimbex added the pinned label Sep 7, 2018
@SvanBoxel
Copy link

Can I assume this is the reason that the installation counter for delete-merged-branch is stuck for more than a month?

probot/stats gives a 404 and this is returned in the log:

1/11 09:01 PM (3m)
21:01:03.344Z ERROR http: (id=3f8cf961-d6d2-4e6e-900d-fb1c402c86df)
11/11 09:01 PM (3m) {
11/11 09:01 PM (3m)  "documentation_url": "https://developer.github.com/v3/#abuse-rate-limits",
11/11 09:01 PM (3m)  "message": "You have triggered an abuse detection mechanism. Please wait a few minutes before you try again."
11/11 09:01 PM (3m) }
11/11 09:01 PM (3m)
11/11 09:01 PM (3m) --
11/11 09:01 PM (3m) HttpError: {
11/11 09:01 PM (3m)  "documentation_url": "https://developer.github.com/v3/#abuse-rate-limits",
11/11 09:01 PM (3m)  "message": "You have triggered an abuse detection mechanism. Please wait a few minutes before you try again."
11/11 09:01 PM (3m) }
11/11 09:01 PM (3m)
11/11 09:01 PM (3m)  at response.text.then.message (/home/nowuser/src/node_modules/@octokit/rest/lib/request/request.js:72:19)
11/11 09:01 PM (3m) at process._tickCallback (internal/process/next_tick.js:68:7)
11/11 09:01 PM (3m)
21:01:03.345Z  INFO http: GET /probot/stats 404 - 0.92 ms (id=3f8cf961-d6d2-4e6e-900d-fb1c402c86df)

@dessant
Copy link
Contributor

dessant commented Nov 11, 2018

The stats APIs for some of my apps have been timing out, so I've replaced the endpoint with this:

// set these env vars for the app
// DISABLE_STATS=true
// GITHUB_ACCESS_TOKEN=<token>

robot.router.get("/probot/stats", async function(req, res) {
  const octokit = require("@octokit/rest")();
  octokit.authenticate({
    type: "token",
    token: process.env.GITHUB_ACCESS_TOKEN
  });

  let github = await robot.auth();
  const installations = await github.paginate(
    github.apps.getInstallations({ per_page: 100 }),
    res => res.data
  );

  // if (installations.length) {
  //   github = await robot.auth(installations[0].id);
  // }

  const popular = [];

  for (const item of installations) {
    const account = item.account;

    const { data: repos } = await octokit.repos.getForUser({
      username: account.login,
      sort: "updated",
      per_page: 30
    });

    if (!repos.length) {
      continue;
    }

    account.stars = repos.reduce((stars, repository) => {
      return stars + repository.stargazers_count;
    }, 0);
    popular.push(account);
  }

  res.json({
    installations: installations.length,
    popular: popular
      .filter(item => item.stars > 0)
      .sort((a, b) => b.stars - a.stars)
      .slice(0, 10)
  });
});

GITHUB_ACCESS_TOKEN is the personal access token of a GitHub account, authenticating with an installation to fetch repositories also causes timeouts.

@jakebolam
Copy link

I see that the probot hosted apps may have solved this issue, but can't find any issues/PRs/code that would indicate how.

Any insight into how this was solved for the apps hosted by probot on heroku?

Planning on turning this into a long-running worker tasks/cron that caches result's for the web worker, was wondering the approach taken for other apps.

@tcbyrd
Copy link
Contributor

tcbyrd commented Mar 22, 2019

@jakebolam Personally we end up just disabling stats on larger installations. The Marketplace is starting to show the exact same data anyway (number of installations and top customers), so ultimately listing your app on the marketplace gets you the same thing.

@tcbyrd
Copy link
Contributor

tcbyrd commented Mar 27, 2019

With the change now that you can list apps on the Marketplace as "unverified", I'm going to close this issue. The Marketplace listing is the more production-ready solution for larger apps and also includes more data like unique visitors and conversion rate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants