Right now, we have this call at the beginning of several high-frequency routes:
This can do two things:
This caused us a big performance issue because we had an empty directory in the repos folder, which caused 1. to always indicate that there are new repos, so 2. was being run every time MetaRepo.instance.scan_for_new_repos was called. This is not a hard pathological case to hit -- if you try to add a new repo that happens to be empty, for instance, you'll end up with just such an empty directory in the repos folder.
More generally, we should try to remove as much disk-touching as possible from the critical path of important frequently-called routes (api calls, /saved_search/:id, etc).
We can debate more in person. I added this because without it, we have poor usability around new or deleted repos -- Commit objects will exist in the database and our code assumes commit.grit_commit is always valid, but it can be nil if we haven't scanned for new repos. I considered scanning for new repos when we discover that commit.grit_commit is nil, but there are enough code paths that access metarepo.grit_commit that I thought the current implementation was of similar complexity.
The common case where there are no new repos is very light and will not actually touch disk, but rather the disk cache. Was the empty directory detection broken by 792bc88?
This check can be made asynchronous in another thread, but it needs to be done in-process. The actual work that needs to be done is loading a Grit instance for each repo, and that needs to be done in all of our running process -- all unicorn workers and all Resque workers.
In the current usage, metarepo.scan_for_new_repos should be at the head of any route which loads commit information. The fact that it's not in /commits is an oversight.
Can't we do the scan in Commit#grit_commit whenever we would return nil? Any solution here that requires us to figure out whether a route might load commit information and remember to add the scan method at the beginning has a smell to me.
This check seems fragile, because there are lots of ways in which it can be tricked into thinking that something changed every time we scan. A repo delete could fail in some way and leave a directory lying around. The repo delete could delete the directory, but fail before deleting the entry from the git_repos table. Somebody could just mkdir foo. Any of these will cause the same nasty performance problem. As an alternative approach, what about adding repos by first cloning, then adding the repo to the DB, and then "scanning for new repos" in the workers by just seeing if any new entries have been added to the DB? That way we won't depend on the file structure of repos folder as the source of the list of repos in the system.
Anyway, now that I see your reasoning here, I'll give some thought to it. Tomorrow I'll see if 792bc88 changed the behavior in an undesirable way, and add more comments here if I come to some more conclusions.
If we move this logic into MetaRepo, it looks like there are 4 places that need to be patched. We can probably reduce this number by refactoring client code:
Actually, any method in MetaRepo that uses @repo_names_and_ids_to_repos, @repo_name_to_id or @repos.
We had this same issue bite us again. ugh...