Handle empty repo case #85

coni2k · 2021-02-18T09:39:15Z

When I was running the script, I bumped into these repos that they fall into the filter due to high number of stars but they're actually empty and the script throws an exception:
https://github.com/fossasia/libregraphics.asia
https://github.com/libredesktop/libredesktop-events
https://github.com/libredesktop/libredesktop-project-list
https://github.com/libredesktop/LibreDesktop-Specs
https://github.com/meilix/arch-meilix
https://github.com/meilix/deb-meilix
https://github.com/meilix/meilix-addons
https://github.com/meilix/meilix-art
https://github.com/meilix/meilix-connect
https://github.com/meilix/meilix-web
https://github.com/susiai/susi_partners
https://github.com/susiai/susi_sdk
https://github.com/ascoders/blog
https://github.com/bigdongdongCLUB/newGCP
https://github.com/koush/support-wiki
https://github.com/mariobehling/ai-packages
https://github.com/mariobehling/mb-sandbox
https://github.com/meilix/meilix-docs
https://github.com/paulirish/devtools-addons
https://github.com/QingDaoIT/BlackList
https://github.com/zhengzhouqiuzhi/zhengzhouqiuzhi

To handle it, for GitLab, checking the commits length was enough:

if len(repo.commits.list()) == 0:

For GitHub, I couldn't find any proper way to understand whether the repo is empty. When we call "get_commits().totalCount", it already throws an exception. What I did is to force it to throw the exception by assigning "totalCount" to an unused variable (I could do it by printing the value as well?). Not an ideal solution, so let me know what you think.

try:
	repo = get_github_auth_token().get_repo(repo_url)
	# Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception
	total_commits = repo.get_commits().totalCount
except github.GithubException as exp:
	if exp.status == 404 or exp.status == 409:
		return None
return GitHubRepository(repo)

Another remark is that we're spending one more request from our rate limit when calling "get_commits()" to make this validation. I only tested this for GitHub, but I'm assuming it's the same for GitLab as well.

Alternatively, we can make all these calls before initializing the repo, do the validations, and pass them to repo object as arguments? This would also help us reducing the number of call to the API, but making these changes would take some time.

To be able to test my changes, I created empty repos on both GitHub & GitLab btw:
https://github.com/coni2k/empty-repo
https://gitlab.com/coni2k/empty-repo

Last, I also added this bit to "generate" script. Otherwise it fails when there are no processed repos:

if len(stats) == 0:
    return

inferno-chromium · 2021-02-20T05:03:40Z

criticality_score/run.py

@@ -541,6 +543,8 @@ def get_repository(url):
        repo_url_encoded = urllib.parse.quote_plus(repo_url)
        try:
            repo = token_obj.projects.get(repo_url_encoded)
+            if len(repo.commits.list()) == 0:


How slow is this call ?

How about making these "get commits" calls in advance?

I added "last_commit" property to Repository class. Before initializing the repo, we make these "get commits" calls, do the validations, if it's all good, then create the repo by passing "last commit" as a parameter.

Since we only pass "last_commit" as a parameter, it feels bit strange but with this approach, we still do the validations and not making additional calls.

Let me know what you think.

inferno-chromium · 2021-02-20T05:04:31Z

criticality_score/run.py

@@ -530,8 +530,10 @@ def get_repository(url):
        repo = None
        try:
            repo = get_github_auth_token().get_repo(repo_url)
+            # Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception
+            total_commits = repo.get_commits().totalCount


I don't like adding another api call since it takes it out of quota. Is there an exception we can catch somewhere and bail out from furthur processing.

- Update empty repo validation

inferno-chromium · 2021-02-21T20:14:50Z

criticality_score/run.py

        try:
            repo = get_github_auth_token().get_repo(repo_url)
+            last_commit = repo.get_commits()[0]


What i meant was calculate last_commit time inside get_repository_stats and bailout there. you can do it as a first call to self.last_commit

def get_repository_stats(repo, additional_params=None):
if not repo.last_commit:
return None

and then define
def last_commit property(self):
if self._last_commit:
return self._last_commit
self._last_commit = self.get_commit()[0]
return self._last_commit

No need to calculate it here, it is better to do this calculation inside the class function itself.

coni2k · 2021-02-21T21:34:20Z

I see, that will be much clearer of course.

I still kept the validation inside of "get_repository" function, but if you specifically want it, I can move it to "get_repository_stats" as well?

inferno-chromium · 2021-02-21T21:49:03Z

I see, that will be much clearer of course.

I still kept the validation inside of "get_repository" function, but if you specifically want it, I can move it to "get_repository_stats" as well?

Sorry i did mean get_repository_stats, so yes lets make that bailout there.

coni2k · 2021-02-21T22:01:00Z

Okay, hopefully last question; if output is empty, do you want to show "repo is empty" error message, both in "run" & "generate" scripts?

logger.error(f'Repo is empty: {args.repo}')

coni2k · 2021-02-21T22:14:05Z

Okay, I kept the error message for the moment, so you can see how it looks

inferno-chromium · 2021-02-22T00:16:16Z

criticality_score/run.py

+    try:
+        if not repo.last_commit:
+            return None
+    except Exception:


Can there be an exception, if not, just remove.

inferno-chromium · 2021-02-22T00:17:07Z

criticality_score/generate.py

                    break
                output = run.get_repository_stats(repo)
+                if not output:
+                    logger.error(f'Repo is empty: {repo_url}')


This logger.error repeated in 2 places can just be moved inside get_repository_stats
if not repo.last_commit:
logger.error(f'Repo is empty: {repo_url}')
return None

inferno-chromium · 2021-02-22T00:17:31Z

criticality_score/run.py

        return
    output = get_repository_stats(repo, args.params)
+    if not output:


See comment above, this can be moved inside get_repository_stats.

- Move try/catch block to GitHub last_commit - Prevent exception for GitLab last_commit

coni2k · 2021-02-22T08:12:55Z

If repo is empty, GitHub throws an exception in any case. Now I added a try/catch block only for GitHub last_commit property

@property
def last_commit(self):
	if self._last_commit:
		return self._last_commit
	try:
		self._last_commit = self._repo.get_commits()[0]
	except Exception:
		pass
	return self._last_commit

For GitLab last_commit will not throw an exception now. It will return the last_commit, only if the list has an item in it (I used next(iter()) to do this)

@property
def last_commit(self):
	if self._last_commit:
		return self._last_commit
	self._last_commit = next(iter(self._repo.commits.list()), None)
	return self._last_commit

So, the validation under "get_repository_stats" is now like this (no try/catch here):

if not repo.last_commit:
	logger.error(f'Repo is empty: {repo.url}')
	return None

Handle empty repo case

46ce66b

inferno-chromium requested changes Feb 20, 2021

View reviewed changes

- Add last_commit prop to prevent double API calls

55ec68d

- Update empty repo validation

inferno-chromium requested changes Feb 21, 2021

View reviewed changes

Move last_commit logic inside of the class

7c969fc

Move the validation to get_repository_stats

5b9784c

inferno-chromium approved these changes Feb 22, 2021

View reviewed changes

- Move error messages to the validation block

4d21047

- Move try/catch block to GitHub last_commit - Prevent exception for GitLab last_commit

inferno-chromium approved these changes Feb 22, 2021

View reviewed changes

inferno-chromium merged commit 01983d1 into ossf:main Feb 22, 2021

coni2k deleted the empty-repo branch February 22, 2021 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty repo case #85

Handle empty repo case #85

coni2k commented Feb 18, 2021 •

edited

inferno-chromium Feb 20, 2021

coni2k Feb 20, 2021

inferno-chromium Feb 20, 2021

inferno-chromium Feb 21, 2021

coni2k commented Feb 21, 2021

inferno-chromium commented Feb 21, 2021

coni2k commented Feb 21, 2021

coni2k commented Feb 21, 2021

inferno-chromium Feb 22, 2021

inferno-chromium Feb 22, 2021

inferno-chromium Feb 22, 2021

coni2k commented Feb 22, 2021

Handle empty repo case #85

Handle empty repo case #85

Conversation

coni2k commented Feb 18, 2021 • edited

inferno-chromium Feb 20, 2021

Choose a reason for hiding this comment

coni2k Feb 20, 2021

Choose a reason for hiding this comment

inferno-chromium Feb 20, 2021

Choose a reason for hiding this comment

inferno-chromium Feb 21, 2021

Choose a reason for hiding this comment

coni2k commented Feb 21, 2021

inferno-chromium commented Feb 21, 2021

coni2k commented Feb 21, 2021

coni2k commented Feb 21, 2021

inferno-chromium Feb 22, 2021

Choose a reason for hiding this comment

inferno-chromium Feb 22, 2021

Choose a reason for hiding this comment

inferno-chromium Feb 22, 2021

Choose a reason for hiding this comment

coni2k commented Feb 22, 2021

coni2k commented Feb 18, 2021 •

edited