Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty repo case #85

Merged
merged 5 commits into from
Feb 22, 2021
Merged

Handle empty repo case #85

merged 5 commits into from
Feb 22, 2021

Conversation

coni2k
Copy link
Contributor

@coni2k coni2k commented Feb 18, 2021

When I was running the script, I bumped into these repos that they fall into the filter due to high number of stars but they're actually empty and the script throws an exception:
https://github.com/fossasia/libregraphics.asia
https://github.com/libredesktop/libredesktop-events
https://github.com/libredesktop/libredesktop-project-list
https://github.com/libredesktop/LibreDesktop-Specs
https://github.com/meilix/arch-meilix
https://github.com/meilix/deb-meilix
https://github.com/meilix/meilix-addons
https://github.com/meilix/meilix-art
https://github.com/meilix/meilix-connect
https://github.com/meilix/meilix-web
https://github.com/susiai/susi_partners
https://github.com/susiai/susi_sdk
https://github.com/ascoders/blog
https://github.com/bigdongdongCLUB/newGCP
https://github.com/koush/support-wiki
https://github.com/mariobehling/ai-packages
https://github.com/mariobehling/mb-sandbox
https://github.com/meilix/meilix-docs
https://github.com/paulirish/devtools-addons
https://github.com/QingDaoIT/BlackList
https://github.com/zhengzhouqiuzhi/zhengzhouqiuzhi

To handle it, for GitLab, checking the commits length was enough:

if len(repo.commits.list()) == 0:

For GitHub, I couldn't find any proper way to understand whether the repo is empty. When we call "get_commits().totalCount", it already throws an exception. What I did is to force it to throw the exception by assigning "totalCount" to an unused variable (I could do it by printing the value as well?). Not an ideal solution, so let me know what you think.

try:
	repo = get_github_auth_token().get_repo(repo_url)
	# Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception
	total_commits = repo.get_commits().totalCount
except github.GithubException as exp:
	if exp.status == 404 or exp.status == 409:
		return None
return GitHubRepository(repo)

Another remark is that we're spending one more request from our rate limit when calling "get_commits()" to make this validation. I only tested this for GitHub, but I'm assuming it's the same for GitLab as well.

Alternatively, we can make all these calls before initializing the repo, do the validations, and pass them to repo object as arguments? This would also help us reducing the number of call to the API, but making these changes would take some time.

To be able to test my changes, I created empty repos on both GitHub & GitLab btw:
https://github.com/coni2k/empty-repo
https://gitlab.com/coni2k/empty-repo

Last, I also added this bit to "generate" script. Otherwise it fails when there are no processed repos:

if len(stats) == 0:
    return

@@ -541,6 +543,8 @@ def get_repository(url):
repo_url_encoded = urllib.parse.quote_plus(repo_url)
try:
repo = token_obj.projects.get(repo_url_encoded)
if len(repo.commits.list()) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How slow is this call ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making these "get commits" calls in advance?

I added "last_commit" property to Repository class. Before initializing the repo, we make these "get commits" calls, do the validations, if it's all good, then create the repo by passing "last commit" as a parameter.

Since we only pass "last_commit" as a parameter, it feels bit strange but with this approach, we still do the validations and not making additional calls.

Let me know what you think.

@@ -530,8 +530,10 @@ def get_repository(url):
repo = None
try:
repo = get_github_auth_token().get_repo(repo_url)
# Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception
total_commits = repo.get_commits().totalCount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like adding another api call since it takes it out of quota. Is there an exception we can catch somewhere and bail out from furthur processing.

try:
repo = get_github_auth_token().get_repo(repo_url)
last_commit = repo.get_commits()[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i meant was calculate last_commit time inside get_repository_stats and bailout there. you can do it as a first call to self.last_commit

def get_repository_stats(repo, additional_params=None):
if not repo.last_commit:
return None

and then define
def last_commit property(self):
if self._last_commit:
return self._last_commit
self._last_commit = self.get_commit()[0]
return self._last_commit

No need to calculate it here, it is better to do this calculation inside the class function itself.

@coni2k
Copy link
Contributor Author

coni2k commented Feb 21, 2021

I see, that will be much clearer of course.

I still kept the validation inside of "get_repository" function, but if you specifically want it, I can move it to "get_repository_stats" as well?

@inferno-chromium
Copy link
Contributor

I see, that will be much clearer of course.

I still kept the validation inside of "get_repository" function, but if you specifically want it, I can move it to "get_repository_stats" as well?

Sorry i did mean get_repository_stats, so yes lets make that bailout there.

@coni2k
Copy link
Contributor Author

coni2k commented Feb 21, 2021

Okay, hopefully last question; if output is empty, do you want to show "repo is empty" error message, both in "run" & "generate" scripts?

logger.error(f'Repo is empty: {args.repo}')

@coni2k
Copy link
Contributor Author

coni2k commented Feb 21, 2021

Okay, I kept the error message for the moment, so you can see how it looks

try:
if not repo.last_commit:
return None
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be an exception, if not, just remove.

break
output = run.get_repository_stats(repo)
if not output:
logger.error(f'Repo is empty: {repo_url}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logger.error repeated in 2 places can just be moved inside get_repository_stats
if not repo.last_commit:
logger.error(f'Repo is empty: {repo_url}')
return None

return
output = get_repository_stats(repo, args.params)
if not output:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above, this can be moved inside get_repository_stats.

- Move try/catch block to GitHub last_commit
- Prevent exception for GitLab last_commit
@coni2k
Copy link
Contributor Author

coni2k commented Feb 22, 2021

If repo is empty, GitHub throws an exception in any case. Now I added a try/catch block only for GitHub last_commit property

@property
def last_commit(self):
	if self._last_commit:
		return self._last_commit
	try:
		self._last_commit = self._repo.get_commits()[0]
	except Exception:
		pass
	return self._last_commit

For GitLab last_commit will not throw an exception now. It will return the last_commit, only if the list has an item in it (I used next(iter()) to do this)

@property
def last_commit(self):
	if self._last_commit:
		return self._last_commit
	self._last_commit = next(iter(self._repo.commits.list()), None)
	return self._last_commit

So, the validation under "get_repository_stats" is now like this (no try/catch here):

if not repo.last_commit:
	logger.error(f'Repo is empty: {repo.url}')
	return None

@inferno-chromium inferno-chromium merged commit 01983d1 into ossf:main Feb 22, 2021
@coni2k coni2k deleted the empty-repo branch February 22, 2021 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants