-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty repo case #85
Conversation
criticality_score/run.py
Outdated
@@ -541,6 +543,8 @@ def get_repository(url): | |||
repo_url_encoded = urllib.parse.quote_plus(repo_url) | |||
try: | |||
repo = token_obj.projects.get(repo_url_encoded) | |||
if len(repo.commits.list()) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How slow is this call ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making these "get commits" calls in advance?
I added "last_commit" property to Repository class. Before initializing the repo, we make these "get commits" calls, do the validations, if it's all good, then create the repo by passing "last commit" as a parameter.
Since we only pass "last_commit" as a parameter, it feels bit strange but with this approach, we still do the validations and not making additional calls.
Let me know what you think.
criticality_score/run.py
Outdated
@@ -530,8 +530,10 @@ def get_repository(url): | |||
repo = None | |||
try: | |||
repo = get_github_auth_token().get_repo(repo_url) | |||
# Validate whether repo is empty; if it's empty, calling totalCount throws a 409 exception | |||
total_commits = repo.get_commits().totalCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like adding another api call since it takes it out of quota. Is there an exception we can catch somewhere and bail out from furthur processing.
- Update empty repo validation
criticality_score/run.py
Outdated
try: | ||
repo = get_github_auth_token().get_repo(repo_url) | ||
last_commit = repo.get_commits()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What i meant was calculate last_commit time inside get_repository_stats and bailout there. you can do it as a first call to self.last_commit
def get_repository_stats(repo, additional_params=None):
if not repo.last_commit:
return None
and then define
def last_commit property(self):
if self._last_commit:
return self._last_commit
self._last_commit = self.get_commit()[0]
return self._last_commit
No need to calculate it here, it is better to do this calculation inside the class function itself.
I see, that will be much clearer of course. I still kept the validation inside of "get_repository" function, but if you specifically want it, I can move it to "get_repository_stats" as well? |
Sorry i did mean get_repository_stats, so yes lets make that bailout there. |
Okay, hopefully last question; if output is empty, do you want to show "repo is empty" error message, both in "run" & "generate" scripts?
|
Okay, I kept the error message for the moment, so you can see how it looks |
criticality_score/run.py
Outdated
try: | ||
if not repo.last_commit: | ||
return None | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be an exception, if not, just remove.
criticality_score/generate.py
Outdated
break | ||
output = run.get_repository_stats(repo) | ||
if not output: | ||
logger.error(f'Repo is empty: {repo_url}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logger.error repeated in 2 places can just be moved inside get_repository_stats
if not repo.last_commit:
logger.error(f'Repo is empty: {repo_url}')
return None
return | ||
output = get_repository_stats(repo, args.params) | ||
if not output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above, this can be moved inside get_repository_stats.
- Move try/catch block to GitHub last_commit - Prevent exception for GitLab last_commit
If repo is empty, GitHub throws an exception in any case. Now I added a try/catch block only for GitHub last_commit property @property
def last_commit(self):
if self._last_commit:
return self._last_commit
try:
self._last_commit = self._repo.get_commits()[0]
except Exception:
pass
return self._last_commit For GitLab last_commit will not throw an exception now. It will return the last_commit, only if the list has an item in it (I used @property
def last_commit(self):
if self._last_commit:
return self._last_commit
self._last_commit = next(iter(self._repo.commits.list()), None)
return self._last_commit So, the validation under "get_repository_stats" is now like this (no try/catch here): if not repo.last_commit:
logger.error(f'Repo is empty: {repo.url}')
return None |
When I was running the script, I bumped into these repos that they fall into the filter due to high number of stars but they're actually empty and the script throws an exception:
https://github.com/fossasia/libregraphics.asia
https://github.com/libredesktop/libredesktop-events
https://github.com/libredesktop/libredesktop-project-list
https://github.com/libredesktop/LibreDesktop-Specs
https://github.com/meilix/arch-meilix
https://github.com/meilix/deb-meilix
https://github.com/meilix/meilix-addons
https://github.com/meilix/meilix-art
https://github.com/meilix/meilix-connect
https://github.com/meilix/meilix-web
https://github.com/susiai/susi_partners
https://github.com/susiai/susi_sdk
https://github.com/ascoders/blog
https://github.com/bigdongdongCLUB/newGCP
https://github.com/koush/support-wiki
https://github.com/mariobehling/ai-packages
https://github.com/mariobehling/mb-sandbox
https://github.com/meilix/meilix-docs
https://github.com/paulirish/devtools-addons
https://github.com/QingDaoIT/BlackList
https://github.com/zhengzhouqiuzhi/zhengzhouqiuzhi
To handle it, for GitLab, checking the commits length was enough:
For GitHub, I couldn't find any proper way to understand whether the repo is empty. When we call "get_commits().totalCount", it already throws an exception. What I did is to force it to throw the exception by assigning "totalCount" to an unused variable (I could do it by printing the value as well?). Not an ideal solution, so let me know what you think.
Another remark is that we're spending one more request from our rate limit when calling "get_commits()" to make this validation. I only tested this for GitHub, but I'm assuming it's the same for GitLab as well.
Alternatively, we can make all these calls before initializing the repo, do the validations, and pass them to repo object as arguments? This would also help us reducing the number of call to the API, but making these changes would take some time.
To be able to test my changes, I created empty repos on both GitHub & GitLab btw:
https://github.com/coni2k/empty-repo
https://gitlab.com/coni2k/empty-repo
Last, I also added this bit to "generate" script. Otherwise it fails when there are no processed repos: