-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup title and comments #33
Conversation
bugbug/bug_features.py
Outdated
@@ -179,6 +179,9 @@ def __init__(self, feature_extractors, commit_messages_map=None): | |||
def fit(self, x, y=None): | |||
return self | |||
|
|||
def cleanup(self, text): | |||
return re.sub(r'http\S+', 'URL', text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also match invalid URLs, but it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd define, outside of this class, a list of "text cleanup" classes (just like the classes we have for feature extraction). Then you call them either in this cleanup
function, or in transform
.
@marco-c I've created a new class that replaces all URLs with the token. Should I do so for remaining changes? (stack traces, file references etc) |
bugbug/models/bug.py
Outdated
@@ -36,6 +36,7 @@ def __init__(self, lemmatization=False): | |||
bug_features.landings(), | |||
bug_features.title(), | |||
bug_features.comments(), | |||
bug_features.cleanup_url(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion, I didn't mean to make it another feature extractor. I just meant we should have "cleanup" functions implemented similarly to feature extractors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, call the cleanup functions in the place where you were calling it before, but making it more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
for cleanup_function in cleanup_functions:
summary = cleanup_function(summary)
comments = cleanup_function(comments)
Let's add one cleanup function per PR. After you add it, please also comment in the PR with the results of the training before/after your change (with the 'bug' and 'regression' models). |
@marco-c Current results:
After changes:
(Only cv accuracy increases) |
In case of regression, results are exactly same. |
bugbug/bug_features.py
Outdated
'comments': ' '.join([c['text'] for c in bug['comments']]), | ||
} | ||
for cleanup_function in self.cleanup_functions: | ||
result = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the patch is looking better!
A couple of things to fix:
- The cleanup functions should be applied to both the summary and the comments;
- The result should not be redefined multiple times, but just once. In the loop you should just apply the cleanup functions to the textual fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
It's good that the CV accuracy increased, I was expecting no change since this is just a small addition!
The test failure is not related to your patch and is actually my fault from another change (I updated the database of bugs and the test was using a bug from the old database), so I'm going to merge this now.
Initial changes : Added a small change for replacing all links with URL token. Will add more changes upon approval.