Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup title and comments #33

Merged
merged 6 commits into from
Dec 15, 2018
Merged

Cleanup title and comments #33

merged 6 commits into from
Dec 15, 2018

Conversation

ayush1999
Copy link
Contributor

Initial changes : Added a small change for replacing all links with URL token. Will add more changes upon approval.

@@ -179,6 +179,9 @@ def __init__(self, feature_extractors, commit_messages_map=None):
def fit(self, x, y=None):
return self

def cleanup(self, text):
return re.sub(r'http\S+', 'URL', text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also match invalid URLs, but it should be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd define, outside of this class, a list of "text cleanup" classes (just like the classes we have for feature extraction). Then you call them either in this cleanup function, or in transform.

@ayush1999
Copy link
Contributor Author

@marco-c I've created a new class that replaces all URLs with the token. Should I do so for remaining changes? (stack traces, file references etc)

@@ -36,6 +36,7 @@ def __init__(self, lemmatization=False):
bug_features.landings(),
bug_features.title(),
bug_features.comments(),
bug_features.cleanup_url(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, I didn't mean to make it another feature extractor. I just meant we should have "cleanup" functions implemented similarly to feature extractors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, call the cleanup functions in the place where you were calling it before, but making it more generic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

for cleanup_function in cleanup_functions:
    summary = cleanup_function(summary)
    comments = cleanup_function(comments)

@marco-c
Copy link
Collaborator

marco-c commented Dec 14, 2018

@marco-c I've created a new class that replaces all URLs with the token. Should I do so for remaining changes? (stack traces, file references etc)

Let's add one cleanup function per PR. After you add it, please also comment in the PR with the results of the training before/after your change (with the 'bug' and 'regression' models).

@ayush1999
Copy link
Contributor Author

ayush1999 commented Dec 14, 2018

@marco-c Current results:
Before changes:

ayush99@ayush99:~/GitHub/bugbug$ python run.py --train --goal=bug
(1627, 101401) (1627,)
(181, 101401) (181,)
CV Accuracy: 0.89 (+/- 0.05)
Accuracy: 0.9171270718232044
Precision: 0.9642857142857143
Recall: 0.8709677419354839
[[85  3]
 [12 81]]

After changes:

ayush99@ayush99:~/GitHub/bugbug$ python run.py --train --goal=bug
(1627, 85316) (1627,)
(181, 85316) (181,)
CV Accuracy: 0.90 (+/- 0.04)
Accuracy: 0.9171270718232044
Precision: 0.9642857142857143
Recall: 0.8709677419354839
[[85  3]
 [12 81]]

(Only cv accuracy increases)

@ayush1999
Copy link
Contributor Author

In case of regression, results are exactly same.

'comments': ' '.join([c['text'] for c in bug['comments']]),
}
for cleanup_function in self.cleanup_functions:
result = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the patch is looking better!

A couple of things to fix:

  1. The cleanup functions should be applied to both the summary and the comments;
  2. The result should not be redefined multiple times, but just once. In the loop you should just apply the cleanup functions to the textual fields.

Copy link
Collaborator

@marco-c marco-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

It's good that the CV accuracy increased, I was expecting no change since this is just a small addition!

The test failure is not related to your patch and is actually my fault from another change (I updated the database of bugs and the test was using a bug from the old database), so I'm going to merge this now.

@marco-c marco-c merged commit 79843fa into mozilla:master Dec 15, 2018
@marco-c marco-c mentioned this pull request Dec 15, 2018
6 tasks
@ayush1999 ayush1999 deleted the cleanup branch December 15, 2018 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants