Cleanup title and comments #33

ayush1999 · 2018-12-13T13:02:20Z

Initial changes : Added a small change for replacing all links with URL token. Will add more changes upon approval.

marco-c · 2018-12-13T21:57:14Z

bugbug/bug_features.py

@@ -179,6 +179,9 @@ def __init__(self, feature_extractors, commit_messages_map=None):
    def fit(self, x, y=None):
        return self

+    def cleanup(self, text):
+        return re.sub(r'http\S+', 'URL', text)


This will also match invalid URLs, but it should be fine.

I'd define, outside of this class, a list of "text cleanup" classes (just like the classes we have for feature extraction). Then you call them either in this cleanup function, or in transform.

ayush1999 · 2018-12-14T14:28:48Z

@marco-c I've created a new class that replaces all URLs with the token. Should I do so for remaining changes? (stack traces, file references etc)

marco-c · 2018-12-14T14:38:11Z

bugbug/models/bug.py

@@ -36,6 +36,7 @@ def __init__(self, lemmatization=False):
            bug_features.landings(),
            bug_features.title(),
            bug_features.comments(),
+            bug_features.cleanup_url(),


Sorry for the confusion, I didn't mean to make it another feature extractor. I just meant we should have "cleanup" functions implemented similarly to feature extractors.

So, call the cleanup functions in the place where you were calling it before, but making it more generic.

Something like:

for cleanup_function in cleanup_functions: summary = cleanup_function(summary) comments = cleanup_function(comments)

marco-c · 2018-12-14T14:42:00Z

@marco-c I've created a new class that replaces all URLs with the token. Should I do so for remaining changes? (stack traces, file references etc)

Let's add one cleanup function per PR. After you add it, please also comment in the PR with the results of the training before/after your change (with the 'bug' and 'regression' models).

ayush1999 · 2018-12-14T16:37:28Z

@marco-c Current results:
Before changes:

ayush99@ayush99:~/GitHub/bugbug$ python run.py --train --goal=bug
(1627, 101401) (1627,)
(181, 101401) (181,)
CV Accuracy: 0.89 (+/- 0.05)
Accuracy: 0.9171270718232044
Precision: 0.9642857142857143
Recall: 0.8709677419354839
[[85  3]
 [12 81]]

After changes:

ayush99@ayush99:~/GitHub/bugbug$ python run.py --train --goal=bug
(1627, 85316) (1627,)
(181, 85316) (181,)
CV Accuracy: 0.90 (+/- 0.04)
Accuracy: 0.9171270718232044
Precision: 0.9642857142857143
Recall: 0.8709677419354839
[[85  3]
 [12 81]]

(Only cv accuracy increases)

ayush1999 · 2018-12-14T16:42:14Z

In case of regression, results are exactly same.

marco-c · 2018-12-14T16:59:57Z

bugbug/bug_features.py

-                'comments': ' '.join([c['text'] for c in bug['comments']]),
-            }
+            for cleanup_function in self.cleanup_functions:
+                result = {


Thanks, the patch is looking better!

A couple of things to fix:

The cleanup functions should be applied to both the summary and the comments;

The result should not be redefined multiple times, but just once. In the loop you should just apply the cleanup functions to the textual fields.

marco-c

LGTM, thanks!

It's good that the CV accuracy increased, I was expecting no change since this is just a small addition!

The test failure is not related to your patch and is actually my fault from another change (I updated the database of bugs and the test was using a bug from the old database), so I'm going to merge this now.

ayush1999 added 2 commits December 13, 2018 18:28

inital cleanup commit

5555686

Merge branch 'master' of https://github.com/marco-c/bugbug into cleanup

1470840

ayush1999 force-pushed the cleanup branch from 2f13f03 to 1470840 Compare December 13, 2018 13:20

marco-c reviewed Dec 13, 2018

View reviewed changes

ayush1999 added 2 commits December 14, 2018 18:21

Merge branch 'master' of https://github.com/marco-c/bugbug into cleanup

f8d8fec

new cleanup_url class

818134d

ayush1999 force-pushed the cleanup branch from c33904a to 818134d Compare December 14, 2018 14:26

marco-c reviewed Dec 14, 2018

View reviewed changes

more generic

73f6d7a

marco-c reviewed Dec 14, 2018

View reviewed changes

put results outside

9e563d1

ayush1999 force-pushed the cleanup branch from c036d39 to 9e563d1 Compare December 14, 2018 18:17

marco-c approved these changes Dec 15, 2018

View reviewed changes

marco-c merged commit 79843fa into mozilla:master Dec 15, 2018

marco-c mentioned this pull request Dec 15, 2018

Perform text cleanup #9

Closed

6 tasks

ayush1999 deleted the cleanup branch December 15, 2018 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup title and comments #33

Cleanup title and comments #33

ayush1999 commented Dec 13, 2018

marco-c Dec 13, 2018

marco-c Dec 13, 2018

ayush1999 commented Dec 14, 2018

marco-c Dec 14, 2018

marco-c Dec 14, 2018

marco-c Dec 14, 2018

marco-c commented Dec 14, 2018

ayush1999 commented Dec 14, 2018 •

edited

Loading

ayush1999 commented Dec 14, 2018

marco-c Dec 14, 2018

marco-c left a comment

Cleanup title and comments #33

Cleanup title and comments #33

Conversation

ayush1999 commented Dec 13, 2018

marco-c Dec 13, 2018

Choose a reason for hiding this comment

marco-c Dec 13, 2018

Choose a reason for hiding this comment

ayush1999 commented Dec 14, 2018

marco-c Dec 14, 2018

Choose a reason for hiding this comment

marco-c Dec 14, 2018

Choose a reason for hiding this comment

marco-c Dec 14, 2018

Choose a reason for hiding this comment

marco-c commented Dec 14, 2018

ayush1999 commented Dec 14, 2018 • edited Loading

ayush1999 commented Dec 14, 2018

marco-c Dec 14, 2018

Choose a reason for hiding this comment

marco-c left a comment

Choose a reason for hiding this comment

ayush1999 commented Dec 14, 2018 •

edited

Loading