Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform text cleanup #9

Closed
6 tasks done
marco-c opened this issue Nov 23, 2018 · 21 comments
Closed
6 tasks done

Perform text cleanup #9

marco-c opened this issue Nov 23, 2018 · 21 comments

Comments

@marco-c
Copy link
Collaborator

marco-c commented Nov 23, 2018

We are not cleaning up the text from title or comments at all, we should:

For URLs, we should probably have different tokens according to the type of URL (e.g. if it's a hg.mozilla.org, or searchfox.org or dxr.mozilla.org, then the meaning of the URL might be different).

@ayush1999
Copy link
Contributor

@marco-c We could add a check for removing the text inside <>, which are generally emails (in the commits.json file), since they aren't of any use. This shouldn't be hard to do, a simple filter and lambda function should be enough to do this. Need your go-ahead on this.

@marco-c
Copy link
Collaborator Author

marco-c commented Dec 12, 2018

I'd focus mostly on the bug data (title and comments), as that where most of the cleanup should be done first.
The emails in commits.json should be pretty rare.

@ayush1999
Copy link
Contributor

@marco-c What kind of checks should be done for URLs that aren't in the regular form? (Such as hg.mozilla.org, or searchfox.org, as you mentioned?) (URLs that haven't been written starting with http)

@marco-c
Copy link
Collaborator Author

marco-c commented Dec 17, 2018

I meant that for those URLs (http{s}://hg.mozilla.org, http{s}://searchfox) we might want to try to assing a separate token (not URL, but something like CODE_REFERENCE_URL).

@ayush1999
Copy link
Contributor

@marco-c How can stack traces be identified? (I'm guessing they might be enclosed in some special character?)

@marco-c
Copy link
Collaborator Author

marco-c commented Dec 24, 2018

Here are two examples https://bugzilla.mozilla.org/show_bug.cgi?id=1376831#c0 and https://bugzilla.mozilla.org/show_bug.cgi?id=1515375#c0, but it isn't so easy, maybe it's better not to do it or to try it after we're done with the others.

@ayush1999
Copy link
Contributor

@marco-c For replacing the crash stats, would a check for all words starting with bp- suffice?

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 7, 2019

I think it'd be better to have a more precise regex, we can probably find it somewhere as Bugzilla is using it to detect the links (indeed if you open one of those bugs, you can see that bp-XXX is a link).

@ayush1999
Copy link
Contributor

Oh, also, could you point me to a bug example that uses hex numbers in the text?

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 7, 2019

Sure, here's an example: https://bugzilla.mozilla.org/show_bug.cgi?id=1489041.

@ayush1999
Copy link
Contributor

If I'm not wrong, the hex numbers in the issue are already in >, so they get eliminated by #52 . Do we still need a check for substituting these numbers? (Maybe because #52 hasn't been merged yet?)

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 8, 2019

Only some are in >, but not all. For example see the title of the bug (it contains scdetour.dll@0x2dd77), or the first comment.

@ayush1999
Copy link
Contributor

@marco-c Regarding cleanup for crash references, should I replace only the bp-XXXX part or the entire crash stat? Plus the bugs.json data doesn't have any indication to check if a certain text is a hyperlink or not.

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 18, 2019

@marco-c Regarding cleanup for crash references, should I replace only the bp-XXXX part or the entire crash stat?

I'd say only the bp-XXXX part.

Plus the bugs.json data doesn't have any indication to check if a certain text is a hyperlink or not.

What do you mean?

@ayush1999
Copy link
Contributor

You mentioned in an earlier comment that we'd need to have a regex that checks for all bp-XXX words that are also links. Right?

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 18, 2019

You mentioned in an earlier comment that we'd need to have a regex that checks for all bp-XXX words that are also links. Right?

Sorry, I wasn't clear enough. We need to replace the bp-XXX words only. I only meant that there must be a regex somewhere already, as Bugzilla itself is detecting the bp-XXX words and turning them into links (in its interface, not in its DB, that's why our bugs.json only contains the bp-XXX words and not the link).

@ayush1999
Copy link
Contributor

Oh, I see. I guess I misunderstood. I just skimmed through the bugzilla repo, but was unable to find the matching regex. A regex such as bp-\w+-\w+-\w+-\w+-\w+ should work, right?

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 18, 2019

Here it is, found it in the Socorro repo: https://github.com/mozilla-services/socorro/blob/fc00cb6a9321b33c2971a2c14c7d916638e33a14/socorro/lib/ooid.py#L68-L88.

@ayush1999
Copy link
Contributor

Ah, thanks a lot.

@ayush1999
Copy link
Contributor

@marco-c How would cleanup for stack traces work? If I'm not wrong, Since stack traces also start with >, they are replaced by #52 right now. Do we need to change that?

@marco-c
Copy link
Collaborator Author

marco-c commented Jan 21, 2019

The text on Bugzilla comments is free text, so there isn't a rule that specifies stack traces to always start with >. Some people use > before stack traces, some people don't. Same with code snippets and so on.

Anyway, I guess cleaning up stack traces is not so easy, so I'd say we can ignore it for now.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants