-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform text cleanup #9
Comments
@marco-c We could add a check for removing the text inside |
I'd focus mostly on the bug data (title and comments), as that where most of the cleanup should be done first. |
@marco-c What kind of checks should be done for URLs that aren't in the regular form? (Such as hg.mozilla.org, or searchfox.org, as you mentioned?) (URLs that haven't been written starting with |
I meant that for those URLs (http{s}://hg.mozilla.org, http{s}://searchfox) we might want to try to assing a separate token (not URL, but something like CODE_REFERENCE_URL). |
@marco-c How can stack traces be identified? (I'm guessing they might be enclosed in some special character?) |
Here are two examples https://bugzilla.mozilla.org/show_bug.cgi?id=1376831#c0 and https://bugzilla.mozilla.org/show_bug.cgi?id=1515375#c0, but it isn't so easy, maybe it's better not to do it or to try it after we're done with the others. |
@marco-c For replacing the crash stats, would a check for all words starting with |
I think it'd be better to have a more precise regex, we can probably find it somewhere as Bugzilla is using it to detect the links (indeed if you open one of those bugs, you can see that bp-XXX is a link). |
Oh, also, could you point me to a bug example that uses hex numbers in the text? |
Sure, here's an example: https://bugzilla.mozilla.org/show_bug.cgi?id=1489041. |
Only some are in |
@marco-c Regarding cleanup for crash references, should I replace only the bp-XXXX part or the entire crash stat? Plus the bugs.json data doesn't have any indication to check if a certain text is a hyperlink or not. |
I'd say only the bp-XXXX part.
What do you mean? |
You mentioned in an earlier comment that we'd need to have a regex that checks for all |
Sorry, I wasn't clear enough. We need to replace the bp-XXX words only. I only meant that there must be a regex somewhere already, as Bugzilla itself is detecting the bp-XXX words and turning them into links (in its interface, not in its DB, that's why our bugs.json only contains the bp-XXX words and not the link). |
Oh, I see. I guess I misunderstood. I just skimmed through the bugzilla repo, but was unable to find the matching regex. A regex such as |
Here it is, found it in the Socorro repo: https://github.com/mozilla-services/socorro/blob/fc00cb6a9321b33c2971a2c14c7d916638e33a14/socorro/lib/ooid.py#L68-L88. |
Ah, thanks a lot. |
The text on Bugzilla comments is free text, so there isn't a rule that specifies stack traces to always start with Anyway, I guess cleaning up stack traces is not so easy, so I'd say we can ignore it for now. |
We are not cleaning up the text from title or comments at all, we should:
For URLs, we should probably have different tokens according to the type of URL (e.g. if it's a hg.mozilla.org, or searchfox.org or dxr.mozilla.org, then the meaning of the URL might be different).
The text was updated successfully, but these errors were encountered: