Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] HTML tags break sentence analysis #867

Closed
ckaden opened this Issue Mar 4, 2019 · 10 comments

Comments

Projects
None yet
2 participants
@ckaden
Copy link

ckaden commented Mar 4, 2019

When sending a text string with HTML tags to Aedict (in this case, via an aedict:// link), it seems like the first closing tag </ breaks the continuation of the sentence analysis.

Solution: strip the text string of all tags ( like <?></?> ) in Aedict that might break the sentence analysis before letting the sentence being handled by the sentence analysis.

Addition: Anki uses square brackets [ ] to add readings to Kanji. Maybe these should be stripped too as this isn't necessary and might clutter up the result page with additional hiragana meanings.

Example: data field 'expression' send to Aedict from Anki (Android App) via aedict:// link, where it breaks with the closing tag.

screenshot_20190304-191922

screenshot_20190304-192016

@mvysny mvysny self-assigned this Mar 5, 2019

@mvysny mvysny added the enhancement label Mar 5, 2019

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Mar 5, 2019

Thanks! Could you please post an example of such aedict:// link?

@ckaden

This comment has been minimized.

Copy link
Author

ckaden commented Mar 5, 2019

I'm using the anki template tags, so the links looks like that:

aedict://{{Expression}}

The actual link send to Aedict in this example should be as follow:

aedict://<b>会計</b>を済ませて店を出たんだ。

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Mar 5, 2019

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Mar 5, 2019

Fixed in Aedict 3.50.13. However, since the <> characters could be also used for formatting of text, e.g. <<yamero>> which needs to be searched properly, I'll hide this behind a setting. You will need to activate the removal of HTML elements, by going into Settings / GUI Tuning / Search / Remove HTML Tags and make sure it's checked.

@mvysny mvysny closed this Mar 5, 2019

@ckaden

This comment has been minimized.

Copy link
Author

ckaden commented Mar 5, 2019

Nice, thank you! Looking forward to receiving the update.

I don't know how it looks like from a developer perspective, but with regex you can match structures like <*>*</*> (structure with opening and closing tag) and remove only the tags while not matching things like <<*>> (lone structure). HTML structures like <*>*</*> make absolutely zero sense for sentence analysis and thus can be removed completely. The option you've built in now makes sense for square brackets though.

But maybe it's simply not worth the hassle and depends on the user scenario. Thanks again for your work!

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Mar 5, 2019

with regex you can match structures like <></*> (structure with opening and closing tag) and remove only the tags

True, but that wouldn't remove HTML4 with unclosed tags like foo<p>bar, or partially copied html like <b>something. Therefore I chose to remove anything that looks like <*> or </*> or <*/> with regexp matching. That kind of matching will however turn <<foo>> into <> (by removing inner <foo>).

@ckaden

This comment has been minimized.

Copy link
Author

ckaden commented Mar 5, 2019

Maybe we're overthinking it a bit. When I look back at the problem, the sentence analysis seems to break with the closing tag, so I suspect the slash / to be the culprit. The opening tags or tags without slash seem to be ignored. Maybe it's enough to just filter out the slash? There will be some clutter because of the tag content but that shouldn't be a huge problem.

@ckaden

This comment has been minimized.

Copy link
Author

ckaden commented Mar 6, 2019

It doesn't seem to work properly yet, see screenshots below:

Full sentence:
screenshot_20190306-142322

'Remove HTML tags' unchecked (same as before):
screenshot_20190306-142353

'Remove HTML tags' checked:
screenshot_20190306-142327

Either way, everything after the closing HTML tag is getting cut off, which is especially problematic if the term is at the beginning of the sentence.

@mvysny

This comment has been minimized.

Copy link
Owner

mvysny commented Mar 6, 2019

With HTML stripping turned off Aedict should work as before, and should not truncate anything. Therefore I'm judging that the original link is incorrect or truncated in the first place. Maybe try escaping characters with %, e.g. < -> %3C. See https://www.w3schools.com/tags/ref_urlencode.asp for more details on what kind of characters may appear in the link (hint: pretty much only a-z, so no kanjis, no </> etc).

@ckaden

This comment has been minimized.

Copy link
Author

ckaden commented Mar 6, 2019

Ah sorry, you're certainly right, it already gets truncated before arriving at the sentence analysis. I just tested it with copying a sentence and directly pasting it into Aedict with html tags, no problems. My bad, didn't think of that possibility. Doesn't make it easier to resolve, sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.