New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable UTF8 in GRegex #653

Merged
merged 6 commits into from Jul 3, 2017

Conversation

Projects
None yet
3 participants
@ailin-nemui
Contributor

ailin-nemui commented Feb 16, 2017

fixes #636

up for discussion! this remaps invalid utf8 bytes into Private Use Area of unicode

@ailin-nemui ailin-nemui added the 1.0.1 label Feb 27, 2017

@ailin-nemui ailin-nemui removed the 1.0.1 label Mar 11, 2017

Refactor regex and implement UTF8 mode for GRegex
- with non-unicode byte to Private Use Area A mapping
- move all ifdefs to iregex.h file only
@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jun 3, 2017

I intend to merge this since there was noone rushing forward to implement something else, and I am heavily in favour of keeping the platform independent gregex/pcre (thus disagree to revert that change)

@dequis

This comment has been minimized.

Member

dequis commented Jun 4, 2017

Sure, that bikeshedding wasn't going anywhere and this method is good enough.

It does need some review, I didn't look at this at all before. Also needs fuzzing.

@LemonBoy

This comment has been minimized.

Member

LemonBoy commented Jun 4, 2017

Since people have shown renewed interest in running irssi on shoeboxes and similar underpowered hardware I think we should also think about the possible performance implications of "sanitizing" the input string every time we're going to run a match against it.

@dequis

This comment has been minimized.

Member

dequis commented Jun 4, 2017

shoeboxes and similar underpowered hardware

Heheh.

I dunno, the performance issues we've fixed recently were O(n^2)-like things that broke for large values of n. This change looks linear on message length to me.

#include "iregex.h"
const gchar *

This comment has been minimized.

@dequis

dequis Jun 4, 2017

Member

static

edit: in make_valid_utf8 and strlen_pua_oddly

This comment has been minimized.

@ailin-nemui

ailin-nemui Jun 5, 2017

Contributor

indeed

gint match_num,
gint *start_pos,
gint *end_pos,
const gchar *new_string);

This comment has been minimized.

@dequis

dequis Jun 4, 2017

Member

What is this new_string for? It seems to behave differently to the i_regex_match one, and one of the two implementations doesn't use it. I assume it's something related to the PUA stuff, but a comment would help

This comment has been minimized.

@ailin-nemui

ailin-nemui Jun 5, 2017

Contributor

i_regex_match allocates the new_string if string is invalid utf8, and i_match makes used of it to do the reverse offset translation. I added a few comments but maybe we can get rid of new_string instead

ailin-nemui added some commits Jun 5, 2017

Update iregex-gregex.c
make helper functions static
Update iregex-gregex.c
add 2 comments about new_string

@ailin-nemui ailin-nemui merged commit 1656dc1 into irssi:master Jul 3, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ailin-nemui ailin-nemui deleted the ailin-nemui:regexex branch Aug 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment