Add caseSensitive option #65

Mottie · 2016-08-28T18:15:04Z

I messed up merging to master instead of the 8.0.0 branch. This PR replaces #59.

julkue · 2016-08-29T06:48:15Z

src/mark.js

-            "yŸÿýÝ",
-            "zŽžżŻźŹ"
-        ];
+            "aàáâãäåāą",


Writing the array this way – with a separate line for each property – is clean if it's not so large.
In this case please try to fill out the line length of 80chars for the const dct array in order to reduce the function height

julkue · 2016-08-29T07:06:53Z

Hi @Mottie,

Thanks for the re-submission!

First, apologize that I haven't had the time to give your PR a feedback earlier.
Seems like you've done a great job, really! I've just added a few notes about uniform spelling.

I'd like to merge the case sensitive option. From my point of view you've also added the case sensitive option for the synonyms option, but haven't added a test for this. Am I right? If so, it'd be great to make sure this works too.

Now to the diacritics issue. I understand that this might be a problem in non-unicode countries. Setting the charset attribute would be one way to workaround this, like one of your users mentioned. I agree that we could workaround this by encoding the diacritics ourself. But I'm not open to encode them in the source itself, because this makes the readability impossible.
I've previously used Google's Closure Compiler to get the minified JS files. This encoded diacritics, as you can see e.g. here. Now, I think it would be enough to just encode them in the build. The ASCIIOnly option of UglifyJS doesn't seem to do this, so we'd need to find another way. Maybe part of the diacritics repo?

Thanks again. Looking forward to your reply.

Mottie · 2016-08-29T16:00:40Z

Hi @julmot!

Thanks for the great feedback! I'll start working on the changes. As for the encoding, the files in the dist folder should definitely encode the diacritics. I think you're right, that would be a nice addition to the diacritics repo.

julkue · 2016-08-29T19:08:14Z

Thanks for reworking. Every point is fine now! 👍

I just wanted to merge this but found out that there's one behavior that doesn't seem to be as expected.
When searching for "lorem" with disabled caseSensitive and enabled diacritics it will log:

mark.js: Searching with expression "/()([lł][oòóôõöøō][rř][eèéêëěēę]m)/gim"

I'm expecting that the uppercase diacritics are also covered in the regular expression when caseSensitive is false, like it was before. It's strange that this finds e.g. lÒrem at all, as I'm expecting that the i flag only works with ASCII characters.

julkue · 2016-08-29T19:15:10Z

Found another issue. When searching with the following configuration:

var instance = new Mark(document.getElementById("context"));
instance.mark("lorem", {
    debug: true,
    diacritics: true,
    caseSensitive: false,
    synonyms: {
        "lorem": "Lorem"
    },
    each: function(node){
        console.log(node);
    }
});

The RegExp will look like:

/()(([[LŁ][LŁ]][oòóôõöøō][rř][eèéêëěēę]m|[[LŁ][LŁ]][oòóôõöøō][rř][eèéêëěēę]m))/gim

Mottie · 2016-08-29T21:19:14Z

The "simplest" solution that I came up with was to make the query lower case

if (sens) {
    // Prevent insensitive searches from including upper case
    // diacritics in the regular expression
    ch = ch.toLowerCase();
}

I'm not sure what you'll think of that addition. I did add a test with the code you shared above.

Mottie · 2016-08-29T21:20:56Z

Hmm, I forgot to use debug... it seems the second part of the regex is still being added:

/()(([lł][oòóôõöøō][rř][eèéêëěēę]m|[lł][oòóôõöøō][rř][eèéêëěēę]m))/gim

Mottie · 2016-08-29T21:34:34Z

Duh, it's because the second part is added as a synonym.

I also thought about doing something like this:

createDiacriticsRegExp(str) {
    let dct = [
        "aàáâãäåāą", "cçćč", "dđď", "eèéêëěēę", "iìíîïī", "lł", "nñňń",
        "oòóôõöøō", "rř", "sšś", "tť", "uùúûüůū", "yÿý", "zžżź"
    ],
        sens = "i";
    if (this.opt.caseSensitive) {
        sens = "";
        dct = dct.concat(dct.slice().join(",").toUpperCase().split(","));
    }
    let handled = [];
    str.split("").forEach(ch => {
        dct.every(dct => {
            // Check if the character is inside a diacritics list
            if(dct.indexOf(ch) !== -1) {
                // Check if the related diacritics list was not
                // handled yet
                if(handled.indexOf(dct) > -1) {
                    return false;
                }
                // Make sure that the character OR any other
                // character in the diacritics list will be matched
                str = str.replace(
                    new RegExp(`[${dct}]`, `gm${sens}`), `[${dct}]`
                );
                handled.push(dct);
            }
            return true;
        });
    });
    return str;
}

The diacritic array only contains lower case characters. It converts and concatinates a clone of the array in upper case. What do you think of this approach?

I always second guess myself... should I use .toUpperCase() or .toLocaleUpperCase()?

julkue · 2016-08-30T18:36:47Z

@Mottie Just for our understanding, the second group is indeed the synonym. If the value is the same word as the key, then of course the groups will be the same. But that wasn't what I meant above, I was talking about the [[LŁ][LŁ]] inside the RegExp.

I can't give you a solution for the issue on the fly and unfortunately don't have time for it today. Will check this out directly after fixing an open bug. Thanks for your patience!

Mottie · 2016-08-30T19:06:59Z

Oh, I shared two different solutions above. No worries, take your time 😁

julkue · 2016-09-03T06:27:33Z

Now, the bugs have been fixed and I'll have the time to come back to your PR. Once again, thank you for your patience!

I've made a few tests with the current implementation and here are my results:

When searching for "lorem" without any additional option (except debug) I expect a regular expression that contains the upper and lower case diacritics. Currently only lower case diacritics are included. Regular expression: /()([lł][oòóôõöøō][rř][eèéêëěēę]m)/gim
When switching between caseSensitive false/true the correct diacritics (upper or lower case) will be added to the regular expression
When searching for "lorem" with a synonyms object {"lorem": "Lorem"} and caseSensitive: true the regular expression looks fine: /()(([lł][oòóôõöøō][rř][eèéêëěēę]m|[LŁ][oòóôõöøō][rř][eèéêëěēę]m))/gm

In summary it seems like only the first point is left open. As mentioned, the regular expression in this case seems to be wrong, but it works. I don't understand why, because the i flag should normally be only insensitive for non-diacritic characters. Anyway, this should be fixed.

The diacritic array only contains lower case characters. It converts and concatinates a clone of the array in upper case. What do you think of this approach?

I'm uncertain if this would be possible for all kind of diacritics. There may be diacritics only available in upper case characters. To play it safe I'd include also upper case diacritics. Don't you think so?

Mottie · 2016-09-03T15:32:03Z

dist/jquery.mark.es6.js

                }
            }
            return str;
        }

        createDiacriticsRegExp(str) {
-            const dct = ["aÀÁÂÃÄÅàáâãäåĀāąĄ", "cÇçćĆčČ", "dđĐďĎ", "eÈÉÊËèéêëěĚĒēęĘ", "iÌÍÎÏìíîïĪī", "lłŁ", "nÑñňŇńŃ", "oÒÓÔÕÕÖØòóôõöøŌō", "rřŘ", "sŠšśŚ", "tťŤ", "uÙÚÛÜùúûüůŮŪū", "yŸÿýÝ", "zŽžżŻźŹ"];
+            const sens = this.opt.caseSensitive ? "" : "i",
+                  dct = this.opt.caseSensitive ? ["aàáâãäåāą", "AÀÁÂÃÄÅĀĄ", "cçćč", "CÇĆČ", "dđď", "DĐĎ", "eèéêëěēę", "EÈÉÊËĚĒĘ", "iìíîïī", "IÌÍÎÏĪ", "lł", "LŁ", "nñňń", "NÑŇŃ", "oòóôõöøō", "OÒÓÔÕÖØŌ", "rř", "RŘ", "sšś", "SŠŚ", "tť", "TŤ", "uùúûüůū", "UÙÚÛÜŮŪ", "yÿý", "YŸÝ", "zžżź", "ZŽŻŹ"] : ["aÀÁÂÃÄÅàáâãäåĀāąĄ", "cÇçćĆčČ", "dđĐďĎ", "eÈÉÊËèéêëěĚĒēęĘ", "iÌÍÎÏìíîïĪī", "lłŁ", "nÑñňŇńŃ", "oÒÓÔÕÖØòóôõöøŌō", "rřŘ", "sŠšśŚ", "tťŤ", "uÙÚÛÜùúûüůŮŪū", "yŸÿýÝ", "zŽžżŻźŹ"];


I expect a regular expression that contains the upper and lower case diacritics

This should take care of the regex.

julkue · 2016-09-03T19:55:48Z

Thanks @Mottie, solved very clever! 👏
Thanks for your work in general!

I'm going to merge this now.
Anyway, it'll be hard to implement this output in the diacritics project. We'd have to:

Output them in two arrays, one including just the upper case characters and one the lower case
Output them in an array containing both, upper and lower case characters
Offer an output to ignore upper case characters in an array containing both, as it'll be mapped with an i flag (when used within a RegExp)

julkue added the type: enhancement label Aug 28, 2016

julkue self-assigned this Aug 28, 2016

julkue reviewed Aug 29, 2016
View reviewed changes

julkue mentioned this pull request Sep 3, 2016

Concept & Specification diacritics/database#1

Closed

4 tasks

Add caseSensitive option

99711ae

Mottie reviewed Sep 3, 2016
View reviewed changes

julkue merged commit 62a4cbf into julkue:master Sep 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caseSensitive option #65

Add caseSensitive option #65

Mottie commented Aug 28, 2016

julkue Aug 29, 2016

julkue commented Aug 29, 2016

Mottie commented Aug 29, 2016

julkue commented Aug 29, 2016 •

edited

Loading

julkue commented Aug 29, 2016 •

edited

Loading

Mottie commented Aug 29, 2016

Mottie commented Aug 29, 2016

Mottie commented Aug 29, 2016 •

edited

Loading

julkue commented Aug 30, 2016

Mottie commented Aug 30, 2016

julkue commented Sep 3, 2016

Mottie Sep 3, 2016

julkue commented Sep 3, 2016

Add caseSensitive option #65

Add caseSensitive option #65

Conversation

Mottie commented Aug 28, 2016

julkue Aug 29, 2016

Choose a reason for hiding this comment

julkue commented Aug 29, 2016

Mottie commented Aug 29, 2016

julkue commented Aug 29, 2016 • edited Loading

julkue commented Aug 29, 2016 • edited Loading

Mottie commented Aug 29, 2016

Mottie commented Aug 29, 2016

Mottie commented Aug 29, 2016 • edited Loading

julkue commented Aug 30, 2016

Mottie commented Aug 30, 2016

julkue commented Sep 3, 2016

Mottie Sep 3, 2016

Choose a reason for hiding this comment

julkue commented Sep 3, 2016

julkue commented Aug 29, 2016 •

edited

Loading

julkue commented Aug 29, 2016 •

edited

Loading

Mottie commented Aug 29, 2016 •

edited

Loading