Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caseSensitive option #65

Merged
merged 1 commit into from
Sep 3, 2016
Merged

Add caseSensitive option #65

merged 1 commit into from
Sep 3, 2016

Conversation

Mottie
Copy link
Collaborator

@Mottie Mottie commented Aug 28, 2016

I messed up merging to master instead of the 8.0.0 branch. This PR replaces #59.

@julkue julkue self-assigned this Aug 28, 2016
"yŸÿýÝ",
"zŽžżŻźŹ"
];
"aàáâãäåāą",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the array this way – with a separate line for each property – is clean if it's not so large.
In this case please try to fill out the line length of 80chars for the const dct array in order to reduce the function height

@julkue
Copy link
Owner

julkue commented Aug 29, 2016

Hi @Mottie,

Thanks for the re-submission!

First, apologize that I haven't had the time to give your PR a feedback earlier.
Seems like you've done a great job, really! I've just added a few notes about uniform spelling.

I'd like to merge the case sensitive option. From my point of view you've also added the case sensitive option for the synonyms option, but haven't added a test for this. Am I right? If so, it'd be great to make sure this works too.

Now to the diacritics issue. I understand that this might be a problem in non-unicode countries. Setting the charset attribute would be one way to workaround this, like one of your users mentioned. I agree that we could workaround this by encoding the diacritics ourself. But I'm not open to encode them in the source itself, because this makes the readability impossible.
I've previously used Google's Closure Compiler to get the minified JS files. This encoded diacritics, as you can see e.g. here. Now, I think it would be enough to just encode them in the build. The ASCIIOnly option of UglifyJS doesn't seem to do this, so we'd need to find another way. Maybe part of the diacritics repo?

Thanks again. Looking forward to your reply.

@Mottie
Copy link
Collaborator Author

Mottie commented Aug 29, 2016

Hi @julmot!

Thanks for the great feedback! I'll start working on the changes. As for the encoding, the files in the dist folder should definitely encode the diacritics. I think you're right, that would be a nice addition to the diacritics repo.

@julkue
Copy link
Owner

julkue commented Aug 29, 2016

Thanks for reworking. Every point is fine now! 👍

I just wanted to merge this but found out that there's one behavior that doesn't seem to be as expected.
When searching for "lorem" with disabled caseSensitive and enabled diacritics it will log:

mark.js: Searching with expression "/()([lł][oòóôõöøō][rř][eèéêëěēę]m)/gim"

I'm expecting that the uppercase diacritics are also covered in the regular expression when caseSensitive is false, like it was before. It's strange that this finds e.g. lÒrem at all, as I'm expecting that the i flag only works with ASCII characters.

@julkue
Copy link
Owner

julkue commented Aug 29, 2016

Found another issue. When searching with the following configuration:

var instance = new Mark(document.getElementById("context"));
instance.mark("lorem", {
    debug: true,
    diacritics: true,
    caseSensitive: false,
    synonyms: {
        "lorem": "Lorem"
    },
    each: function(node){
        console.log(node);
    }
});

The RegExp will look like:

/()(([[LŁ][LŁ]][oòóôõöøō][rř][eèéêëěēę]m|[[LŁ][LŁ]][oòóôõöøō][rř][eèéêëěēę]m))/gim

@Mottie
Copy link
Collaborator Author

Mottie commented Aug 29, 2016

The "simplest" solution that I came up with was to make the query lower case

if (sens) {
    // Prevent insensitive searches from including upper case
    // diacritics in the regular expression
    ch = ch.toLowerCase();
}

I'm not sure what you'll think of that addition. I did add a test with the code you shared above.

@Mottie
Copy link
Collaborator Author

Mottie commented Aug 29, 2016

Hmm, I forgot to use debug... it seems the second part of the regex is still being added:

/()(([lł][oòóôõöøō][rř][eèéêëěēę]m|[lł][oòóôõöøō][rř][eèéêëěēę]m))/gim

@Mottie
Copy link
Collaborator Author

Mottie commented Aug 29, 2016

Duh, it's because the second part is added as a synonym.

I also thought about doing something like this:

createDiacriticsRegExp(str) {
    let dct = [
        "aàáâãäåāą", "cçćč", "dđď", "eèéêëěēę", "iìíîïī", "lł", "nñňń",
        "oòóôõöøō", "rř", "sšś", "tť", "uùúûüůū", "yÿý", "zžżź"
    ],
        sens = "i";
    if (this.opt.caseSensitive) {
        sens = "";
        dct = dct.concat(dct.slice().join(",").toUpperCase().split(","));
    }
    let handled = [];
    str.split("").forEach(ch => {
        dct.every(dct => {
            // Check if the character is inside a diacritics list
            if(dct.indexOf(ch) !== -1) {
                // Check if the related diacritics list was not
                // handled yet
                if(handled.indexOf(dct) > -1) {
                    return false;
                }
                // Make sure that the character OR any other
                // character in the diacritics list will be matched
                str = str.replace(
                    new RegExp(`[${dct}]`, `gm${sens}`), `[${dct}]`
                );
                handled.push(dct);
            }
            return true;
        });
    });
    return str;
}

The diacritic array only contains lower case characters. It converts and concatinates a clone of the array in upper case. What do you think of this approach?

I always second guess myself... should I use .toUpperCase() or .toLocaleUpperCase()?

@julkue
Copy link
Owner

julkue commented Aug 30, 2016

@Mottie Just for our understanding, the second group is indeed the synonym. If the value is the same word as the key, then of course the groups will be the same. But that wasn't what I meant above, I was talking about the [[LŁ][LŁ]] inside the RegExp.

I can't give you a solution for the issue on the fly and unfortunately don't have time for it today. Will check this out directly after fixing an open bug. Thanks for your patience!

@Mottie
Copy link
Collaborator Author

Mottie commented Aug 30, 2016

Oh, I shared two different solutions above. No worries, take your time 😁

@julkue
Copy link
Owner

julkue commented Sep 3, 2016

Now, the bugs have been fixed and I'll have the time to come back to your PR. Once again, thank you for your patience!

I've made a few tests with the current implementation and here are my results:

  • When searching for "lorem" without any additional option (except debug) I expect a regular expression that contains the upper and lower case diacritics. Currently only lower case diacritics are included. Regular expression: /()([lł][oòóôõöøō][rř][eèéêëěēę]m)/gim
  • When switching between caseSensitive false/true the correct diacritics (upper or lower case) will be added to the regular expression
  • When searching for "lorem" with a synonyms object {"lorem": "Lorem"} and caseSensitive: true the regular expression looks fine: /()(([lł][oòóôõöøō][rř][eèéêëěēę]m|[LŁ][oòóôõöøō][rř][eèéêëěēę]m))/gm

In summary it seems like only the first point is left open. As mentioned, the regular expression in this case seems to be wrong, but it works. I don't understand why, because the i flag should normally be only insensitive for non-diacritic characters. Anyway, this should be fixed.

The diacritic array only contains lower case characters. It converts and concatinates a clone of the array in upper case. What do you think of this approach?

I'm uncertain if this would be possible for all kind of diacritics. There may be diacritics only available in upper case characters. To play it safe I'd include also upper case diacritics. Don't you think so?

}
}
return str;
}

createDiacriticsRegExp(str) {
const dct = ["aÀÁÂÃÄÅàáâãäåĀāąĄ", "cÇçćĆčČ", "dđĐďĎ", "eÈÉÊËèéêëěĚĒēęĘ", "iÌÍÎÏìíîïĪī", "lłŁ", "nÑñňŇńŃ", "oÒÓÔÕÕÖØòóôõöøŌō", "rřŘ", "sŠšśŚ", "tťŤ", "uÙÚÛÜùúûüůŮŪū", "yŸÿýÝ", "zŽžżŻźŹ"];
const sens = this.opt.caseSensitive ? "" : "i",
dct = this.opt.caseSensitive ? ["aàáâãäåāą", "AÀÁÂÃÄÅĀĄ", "cçćč", "CÇĆČ", "dđď", "DĐĎ", "eèéêëěēę", "EÈÉÊËĚĒĘ", "iìíîïī", "IÌÍÎÏĪ", "lł", "LŁ", "nñňń", "NÑŇŃ", "oòóôõöøō", "OÒÓÔÕÖØŌ", "rř", "RŘ", "sšś", "SŠŚ", "tť", "TŤ", "uùúûüůū", "UÙÚÛÜŮŪ", "yÿý", "YŸÝ", "zžżź", "ZŽŻŹ"] : ["aÀÁÂÃÄÅàáâãäåĀāąĄ", "cÇçćĆčČ", "dđĐďĎ", "eÈÉÊËèéêëěĚĒēęĘ", "iÌÍÎÏìíîïĪī", "lłŁ", "nÑñňŇńŃ", "oÒÓÔÕÖØòóôõöøŌō", "rřŘ", "sŠšśŚ", "tťŤ", "uÙÚÛÜùúûüůŮŪū", "yŸÿýÝ", "zŽžżŻźŹ"];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect a regular expression that contains the upper and lower case diacritics

This should take care of the regex.

@julkue
Copy link
Owner

julkue commented Sep 3, 2016

Thanks @Mottie, solved very clever! 👏
Thanks for your work in general!

I'm going to merge this now.
Anyway, it'll be hard to implement this output in the diacritics project. We'd have to:

  • Output them in two arrays, one including just the upper case characters and one the lower case
  • Output them in an array containing both, upper and lower case characters
  • Offer an output to ignore upper case characters in an array containing both, as it'll be mapped with an i flag (when used within a RegExp)

@julkue julkue merged commit 62a4cbf into julkue:master Sep 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants