Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected matches in the resulting RegExp #29

Closed
WebReflection opened this issue Jun 12, 2015 · 3 comments
Closed

Unexpected matches in the resulting RegExp #29

WebReflection opened this issue Jun 12, 2015 · 3 comments

Comments

@WebReflection
Copy link

I was trying to use this for twemoji but I've found a very weird behavior I'm not sure it's me doing it wrong or there's a bug in here (haven't checked your source code yet).

So, the resulting RegExp matches # and every char between 0 and 9 and I've no idea what's going on and why is that, so I've prepared this test code:

var list = ["🇨🇳", "🇺🇸", "🇷🇺", "🇰🇷", "🇯🇵", "🇮🇹", "🇬🇧", "🇫🇷", "🇪🇸", "🇩🇪", "9⃣", "8⃣", "7⃣", "6⃣", "5⃣", "4⃣", "3⃣", "2⃣", "1⃣", "0⃣", "#⃣", "🚳", "🚱", "🚰", "🚯", "🚮", "🚦", "🚣", "🚡", "🚠", "🚟", "🚞", "🚝", "🚜", "🚛", "🚘", "🚖", "🚔", "🚐", "🚎", "🚍", "🚋", "🚊", "🚈", "🚆", "🚂", "🚁", "😶", "😴", "😯", "😮", "😬", "😧", "😦", "😟", "😛", "😙", "😗", "😕", "😑", "😐", "😎", "😈", "😇", "😀", "🕧", "🕦", "🕥", "🕤", "🕣", "🕢", "🕡", "🕠", "🕟", "🕞", "🕝", "🕜", "🔭", "🔬", "🔕", "🔉", "🔈", "🔇", "🔆", "🔅", "🔄", "🔂", "🔁", "🔀", "📵", "📯", "📭", "📬", "💷", "💶", "💭", "👭", "👬", "👥", "🐪", "🐖", "🐕", "🐓", "🐐", "🐏", "🐋", "🐊", "🐉", "🐈", "🐇", "🐆", "🐅", "🐄", "🐃", "🐂", "🐁", "🐀", "🏤", "🏉", "🏇", "🍼", "🍐", "🍋", "🌳", "🌲", "🌞", "🌝", "🌜", "🌚", "🌘", "🃏", "🅰", "🅱", "🅾", "🆎", "🆑", "🆒", "🆓", "🆔", "🆕", "🆖", "🆗", "🆘", "🆙", "🆚", "👷", "🛅", "🛄", "🛃", "🛂", "🛁", "🚿", "🚸", "🚷", "🚵", "🈁", "🈂", "🈲", "🈳", "🈴", "🈵", "🈶", "🈷", "🈸", "🈹", "🈺", "🉐", "🉑", "🌀", "🌁", "🌂", "🌃", "🌄", "🌅", "🌆", "🌇", "🌈", "🌉", "🌊", "🌋", "🌌", "🌏", "🌑", "🌓", "🌔", "🌕", "🌙", "🌛", "🌟", "🌠", "🌰", "🌱", "🌴", "🌵", "🌷", "🌸", "🌹", "🌺", "🌻", "🌼", "🌽", "🌾", "🌿", "🍀", "🍁", "🍂", "🍃", "🍄", "🍅", "🍆", "🍇", "🍈", "🍉", "🍊", "🍌", "🍍", "🍎", "🍏", "🍑", "🍒", "🍓", "🍔", "🍕", "🍖", "🍗", "🍘", "🍙", "🍚", "🍛", "🍜", "🍝", "🍞", "🍟", "🍠", "🍡", "🍢", "🍣", "🍤", "🍥", "🍦", "🍧", "🍨", "🍩", "🍪", "🍫", "🍬", "🍭", "🍮", "🍯", "🍰", "🍱", "🍲", "🍳", "🍴", "🍵", "🍶", "🍷", "🍸", "🍹", "🍺", "🍻", "🎀", "🎁", "🎂", "🎃", "🎄", "🎅", "🎆", "🎇", "🎈", "🎉", "🎊", "🎋", "🎌", "🎍", "🎎", "🎏", "🎐", "🎑", "🎒", "🎓", "🎠", "🎡", "🎢", "🎣", "🎤", "🎥", "🎦", "🎧", "🎨", "🎩", "🎪", "🎫", "🎬", "🎭", "🎮", "🎯", "🎰", "🎱", "🎲", "🎳", "🎴", "🎵", "🎶", "🎷", "🎸", "🎹", "🎺", "🎻", "🎼", "🎽", "🎾", "🎿", "🏀", "🏁", "🏂", "🏃", "🏄", "🏆", "🏈", "🏊", "🏠", "🏡", "🏢", "🏣", "🏥", "🏦", "🏧", "🏨", "🏩", "🏪", "🏫", "🏬", "🏭", "🏮", "🏯", "🏰", "🐌", "🐍", "🐎", "🐑", "🐒", "🐔", "🐗", "🐘", "🐙", "🐚", "🐛", "🐜", "🐝", "🐞", "🐟", "🐠", "🐡", "🐢", "🐣", "🐤", "🐥", "🐦", "🐧", "🐨", "🐩", "🐫", "🐬", "🐭", "🐮", "🐯", "🐰", "🐱", "🐲", "🐳", "🐴", "🐵", "🐶", "🐷", "🐸", "🐹", "🐺", "🐻", "🐼", "🐽", "🐾", "👀", "👂", "👃", "👄", "👅", "👆", "👇", "👈", "👉", "👊", "👋", "👌", "👍", "👎", "👏", "👐", "👑", "👒", "👓", "👔", "👕", "👖", "👗", "👘", "👙", "👚", "👛", "👜", "👝", "👞", "👟", "👠", "👡", "👢", "👣", "👤", "👦", "👧", "👨", "👩", "👪", "👫", "👮", "👯", "👰", "👱", "👲", "👳", "👴", "👵", "👶", "🚴", "👸", "👹", "👺", "👻", "👼", "👽", "👾", "👿", "💀", "💁", "💂", "💃", "💄", "💅", "💆", "💇", "💈", "💉", "💊", "💋", "💌", "💍", "💎", "💏", "💐", "💑", "💒", "💓", "💔", "💕", "💖", "💗", "💘", "💙", "💚", "💛", "💜", "💝", "💞", "💟", "💠", "💡", "💢", "💣", "💤", "💥", "💦", "💧", "💨", "💩", "💪", "💫", "💬", "💮", "💯", "💰", "💱", "💲", "💳", "💴", "💵", "💸", "💹", "💺", "💻", "💼", "💽", "💾", "💿", "📀", "📁", "📂", "📃", "📄", "📅", "📆", "📇", "📈", "📉", "📊", "📋", "📌", "📍", "📎", "📏", "📐", "📑", "📒", "📓", "📔", "📕", "📖", "📗", "📘", "📙", "📚", "📛", "📜", "📝", "📞", "📟", "📠", "📡", "📢", "📣", "📤", "📥", "📦", "📧", "📨", "📩", "📪", "📫", "📮", "📰", "📱", "📲", "📳", "📴", "📶", "📷", "📹", "📺", "📻", "📼", "🔃", "🔊", "🔋", "🔌", "🔍", "🔎", "🔏", "🔐", "🔑", "🔒", "🔓", "🔔", "🔖", "🔗", "🔘", "🔙", "🔚", "🔛", "🔜", "🔝", "🔞", "🔟", "🔠", "🔡", "🔢", "🔣", "🔤", "🔥", "🔦", "🔧", "🔨", "🔩", "🔪", "🔫", "🔮", "🔯", "🔰", "🔱", "🔲", "🔳", "🔴", "🔵", "🔶", "🔷", "🔸", "🔹", "🔺", "🔻", "🔼", "🔽", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚", "🕛", "🗻", "🗼", "🗽", "🗾", "🗿", "😁", "😂", "😃", "😄", "😅", "😆", "😉", "😊", "😋", "😌", "😍", "😏", "😒", "😓", "😔", "😖", "😘", "😚", "😜", "😝", "😞", "😠", "😡", "😢", "😣", "😤", "😥", "😨", "😩", "😪", "😫", "😭", "😰", "😱", "😲", "😳", "😵", "😷", "😸", "😹", "😺", "😻", "😼", "😽", "😾", "😿", "🙀", "🙅", "🙆", "🙇", "🙈", "🙉", "🙊", "🙋", "🙌", "🙍", "🙎", "🙏", "🚀", "🚃", "🚄", "🚅", "🚇", "🚉", "🚌", "🚏", "🚑", "🚒", "🚓", "🚕", "🚗", "🚙", "🚚", "🚢", "🚤", "🚥", "🚧", "🚨", "🚩", "🚪", "🚫", "🚬", "🚭", "🚲", "🚶", "🚹", "🚺", "🚻", "🚼", "🚽", "🚾", "🛀", "🇦", "🇧", "🇨", "🇩", "🇪", "🇫", "🇬", "🇭", "🇮", "🇯", "🇰", "🇱", "🇲", "🇳", "🇴", "🇵", "🇶", "🇷", "🇸", "🇹", "🇺", "🇻", "🇼", "🇽", "🇾", "🇿", "🌍", "🌎", "🌐", "🌒", "🌖", "🌗", "", "〰", "➰", "➗", "➖", "➕", "❕", "❔", "❓", "❎", "❌", "✨", "✋", "✊", "✅", "⛎", "⏳", "⏰", "⏬", "⏫", "⏪", "⏩", "™", "➿", "©", "®"];
var regenerate = require('regenerate');
var regenerated = regenerate.apply(null, list);
console.log(list.filter(function (chr) {
  return (0 <= chr && chr <= 9) || chr == '#';
}).length ?
  'There should be #0-9 in the RegExp' :
  'There should be NO #0-9 in the RegExp'
);
console.log(regenerated.toRegExp());

This should not result in the following RegExp:

/[#0-9\xA9\xAE\u2122\u23E9-\u23EC\u23F0\u23F3\u26CE\u2705\u270A\u270B\u2728\u274C\u274E\u2753-\u2755\u2795-\u2797\u27B0\u27BF\u3030\uE50A]|\uD83C[\uDCCF\uDD70\uDD71\uDD7E\uDD8E\uDD91-\uDD9A\uDDE6-\uDDFF\uDE01\uDE02\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF20\uDF30-\uDF35\uDF37-\uDF7C\uDF80-\uDF93\uDFA0-\uDFC4\uDFC6-\uDFCA\uDFE0-\uDFF0]|\uD83D[\uDC00-\uDC3E\uDC40\uDC42-\uDCF7\uDCF9-\uDCFC\uDD00-\uDD3D\uDD50-\uDD67\uDDFB-\uDE40\uDE45-\uDE4F\uDE80-\uDEC5]/

'cause #0-9 is absolutely undesired as a match.

Thanks for any sort of outcome.

@WebReflection
Copy link
Author

OK, found the gotcha ...

var list = ["9⃣", "8⃣", "7⃣", "6⃣", "5⃣", "4⃣", "3⃣", "2⃣", "1⃣", "0⃣", "#⃣"];
var regenerate = require('regenerate');
var regenerated = regenerate.apply(null, list);
console.log(list.filter(function (chr) {
  return (0 <= chr && chr <= 9) || chr == '#';
}).length ?
  'There should be #0-9 in the RegExp' :
  'There should be NO #0-9 in the RegExp'
);
console.log(regenerated.toRegExp());

@mathiasbynens
Copy link
Owner

Regenerate only deals with individual code points or symbols (by design). The problem is list contains strings that contain multiple code points, i.e. the first 21 entries:

"🇨🇳", "🇺🇸", "🇷🇺", "🇰🇷", "🇯🇵", "🇮🇹", "🇬🇧", "🇫🇷", "🇪🇸", "🇩🇪", "9⃣", "8⃣", "7⃣", "6⃣", "5⃣", "4⃣", "3⃣", "2⃣", "1⃣", "0⃣", "#⃣",

I’d suggest you manually create the regexp for those 21 emoji, and have the rest of the regex generated by Regenerate.

@WebReflection
Copy link
Author

so I guess I should convert those chars upfront and then reparse, right? Thanks. I knew it was me, although a warning or something like "you are doing it wrong with these strings" would have been nicer than a silent success with potentially broken RegExp.

Talking from a parsing security point of view. Will close this anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants