Greatly improve the performance of the PHP implementation #4
Conversation
…rounding characters, as per original implementation
There is a slight difference between this implementation and the original. This one will not convert a valid shortcode if it shares its first colon with an invalid shortcode. For example: This could be remedied with an accurate static $shortcodeRegexp = '`:(?:x|1(?>00|234)|8ball|a(?:(?:b(?:cd?)?|ccept|erial_tramway|irplane|l(?>arm_clock|ien)|mbulance|n(?>t|chor|g(?>e[lr]|ry|uished))|pple|quarius|r(?:ies|row(?:_(?:backward|do(?:uble_(?>down|up)|wn(?>_small)?)|forward|heading_(?>down|up)|l(?>ef|ower_(?>lef|righ))t|right(?>_hook)?|up(?>(?>_(?>down|small)|per_(?>lef|righ)t))?)|s_c(?>ounterc)?lockwise)|t(?>iculated_lorry)?)|stonished|t(?>m|hletic_shoe)))?|b(?:(?:a(?:by(?>_(?>bottle|chick|symbol))?|ck|ggage_claim|llo(?>on|t_box_with_check)|mboo|n(?>k|ana|gbang)|r(?>_chart|ber)|s(?>e|ket)ball|t(?:h(?>tub)?|tery))|e(?:ar|e(?:rs?|tle)?|ginner|ll|nto)|i(?>cyclist|k(?>e|ini)|r(?>d|thday))|l(?:ack_(?:circle|joker|large_square|medium_s(?>mall_s)?quare|nib|s(?>mall_square|quare_button))|o(?>ssom|wfish)|u(?>e_(?>book|car|heart)|sh))|o(?:y|ar|mb|o(?:[mt]|k(?:(?:s|mark(?>_tabs)?))?)|uquet|w(?>ling)?)|r(?>ead|i(?>d(?>e_with_veil|ge_at_night)|efcase)|oken_heart)|u(?:g|l(?>b|lettrain_(?>front|side))|s(?:stop|ts?_in_silhouette)?)))?|c(?:[dn]|a(?:ctus|ke|l(?>endar|ling)|me(?>l|ra)|n(?>cer|dy)|p(?>ital_abcd|ricorn)|r(?>d_index|ousel_horse)|t2?)|h(?:art(?>_with_(?>down|up)wards_trend)?|e(?>ckered_flag|rr(?>ies|y_blossom)|stnut)|i(?>cken|ldren_crossing)|ocolate_bar|ristmas_tree|urch)|i(?>nema|rcus_tent|ty_(?>dusk|sun(?>rise|set)))|l(?:(?:ap(?>per)?|ipboard|o(?:ck(?:1(?:(?:30|[012](?>30)?))?|[23456789](?>30)?)|sed_(?>book|lock_with_key|umbrella)|ud)|ubs))?|o(?:cktail|ffee|ld_sweat|mputer|n(?:f(?>etti_ball|ounded|used)|gratulations|struction(?>_worker)?|venience_store)|o(?>l|kie)|p(?>yright)?|rn|uple(?>_with_heart|kiss)?|w2?)|r(?:e(?>dit_card|scent_moon)|o(?>codile|ssed_flags|wn)|y(?>ing_cat_face|stal_ball)?)|u(?>pid|r(?>ly_loop|r(?>y|ency_exchange))|st(?>ard|oms))|yclone)|d(?:a(?:n(?:cers?|go)|rt|sh|te)|e(?>ciduous_tree|partment_store)?|i(?:amond(?>s|_shape_with_a_dot_inside)|sappointed(?>_relieved)?|zzy(?>_face)?)|o(?:_not_litter|g2?|l(?>l(?>s|ar)|phin)|or|ughnut)|r(?:agon(?>_face)?|ess|o(?>medary_camel|plet))|vd)|e(?:s|ar(?>(?>_of_rice|th_a(?>frica|mericas|sia)))?|gg(?>plant)?|ight(?>_(?>pointed_black_star|spoked_asterisk))?|le(?>ctric_plug|phant)|n(?:d|velope(?>_with_arrow)?)|uro(?>pean_(?>castl|post_offic)e)?|vergreen_tree|x(?>clamation|pressionless)|ye(?>glasse)?s|-?mail)|f(?:a(?>x|ctory|llen_leaf|mily|st_forward)|e(?>arful|et|rris_wheel)|i(?:le_folder|r(?:e(?>_engine|works)?|st_quarter_moon(?>_with_face)?)|s(?:t|h(?>_cake|ing_pole_and_fish)?)|ve)|l(?>a(?>gs|me|shlight)|o(?>ppy_disk|wer_playing_cards)|ushed)|o(?:ggy|ot(?>ball|prints)|rk_and_knife|u(?:ntain|r(?>_leaf_clover)?))|r(?:(?:ee|ie(?>s|d_shrimp)|o(?>wnin)?g))?|u(?:elpump|ll_moon(?>_with_face)?))|g(?:b|ame_die|em(?>ini)?|host|i(?:ft(?>_heart)?|rl)|lobe_with_meridians|o(?>at|lf)|r(?:a(?>ndma|pes)|e(?>en_(?>apple|book|heart)|y_(?>exclama|ques)tion)|i(?:macing|n(?>ning)?))|u(?>n|ardsman|itar))|h(?:a(?>ircut|m(?>m|burg|st)er|n(?>dbag|key)|sh|tch(?>ed|ing)_chick)|e(?:a(?:dphones|r(?:_no_evil|t(?:(?:s|_(?:decoration|eyes(?>_cat)?)|beat|pulse))?)|vy_(?>check_mark|d(?>ivision|ollar)_sign|m(?>inus_sign|ultiplication_x)|plus_sign))|licopter|rb)|i(?>biscus|gh_(?>brightness|heel))|o(?:ney_pot|rse(?>_racing)?|spital|t(?>el|springs)|u(?:rglass(?>_flowing_sand)?|se(?>_with_garden)?))|ushed)|i(?:t|ce_?cream|d(?>eograph_advantage)?|mp|n(?>box_tray|coming_envelope|formation_(?>desk_person|source)|nocent|terrobang)|phone|zakaya_lantern)|j(?:p|a(?:ck_o_lantern|pan(?>ese_(?>castle|goblin|ogre))?)|eans|oy(?>_cat)?)|k(?:r|ey(?>cap_ten)?|i(?:mono|ss(?:ing(?>_(?>c(?>at|losed_eyes)|heart|smiling_eyes))?)?)|nife|o(?>ala|ko))|l(?:a(?:rge_(?>blue_(?>circle|diamond)|orange_diamond)|st_quarter_moon(?>_with_face)?|ughing)|e(?:aves|dger|ft(?>_(?>luggage|right_arrow)|wards_arrow_with_hook)|mon|o(?>pard)?)|i(?:bra|ght_rail|nk|ps(?>tick)?)|o(?:ck(?>_with_ink_pen)?|llipop|ud(?>_sound|speaker)|ve_(?>hotel|letter)|w_brightness))|m(?:(?:a(?:g(?>_right)?|hjong|ilbox(?:_(?:closed|with_(?>no_)?mail))?|n(?>(?>_with_(?>gua_pi_mao|turban)|s_shoe))?|ple_leaf|s(?>k|sage))|e(?>at_on_bone|ga|lon|ns|tro)|i(?>cro(?>phon|scop)e|lky_way|ni(?>bus|disc))|o(?:bile_phone_off|n(?:ey(?>_with_wings|bag)|key(?>_face)?|orail)|rtar_board|u(?:nt(?>_fuji|ain_(?>bicyclist|cableway|railway))|se2?)|vie_camera|yai)|u(?>s(?>cle|hroom|ical_(?>keyboard|note|score))|te)))?|n(?:g|a(?>il_car|me_badg)e|e(?:cktie|gative_squared_cross_mark|utral_face|w(?:(?:_moon(?>_with_face)?|spaper))?)|i(?>ght_with_stars|ne)|o(?:_(?:b(?>ell|icycles)|entry(?>_sign)?|good|mo(?>bile_phones|uth)|pedestrians|smoking)|n-potable_water|se|te(?:s|book(?>_with_decorative_cover)?))|ut_and_bolt)|o(?:(?:[x2]|c(?>ean|topus)|den|ffice|k(?>_(?>hand|woman))?|lder_(?>wo)?man|n(?>(?>e|coming_(?>automobile|bus|police_car|taxi)))?|p(?>en_(?>file_folder|hands|mouth)|hiuchus)|range_book|utbox_tray))?|p(?:a(?>ckage|ge(?>r|_(?>facing_up|with_curl))|lm_tree|nda_face|perclip|r(?>king|t(?>_alternation_mark|ly_sunny))|ssport_control)|e(?:a(?>r|ch)|n(?:cil2?|guin|sive)|r(?>forming_arts|s(?>evere|on_(?>frowning|with_(?>blond_hair|pouting_face)))))|i(?:g(?>2|_nose)?|ll|neapple|sces|zza)|o(?:int_(?:down|left|right|up(?>_2)?)|lice_car|o(?>p|dle)?|st(?>_office|al_horn|box)|table_water|u(?>ch|ltry_leg|nd|ting_cat))|r(?>ay|incess)|u(?>nch|r(?>ple_heart|se)|shpin|t_litter_in_its_place))|question|r(?:a(?:t|bbit2?|cehorse|dio(?>_button)?|ge|i(?:lway_car|nbow|s(?:ed_hands?|ing_hand))|m(?>en)?)|e(?:cycle|d_c(?>ar|ircle)|gistered|l(?>ax|iev)ed|peat(?>_one)?|stroom|volving_hearts|wind)|i(?:bbon|ce(?>_(?>ball|cracker|scene))?|ng)|o(?>cket|ller_coaster|oster|se|tating_light|und_pushpin|wboat)|u(?>(?>gby_football|nn(?>er|ing_shirt_with_sash)))?)|s(?:a(?>(?>gittarius|ilboat|ke|n(?>dal|ta)|t(?>ellite|isfied)|xophone))?|c(?:hool(?>_satchel)?|issors|orpius|r(?:eam(?>_cat)?|oll))|e(?>at|cret|e(?>_no_evil|dling)|ven)|h(?>aved_ice|e(?>ep|ll)|i(?>[pt]|rt)|ower)|i(?:gnal_strength|x(?>_pointed_star)?)|k(?>i|eleton|ull)|l(?>eep(?>y|ing)|ot_machine)|m(?:all_(?:red_triangle(?>_down)?|(?>blu|orang)e_diamond)|i(?:l(?:e(?:(?:_cat|y(?>_cat)?))?|ing_imp)|rk(?>_cat)?)|oking)|n(?>a(?>il|ke)|ow(?>boarder|flake|man))|o(?>[bs]|ccer|on|und)|p(?:a(?:ce_invader|des|ghetti|rkl(?:e[rs]?|ing_heart))|e(?>ak_no_evil|e(?>ch_balloon|dboat)))|t(?:a(?:r[s2]?|t(?>ion|ue_of_liberty))|e(?>w|am_locomotive)|ra(?>ight_ruler|wberry)|uck_out_tongue(?>_(?>closed_eyes|winking_eye))?)|u(?:n(?:_with_face|flower|glasses|ny|rise(?>_over_mountains)?)|rfer|s(?>hi|pension_railway))|w(?:e(?:at(?>_(?>drops|smile))?|et_potato)|immer)|y(?>mbols|ringe))|t(?:[mv]|a(?>da|n(?>abata_tre|gerin)e|urus|xi)|e(?:a|le(?:phone(?>_receiver)?|scope)|n(?>t|nis))|h(?>ought_balloon|ree|umbs(?>down|up))|i(?:cket|ger2?|red_face)|o(?:ilet|kyo_tower|mato|ngue|p(?>hat)?)|r(?>a(?>m|ctor|ffic_light|in2)|i(?>angular_(?>flag_on_post|ruler)|dent|umph)|o(?>lleybus|p(?>hy|ical_(?>drink|fish)))|u(?>ck|mpet))|u(?>lip|rtle)|w(?:isted_rightwards_arrows|o(?:_(?:heart|(?>wo)?men_holding_hand)s)?))|u(?>[ps]|5(?>272|408|5b6)|6(?>307|70[89]|e80)|7(?>121|533|981|a7a)|mbrella|n(?>amused|derage|lock))|v(?>(?>s|ertical_traffic_light|hs|i(?>bration_mode|deo_(?>camera|game)|olin|rgo)|olcano))?|w(?:c|a(?>lking|rning|t(?>ch|er(?>_buffalo|melon))|v(?>e|y_dash)|[nx]ing_(?>crescent|gibbous)_moon)|e(?>ary|dding)|h(?:ale2?|eelchair|ite_(?:c(?>heck_mark|ircle)|flower|large_square|medium_s(?>mall_s)?quare|s(?>mall_square|quare_button)))|in(?>k|d_chime|e_glass)|o(?:lf|m(?:an(?>s_(?>clothes|hat))?|ens)|rried)|rench)|y(?>e(?>n|llow_heart)|um)|z(?>ap|ero|zz)|[-+]1):`'; |
Hey s9e! We appreciate your contribution here, could you join our gitter to chat about it? |
https://gist.github.com/s9e/225b3c77005a89d81511 Wrote down some notes about grouping the emoji by block (not Unicode blocks, just ad-hoc units) to create the regexp from Also, it should be noted that this expression only works in PHP. Or rather, it works with PCRE without the u flag. The regular expression engine in JavaScript works on codepoints, not bytes. You can keep the same blocks but you need to convert the byte sequences to codepoints. I guess blocks 3 and 4 could be merged in this case, as I assume they're contiguous. |
Nice work s9e! We've been caught up with some other stuff but will get to this request ASAP and get it merged in. Big thanks again 🍻 |
I just ran some tests and benchmarks and everything looks good except for the ASCII smiley replacement. Could you have a look and update before I merge. Line 29: $string = preg_replace(self::$asciiRegexp, 'Emojione::asciiToImageCallback', $string); Should be preg_replace_callback. |
Absolutely, yes. It's a typo. |
Hang on while I'm checking it, I realize that I never tested that particular method. |
Ok, it seems fine now. If there are any bugs left, I can't see them. 😎 |
Beautiful! The speed increases are dramatic. Great contribution. I have a few more updates to make before releasing this. Also going to look at implementing the same strategies into the JS lib. |
Greatly improve the performance of the PHP implementation
Thank you Josh for all your help so far! |
My pleasure. |
Hello
This branch contains an alternative PHP algorithm for replacing strings. Instead of frontloading hundreds of replacements via
str_replace
, it matches every Emoji or smileys at once with a regular expression and replaces them with a callback. The performance increases tenfold to a hundredfold (hence the branch name) depending on the method called and input provided. Demo 4 with its default input goes from ~4ms to ~60µs on my machine.A note about those regexps: I wrote
$unicodeRegexp
by hand. It more-or-less matches the Unicode blocks used in Emojione. The callback filters out false positives.$asciiRegexp
was generated programmatically with the help of this class, using the following script: