Skip to content
This repository has been archived by the owner on Jul 23, 2019. It is now read-only.

Greatly improve the performance of the PHP implementation #4

Merged
merged 11 commits into from Sep 14, 2014

Conversation

JoshyPHP
Copy link
Contributor

Hello :bowtie:

This branch contains an alternative PHP algorithm for replacing strings. Instead of frontloading hundreds of replacements via str_replace, it matches every Emoji or smileys at once with a regular expression and replaces them with a callback. The performance increases tenfold to a hundredfold (hence the branch name) depending on the method called and input provided. Demo 4 with its default input goes from ~4ms to ~60µs on my machine.

A note about those regexps: I wrote $unicodeRegexp by hand. It more-or-less matches the Unicode blocks used in Emojione. The callback filters out false positives. $asciiRegexp was generated programmatically with the help of this class, using the following script:

include '/s9e/TextFormatter/src/autoloader.php';
include '/s9e/emojione/lib/php/Emojione.class.php';

$regexp = s9e\TextFormatter\Configurator\Helpers\RegexpBuilder::fromList(
    array_keys(Emojione::$ascii_replace),
    ['delimiter' => '`', 'caseInsensitive' => true]
);
$regexp = '`(?<!\\S)' . $regexp  .'(?=\\s|$|[!,\.])`Si';

echo var_export($regexp, true), "\n";

@JoshyPHP
Copy link
Contributor Author

There is a slight difference between this implementation and the original. This one will not convert a valid shortcode if it shares its first colon with an invalid shortcode. For example: :invalid:smile:. This implentation will not convert the :smile: part in this case.

This could be remedied with an accurate $shortcodeRegexp but it feels too unlikely to warrant the change. For reference, here's the code for an accurate $shortcodeRegexp:

static $shortcodeRegexp = '`:(?:x|1(?>00|234)|8ball|a(?:(?:b(?:cd?)?|ccept|erial_tramway|irplane|l(?>arm_clock|ien)|mbulance|n(?>t|chor|g(?>e[lr]|ry|uished))|pple|quarius|r(?:ies|row(?:_(?:backward|do(?:uble_(?>down|up)|wn(?>_small)?)|forward|heading_(?>down|up)|l(?>ef|ower_(?>lef|righ))t|right(?>_hook)?|up(?>(?>_(?>down|small)|per_(?>lef|righ)t))?)|s_c(?>ounterc)?lockwise)|t(?>iculated_lorry)?)|stonished|t(?>m|hletic_shoe)))?|b(?:(?:a(?:by(?>_(?>bottle|chick|symbol))?|ck|ggage_claim|llo(?>on|t_box_with_check)|mboo|n(?>k|ana|gbang)|r(?>_chart|ber)|s(?>e|ket)ball|t(?:h(?>tub)?|tery))|e(?:ar|e(?:rs?|tle)?|ginner|ll|nto)|i(?>cyclist|k(?>e|ini)|r(?>d|thday))|l(?:ack_(?:circle|joker|large_square|medium_s(?>mall_s)?quare|nib|s(?>mall_square|quare_button))|o(?>ssom|wfish)|u(?>e_(?>book|car|heart)|sh))|o(?:y|ar|mb|o(?:[mt]|k(?:(?:s|mark(?>_tabs)?))?)|uquet|w(?>ling)?)|r(?>ead|i(?>d(?>e_with_veil|ge_at_night)|efcase)|oken_heart)|u(?:g|l(?>b|lettrain_(?>front|side))|s(?:stop|ts?_in_silhouette)?)))?|c(?:[dn]|a(?:ctus|ke|l(?>endar|ling)|me(?>l|ra)|n(?>cer|dy)|p(?>ital_abcd|ricorn)|r(?>d_index|ousel_horse)|t2?)|h(?:art(?>_with_(?>down|up)wards_trend)?|e(?>ckered_flag|rr(?>ies|y_blossom)|stnut)|i(?>cken|ldren_crossing)|ocolate_bar|ristmas_tree|urch)|i(?>nema|rcus_tent|ty_(?>dusk|sun(?>rise|set)))|l(?:(?:ap(?>per)?|ipboard|o(?:ck(?:1(?:(?:30|[012](?>30)?))?|[23456789](?>30)?)|sed_(?>book|lock_with_key|umbrella)|ud)|ubs))?|o(?:cktail|ffee|ld_sweat|mputer|n(?:f(?>etti_ball|ounded|used)|gratulations|struction(?>_worker)?|venience_store)|o(?>l|kie)|p(?>yright)?|rn|uple(?>_with_heart|kiss)?|w2?)|r(?:e(?>dit_card|scent_moon)|o(?>codile|ssed_flags|wn)|y(?>ing_cat_face|stal_ball)?)|u(?>pid|r(?>ly_loop|r(?>y|ency_exchange))|st(?>ard|oms))|yclone)|d(?:a(?:n(?:cers?|go)|rt|sh|te)|e(?>ciduous_tree|partment_store)?|i(?:amond(?>s|_shape_with_a_dot_inside)|sappointed(?>_relieved)?|zzy(?>_face)?)|o(?:_not_litter|g2?|l(?>l(?>s|ar)|phin)|or|ughnut)|r(?:agon(?>_face)?|ess|o(?>medary_camel|plet))|vd)|e(?:s|ar(?>(?>_of_rice|th_a(?>frica|mericas|sia)))?|gg(?>plant)?|ight(?>_(?>pointed_black_star|spoked_asterisk))?|le(?>ctric_plug|phant)|n(?:d|velope(?>_with_arrow)?)|uro(?>pean_(?>castl|post_offic)e)?|vergreen_tree|x(?>clamation|pressionless)|ye(?>glasse)?s|-?mail)|f(?:a(?>x|ctory|llen_leaf|mily|st_forward)|e(?>arful|et|rris_wheel)|i(?:le_folder|r(?:e(?>_engine|works)?|st_quarter_moon(?>_with_face)?)|s(?:t|h(?>_cake|ing_pole_and_fish)?)|ve)|l(?>a(?>gs|me|shlight)|o(?>ppy_disk|wer_playing_cards)|ushed)|o(?:ggy|ot(?>ball|prints)|rk_and_knife|u(?:ntain|r(?>_leaf_clover)?))|r(?:(?:ee|ie(?>s|d_shrimp)|o(?>wnin)?g))?|u(?:elpump|ll_moon(?>_with_face)?))|g(?:b|ame_die|em(?>ini)?|host|i(?:ft(?>_heart)?|rl)|lobe_with_meridians|o(?>at|lf)|r(?:a(?>ndma|pes)|e(?>en_(?>apple|book|heart)|y_(?>exclama|ques)tion)|i(?:macing|n(?>ning)?))|u(?>n|ardsman|itar))|h(?:a(?>ircut|m(?>m|burg|st)er|n(?>dbag|key)|sh|tch(?>ed|ing)_chick)|e(?:a(?:dphones|r(?:_no_evil|t(?:(?:s|_(?:decoration|eyes(?>_cat)?)|beat|pulse))?)|vy_(?>check_mark|d(?>ivision|ollar)_sign|m(?>inus_sign|ultiplication_x)|plus_sign))|licopter|rb)|i(?>biscus|gh_(?>brightness|heel))|o(?:ney_pot|rse(?>_racing)?|spital|t(?>el|springs)|u(?:rglass(?>_flowing_sand)?|se(?>_with_garden)?))|ushed)|i(?:t|ce_?cream|d(?>eograph_advantage)?|mp|n(?>box_tray|coming_envelope|formation_(?>desk_person|source)|nocent|terrobang)|phone|zakaya_lantern)|j(?:p|a(?:ck_o_lantern|pan(?>ese_(?>castle|goblin|ogre))?)|eans|oy(?>_cat)?)|k(?:r|ey(?>cap_ten)?|i(?:mono|ss(?:ing(?>_(?>c(?>at|losed_eyes)|heart|smiling_eyes))?)?)|nife|o(?>ala|ko))|l(?:a(?:rge_(?>blue_(?>circle|diamond)|orange_diamond)|st_quarter_moon(?>_with_face)?|ughing)|e(?:aves|dger|ft(?>_(?>luggage|right_arrow)|wards_arrow_with_hook)|mon|o(?>pard)?)|i(?:bra|ght_rail|nk|ps(?>tick)?)|o(?:ck(?>_with_ink_pen)?|llipop|ud(?>_sound|speaker)|ve_(?>hotel|letter)|w_brightness))|m(?:(?:a(?:g(?>_right)?|hjong|ilbox(?:_(?:closed|with_(?>no_)?mail))?|n(?>(?>_with_(?>gua_pi_mao|turban)|s_shoe))?|ple_leaf|s(?>k|sage))|e(?>at_on_bone|ga|lon|ns|tro)|i(?>cro(?>phon|scop)e|lky_way|ni(?>bus|disc))|o(?:bile_phone_off|n(?:ey(?>_with_wings|bag)|key(?>_face)?|orail)|rtar_board|u(?:nt(?>_fuji|ain_(?>bicyclist|cableway|railway))|se2?)|vie_camera|yai)|u(?>s(?>cle|hroom|ical_(?>keyboard|note|score))|te)))?|n(?:g|a(?>il_car|me_badg)e|e(?:cktie|gative_squared_cross_mark|utral_face|w(?:(?:_moon(?>_with_face)?|spaper))?)|i(?>ght_with_stars|ne)|o(?:_(?:b(?>ell|icycles)|entry(?>_sign)?|good|mo(?>bile_phones|uth)|pedestrians|smoking)|n-potable_water|se|te(?:s|book(?>_with_decorative_cover)?))|ut_and_bolt)|o(?:(?:[x2]|c(?>ean|topus)|den|ffice|k(?>_(?>hand|woman))?|lder_(?>wo)?man|n(?>(?>e|coming_(?>automobile|bus|police_car|taxi)))?|p(?>en_(?>file_folder|hands|mouth)|hiuchus)|range_book|utbox_tray))?|p(?:a(?>ckage|ge(?>r|_(?>facing_up|with_curl))|lm_tree|nda_face|perclip|r(?>king|t(?>_alternation_mark|ly_sunny))|ssport_control)|e(?:a(?>r|ch)|n(?:cil2?|guin|sive)|r(?>forming_arts|s(?>evere|on_(?>frowning|with_(?>blond_hair|pouting_face)))))|i(?:g(?>2|_nose)?|ll|neapple|sces|zza)|o(?:int_(?:down|left|right|up(?>_2)?)|lice_car|o(?>p|dle)?|st(?>_office|al_horn|box)|table_water|u(?>ch|ltry_leg|nd|ting_cat))|r(?>ay|incess)|u(?>nch|r(?>ple_heart|se)|shpin|t_litter_in_its_place))|question|r(?:a(?:t|bbit2?|cehorse|dio(?>_button)?|ge|i(?:lway_car|nbow|s(?:ed_hands?|ing_hand))|m(?>en)?)|e(?:cycle|d_c(?>ar|ircle)|gistered|l(?>ax|iev)ed|peat(?>_one)?|stroom|volving_hearts|wind)|i(?:bbon|ce(?>_(?>ball|cracker|scene))?|ng)|o(?>cket|ller_coaster|oster|se|tating_light|und_pushpin|wboat)|u(?>(?>gby_football|nn(?>er|ing_shirt_with_sash)))?)|s(?:a(?>(?>gittarius|ilboat|ke|n(?>dal|ta)|t(?>ellite|isfied)|xophone))?|c(?:hool(?>_satchel)?|issors|orpius|r(?:eam(?>_cat)?|oll))|e(?>at|cret|e(?>_no_evil|dling)|ven)|h(?>aved_ice|e(?>ep|ll)|i(?>[pt]|rt)|ower)|i(?:gnal_strength|x(?>_pointed_star)?)|k(?>i|eleton|ull)|l(?>eep(?>y|ing)|ot_machine)|m(?:all_(?:red_triangle(?>_down)?|(?>blu|orang)e_diamond)|i(?:l(?:e(?:(?:_cat|y(?>_cat)?))?|ing_imp)|rk(?>_cat)?)|oking)|n(?>a(?>il|ke)|ow(?>boarder|flake|man))|o(?>[bs]|ccer|on|und)|p(?:a(?:ce_invader|des|ghetti|rkl(?:e[rs]?|ing_heart))|e(?>ak_no_evil|e(?>ch_balloon|dboat)))|t(?:a(?:r[s2]?|t(?>ion|ue_of_liberty))|e(?>w|am_locomotive)|ra(?>ight_ruler|wberry)|uck_out_tongue(?>_(?>closed_eyes|winking_eye))?)|u(?:n(?:_with_face|flower|glasses|ny|rise(?>_over_mountains)?)|rfer|s(?>hi|pension_railway))|w(?:e(?:at(?>_(?>drops|smile))?|et_potato)|immer)|y(?>mbols|ringe))|t(?:[mv]|a(?>da|n(?>abata_tre|gerin)e|urus|xi)|e(?:a|le(?:phone(?>_receiver)?|scope)|n(?>t|nis))|h(?>ought_balloon|ree|umbs(?>down|up))|i(?:cket|ger2?|red_face)|o(?:ilet|kyo_tower|mato|ngue|p(?>hat)?)|r(?>a(?>m|ctor|ffic_light|in2)|i(?>angular_(?>flag_on_post|ruler)|dent|umph)|o(?>lleybus|p(?>hy|ical_(?>drink|fish)))|u(?>ck|mpet))|u(?>lip|rtle)|w(?:isted_rightwards_arrows|o(?:_(?:heart|(?>wo)?men_holding_hand)s)?))|u(?>[ps]|5(?>272|408|5b6)|6(?>307|70[89]|e80)|7(?>121|533|981|a7a)|mbrella|n(?>amused|derage|lock))|v(?>(?>s|ertical_traffic_light|hs|i(?>bration_mode|deo_(?>camera|game)|olin|rgo)|olcano))?|w(?:c|a(?>lking|rning|t(?>ch|er(?>_buffalo|melon))|v(?>e|y_dash)|[nx]ing_(?>crescent|gibbous)_moon)|e(?>ary|dding)|h(?:ale2?|eelchair|ite_(?:c(?>heck_mark|ircle)|flower|large_square|medium_s(?>mall_s)?quare|s(?>mall_square|quare_button)))|in(?>k|d_chime|e_glass)|o(?:lf|m(?:an(?>s_(?>clothes|hat))?|ens)|rried)|rench)|y(?>e(?>n|llow_heart)|um)|z(?>ap|ero|zz)|[-+]1):`';

@kevinranks
Copy link
Contributor

Hey s9e! We appreciate your contribution here, could you join our gitter to chat about it?
https://gitter.im/Ranks/emojione

@JoshyPHP
Copy link
Contributor Author

https://gist.github.com/s9e/225b3c77005a89d81511

Wrote down some notes about grouping the emoji by block (not Unicode blocks, just ad-hoc units) to create the regexp from $unicodeRegexp. The regular expression at the end is slightly different from the one in this PR but they're functionally equivalent. (although I didn't really proof the one from the gist so if something seems wrong, it might be)

Also, it should be noted that this expression only works in PHP. Or rather, it works with PCRE without the u flag. The regular expression engine in JavaScript works on codepoints, not bytes. You can keep the same blocks but you need to convert the byte sequences to codepoints. I guess blocks 3 and 4 could be merged in this case, as I assume they're contiguous.

@kevinranks
Copy link
Contributor

Nice work s9e! We've been caught up with some other stuff but will get to this request ASAP and get it merged in. Big thanks again 🍻

@kevinranks
Copy link
Contributor

I just ran some tests and benchmarks and everything looks good except for the ASCII smiley replacement. Could you have a look and update before I merge.

Line 29:

 $string = preg_replace(self::$asciiRegexp, 'Emojione::asciiToImageCallback', $string);

Should be preg_replace_callback.

@JoshyPHP
Copy link
Contributor Author

Absolutely, yes. It's a typo.

@JoshyPHP
Copy link
Contributor Author

Hang on while I'm checking it, I realize that I never tested that particular method.

@JoshyPHP
Copy link
Contributor Author

Ok, it seems fine now. If there are any bugs left, I can't see them. 😎

@kevinranks
Copy link
Contributor

Beautiful! The speed increases are dramatic. Great contribution. I have a few more updates to make before releasing this. Also going to look at implementing the same strategies into the JS lib.

kevinranks pushed a commit that referenced this pull request Sep 14, 2014
Greatly improve the performance of the PHP implementation
@kevinranks kevinranks merged commit 5a4f2e8 into joypixels:master Sep 14, 2014
@thinkrick
Copy link
Contributor

Thank you Josh for all your help so far!

@JoshyPHP
Copy link
Contributor Author

My pleasure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants