Hasher performance #41

kreynolds · 2015-09-03T01:02:00Z

This is a set of simple performance improvements for the Hasher process. Picking a random wikipedia page and hashing it 200 times, I get the following improvements before/after (excluding the Set regression, it was a set before my last PR). Additionally, it fixes broken regex splitting for non-ascii characters and removes the unused punctuation filter.

BEFORE
user     system      total        real
times:    4.400000   0.010000   4.410000 (  4.403278)

AFTER
user     system      total        real
times:    2.630000   0.010000   2.640000 (  2.632828)

… comparison before Set membership

parkr · 2015-09-03T01:41:11Z

lib/classifier-reborn/extensions/hasher.rb

    end

    # Return a word hash without extra punctuation or short symbols, just stemmed words
    def clean_word_hash(str, language = 'en')
-      word_hash_for_words str.gsub(/[^\w\s]/,"").split, language
+      word_hash_for_words str.gsub(/[^\p{WORD}\s]/,'').downcase.split, language
    end


Why downcase here? Seems like it could have some really problematic side effects.

What side effects? Its not a downcase or gsub in place, it makes a new object which is split, then passed to word_hash_for_words then downcased. All I do is downcase the entire thing at once instead of word*N times.

parkr · 2015-09-03T01:42:50Z

Woot! Looks good on mobile. Didn't know String#scan worked like that.

kreynolds · 2015-09-03T02:51:00Z

String#scan isn't always faster though .. notice I only changed one of the splits to scan .. benchmarks showed the gsub/split method faster there.

Ch4s3 · 2015-09-03T18:08:49Z

@parkr This looks great!

Hasher performance

Kelley Reynolds added 4 commits September 2, 2015 20:46

Slight regression, STOPWORDS should be Sets

86997db

Remove unused punctuation filter

ab90248

Only hash once, use utf-8 aware regex, and scan instead of split

710239e

Use utf08 aware regex, downcase entire string at once, perform length…

0e6824f

… comparison before Set membership

parkr reviewed Sep 3, 2015
View reviewed changes

Ch4s3 added a commit that referenced this pull request Sep 3, 2015

Merge pull request #41 from kreynolds/hasher-performance

000f2b7

Hasher performance

Ch4s3 merged commit 000f2b7 into jekyll:master Sep 3, 2015

Ch4s3 added a commit that referenced this pull request Sep 3, 2015

update to reflect #39 and #41

dddcb8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hasher performance #41

Hasher performance #41

kreynolds commented Sep 3, 2015

parkr Sep 3, 2015

kreynolds Sep 3, 2015

parkr commented Sep 3, 2015

kreynolds commented Sep 3, 2015

Ch4s3 commented Sep 3, 2015

Hasher performance #41

Hasher performance #41

Conversation

kreynolds commented Sep 3, 2015

parkr Sep 3, 2015

Choose a reason for hiding this comment

kreynolds Sep 3, 2015

Choose a reason for hiding this comment

parkr commented Sep 3, 2015

kreynolds commented Sep 3, 2015

Ch4s3 commented Sep 3, 2015