ScrubRb

Pure-ruby polyfill of MRI 2.1 String#scrub, for ruby 1.9 and 2.0 any interpreter

Installation

Add this line to your application's Gemfile:

gem 'scrub_rb'

And then execute:

$ bundle

Or install it yourself as:

$ gem install scrub_rb

What it is

Ruby 2.1 introduces String#scrub, a method to replace bytes in a string that are invalid for it's specified encoding. See docs in MRI ruby source

If you need String#scrub in MRI ruby 2.0, you can use the string-scrub gem, which provides a backport of the C code from MRI ruby 2.1 into MRI 2.0.

What if you need this functionality in ruby 1.9, in jruby in 1.9 or 2.0 modes, or in any other ruby platform that does not (or does not yet) support String#scrub? What if you need to write code that will work on any of these platforms?

This gem provides a pure-ruby implementation of String#scrub and #scrub!, monkey-patched into String, that should work on any ruby platform. It will only monkey-patch String if String does not already have a #scrub method -- so it's safe to include this gem in multi-platform code, when the code runs on ruby 2.1, String#scrub will still be the original stdlib implementation.

# Encoding: utf-8

"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"

Performance

This pure ruby implementation is about an order of magnitude slower than stdlib String#scrub on ruby 2.1, or than string-scrub C gem on MRI 2.0. For most applications, string-scrubbing will probably be a small portion of total execution time, is still fairly fast, and hopefully won't be a problem.

Discrepency with MRI 2.1 String#scrub

If there is a sequence of multiple contiguous invalid bytes in a string, should the entire block be replaced with only one replacement, or should each invalid byte be replaced with a replacement?

I have not been able to understand the logic MRI 2.1 uses to divide contiguous invalid bytes into certain sub-sequences for replacement, as represented in the test suite. The test suite may be suggesting that the examples are from unicode documentation, but I wasn't able to find such documentation to see if it shed any light on the matter.

scrub_rb always combines contiguous invalid bytes into a single replacement. As a result, it fails several tests from the original String#scrub test suite, which want other divisions of contiguous invalid bytes. I've altered our local tests for our current behavior.

Beware of this potential difference when using the block form of #scrub especially -- you may get a different number of calls with sequence of invalid bytes divided into different substrings with scrub_rb as compared to official MRI 2.1 String#scrub or string-scrub.

For most uses, this discrepency is probably not of consequence.

If anyone can explain whats going on here, I'm very curious! I can't read C very well to try and figure it out from source.

JRuby in earlier versions may raise

Use Jruby 1.7.11 or later to avoid a known bug that made JRuby raise exceptions on certain unusual illegal byte combinations and prevent scrub_rb from scrubbing them.

Contributions

Pull requests or suggestions welcome, especially on performance, on JRuby issue, and on discrepencies with official String#scrub.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
benchmark		benchmark
lib		lib
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
scrub_rb.gemspec		scrub_rb.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrubRb

Installation

What it is

Performance

Discrepency with MRI 2.1 String#scrub

JRuby in earlier versions may raise

Contributions

About

Releases

Packages

Languages

License

jrochkind/scrub_rb

Folders and files

Latest commit

History

Repository files navigation

ScrubRb

Installation

What it is

Performance

Discrepency with MRI 2.1 String#scrub

JRuby in earlier versions may raise

Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages