Skip to content
Permalink
Browse files

Rename, extend string interaction methods

  • Loading branch information...
jaynetics committed Jul 2, 2018
1 parent 6ce1670 commit 582e45b56c12a0096c13b82f057ffe77b727c9e7
Showing with 1,358 additions and 433 deletions.
  1. +1 −1 .travis.yml
  2. +50 −0 BENCHMARK.md
  3. +58 −21 README.md
  4. +43 −5 Rakefile
  5. +25 −0 benchmarks/cover.rb
  6. +25 −0 benchmarks/delete_in.rb
  7. +25 −0 benchmarks/keep_in.rb
  8. +25 −0 benchmarks/shared.rb
  9. +25 −0 benchmarks/used_by.rb
  10. +5 −3 character_set.gemspec
  11. +154 −67 ext/character_set/character_set.c
  12. +5 −40 lib/character_set.rb
  13. +6 −3 lib/character_set/character.rb
  14. +76 −4 lib/character_set/common_sets.rb
  15. +21 −5 lib/character_set/core_ext.rb
  16. +4 −4 lib/character_set/parser.rb
  17. +0 −21 lib/character_set/plane_methods.rb
  18. +7 −6 lib/character_set/pure.rb
  19. +0 −41 lib/character_set/range_compressor.rb
  20. +35 −3 lib/character_set/ruby_fallback/character_set_methods.rb
  21. +5 −5 lib/character_set/ruby_fallback/plane_methods.rb
  22. +42 −23 lib/character_set/ruby_fallback/set_methods.rb
  23. +9 −3 lib/character_set/set_method_adapters.rb
  24. +0 −74 lib/character_set/set_methods.rb
  25. +143 −0 lib/character_set/shared_methods.rb
  26. +1 −1 lib/character_set/writer.rb
  27. +11 −3 spec/character_set/character_spec.rb
  28. +72 −26 spec/character_set/common_sets_spec.rb
  29. +45 −10 spec/character_set/core_ext_spec.rb
  30. +54 −0 spec/character_set/delete_in_bang_spec.rb
  31. +42 −0 spec/character_set/delete_in_spec.rb
  32. +19 −0 spec/character_set/for_property_spec.rb
  33. +50 −0 spec/character_set/keep_in_bang_spec.rb
  34. +41 −0 spec/character_set/keep_in_spec.rb
  35. +20 −0 spec/character_set/of_spec.rb
  36. +17 −0 spec/character_set/pure_spec.rb
  37. +0 −33 spec/character_set/range_compressor_spec.rb
  38. +16 −0 spec/character_set/to_s_with_surrogate_alternation_spec.rb
  39. +0 −16 spec/character_set/to_s_with_surrogate_pair_alternation_spec.rb
  40. +172 −0 spec/character_set/type_safety_spec.rb
  41. +7 −0 spec/character_set/used_by_p_spec.rb
  42. +0 −13 spec/character_set/used_by_spec.rb
  43. +2 −2 spec/character_set/writer_spec.rb
@@ -1,7 +1,7 @@
sudo: false
language: ruby
rvm:
- 2.3
- 2.1
- 2.4
- 2.5
- 2.6
@@ -0,0 +1,50 @@
Results of `rake:benchmark` on ruby 2.6.0preview1 (2018-02-24 trunk 62554) [x86_64-darwin17]

```
Detecting non-whitespace
CharacterSet#cover?: 13244577.7 i/s
Regexp#match?: 8027017.5 i/s - 1.65x slower
```
```
Detecting non-letters
CharacterSet#cover?: 13082940.8 i/s
Regexp#match?: 5372589.2 i/s - 2.44x slower
```
```
Removing whitespace
CharacterSet#delete_in: 389315.6 i/s
String#gsub: 223773.5 i/s - 1.74x slower
```
```
Removing whitespace, emoji and umlauts
CharacterSet#delete_in: 470239.3 i/s
String#gsub: 278679.4 i/s - 1.69x slower
```
```
Removing non-whitespace
CharacterSet#keep_in: 1138461.0 i/s
String#gsub: 235287.4 i/s - 4.84x slower
```
```
Extracting emoji
CharacterSet#keep_in: 1474472.0 i/s
String#gsub: 212269.6 i/s - 6.95x slower
```
```
Detecting whitespace
CharacterSet#used_by?: 13063108.7 i/s
Regexp#match?: 7215075.0 i/s - 1.81x slower
```
```
Detecting emoji in a large string
CharacterSet#used_by?: 246527.7 i/s
Regexp#match?: 92956.5 i/s - 2.65x slower
```
@@ -5,7 +5,12 @@

A gem to build, read, write and compare sets of Unicode codepoints.

Many parts can be used independently, e.g. `CharacterSet::Character`, `CharacterSet::RangeCompressor`, `CharacterSet::Parser`, `CharacterSet::Writer`.
Many parts can be used independently, e.g.:
- `CharacterSet::Character`
- `CharacterSet::Parser`
- `CharacterSet::Writer`
- [`RangeCompressor`](https://github.com/janosch-x/range_compressor)
- [`RegexpPropertyValues`](https://github.com/janosch-x/regexp_property_values)

## Usage

@@ -20,31 +25,74 @@ CharacterSet['a', 'b', 'c']
CharacterSet[97, 98, 99]
CharacterSet.new('a'..'c')
CharacterSet.new(0x61..0x63)
CharacterSet.used_by('abacababa')
CharacterSet.of('abacababa')
```

### Common utility sets

```ruby
CharacterSet.ascii
CharacterSet.bmp
CharacterSet.crypt
CharacterSet.emoji
CharacterSet.newline
CharacterSet.unicode
CharacterSet.url_fragment
CharacterSet.url_host
CharacterSet.url_path
CharacterSet.url_query
CharacterSet.whitespace
# e.g.
CharacterSet.url_query.cover?('?a=(b$c;)') # => true
CharacterSet.emoji.sample(5) # => ["⛷", "👈", "🌞", "♑", "⛈"]
# all can be prefixed with `non_`, e.g.
(CharacterSet.non_ascii + CharacterSet.newline).delete_in(string)
CharacterSet['🤩'].subset?(CharacterSet.non_ascii) # => true
CharacterSet.non_bmp.intersect?(CharacterSet.newline) # => false
```

### Interact with Strings

`#used_by?` and `#cover?` are as fast as `Regexp#match?`.
CharacterSet can replace some `Regexp` actions on Strings, at better speed (see [benchmarks](./BENCHMARK.md)).

`#used_by?` and `#cover?` can replace some `Regexp#match?` calls:

```ruby
CharacterSet.ascii.used_by?('Tüür') # => true
CharacterSet.ascii.cover?('Tüür') # => false
CharacterSet.ascii.cover?('Tr') # => true
```

There is also a core extension for this.
`#delete_in(!)` and `#keep_in(!)` can replace `String#gsub(!)` and the like:
```ruby
string = 'Tüür'
CharacterSet.ascii.delete_in(string) # => 'üü'
CharacterSet.ascii.keep_in(string) # => 'Tr'
string # => 'Tüür'
CharacterSet.ascii.delete_in!(string) # => 'üü'
string # => 'üü'
CharacterSet.ascii.keep_in!(string) # => ''
string # => ''
```

There is also a core extension for String interaction.
```ruby
require 'character_set/core_ext'
"a\rb".character_set & CharacterSet.newline # => CharacterSet["\r"]
"a\rb".uses?(CharacterSet.newline) # => true
"a\rb".covered_by?(CharacterSet.newline) # => false
"a\rb".uses_character_set?(CharacterSet.emoji) # => false
"a\rb".covered_by_character_set?(CharacterSet.newline) # => false
"a\rb".delete_character_set(CharacterSet.newline) # => 'ab'
# etc.
```

### Manipulate

Use any [Ruby Set method](https://ruby-doc.org/stdlib-2.5.1/libdoc/set/rdoc/Set.html) to perform modifications, checks and comparisons between character sets.
Use any [Ruby Set method](https://ruby-doc.org/stdlib-2.5.1/libdoc/set/rdoc/Set.html), e.g. `#+`, `#-`, `#&`, `#^`, `#intersect?`, `#<`, `#>` etc. to interact with other sets, and `#add`, `#delete`, `#include?` etc. to change or check members.

Where appropriate, methods take both chars and codepoints, e.g.:

@@ -64,7 +112,7 @@ non_a = CharacterSet['a'].inversion
non_a.include?('a') # => false
non_a.include?('ü') # => true
# to include surrogate pair halves:
# surrogate pair halves are not included by default
CharacterSet['a'].inversion(include_surrogates: true)
# => #<CharacterSet (size: 1114111)>
```
@@ -95,18 +143,7 @@ set.to_s(escape_all: true) { |c| "<#{c.hex}>" } # => "<61>-<63><258><1F929>"
set.to_s(abbreviate: false) # => "abc\u0258\u{1F929}"
# for full js regex compatibility in case of astral members:
set.to_s_with_surrogate_pair_alternation # => '(?:[\u0258]|\ud83e\udd29)'
```

### Common utility sets

```ruby
CharacterSet.ascii
CharacterSet.emoji
CharacterSet.newline
CharacterSet.unicode
CharacterSet.emoji.sample(5) # => ["⛷", "👈", "🌞", "♑", "⛈"]
set.to_s_with_surrogate_alternation # => '(?:[\u0258]|\ud83e\udd29)'
```

### Unicode plane methods
@@ -118,5 +155,5 @@ CharacterSet['a', 'ü', '🤩'].astral_part # => CharacterSet['🤩']
CharacterSet['a', 'ü', '🤩'].bmp_ratio # => 0.6666666
CharacterSet['a', 'ü', '🤩'].planes # => [0, 1]
CharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # => false
CharacterSet::Character.new(0x61).plane # => 0
CharacterSet::Character.new('a').plane # => 0
```
@@ -1,21 +1,30 @@
require 'bundler/gem_tasks'
require 'rspec/core/rake_task'
require 'rubygems/package_task'
require 'rake/extensiontask'

RSpec::Core::RakeTask.new(:spec)

task default: :spec

require 'rake/extensiontask'

Rake::ExtensionTask.new('character_set') do |ext|
ext.lib_dir = 'lib/character_set'
end

unless RUBY_PLATFORM =~ /java/
# recompile before benchmarking or running specs
task(:spec).enhance([:compile])
namespace :java do
java_gemspec = eval File.read('./character_set.gemspec')
java_gemspec.platform = 'java'
java_gemspec.extensions = []

Gem::PackageTask.new(java_gemspec) do |pkg|
pkg.need_zip = true
pkg.need_tar = true
pkg.package_dir = 'pkg'
end
end

task package: 'java:gem'

desc 'Download relevant ruby/spec tests, adapt to CharacterSet and its variants'
task :sync_ruby_spec do
require 'fileutils'
@@ -62,3 +71,32 @@ task :sync_ruby_spec do
end
end
end

desc 'Run all IPS benchmarks'
task :benchmark do
Dir['./benchmarks/*.rb'].sort.each { |file| require file }
end

namespace :benchmark do
desc 'Run all IPS benchmarks and store the comparison results in BENCHMARK.md'
task :write_to_file do
$store_comparison_results = {}

Rake.application[:benchmark].invoke

File.open('BENCHMARK.md', 'w') do |f|
f.puts "Results of `rake:benchmark` on #{RUBY_DESCRIPTION}", ''

$store_comparison_results.each do |caption, result|
f.puts '```', caption, '',
result.strip.gsub(/(same-ish).*$/, '\1').lines[1..-1], '```'
end
end
end
end

unless RUBY_PLATFORM =~ /java/
# recompile before benchmarking or running specs
task(:benchmark).enhance([:compile])
task(:spec).enhance([:compile])
end
@@ -0,0 +1,25 @@
require_relative './shared'

str = 'Lorem ipsum et dolorem'
rx = /\S/
cs = CharacterSet.whitespace.inversion

benchmark(
caption: 'Detecting non-whitespace',
cases: {
'Regexp#match?' => -> { rx.match?(str) },
'CharacterSet#cover?' => -> { cs.cover?(str) },
}
)

str = 'Lorem ipsum et dolorem'
rx = /[^a-z]/i
cs = CharacterSet.new('A'..'Z') + CharacterSet.new('a'..'z')

benchmark(
caption: 'Detecting non-letters',
cases: {
'Regexp#match?' => -> { rx.match?(str) },
'CharacterSet#cover?' => -> { cs.cover?(str) },
}
)
@@ -0,0 +1,25 @@
require_relative './shared'

str = 'Lorem ipsum et dolorem'
rx = /\s/
cs = CharacterSet.whitespace

benchmark(
caption: 'Removing whitespace',
cases: {
'String#gsub' => -> { str.gsub(rx, '') },
'CharacterSet#delete_in' => -> { cs.delete_in(str) },
}
)

str = 'Lörem ipsüm ⛷ et dölörem'
rx = /[\s\p{emoji}äüö]/
cs = CharacterSet.whitespace + CharacterSet.emoji + CS['ä', 'ü', 'ö']

benchmark(
caption: 'Removing whitespace, emoji and umlauts',
cases: {
'String#gsub' => -> { str.gsub(rx, '') },
'CharacterSet#delete_in' => -> { cs.delete_in(str) },
}
)
@@ -0,0 +1,25 @@
require_relative './shared'

str = 'Lorem ipsum et dolorem'
rx = /\S/
cs = CharacterSet.whitespace

benchmark(
caption: 'Removing non-whitespace',
cases: {
'String#gsub' => -> { str.gsub(rx, '') },
'CharacterSet#keep_in' => -> { cs.keep_in(str) },
}
)

str = 'Lorem ipsum ⛷ et dolorem'
rx = /\p{^emoji}/
cs = CharacterSet.emoji

benchmark(
caption: 'Extracting emoji',
cases: {
'String#gsub' => -> { str.gsub(rx, '') },
'CharacterSet#keep_in' => -> { cs.keep_in(str) },
}
)
@@ -0,0 +1,25 @@
lib = File.expand_path('../lib', __dir__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)

require 'benchmark/ips'
require 'character_set'

def benchmark(caption: nil, cases: {})
puts caption

report = Benchmark.ips do |x|
cases.each do |label, callable|
x.report(label, &callable)
end
x.compare!
end

return unless $store_comparison_results

old_stdout = $stdout.clone
captured_stdout = StringIO.new
$stdout = captured_stdout
report.run_comparison
$store_comparison_results[caption] = captured_stdout.string
$stdout = old_stdout
end
@@ -0,0 +1,25 @@
require_relative './shared'

str = 'Lorem ipsum et dolorem'
rx = /\s/
cs = CharacterSet.whitespace

benchmark(
caption: 'Detecting whitespace',
cases: {
'Regexp#match?' => -> { rx.match?(str) },
'CharacterSet#used_by?' => -> { cs.used_by?(str) },
}
)

str = 'Lorem ipsum et dolorem' * 20 + '' + 'Lorem ipsum et dolorem' * 20
rx = /\p{emoji}/
cs = CharacterSet.emoji

benchmark(
caption: 'Detecting emoji in a large string',
cases: {
'Regexp#match?' => -> { rx.match?(str) },
'CharacterSet#used_by?' => -> { cs.used_by?(str) },
}
)
Oops, something went wrong.

0 comments on commit 582e45b

Please sign in to comment.
You can’t perform that action at this time.