Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add regex=True flag to str_contains #5879

Merged
merged 1 commit into from Jan 15, 2014

Conversation

@unutbu
Copy link
Contributor

commented Jan 8, 2014

Using regex=False can be faster when full regex searching is not needed.
See http://stackoverflow.com/q/20951840/190597

Example use case:

import string
import itertools as IT
import numpy as np
from pandas import Series

def make_series(letters, strlen, size):
    return Series(
        np.fromiter(IT.cycle(letters), count=size*strlen, dtype='|S1')
        .view('|S{}'.format(strlen)))

many = make_series('matchthis'+string.uppercase, strlen=19, size=10000) # 31% matches
few = make_series('matchthis'+string.uppercase*42, strlen=19, size=10000) # 1% matches
In [115]: %timeit many.str.contains('matchthis')
100 loops, best of 3: 4.93 ms per loop

In [116]: %timeit many.str.contains('matchthis', regex=False)
100 loops, best of 3: 2.44 ms per loop

In [118]: %timeit few.str.contains('matchthis')
100 loops, best of 3: 4.81 ms per loop

In [119]: %timeit few.str.contains('matchthis', regex=False)
100 loops, best of 3: 2.37 ms per loop
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2014

can you add a vbench for both of these? (with and w/o a regex)....

no vbenches for strings at ALL!

pls add vb_suite/strings.py and add to vb_suite/suite.py

bonus points for adding benches for more string ops (could be another PR if you want)!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2014

also, pls add a release notes refering this as an API change (you can put in the 0.13.1 bucket)

@cancan101

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2014

Would is_regex be a better name for the parameter?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2014

@cancan101 no regex is used in filter/replace so consistent

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2014

I added some Benchmarks. Note that I modified _wrap_result (commit da8d451) because x.str.cat returns a string, and strings have no ndim attribute. Without the change x.str.cat raises an AttributeError.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2014

can you post the run of this benchmarks ?

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2014

Running

time ./test_perf.sh -b master -t str-contains -r strings

yields

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
strings_lower                                |   3.5593 |   3.8073 |   0.9349 |
strings_title                                |   4.1630 |   4.3960 |   0.9470 |
strings_match                                |   5.2484 |   5.4413 |   0.9645 |
series_value_counts_strings                  |   6.8393 |   7.0876 |   0.9650 |
strings_upper                                |   3.9666 |   4.0997 |   0.9675 |
contains_many                                |   4.8403 |   4.9907 |   0.9699 |
strings_rstrip                               |   3.2744 |   3.3757 |   0.9700 |
strings_get                                  |   2.6247 |   2.7027 |   0.9711 |
contains_few                                 |   4.7243 |   4.8621 |   0.9717 |
strings_lstrip                               |   3.2550 |   3.3450 |   0.9731 |
strings_strip                                |   3.5814 |   3.6740 |   0.9748 |
strings_slice                                |   2.5616 |   2.6247 |   0.9760 |
strings_pad                                  |   4.1074 |   4.1869 |   0.9810 |
strings_startswith                           |   3.5427 |   3.6077 |   0.9820 |
strings_endswith                             |   3.5497 |   3.6040 |   0.9849 |
strings_count                                |   5.4746 |   5.5540 |   0.9857 |
strings_len                                  |   2.1100 |   2.1390 |   0.9864 |
strings_center                               |   4.1567 |   4.2037 |   0.9888 |
strings_repeat                               |   4.1867 |   4.2174 |   0.9927 |
strings_findall                              |   8.4347 |   8.4640 |   0.9965 |
match_strings                                |   0.4900 |   0.4887 |   1.0026 |
strings_extract                              | 570.6987 | 568.8750 |   1.0032 |
strings_replace                              |  12.6190 |  12.5709 |   1.0038 |
strings_join_split                           |  31.8976 |  31.4066 |   1.0156 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [b2bdbf5] : PERF: add benchmarks for every str method
Base   [0bab303] : BLD: python3 unicode issue in print_versions


@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2014

@unutbu grt! hmm.....maybe have a look at the last 3 to see if anything obvious can do? (esp extract)?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2014

@unutbu can you squash to a smaller num of commits....

@y-p @jtratner look ok?

@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

Where's the perf difference? all those benchmarks look like 1.0x to me, or is that 4% it?

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

@y-p: There should be no performance gain shown in the Benchmarks (because the default behavior, regex=True, was left unchanged). The regex=False keyword in x.str.contains(pat, regex=False) can result in a 2x perf gain for users when the pattern is a plain string however. See http://stackoverflow.com/q/20951840/190597 for a use case.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2014

@unutbu I think that you show these benchmarks, but they are not appearing in the above list; this is a case where you compare across benchmarks
e.g. maybe paste the result of these (can just be in ipython as well - in the top section would be great)

strings_contains_many = Benchmark("many.str.contains('matchthis')", setup)
strings_contains_few = Benchmark("few.str.contains('matchthis')", setup)
strings_contains_many_noregex = Benchmark(
     "many.str.contains('matchthis', regex=False)", setup)
strings_contains_few_noregex = Benchmark(
     "few.str.contains('matchthis',
@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

can you add a vbench for both of these? (with and w/o a regex)....

No matter, I tried this here and it does indeed give a nice boost.

+1

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

@jreback: I'd be glad to but I don't know how.

When I run ./test_perf.sh -b master -t str-contains -r strings

those Benchmarks appear to be missing. And some others (with no strings in its
name), like contains_many and contains_few, are present.

I still don't understand how to use vbench very well. Is there any documentation?
I would like to learn how to run benchmarks from IPython.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2014

@unutbu that looks right....

try test.perf.sh -H -r strings. I have always found it has issues when the benchmarks themselvs are new

I just meant to do timeite from IPython....for a quick comparison of old/new that is ok (for display purposes in a PR).

@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

The filtering is on the benchmark name (first arg of Benchmark()) not the name of the python variable.

ENH: Add regex=True flag to str_contains
Using regex=False can be faster when full regex searching is not needed.
See http://stackoverflow.com/q/20951840/190597

TST: Add a test for str_contains with regex=False
BUG: Not all strings.str_* functions return an object with an ndim attribute.
PERF: add benchmarks for every str method
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2014

if regex is False but the pattern is actually a regex should this raise? (not sure if its easy to determine if its an actual regex)

@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

any regex is also a valid string, so you can't disambiguate.

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

Someone might have a list of regex strings and wish to find those that contain a literal string like (.*). So the pattern could look like regex and yet one might still wish to search for it like a plain string. Maybe that's a stretch, but it seems conceivable. :)

@jreback: I tried test_perf.sh -H -r strings and it did find and run the missing benchmarks, but with no comparison to master (of course). I then reran ./test_perf.sh -b master -t str-contains -r strings but there was no change in the benchmarks found.

@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

a real world example:

'Mr. Smith'

Do I want to match Mr./Mrs smith or an exact match?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2014

yep...ok

@unutbu

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2014

@y-p: I think filtering is being done in test_perf.py with

benchmarks = [x for x in benchmarks if re.search(args.regex,x.name)]

and x.name is being set in vbench/benchmark.py by the _get_assigned_name function, which ultimately does this:

varname = line.split('=', 1)[0].strip()
return varname

Moreover,

from vbench.api import Benchmark

foo = Benchmark('bar','baz')
print(foo.name)

prints foo, so it looks like the Benchmark().name attribute is being set to the variable name.

But I must be missing something because I don't know why some Benchmarks are not being run.

Maybe this is a clue: contains_many and contains_few were old variable names for Benchmarks I made in vb_suite/strings.py. They no longer exist in the str-contains commit, nor in master. They do exist in vb_suite/benchmark.db, though. For some reason these benchmarks are getting called when they should not even exist...

@ghost

This comment has been minimized.

Copy link

commented Jan 14, 2014

That explaines why I got different results from you, erase the db and they'll probably vanish.

I had no idea wes used the inspect trick. Usually frowned upon as too much magic.
https://www.youtube.com/watch?v=8e0l_Dt28MQ

Update: I'm trying to do too many things at once, Of course the first arg is not the name.

jreback added a commit that referenced this pull request Jan 15, 2014
Merge pull request #5879 from unutbu/str-contains
ENH: Add regex=True flag to str_contains

@jreback jreback merged commit 9d9b6dd into pandas-dev:master Jan 15, 2014

1 check passed

default The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2014

@unutbu thanks for the PR....nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.