Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for partial string matching in query #8749

Closed
johanekholm opened this issue Nov 7, 2014 · 7 comments
Closed

Support for partial string matching in query #8749

johanekholm opened this issue Nov 7, 2014 · 7 comments
Labels
API Design Strings String extension data type and string data

Comments

@johanekholm
Copy link

Would be nice to have the query method support partial string matching, so you could do the equivalence of df[df['A'].str.contains("abc")] using query: df.query("A contains 'abc'").

@jreback jreback added API Design Strings String extension data type and string data labels Nov 7, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 7, 2014
@jreback
Copy link
Contributor

jreback commented Nov 7, 2014

sure. pull -requests welcome!

@jorisvandenbossche
Copy link
Member

Would it support regex? Or be more like the standard python in operator?

Because I was thinking, similar to

In [13]: df = pd.DataFrame({'a': ['abcde', 'fghij']})

In [14]: 'a' in 'abcd'
Out[14]: True

you could also do:

df.query("'a' in a")

However, I don't know in what sense it does conflict with the current use of in inside query

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@rea725
Copy link

rea725 commented Nov 28, 2017

Maybe I'm missing something, but seems this is still an open issue. I ended up here via a lot of googling. I'm having the same sort of challenge. I've tried in both pandas 0.18.1 and 0.20.1.

I would love to be able to do df.query("A contains 'abc'") as @johanekholm suggested. It's understood that this would be a slower operation than a simpler condition such as == or != but I don't see any downside to having the option.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2017

@rea725 this is an open issue as the tag indicates
if you want this implemented the quickest route would be a pull request to do so

@rea725
Copy link

rea725 commented Nov 29, 2017

It looks like I found a solution, by reading the pandas documentation. I get the behavior I seek by passing engine='python'. The explanation totally makes sense, including the recommendation to avoid doing so unless you really need to, since this would be slow compared to the default option. I'm not sure any additional action is merited.

More specifically, I had to do df.query('A.str.contains("abc"), engine=python) which is maybe not quite as elegant as df.query("A contains 'abc'"), but it is good enough for my purposes.

@vijaysaimutyala
Copy link
Contributor

Not sure whether I need to open a new issue. If needed, will do.

So I've been using the Series string methods to do some comparisons with a input string. I'm using the

series.str.contains(word,case=False)

and create a new dataframe with the results. What I've observed is that if there is a plus sign (+) in the word I supply for search, the method return 0 zero results
Below is a snippet

import os
import pandas as pd

datadf = pd.DataFrame()
resultdf = pd.DataFrame()

datadf = pd.DataFrame({'description':["i am good boy","i am a bad boy","i am an ugly boy","i am a + boy"]})
print(datadf)
word = "i am a + boy"
resultdf=resultdf.append(datadf[datadf['description'].str.contains(word,case=False)])
print(len(resultdf.index))

The issue is not only with +, but also with *

image

Also does this have to do anything with the below note on official docs ?

image

@wesm
Copy link
Member

wesm commented Jul 6, 2018

Closing. Contributions welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants