Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.DataFrame.query to allow column name with space #6508

Closed
socheon opened this issue Feb 28, 2014 · 41 comments · Fixed by #24955
Closed

pandas.DataFrame.query to allow column name with space #6508

socheon opened this issue Feb 28, 2014 · 41 comments · Fixed by #24955
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@socheon
Copy link

socheon commented Feb 28, 2014

Able to do something like this would be nice

df.query('[col with space] < col')

I came across many external data files which have spaces in the column names. It would be nice to be able to do quick analysis on the data without first renaming the columns.

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2014

Hm. Many datasets do have this issue, but consider the amount of code needed to rename columns versus the amount of code needed to parse that new syntax (or something similar).

Renaming columns is straightforward:

cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, unicode)) else x)
df.columns = cols

This is well tested and easy to debug. For more complicated replacement you can use regular expressions.

The things that go into parsing are markedly less straightforward:

  1. How would this tokenize? (This isn't so straightforward to me, there might be an easy way to do it, but it seems like you'd have to create a new token to disambiguate a syntax error from list construction and then map the whole thing to a valid Python identifier).
  2. How would this parse? (not too hard, it parses as a ColumnNode or some such object which gets looked up as a column in the frame)

Something that might be useful as a happy medium is a df = pd.clean_columns(df) function that will rename your columns to be valid Python identifiers so that you don't have to think too hard about it.

@socheon
Copy link
Author

socheon commented Feb 28, 2014

I am not sure about the implementation. Maybe we can use normal brackets instead like

df.query('(col with space) < col')

Anyway, the clean_columns method sounds like a good idea to me. Maybe it could also be a keyword argument in the read_csv function.

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2014

@socheon Let's try to keep our discussion to a single issue (I tend to get off topic too!). If you think there should be a clean_columns argument to read_csv, please open another issue.

@jreback What do you think about a top-level clean_columns function?

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

I think if u put quotes around the column name it might work on master

the fix for allowing & and | makes these be treAted like single tokens

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2014

It will be treated as a string, which is then turned into an internal temporary so that won't work.

@cpcloud
Copy link
Member

cpcloud commented Feb 28, 2014

df.query("'a column with a space' > 2") -> df.query("tmp_var_str_some_hex_value > 2")

@dalejung
Copy link
Contributor

Maybe allowing the column to be referenced by its clean version?

df.query("column_with_space > 2")

I do this cleanup for autocompletion whenever it makes sense.

@jreback jreback added this to the Someday milestone Mar 6, 2014
@den-run-ai
Copy link

Column names with spaces, dots, brackets and other invalid characters may be optionally auto-replaced by equivalent valid characters, such as underscore. This is also very handy for accessing columns as members of dataframe with dot syntax.

@dgua
Copy link

dgua commented Feb 28, 2017

Hello,
I want to make a pledge for a resolution of this issue.

It is a very important issue for us. There are cases where we cannot change the column names because they need to be preserved.

That pandas puts an arbitrary requirement on column names is, IMHO, a bad design decision and bad programming practice.

Sorry for complaining, but I really think that pandas should fix this issue properly. Please consider it. Thank you.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

@dgua you are welcome to submit a pull-request to fix. Note that in reality .queryis just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.

In any event, the recommended 'main' way of indexing has always been:

In [10]: df = DataFrame({'foo bar': [1, 2, 3, 4]})

In [11]: df[df['foo bar'] > 2]
Out[11]: 
   foo bar
2        3
3        4

@jreback jreback added Difficulty Advanced Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 28, 2017
@dgua
Copy link

dgua commented Feb 28, 2017

jreback, I agree with you. However, we do have people who use .query extensively in their code because it's easier/more readable, so now we are in a conundrum to either change all that code or rename the columns, neither of which is desirable...

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

@dgua as I said, a pull-request from the community would get this done. I simply don't have time.

@zhiruiwang
Copy link

R and dplyr use backtick( `` ) to quote column names with space and special characters. For example:

df$`column with space`
df %>%
    mutate(`column with space` = 1)

I wonder how hard it is to incorporate this into the pandas parser, so that functions like query, eval or even df. dot syntax can deal with more general column names?

@hwalinga
Copy link
Contributor

hwalinga commented Oct 26, 2018

I like the proposal of zhiruiwang and looking at his upvotes others do to.

So I have looked into the code a bit and basically pandas does some preprocessing on the expression and passes it to numexpr. Next to that a localdict containing among others the names of the dataframe columns is passed with it.

I was thinking to just alter the expression by replacing every space within backticks to something else (like "_" or something less used to prevent name clashes) and remove the backticks. This creates a valid expression for numexpr. Next to that do the same to spaces in the names of columns when they are passed to the resolvers which eventually make up the localdict, so that the correct names can still be found by numexpr.

Maybe the code that will do this can look like this, but I have not tested it. I first like to hear what others think of the idea.

# Don't know if "_" is a good choice and don't know where to place this variable, 
# since it has to be constant in two different files and ideally is only defined once.
SEPERATOR_REPLACING_SPACES = "_"

# Replace spaces in variables surrounded by backticks:
# pandas/pandas/core/computation/expr.py 

import re

...

# new function
def _replace_spaces_backtickvariables(source):
    return re.sub(r'`(.*?)`', 
                  lambda m: m.group(1).replace(" ", SEPERATOR_REPLACING_SPACES), 
                  source)

...

# adjusted function
def _preparse(source, f=compose(_replace_locals, _replace_booleans,
                                _rewrite_assign), g=lambda x: x):
    ...
    g : callable
        This takes a source string and returns an altered one
    ...
    assert callable(g), 'g must be callable'
    source = g(source)

...

# adjusted class
class PandasExprVisitor(BaseExprVisitor):
    
    def __init__(self, env, engine, parser,
                 preparser=partial(_preparse, 
                                   f=compose(_replace_locals, _replace_booleans)
                                   g=_replace_spaces_backtickvariables)):

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

EDIT: fixed _replace_spaces_backtickvariables (regex should've been lazy)

@beojan
Copy link

beojan commented Nov 8, 2018

@zhiruiwang The df. syntax won't be able to use this since it's parsed by Python, but query and eval would work.

@hwalinga
Copy link
Contributor

@jreback You seem the one knowing most about this.

What do you think of my approach as explained in the previous comment?

Instead, we could also decide to solve it as dalelung proposed. So allow "dirty" names to referred to by their "clean" names ("this column name" can be referred to by "this_column_name" without the column actually changing the name) and don't use the `` encapsulation at all.

This would than only require this single line to be changed. (If I understand the code correctly, not tested.)

# Replace spaces in column names when passed to the localdict:
# pandas/pandas/core/frame.py

# adjusted function
def eval(self, expr, inplace=False, **kwargs):
    ...
    # line 3076
    resolvers = dict((k.replace(" ", SEPERATOR_REPLACING_SPACES), v) 
                     for k, v in self.iteritems()), index_resolvers

What do you think? Which approach do you think is best?

Than I can try to make a pull request for it.

@bscully27
Copy link

This feature would be nice but I resolve with commands like these:

Replace White Space

df.rename(columns={k: k.replace(' ','_') for k in df.columns if k.count(' ')>0}, inplace=1)

Starts with Numeric Value

df.rename(columns={k: '_'+k for k in df.columns if k[0].isdigit()}, inplace=1)

@dgua
Copy link

dgua commented Jan 20, 2019 via email

@polidore
Copy link

in sql, you just use square brackets and that's nice.

@TomAugspurger
Copy link
Contributor

@dgua do you have time to submit a PR? @zhiruiwang's suggestion of using backticks seems reasonable, if it can be implemented (we couldn't use them before, since they already have a meaning in Python 2).

@hwalinga
Copy link
Contributor

hwalinga commented Jan 23, 2019

@TomAugspurger I would have some time after my exams, and already took a look into the code for the implementation (see my previous comments). I still have some questions:

  • What should SEPERATOR_REPLACING_SPACES be? If we pick just "_", instead of a more complex replacement string, it might interfere with already existing other columns, but if we choose "_" it has the extra feature that you can refer to this column name by `this column name` and this_column_name. (We could even drop the backticks feature.)

  • SEPERATOR_REPLACING_SPACES has to be consistent over two files. So, should it be declared somewhere else and imported into these files to make sure they are the same, or would a simple comment mentioning this suffice?

  • In the code I will alter, callable() is used. callable() is not available in 3.0-3.2, so should I also fix this at the same time?

@TomAugspurger
Copy link
Contributor

I'm not familiar with this code, so you may be the expert here :)

In the code I will alter, callable() is used. callable() is not available in 3.0-3.2

Python 3.0 - 3.2? We require 3.5+

@beojan
Copy link

beojan commented Jan 23, 2019

I think the backtick idea is rather good now, since in the eval or query context you wouldn't need to worry about how things are implemented. I wouldn't want to drop it.

For the plain Python context, the new names could have a prefix or suffix to prevent collisions (e.g. they all end with an underscore).

@hwalinga
Copy link
Contributor

have a prefix or suffix to prevent collisions

@beojan yes, but maybe you can see those "collisions" as a feature. So you can refer to this column name by `this column name` and this_column_name. I also think it is a bit silly if you have two columns names this column name and this_column_name, but you never know.

@beojan
Copy link

beojan commented Jan 23, 2019

You can use this_column_name_.

If you really have two colliding columns, I don't see how that's a feature though.

@hwalinga
Copy link
Contributor

Well, if I have a dataframe:

    "column name" "name"
1   4              5
2   2              1

With the feature implemented, without measures for colliding, I can now say:

df.query(column_name > 3)

And pandas would automatically refer to "column name" in this query. This is also earlier suggested by dalejung. You can now also leave the support for backticks out.

I also don't think you would see any dataframes in the wild that looks like:

    "column name" "name" "column_name"
1   3              5     6
2   2              1     9

In which the collisions would cause a problem.

So, in my view, it won't cause any collisions and gives an extra way to refer to the column.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 23, 2019 via email

@hwalinga
Copy link
Contributor

@TomAugspurger Okay, that is clear.

Do we than go with a suffix like "_" to prevent collisions, or do we go with replacing the space between the column names with a complex string, so we can rule out even more accidental collisions.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2019

this might work directly in the AST
I think these actually parse correctly but we don’t handle the reassembly correctly

hwalinga added a commit to hwalinga/pandas that referenced this issue Jan 26, 2019
hwalinga added a commit to hwalinga/pandas that referenced this issue Jan 26, 2019
This only works for parser=pandas and engine=numexpr
It works by replacing any backtick quoted variables to a clean version.
    For this, see: pandas/core/common.py::clean_column_name_with_spaces
This happens before the query is passed and by changing the names
of the localdict before it passed to numexpr.
hwalinga added a commit to hwalinga/pandas that referenced this issue Jan 26, 2019
This only works for parser=pandas and engine=numexpr
It works by replacing any backtick quoted variables to a clean version.
    For this, see: pandas/core/common.py::clean_column_name_with_spaces
This happens before the query is passed and by changing the names
of the localdict before it passed to numexpr.
hwalinga added a commit to hwalinga/pandas that referenced this issue Feb 16, 2019
@jreback jreback modified the milestones: Someday, 0.25.0 Mar 10, 2019
@palnabarun
Copy link

Hi everyone,

Is there a timeline when will this feature be released?

@beojan
Copy link

beojan commented Apr 5, 2019

I think it's done in master, given the commit mentioned above.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

this will be in 0.25 - prob in a month or 2

@hwalinga
Copy link
Contributor

hwalinga commented Apr 5, 2019

Don't want to ruin the fun, but to prevent any disappointments: Pandas 0.25 will only be available for Python3. Also see https://pandas-docs.github.io/pandas-docs-travis/install.html#install-dropping-27

@palnabarun
Copy link

@jreback Thanks for the information. I saw that the 0.25.0 milestone is due for May 1.

@hwalinga No worries. This will be used in Python3 only codebase.

@danielhrisca
Copy link
Contributor

This would have been a good opportunity to allow column name that contain dots. Why was this not included?

@hwalinga
Copy link
Contributor

@danielhrisca

I implemented this and we didn't thought of it. You can open a new issue to bring it up again. There might be a good reason I am not aware of, but the maintainers are, that this is best not to be allowed.

I think I can implement it again. It won't have the same solution as the space however. The reason is that the query string is parsed as python source code. You have to apply workarounds if you want certain syntax to be interpreted differently. And there probably won't be implemented a custom parser for this function.

@solivehong
Copy link

@hwalinga HI
I have encountered the same problem, but i table name is point

snv_df.query('Gene.refGene in ["MSH2","MSH3","MLH1","MLH3","MSH6","PMS2","PMS3"]'

I saw the changes you made in 0.25.0.
I have a suggestion,
When I read the source code of a perl software annovar, I found that can use hexadecimal to represent special symbols.
f.g.

comsic=comsic\x3dxxxxxxxxxxxxxxxxxxx

Use a function to correspond to input and output special symbols when parsing

@hwalinga
Copy link
Contributor

@zhaohongqiangsoliva

I don't think I understand what you are trying to say. Can you elaborate a bit more on the problem you want to solve?

@solivehong
Copy link

@hwalinga
Sorry, my English is not good. My problem is that the problem of spaces between column names is now solved, but other symbols are still not solved, for example

.      /           \     

and I give a proposed change is to use Hex

@dgua
Copy link

dgua commented Jun 23, 2019 via email

@hwalinga
Copy link
Contributor

@dgua @zhaohongqiangsoliva

The problem is that the query has to become a valid Python expression. Using hex for disallowed characters won't solve this problem. Allowing spaces in the name is already based on hacking around the tokenize function (from tokenize import generate_tokens). I don't think it is impossible to allow more characters in the name, but it will be based on hacking around the tokenize function again. (Pandas won't role their own Python parser.)

If you really want this, you are off course free to open a new issue addressing this, and if you tag me in the issue, I will explain my solution to the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet