concat produces incorrect output #3602

Closed
rhstanton opened this Issue May 14, 2013 · 20 comments

Comments

Projects
None yet
3 participants

Under certain circumstances, concat seems to produce erroneous results. I haven't worked out what causes the problems to arise, but here's an example:

df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })
df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})
concat([df1,df2],axis=1)

produces as output:

firmNo prc stringvar C misc prc
0 rrr 0 6 9 1 6
1 rrr 0 6 10 2 6
2 rrr 0 6 11 3 6
3 rrr 0 6 12 4 6

Member

cpcloud commented May 15, 2013

@rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced font. It's easier on the eyes. :) I can reproduce this on git master. What is the expected output?

Member

cpcloud commented May 15, 2013

Ah. looks like there's a sorting problem here...

I agree it looks terrible! Does the output go in quotes in my notebook or when I upload to github? If you could give me a quick example of how to do this, I'd be more than happy to help others' eyesight in future.

From: Phillip Cloud <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, May 15, 2013 3:42 PM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] concat produces incorrect output (#3602)

@rhstantonhttps://github.com/rhstanton It's helpful if you can put your output in "``" so that it prints in a monospaced fonts. It's easier on the eyes. :)


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3602#issuecomment-17970916.

Member

cpcloud commented May 15, 2013

Surround anything you want monospaced type with backquotes, i.e., the **** character. For examplex = 1`. When you get a chance to get on GitHub (not re ing from ur email) you should click on the GitHub Flavored Markdown link. It's full of useful info.

Contributor

jreback commented May 18, 2013

@cpcloud any luck with this?

Member

cpcloud commented May 18, 2013

Nah not yet, but I haven't given it more than a cursory glance. I will look
into it this weekend.
On May 17, 2013 9:02 PM, "jreback" notifications@github.com wrote:

@cpcloud https://github.com/cpcloud any luck with this?


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3602#issuecomment-18092899
.

Member

cpcloud commented May 19, 2013

@jreback @rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors?

Is this the expected output? (I will assume it is since it is the least magical thing concat could do here: just basically "join" [not in the database sense] the two frames together along the requested axis.)

firmNo prc stringvar C misc prc
0 0 6 rrr 9 1 6
1 0 6 rrr 10 2 6
2 0 6 rrr 11 3 6
3 0 6 rrr 12 4 6

(heh i will try to parse this with read_html later...)

Yes, I’d expect the output you show below, just with the right column headings (2 prc columns with the same values, but only because they were passed in with the same values. If they’d had different values in df1 and df2, I’d expect two prc columns with different contents).

Best,

Richard

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 6:42 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

@jrebackhttps://github.com/jreback @rhstantonhttps://github.com/rhstanton What is expected output? 2 prc columns with the same values? one of the merge behaviors? if u do

use dfs from above

df = concat([df1, df2], axis, ignore_index=True)

print df

0

1

2

3

4

5

0

0

6

Rrr

9

1

6

1

0

6

Rrr

10

2

6

2

0

6

Rrr

11

3

6

3

0

6

Rrr

12

4

6

Is this what u want except with the original column indices?
(heh i will try to parse this with read_html later...)


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3602#issuecomment-18110879.

Member

cpcloud commented May 19, 2013

@jreback This is a strange beast i went all the way into ndframe.init and back up to concat. values attrs test the same, prolly a repr bug now

Member

cpcloud commented May 19, 2013

nvm something else...

Member

cpcloud commented May 19, 2013

@jreback AH HA! the bug is that the _ref_locs attribute of BlockManager is not set when u concat the two dfs and keep track of the index, but when u ignore the index and then set the columns there is already an ordering (_ref_locs is already set) so u r good. the question remains tho how u want to deal with this...seems like might want to raise when trying to concat in this situation. right now the exception thrown by get_indexer is caught and the assumption is made that the ordering is 0..n - 1 where n is the number of blocks, but that doesn't seem totally consistent, not sure what the optimal approach here is.

Member

cpcloud commented May 19, 2013

828f9f9 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here? don't think so, maybe can ignore index if ignore_index is false and there are dup cols and axis is 1

Looks like that would work given the results of your earlier concat without column names.

From: Phillip Cloud [mailto:notifications@github.com]
Sent: Saturday, May 18, 2013 11:01 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

828f9f9828f9f9 fixed the series version of this by just assigning the columns after the concat. is that the correct fix here?


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3602#issuecomment-18112898.

Member

cpcloud commented May 19, 2013

u might be right, need to see how to do this in a sane way...

Member

cpcloud commented May 19, 2013

i wonder if an __eq__ on block and blkmgr might help things like this in the future. could compare items, values and ref_locs

Contributor

jreback commented May 19, 2013

let me take a look

Contributor

jreback commented May 19, 2013

should be fixed by #3647, once I figured out was going on, fix was trivial

@cpcloud you were basically right, the newly created block has a non-unique index, so the block manager tries to create _ref_locs on each block, but this is wrong because it doesn't have an indexer map for the axes -> block locations (but of course have one when we are creating the blocks in the first place, so just set it there)

this worked in <= 0.11, but not in master because of the changes in non-unqique indexes

non-unique are a bit of an animal!

Contributor

jreback commented May 19, 2013

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1
Out[5]: 
   firmNo  prc stringvar
0       0    6       rrr
1       0    6       rrr
2       0    6       rrr
3       0    6       rrr

In [6]: df2
Out[6]: 
    C  misc  prc
0   9     1    6
1  10     2    6
2  11     3    6
3  12     4    6

In [7]: pd.concat([df1,df2],axis=1)
Out[7]: 
   firmNo  prc stringvar   C  misc  prc
0       0    6       rrr   9     1    6
1       0    6       rrr  10     2    6
2       0    6       rrr  11     3    6
3       0    6       rrr  12     4    6

In [8]: pd.concat([df1,df2],axis=1).dtypes
Out[8]: 
firmNo        int64
prc           int64
stringvar    object
C             int64
misc          int64
prc           int64
dtype: object

That looks a lot better. Thanks.

From: jreback [mailto:notifications@github.com]
Sent: Sunday, May 19, 2013 7:11 AM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] concat produces incorrect output (#3602)

In [3]: df1 = DataFrame({'firmNo' : [0,0,0,0], 'stringvar' : ['rrr', 'rrr', 'rrr', 'rrr'], 'prc' : [6,6,6,6] })

In [4]: df2 = DataFrame({'misc' : [1,2,3,4], 'prc' : [6,6,6,6], 'C' : [9,10,11,12]})

In [5]: df1

Out[5]:

firmNo prc stringvar

0 0 6 rrr

1 0 6 rrr

2 0 6 rrr

3 0 6 rrr

In [6]: df2

Out[6]:

C  misc  prc

0 9 1 6

1 10 2 6

2 11 3 6

3 12 4 6

In [7]: pd.concat([df1,df2],axis=1)

Out[7]:

firmNo prc stringvar C misc prc

0 0 6 rrr 9 1 6

1 0 6 rrr 10 2 6

2 0 6 rrr 11 3 6

3 0 6 rrr 12 4 6

In [8]: pd.concat([df1,df2],axis=1).dtypes

Out[8]:

firmNo int64

prc int64

stringvar object

C int64

misc int64

prc int64

dtype: object


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/3602#issuecomment-18118236.

Member

cpcloud commented May 19, 2013

This is the part that I missed: "but of course have one when we are creating the blocks in the first place, so just set it there" arg :) @jreback thanks.

jreback closed this in #3647 May 19, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment