Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.fillna() working on row vector instead of column vector? #15522

Closed
ixru opened this issue Feb 27, 2017 · 6 comments · Fixed by #21674
Closed

DataFrame.fillna() working on row vector instead of column vector? #15522

ixru opened this issue Feb 27, 2017 · 6 comments · Fixed by #21674
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timezones Timezone data dtype
Milestone

Comments

@ixru
Copy link

ixru commented Feb 27, 2017

Code Sample, a copy-pastable example if possible

>>> df.head(5)
                       time   id    bid  bid_depth  bid_depth_total  \
0 2017-02-27 11:34:31+00:00  105  148.0      497.0         216589.0
1 2017-02-27 11:34:35+00:00  105    NaN        NaN              NaN
2 2017-02-27 11:34:38+00:00  105    NaN        NaN              NaN
3 2017-02-27 11:34:40+00:00  105    NaN        NaN              NaN
4 2017-02-27 11:34:41+00:00  105    NaN        NaN              NaN

   bid_number  offer  offer_depth  offer_depth_total  offer_number   open  \
0       243.0  148.1      14192.0           530373.0         503.0  147.5
1         NaN    NaN      14272.0           530453.0         504.0    NaN
2         NaN    NaN      14192.0           530373.0         503.0    NaN
3         NaN    NaN      14272.0           530453.0         504.0    NaN
4         NaN    NaN      14492.0           530673.0         505.0    NaN

    high    low   last  change  change_percent     volume        value  trades
0  148.2  147.3  148.0     0.9            0.61  1286830.0  190224000.0  2112.0
1    NaN    NaN    NaN     NaN             NaN        NaN          NaN     NaN
2    NaN    NaN    NaN     NaN             NaN        NaN          NaN     NaN
3    NaN    NaN    NaN     NaN             NaN        NaN          NaN     NaN
4    NaN    NaN    NaN     NaN             NaN        NaN          NaN     NaN

>>> df.fillna(method='pad')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 2842, in fillna
    downcast=downcast, **kwargs)
  File "/usr/lib/python3.6/site-packages/pandas/core/generic.py", line 3250, in fillna
    downcast=downcast)
  File "/usr/lib/python3.6/site-packages/pandas/core/internals.py", line 3177, in interpolate
    return self.apply('interpolate', **kwargs)
  File "/usr/lib/python3.6/site-packages/pandas/core/internals.py", line 3056, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/lib/python3.6/site-packages/pandas/core/internals.py", line 917, in interpolate
    downcast=downcast, mgr=mgr)
  File "/usr/lib/python3.6/site-packages/pandas/core/internals.py", line 956, in _interpolate_with_fill
    values = self._try_coerce_result(values)
  File "/usr/lib/python3.6/site-packages/pandas/core/internals.py", line 2448, in _try_coerce_result
    result = result.reshape(len(result))
ValueError: cannot reshape array of size 24311 into shape (1,)

Problem description

msgpack of dataframe for replication:
https://www.dropbox.com/s/5skf6v8x2vg103o/dataframe?dl=0

I'm a beginner so I can only guess at what is wrong, but it seems to be working on rows instead of the columns. I can loop through df.columns and do it series by series to end up with the expected output so it doesn't seem to me as if it is a problem with any of the columns.

Expected Output

Fill the columns of NaN's with prior value in column.

Output of pd.show_versions()

commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.9.8-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.2.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

can you show df.info()

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 27, 2017
@ixru
Copy link
Author

ixru commented Feb 27, 2017

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24311 entries, 0 to 24310
Data columns (total 19 columns):
time                 24311 non-null datetime64[ns, UTC]
id                   24311 non-null int64
bid                  1469 non-null float64
bid_depth            7988 non-null float64
bid_depth_total      11630 non-null float64
bid_number           10765 non-null float64
offer                1370 non-null float64
offer_depth          7864 non-null float64
offer_depth_total    10617 non-null float64
offer_number         9940 non-null float64
open                 1085 non-null float64
high                 1086 non-null float64
low                  1085 non-null float64
last                 1223 non-null float64
change               1223 non-null float64
change_percent       1223 non-null float64
volume               3697 non-null float64
value                3697 non-null float64
trades               3697 non-null float64
dtypes: datetime64[ns, UTC](1), float64(17), int64(1)
memory usage: 3.5 MB

@chris-b1
Copy link
Contributor

Something to do with datetimetz. Here's a simpler repro:

df = pd.DataFrame({'date': pd.date_range('2014-01-01', periods=5, tz='US/Central')})
df.fillna(method='pad')

ValueError                                Traceback (most recent call last)
<ipython-input-77-8f5ecb26a2f6> in <module>()
----> 1 df.fillna(method='pad')

@chris-b1 chris-b1 added Bug Timezones Timezone data dtype labels Feb 27, 2017
@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

yeah need to handle these in the Block correctly (the tz)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

@MatSalm easy way to do this is (though not super pretty)

In [20]: df = pd.DataFrame({'A':pd.date_range('20130101',periods=4,tz='US/Eastern'),'B':[1,2,np.nan,np.nan]})

In [21]: df
Out[21]: 
                          A    B
0 2013-01-01 00:00:00-05:00  1.0
1 2013-01-02 00:00:00-05:00  2.0
2 2013-01-03 00:00:00-05:00  NaN
3 2013-01-04 00:00:00-05:00  NaN

In [23]: df[df.select_dtypes(exclude=['number']).columns].join(df.select_dtypes(include=['number']).fillna(method='pad'))
Out[23]: 
                          A    B
0 2013-01-01 00:00:00-05:00  1.0
1 2013-01-02 00:00:00-05:00  2.0
2 2013-01-03 00:00:00-05:00  2.0
3 2013-01-04 00:00:00-05:00  2.0

@jreback jreback added this to the Next Major Release milestone Feb 27, 2017
@ixru
Copy link
Author

ixru commented Feb 27, 2017

Thank you

@jreback jreback modified the milestones: Next Major Release, 0.24.0 Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants