Can not put a dataframe into hdfstore *completely* #3012

Closed
simomo opened this Issue Mar 11, 2013 · 3 comments

Comments

Projects
None yet
2 participants

simomo commented Mar 11, 2013

  • I load a dataframe from mysql:
df_bugs_activity_4w = psql.read_frame('select * from bugs_activity limit 0, 40000', conn)
  • and the structure of df_bugs_activity_4w:
In[19]: df_bugs_activity_4w
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: float64(1), int64(4), object(3)
  • then, convert the object columns
In [60]: df_bugs_activity_4w = df_bugs_activity_4w.convert_objects()
In [61]: df_bugs_activity_4w
Out[61]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: datetime64[ns](1), float64(1), int64(4), object(2)
  • I put it into a hdfstore, and then get it out, found the number of dataframe entries changed from 40,000 to 13 ! That's weird. It seems that the number of 'attach_id' columns limits the total number of dataframe when putting it into a hdfstore.
In [63]: %prun store.put('df_bugs_activity_4w1', df_bugs_activity_4w, table=True)

In [64]: %time store.get('df_bugs_activity_4w1')
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.01 s
Out[64]:
bug_id  attach_id   who bug_when    fieldid added   removed id
2012     301879  0   35 1999-04-06 11:22:11  16  dev     bug     2013
2014     301879  0   35 1999-04-06 11:22:12  5   para    us 2015
2835     301879  0   56 1999-05-14 15:56:12  10  clo     op  2836
31244    301879  0   207    2001-07-18 14:11:38  10  op  clo     31245
31252    301879  0   207    2001-07-18 15:40:52  10  ana     op  31253
31283    301879  0   35 2001-07-18 21:21:33  16  lui     dev     31284
31285    301879  0   35 2001-07-18 21:21:34  15  296     10  31286
31287    301879  0   35 2001-07-18 21:21:35  5   unk     para    31288
31393    301879  0   159    2001-07-19 12:41:07  16  prat    lui     31394
31472    301879  0   207    2001-07-19 17:27:31  10  ope     ana     31473
32675    301879  0   207    2001-08-02 10:09:08  10  clos    op   32676
38609    235837  0   201    2001-09-26 20:28:11  15  310-    300-3   38610
38610    235838  0   201    2001-09-26 20:28:11  15  310-    300     38611

2013-03-11 21:43:21

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_no_fill.h5
/df_bugs_4w                      frame_table  (typ->appendable,nrows->40000,ncols->52,indexers->[index])
/df_bugs_4w1                     frame_table  (typ->legacy,nrows->None,ncols->0,indexers->[])           
/df_bugs_activity_4w             frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index])    
/df_bugs_activity_4w1            frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index]) 
Contributor

jreback commented Mar 11, 2013

pls post a sample of the data as it comes out of SQL, before any other operations. post as a text string, EXACTLY as you have it (e.g. do a df.to_csv()) with a subset of the rows,

Contributor

jreback commented Mar 11, 2013

@simomo PR #3013 should fix your 2nd issue (there was a bug), pls give a try and let me know

simomo commented Mar 14, 2013

This issue has been solved. thanks~

@simomo simomo closed this Mar 14, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment