Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndexes and large CSV files #4516

Closed
highpost opened this issue Aug 8, 2013 · 13 comments · Fixed by #4522
Closed

MultiIndexes and large CSV files #4516

highpost opened this issue Aug 8, 2013 · 13 comments · Fixed by #4522
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@highpost
Copy link

highpost commented Aug 8, 2013

I've run into a data-dependent bug with MultiIndex. Attached are a test script and a data file generation script:

  • sales-data.py loads CSV data from a CSV data file with read_csv and then displays a multiindexed subset.
  • sales-gen.sh is an AWK script that generates CSV test files. By default, it creates a 600MB file, which causes the failure. To make a smaller test file, which succeeds, just use "head -1000".

My problem is that sales-data.py breaks with large CSV files but works with smaller CSV files. When it fails it displays the following error message:

'MultiIndex lexsort depth 0, key was length 4'

I know that MultiIndexes must be pre-sorted. So sales-gen.sh makes sure that the first four columns in the auto-generated CSV files are ordered. BTW, I'm not sure why lexsort is getting called.

sales-data.py

#!/usr/bin/env python

import numpy as np
import pandas as pd
from memory_profiler import profile

pd.set_option('display.height',            400)
pd.set_option('display.width',             400)
pd.set_option('display.max_rows',         1000)
pd.set_option('display.max_columns',        30)
pd.set_option('display.line_width',        200)

# @profile
def load_data():
    try:
        df = pd.read_csv(
            './sales-large.csv',
          # './sales-small.csv',
            header = None,
            na_values = ['NULL'],
            names = [
                'salesperson',
                'customer',
                'invoice_date',
                'ship_date',
                'product',
                'quantity',
                'price',
            ],
            index_col = [
                'salesperson',
                'customer',
                'invoice_date',
                'ship_date',
            ],
            parse_dates = [
                'invoice_date',
                'ship_date',
            ],
        )

        print(df.loc[(
            'A00000',                        # salesperson
            'A',                             # customer
            pd.datetime(2011,3,1,0,0,0),     # invoice_date
            pd.datetime(2011,3,6,0,0,0),     # ship_date
        )])

    except Exception as e:
        print(e)

if __name__== '__main__':
    load_data()

sales-gen.sh

#!/usr/bin/env bash

# 'salesperson',
# 'customer',
# 'invoice_date',
# 'ship_date',
# 'product',
# 'quantity',
# 'price',

awk "BEGIN {                                                                                              \
          first = 65;                                                                                     \
          last = 91;                                                                                      \
          s_char = first;                                                                                 \
          c_char = first;                                                                                 \
                                                                                                          \
          n_sa = 100;                                                                                     \
          n_cu = 100;                                                                                     \
          n_dt = 10;                                                                                      \
          n_pr = 100;                                                                                     \
                                                                                                          \
          for (i = 0; i < n_sa; i++ s_char++) {                                                           \
              if (s_char == last) {                                                                       \
                  s_char = first;                                                                         \
              }                                                                                           \
              c_char = first;                                                                             \
              for (j = 0; j < n_cu; j++ c_char++) {                                                       \
                  if (c_char == last) {                                                                   \
                      c_char = first;                                                                     \
                  }                                                                                       \
                  for (k = 1; k <= n_dt; k++) {                                                           \
                      for (l = 0; l < n_pr; l++) {                                                        \
                          printf(\"%c%05d,%c,2011-03-%02d 00:00:00,2011-03-%02d 00:00:00,%d,%d,%f\n\",    \
                              s_char, i, c_char, k, k+5, l, 1 + int(rand() * 10), rand())                 \
                      }                                                                                   \
                  }                                                                                       \
              }                                                                                           \
          }                                                                                               \
      }"
@jreback
Copy link
Contributor

jreback commented Aug 8, 2013

Here's how to figure out the problem

big file like this best to read it in chunks, and set the index at the end
(it SHOULD be a bit faster and will have lower peak memory too)

df_iter = pd.read_csv(
    './sales-large.csv',
    header = None,
    na_values = ['NULL'],
    names = [
        'salesperson',
        'customer',
        'invoice_date',
        'ship_date',
        'product',
        'quantity',
        'price',
        ],
    parse_dates = [
        'invoice_date',
        'ship_date',
        ],
    chunksize=100000,
    )

Read from the iterator; you have duplicate values, this will show you where they are

for df in df_iter:
    dups = df.duplicated(['salesperson','customer','invoice_date','ship_date'])
    dups[dups]

    dfs.append(df)

The above is for you to inspect, but here's to create a non-duplicated frame

df = pd.concat([ df for df in df_iter ]).drop_duplicates(['salesperson','customer','invoice_date','ship_date'])

Set the index at the very end

df = df.set_index([
    'salesperson',
    'customer',
    'invoice_date',
    'ship_date',
    ])

It actually is not sorted, but doesn't matter (though you prob DO want to sort it)

df.index.lexsort_depth

Running the above script

[goat-jreback-~] python gen.py
0
product     0.00000
quantity    1.00000
price       0.49057
Name: (A00000, A, 2011-03-01 00:00:00, 2011-03-06 00:00:00), dtype: float64

@highpost
Copy link
Author

highpost commented Aug 8, 2013

Thanks, Jeff. I hate it when it's pilot error.

On Thu, Aug 8, 2013 at 3:39 PM, jreback notifications@github.com wrote:

Here's how to figure out the problem

big file like this best to read it in chunks, and set the index at the end

df_iter = pd.read_csv(
'./sales-large.csv',
header = None,
na_values = ['NULL'],
names = [
'salesperson',
'customer',
'invoice_date',
'ship_date',
'product',
'quantity',
'price',
],
parse_dates = [
'invoice_date',
'ship_date',
],
chunksize=100000,
)

Read from the iterator; you have duplicate values, this will show you
where they are

for df in df_iter:
dups = df.duplicated(['salesperson','customer','invoice_date','ship_date'])
dups[dups]

dfs.append(df)

The above is for you to inspect, but here's to create a non-duplicated
frame

df = pd.concat([ df for df in df_iter ]).drop_duplicates(['salesperson','customer','invoice_date','ship_date'])

Set the index at the very ends

df = df.set_index([
'salesperson',
'customer',
'invoice_date',
'ship_date',
])

It actually is not sorted, but doesn't matter (though you prob DO want to
sort it)

df.index.lexsort_depth

Running the above script

[goat-jreback-~] python gen.py
0
product 0.00000
quantity 1.00000
price 0.49057
Name: (A00000, A, 2011-03-01 00:00:00, 2011-03-06 00:00:00), dtype: float64


Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22363578
.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2013

@DanSears this actually is a bug, has to do with how a frame with a multi-index is reindexed when there are multiple dtypes in the data (e.g. in the example below there are floats and ints). This is not implemented, but not too hard. thanks for the report.

In your case, depending no how you are going to access the data, I actually wouldn't set an index at all.

What are you going to do with this data, eg. are you going to simple look up fields like you posted?

In [6]: df = DataFrame([['foo','bar',1.0,1],['foo','bar',2.0,2],['bah','bam',3.0,3],
   ...: ['bah','bam',4.0,4],['foo','bar',5.0,5],['bah','bam',6.0,6]],
   ...: columns=list('ABCD'))

In [8]: df
Out[8]: 
     A    B  C  D
0  foo  bar  1  1
1  foo  bar  2  2
2  bah  bam  3  3
3  bah  bam  4  4
4  foo  bar  5  5
5  bah  bam  6  6
In [9]: df = df.set_index(['A','B'])

In [10]: df
Out[10]: 
         C  D
A   B        
foo bar  1  1
    bar  2  2
bah bam  3  3
    bam  4  4
foo bar  5  5
bah bam  6  6

In [11]: df.sortlevel(0)
Exception: Reindexing only valid with uniquely valued Index objects

@jreback
Copy link
Contributor

jreback commented Aug 9, 2013

@DanSears If youd like to give a try with #4522, this fixes the issue

@highpost
Copy link
Author

highpost commented Aug 9, 2013

Right now I'm just learning pandas and MultiIndexing looks pretty
innovative. I anticipate doing lookups like this to replace SQL selects and
so I mocked up the test data with multiple datetimes. But while my test
data passes sort(1) and uniq(1), it has a bug where its second datetime
column is not unique and so pandas objects. I'll update the awk script soon.

Thanks for looking into this. I'll definitely test anything you come up
with.

--Dan

On Fri, Aug 9, 2013 at 6:12 AM, jreback notifications@github.com wrote:

@DanSears https://github.com/DanSears this actually is a bug, has to do
with how a frame with a multi-index is reindexed when there are multiple
dtypes in the data (e.g. in the example below there are floats and ints).
This is not implemented, but not too hard. thanks for the report.

In your case, depending no how you are going to access the data, I
actually wouldn't set an index at all.

What are you going to do with this data, eg. are you going to simple look
up fields like you posted?

In [6]: df = DataFrame([['foo','bar',1.0,1],['foo','bar',2.0,2],['bah','bam',3.0,3],
...: ['bah','bam',4.0,4],['foo','bar',5.0,5],['bah','bam',6.0,6]],
...: columns=list('ABCD'))

In [8]: df
Out[8]:
A B C D
0 foo bar 1 1
1 foo bar 2 2
2 bah bam 3 3
3 bah bam 4 4
4 foo bar 5 5
5 bah bam 6 6
In [9]: df = df.set_index(['A','B'])

In [10]: df
Out[10]:
C D
A B
foo bar 1 1
bar 2 2
bah bam 3 3
bam 4 4
foo bar 5 5
bah bam 6 6

In [11]: df.sortlevel(0)
Exception: Reindexing only valid with uniquely valued Index objects


Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22393125
.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2013

@DanSears

having dups in a multi-index is ok, but reallly depends on what you are doing with it.

try things like this:

df[(df['salesperson'] == 'foo') & (df['invoice_date']>Timestamp('20130101'))] for selecting

@highpost
Copy link
Author

Sorry for the delay, but I've been testing your fix and here are my results.

I updated my test script and data generation script and I'm pretty sure the data is sorted correctly :) . The test script works fine for smaller datasets, but if I increase the dataset size I get:

index out of bounds

You can make smaller datasets by chopping up the CSV file with something like "head -1000000". Actually, it blows up between 999999 and 1000000 lines!

Here's load-sales.py

#!/usr/bin/env python

import numpy as np
import pandas as pd
from memory_profiler import profile

pd.set_option('display.height',            400)
pd.set_option('display.width',             400)
pd.set_option('display.max_rows',         1000)
pd.set_option('display.max_columns',        30)
pd.set_option('display.line_width',        200)

# @profile
def load_data():
    try:
        df_iter = pd.read_csv(
            './sales-sorted.csv',
            header = None,
            na_values = ['NULL'],
            names = [
                'salesperson',
                'customer',
                'invoice_date',
                'price',
                'ship_date',
                'product',
                'quantity',            
            ],
            parse_dates = [
                'invoice_date',        
                'ship_date',
            ],
            chunksize = 1000,
        )

        #
        # aggregate the chunks
        #
        df = pd.concat(
            [df for df in df_iter],
            ignore_index = True)

      # df = pd.concat(
      #     [df for df in df_iter],
      #     ignore_index = True).drop_duplicates([
      #         'salesperson',
      #         'customer',
      #         'invoice_date',
      #         'ship_date',
      #     ])

        #
        # apply the MultiIndex
        #
        df = df.set_index([
            'salesperson',
            'customer',
            'invoice_date',
            'price',
            'ship_date',
        ])

        print(df.loc[(
            'A00000',                        # salesperson
            'A',                             # customer
            pd.datetime(2011,3,1,0,0,0),     # invoice_date
            1.56,                            # price
            pd.datetime(2011,3,6,0,0,0),     # ship_date
        )])

    except Exception as e:
        print(e)

if __name__== '__main__':
    load_data()

and here's the test data generation script:

#!/usr/bin/env bash

# 'salesperson',
# 'customer',
# 'invoice_date',
# 'price',
# 'ship_date',
# 'product',
# 'quantity',

awk "BEGIN {                                                                                              \
          prices[0] = 1.1;                                                                                \
          prices[1] = 2.2;                                                                                \
          prices[2] = 3.3;                                                                                \
          prices[3] = 4.4;                                                                                \
          prices[4] = 5.5;                                                                                \
          prices[5] = 6.6;                                                                                \
          prices[6] = 7.7;                                                                                \
          prices[7] = 8.8;                                                                                \
          prices[8] = 9.9;                                                                                \
          prices[9] = 10.0;                                                                               \
                                                                                                          \
          n_sa = 300;                                                                                     \
          n_cu = 300;                                                                                     \
          n_in = 10;                                                                                      \
          n_pr = 10;                                                                                      \
                                                                                                          \
          for (i = 0; i < n_sa; i++) {                                                                    \
              for (j = 0; j < n_cu; j++) {                                                                \
                  for (k = 1; k <= n_in; k++) {                                                           \
                      for (l = 0; l < n_pr; l++) {                                                        \
                          printf(\"%d,%d,2011-03-%02d 00:00:00,%0.2f,2011-03-%02d 00:00:00,%d,%d\n\",     \
                              i, j, k, prices[l], k+5, l + 1, 1 + int(rand() * 10))                       \
                      }                                                                                   \
                  }                                                                                       \
              }                                                                                           \
          }                                                                                               \
      }"

@jreback
Copy link
Contributor

jreback commented Aug 16, 2013

the problem is the price field is a float. which is part of an index. This does very weird things in indicies and make them very slow. Make it a column instead and do a boolean search on it. see here

@highpost
Copy link
Author

Thanks.

--Dan

On Fri, Aug 16, 2013 at 5:22 AM, jreback notifications@github.com wrote:

the problem is the price field is a float. which is part of an index.
This does very weird things in indicies and make them very slow. Make it a
column instead and do a boolean search on it. see herehttp://pandas.pydata.org/pandas-docs/dev/indexing.html#fallback-indexing


Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22762792
.

@highpost
Copy link
Author

I converted the price column to an integer (x100) and I'm still getting the same result. That is, for data files below 1 million lines it works. But above 1M I get:

index out of bounds

change to load-sales.py:

print(df.loc[(
    'A00000',                        # salesperson
    'A',                             # customer
    pd.datetime(2011,3,1,0,0,0),     # invoice_date
    156,                             # price (x100)
    pd.datetime(2011,3,6,0,0,0),     # ship_date
)])

change to sales-gen.sh:

printf(\"%d,%d,2011-03-%02d 00:00:00,%d,2011-03-%02d 00:00:00,%d,%d\n\",        \
    i, j, k, int(prices[l] * 100), k+5, l + 1, 1 + int(rand() * 10))            \

@jreback
Copy link
Contributor

jreback commented Aug 17, 2013

the salesperson/customer fields are now integers? A00000, A are not there in your file at all?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2013

Heres a couple of more tipes

access the data like this (which is in effect a short-cut for the type of lookup y ou are doing)
df.xs(('A00000','A', pd.datetime(2011,3,1,0,0,0), 156, pd.datetime(2011,3,6,0,0,0))

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9000000 entries, (0, 0, 2011-03-01 00:00:00, 110, 2011-03-06 00:00:00) to (299, 299, 2011-03-10 00:00:00, 1000, 2011-03-15 00:00:00)
Columns: 2 entries, product to quantity
dtypes: int64(2)

x = df.reset_index()
x

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9000000 entries, 0 to 8999999
Columns: 7 entries, salesperson to quantity
dtypes: datetime64[ns](2), int64(5)

x.head()
   salesperson  customer        invoice_date  price           ship_date  product  quantity
0            0         0 2011-03-01 00:00:00    110 2011-03-06 00:00:00        1        10
1            0         0 2011-03-01 00:00:00    220 2011-03-06 00:00:00        2         6
2            0         0 2011-03-01 00:00:00    330 2011-03-06 00:00:00        3         5
3            0         0 2011-03-01 00:00:00    440 2011-03-06 00:00:00        4         4
4            0         0 2011-03-01 00:00:00    550 2011-03-06 00:00:00        5         6

Now just boolean search (to make sure that the key is in fact there; you can search with one of more keys

x[x.salesperson == 'A00000']
*** TypeError: Could not compare <type 'str'> type with Series

unique values in a column

pd.Series(x.salesperson.unique())

@highpost
Copy link
Author

You're right, I had copied the wrong snippet. The following query works:

print(df.loc[(
    0,                               # salesperson
    0,                               # customer
    pd.datetime(2011,3,1,0,0,0),     # invoice_date
    770,                             # price (x100)
    pd.datetime(2011,3,6,0,0,0),     # ship_date
)])

Thanks for your help with this!

--Dan

On Sat, Aug 17, 2013 at 7:16 AM, jreback notifications@github.com wrote:

Heres a couple of more tipes

access the data like this (which is in effect a short-cut for the type of
lookup y ou are doing)
df.xs(('A00000','A', pd.datetime(2011,3,1,0,0,0), 156,
pd.datetime(2011,3,6,0,0,0))

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9000000 entries, (0, 0, 2011-03-01 00:00:00, 110, 2011-03-06 00:00:00) to (299, 299, 2011-03-10 00:00:00, 1000, 2011-03-15 00:00:00)
Columns: 2 entries, product to quantity
dtypes: int64(2)

x = df.reset_index()
x

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9000000 entries, 0 to 8999999
Columns: 7 entries, salesperson to quantity
dtypes: datetime64ns, int64(5)

x.head()
salesperson customer invoice_date price ship_date product quantity
0 0 0 2011-03-01 00:00:00 110 2011-03-06 00:00:00 1 10
1 0 0 2011-03-01 00:00:00 220 2011-03-06 00:00:00 2 6
2 0 0 2011-03-01 00:00:00 330 2011-03-06 00:00:00 3 5
3 0 0 2011-03-01 00:00:00 440 2011-03-06 00:00:00 4 4
4 0 0 2011-03-01 00:00:00 550 2011-03-06 00:00:00 5 6

Now just boolean search (to make sure that the key is in fact there; you
can search with one of more keys

x[x.salesperson == 'A00000']
*** TypeError: Could not compare <type 'str'> type with Series

unique values in a column

pd.Series(x.salesperson.unique())


Reply to this email directly or view it on GitHubhttps://github.com//issues/4516#issuecomment-22812562
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants