Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I dump DBFs to CSVs faster? #38

Open
marklit opened this issue Oct 29, 2019 · 8 comments
Open

How can I dump DBFs to CSVs faster? #38

marklit opened this issue Oct 29, 2019 · 8 comments

Comments

@marklit
Copy link

marklit commented Oct 29, 2019

I ran a benchmark dumping a DBF to CSV 1,000 times using dbfread 2.0.7 and Pandas 0.24.1 and comparing it to https://github.com/yellowfeather/dbf2csv.

The file I used was a 1.5 MB DBF in 'FoxBASE+/Dbase III plus, no memory' format with 4,522 records and 40 columns made up of 36 numeric fields, 2 char fields and 2 date fields. I used CPython 2.7.12 for dbfread and I compiled dbf2csv using GCC 5.4.0 and libboost v1.58.0. I ran the test on a t3.2xlarge instance on AWS.

The disk can be written to at 115 MB/s according to:

dd if=/dev/zero of=./test bs=1G count=1 oflag=dsync

dbfread and Pandas managed to write the CSV 1,000 in 26 minutes while dbf2csv took just under 74 seconds, a 21x difference in performance.

These were the steps I took during the benchmark:

$ vi run_1
for i in {1..1000}; do
    ./dbf2csv -c FILE.DBF > FILE.csv
done
$ time bash run_1 # 1m13.699s, 21x faster
$ vi run_2.py
from   dbfread import DBF
import pandas as pd
dbf = DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')
pd.DataFrame(iter(dbf)).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')
$ vi run_2
for i in {1..1000}; do
    python run_2.py
done
$ time bash run_2 # 25m59.877s

The DBF would have sat in the page cache but nonetheless represented 1,489 MB of data read. The resulting CSV represented 638 MB. So this works out to 0.95 MB/s read and 0.41 MB/s written.

Do you know of any way I could see improved performance in this workload using dbfread?

@kokes
Copy link

kokes commented Oct 29, 2019

Can you benchmark the imports alone? I presume much of your benchmark is spent import pandas, which is quite a heavy dependency (takes 0.7s, warm, on my machine).

Plus there's a bit of overhead from the non-lazy DataFrame creation, but it's not terrible at 25% using my dummy dataset (1.1 MB, 20 columns, strings and ints).

Screenshot 2019-10-29 at 11 37 52

Edit: I have now tried it with a larger file (40 MB) to see if the lazy/eager approach makes a larger difference and indeed it does - 45% extra runtime under pandas.

@marklit
Copy link
Author

marklit commented Oct 29, 2019

@kokes I ran the following which only imports Pandas and dbfread once and doesn't re-execute Python during each iteration. This wasn't run on Python 2.7.12 due to yield from being used. It was run on Python 3.5 with dbfread 2.0.7 and Pandas 0.24.2. It brought it down to 11m30s. It's a solid improvement.

$ vi run_3.py
from   dbfread import DBF
import pandas as pd

def get_dbf():
    yield from DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')

for _ in range(0, 1000):
    pd.DataFrame(get_dbf()).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')
$ time python3 run_3.py # 11m30.328s

Any more ideas of how we could get this closer to 74 seconds?

@kokes
Copy link

kokes commented Oct 29, 2019

yield from can be replaced by

data = DBF(...)
for el in data:
    yield el

@kokes
Copy link

kokes commented Oct 29, 2019

Any more ideas of how we could get this closer to 74 seconds?

It comes down to what the limits are. I tried running the three obvious pieces of code:

  1. Pandas materialization and csv export (34.4 seconds)
  2. Streaming into a CSV file using the standard csv package (25.4 seconds)
  3. Streaming through the rows and doing nothing (21.6 seconds)

(I also tried the eager mode in dbfread itself, using load=True, but got pretty much identical results)

This goes to show that the major overhead is this library and unless it's changed, you won't get faster than the baseline. So I wouldn't focus on the CSV writer, I'd look at the parsing itself, that is if you really want to go faster.

@olemb
Copy link
Owner

olemb commented Dec 21, 2020

dbfread author here.

I guess the main reason dbf2csv is faster is that it's written in C++. A lot of the operations in dbfread are fiddling with little bits of binary data which would be a lot faster in a language closer to the hardware.

That said I'm sure dbfread could be faster and I'd be interested in any suggestions you might have.

@kokes load=True calls the same code as load=False so I wouldn't expect it to be any faster. (It just calls the same generator and wraps the result in a list.) It should even be a little bit slower since it does a second pass to load deleted records:

    def load(self):
        if not self.loaded:
            self._records = list(self._iter_records(b' '))
            self._deleted = list(self._iter_records(b'*'))

@fccoelho
Copy link

fccoelho commented Feb 2, 2022

@kokes I ran the following which only imports Pandas and dbfread once and doesn't re-execute Python during each iteration. This wasn't run on Python 2.7.12 due to yield from being used. It was run on Python 3.5 with dbfread 2.0.7 and Pandas 0.24.2. It brought it down to 11m30s. It's a solid improvement.

$ vi run_3.py
from   dbfread import DBF
import pandas as pd

def get_dbf():
    yield from DBF('FILE.DBF', encoding='latin-1', char_decode_errors='strict')

for _ in range(0, 1000):
    pd.DataFrame(get_dbf()).to_csv('FILE.csv', index=False, encoding='utf-8', mode='w')
$ time python3 run_3.py # 11m30.328s

Any more ideas of how we could get this closer to 74 seconds?

In the loop you use to save the CSV, you are instantiating a new DataFrame object at every iteration, this definetly adds up in wasted time. You don't need pandas here. When you iterate over a DBF object, you get an ordered dict. You can write them straight to the CSV file using csv.DictWriter as shown here. On a SSD drive I am getting ~8k lines written per second. And I am writing to a gzipped csv which adds the overhead of compression.

@kokes
Copy link

kokes commented Feb 2, 2022

In the loop you use to save the CSV, you are instantiating a new DataFrame object at every iteration, this definetly adds up in wasted time. You don't need pandas here. When you iterate over a DBF object, you get an ordered dict. You can write them straight to the CSV file using csv.DictWriter as shown here. On a SSD drive I am getting ~8k lines written per second. And I am writing to a gzipped csv which adds the overhead of compression.

We've already discussed this above: #38 (comment)

And since we know how fast the parsing itself is (by not writing anything, in the same post), we know the perf ceiling of this library - and as the author notes, the difference against the other library is primarily due to the language used.

I don't think there's much to add here.

@fccoelho
Copy link

fccoelho commented Feb 2, 2022

you're welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants