Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv : using day first 23x to 35x slower than setting the format explicitly #25848

Closed
garfieldthecat opened this issue Mar 23, 2019 · 7 comments · Fixed by #26360
Closed
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Timeseries
Milestone

Comments

@garfieldthecat
Copy link

Code Sample: create a small csv with dates, then imports it and times the import

import numpy as np
import pandas as pd
import time
import timeit
start_time=time.time()
n=int(100e3)
out =pd.DataFrame()
out['day']=np.random.randint(1,12,n)
out['month']=np.random.randint(1,12,n)
out['year']=np.random.randint(2000,2010,n)
out['date1']= out.day.astype(str) + '-' + out.month.astype(str) + '-' + out.year.astype(str)
out['date2']= out['day'].map('{:0>2d}'.format ) + '-' + out['month'].map('{:0>2d}'.format ) + '-' + out.year.astype(str)
out.to_csv('test dates.csv')
print('Exported in ')
print(time.time()-start_time)


def noparse():
    return pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )

def parse1():
    return pd.read_csv('test dates.csv', parse_dates=['date1','date2'], dayfirst=True )

def parse2():
    df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
    for d in ['date1','date2']:
        df[d]= pd.to_datetime( df[d], dayfirst=True )
    return df

def parse3():
    df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
    # formatting described at: http://php.net/manual/en/function.strftime.php
    df['date1']=pd.to_datetime(df['date1'], format='%d-%m-%Y')
    df['date2']=pd.to_datetime(df['date2'], format='%d-%m-%Y')
    return df

results=pd.DataFrame()
results['no parse'] = timeit.Timer( "noparse()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 1']=timeit.Timer( "parse1()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 2']=timeit.Timer( "parse2()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 3']=timeit.Timer( "parse3()", globals=globals() ).repeat(repeat=3, number =1)
print(results)

Problem description

I am trying to import a CSV with dates, where the days are always < 13 , i.e. there is no way to infer the proper format - it must be specified explicitly (how can you tell whether 1-2 is Jan 2nd or Feb 1st?).

What I noticed is that setting the date format with dayfirst is 25 to 35x (I tried it on two PCs) slower than setting the format explicitly with format='%d-%m-%Y'

How can this be? It's insane. I appreciate there may be a little bit of additional overhead, as dayfirst must guess the position of the year, but 25x to 35x slower leaves me speechless. With dayfirst it took 16 seconds to import a csv with 2 date columns and only 100k rows.

There must be something very wrong with how read_csv implements dayfirst - I'm hoping it shouldn't be too complicated to fix it?

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

@jreback
Copy link
Contributor

jreback commented Mar 23, 2019

i think this is also a duplicate issue

dateutil does dayfirst / yearfirst parsing and it’s python based so it’s slow

you are welcome to submit a patch

@jreback jreback added Timeseries Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 23, 2019
@jreback
Copy link
Contributor

jreback commented May 7, 2019

cc @anmyachev i believe this will be much faster now (from your recent patch)
can u confirm and see if we have appropriate benchmarks

@anmyachev
Copy link
Contributor

Ok I only need time.
fresh result (now more than 100 times slower - testing on master + #25754):

id no parse parse 1 parse 2 parse 3
0 0.079626 0.191381 6.523449 0.470717
1 0.075815 0.173151 6.387253 0.472316
2 0.083093 0.182341 6.439037 0.489041

@anmyachev
Copy link
Contributor

anmyachev commented May 8, 2019

For comparing read_csv and to_datetime functions it makes sense for the second function to pass cache=True flag.
So the results will be as follows(I think the gap between use dayfirst or format is acceptable):

id no parse parse 1 parse 2 parse 3
0 0.078812 0.173535 0.182994 0.095245
1 0.070532 0.172566 0.194698 0.099433
2 0.065606 0.172240 0.180793 0.101515

No benchmarks with dayfirst=True or format='%d-%m-%Y' were found.

@jreback
Copy link
Contributor

jreback commented May 8, 2019

No benchmarks with dayfirst=True or format='%d-%m-%Y' were found.

could you add those u ran above?

@anmyachev
Copy link
Contributor

Yes I'll do it.

@vnlitvinov
Copy link
Contributor

@garfieldthecat and others interested - this was improved in our PR #25922 (the figures @anmyachev is showing above are after it was accepted to master).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants