-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
A small, complete example of the issue
import pandas as pd
import numpy as np
df = pd.DataFrame({'tstamp' : pd.date_range('2016-11-15', periods=4*365*60, freq='T'),
'data' : np.random.rand(4*365*60)})
len(df)
## 87600
%%timeit
__ = df['tstamp'].dt.date
## 10 loops, best of 3: 128 ms per loop
%%timeit
__ = df['tstamp'].dt.time
## 10 loops, best of 3: 132 ms per loop
%%timeit
__ = df['tstamp'].dt.dayofyear
## 100 loops, best of 3: 3.04 ms per loop
%%timeit
__ = df['tstamp'].dt.day
## 100 loops, best of 3: 2.83 ms per loop
As clearly demonstrated, accessing date and time take a really long time to compute. I do not know what is causing the bottleneck, but some speed-up will be definitely appreciated.
Also, accessing date
and time
require more than double the memory that the DataFrame requires. I don't have a memory profiler working, but I can attest that my computer with 30 GB of available RAM (after OS use), can load a massive csv that consumes 10.2 GB in memory as a DataFrame. However, trying to access date
from that DataFrame raises MemoryError
. It basically fills up the remaining 19.8 GB of RAM trying to compute the date
from a timestamp column. The DataFrame in question has 113,587,339 rows, and 5 columns of numeric data, one column of strings, and a column with the datetime stamp similar to the example above.
Output of pd.show_versions()
pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None