Failing to parse date given as integers from a (MS)SQL query #17855

drorata · 2017-10-12T14:53:49Z

Code Sample

SQL = """
SELECT TOP 10
submitted_date_id
FROM my_table
WHERE submitted_date_id BETWEEN %(mindate)s AND %(maxdate)s
"""

df = pd.read_sql(
    SQL, 
    con, 
    params={'mindate': 20171008, 'maxdate': '20171010'},
    parse_dates={'submitted_date_id': '%Y%m%d'})

df

The underlaying database is MSSQL and I'm using pymssql to build the connection con.

Problem description

The column submitted_date_id has integers representing the date YYYYMMDD. The code above is expected to parse the column and populate the resulting column in the dataframe with Timestamp objects. However, this code yields an error:

ValueError: cannot cast unit %Y%m%d

The current behavior is counter intuitive and seems to assume that the date column contains stings. A work around is to cast the column in the query to VARCHAR, but this is cumbersome. Moreover, it adds unneeded complexity to the query.

Expected Output

When providing the format of the dates to parse_dates, it is expected that the column will hold Timestamp objects.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.2.0
sphinx: 1.6.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-10-13T02:13:40Z

w/o a reproducible example you would have to debug this and see where things are going wrong.
as we handle this case already. (not 100% sure that the implementation actually passes things thru properly though).

In [10]: pd.to_datetime(20170101, format='%Y%m%d')
Out[10]: Timestamp('2017-01-01 00:00:00')

In [11]: pd.to_datetime('20170101', format='%Y%m%d')
Out[11]: Timestamp('2017-01-01 00:00:00')

drorata · 2017-10-13T08:09:48Z

Here's a reproducible example:

import sqlalchemy
import pymysql
import pandas as pd

eng = sqlalchemy.create_engine('mysql+pymysql://sql2199180:kI1*yG4%@sql2.freemysqlhosting.net/sql2199180')
con = eng.connect()

pd.read_sql('select d from foo', con) # Works
pd.read_sql('select d from foo', con, parse_dates={'d': '%Y%m%d'}) # Fails
pd.read_sql('select CONVERT(d, CHAR(10)) as d from foo', con, parse_dates={'d': '%Y%m%d'}) # Works

jorisvandenbossche · 2017-10-13T09:26:26Z

So this is the current code to handle the parse_dates keyword:

pandas/pandas/io/sql.py

Lines 102 to 119 in c277cd7

    
           def _handle_date_column(col, utc=None, format=None): 
        
               if isinstance(format, dict): 
        
                   return to_datetime(col, errors='ignore', **format) 
        
               else: 
        
                   if format in ['D', 's', 'ms', 'us', 'ns']: 
        
                       return to_datetime(col, errors='coerce', unit=format, utc=utc) 
        
                   elif (issubclass(col.dtype.type, np.floating) or 
        
                         issubclass(col.dtype.type, np.integer)): 
        
                       # parse dates as timestamp 
        
                       format = 's' if format is None else format 
        
                       return to_datetime(col, errors='coerce', unit=format, utc=utc) 
        
                   elif is_datetime64tz_dtype(col): 
        
                       # coerce to UTC timezone 
        
                       # GH11216 
        
                       return (to_datetime(col, errors='coerce') 
        
                               .astype('datetime64[ns, UTC]')) 
        
                   else: 
        
                       return to_datetime(col, errors='coerce', format=format, utc=utc)

where it explicitly checks for numeric column, and then passes format as the unit, which then fails.

I agree this is a bit inconvenient.
As a workaround for now, going from the code (didn't try), I think you can do parse_dates={'d':dict(format='%Y%m%d')} to force the use of the correct keyword.

An easy fix could also be to check whether '%' is contained in the format string, and in that case not pass it to unit (cleaner fix would be to not make this check on the data type, and always pass the format to format if it is not in the list of known units, but still parse it with unit='s' when no format is passed. I think this should not break any code).

@drorata Would you like to do a PR?

drorata · 2017-10-13T13:01:31Z

I am missing something. Why to_datetime is not used directly and if there's a formatting available it should be passed and used by to_datetime? The code @jorisvandenbossche referred to is changing the expected behavior of converting a Series to Timestampss using to_datetime.

drorata · 2017-10-13T19:39:07Z

@jorisvandenbossche I just tried to run the test suite (pytest from the root directory of the repository) on my checked out copy and I get the following:

``` ../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:342: in _getconftestmodules return self._path2confmods[path] E KeyError: local('/Users/username/Dropbox/dev/pandas/pandas')

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:373: in _importconftest
return self._conftestpath2mod[conftestpath]
E KeyError: local('/Users/username/Dropbox/dev/pandas/pandas/conftest.py')

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:379: in _importconftest
mod = conftestpath.pyimport()
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/local.py:662: in pyimport
import(modname)
pandas/pandas/init.py:40: in
import pandas.core.config_init
pandas/pandas/core/config_init.py:12: in
import pandas.core.config as cf
pandas/pandas/core/config.py:820: in
is_text = is_instance_factory((str, bytes))
pandas/pandas/core/config.py:781: in is_instance_factory
from pandas.io.formats.printing import pprint_thing
pandas/pandas/io/formats/printing.py:6: in
from pandas.core.dtypes.inference import is_sequence
pandas/pandas/core/dtypes/inference.py:24: in
is_interval = lib.is_interval
E AttributeError: module 'pandas._libs.lib' has no attribute 'is_interval'

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:372: in visit
for x in Visitor(fil, rec, ignore, bf, sort).gen(self):
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:410: in gen
dirs = self.optsort([p for p in entries
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:411: in
if p.check(dir=1) and (rec is None or rec(p))])
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/main.py:728: in _recurse
ihook = self.gethookproxy(path)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/main.py:632: in gethookproxy
my_conftestmodules = pm._getconftestmodules(fspath)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:356: in _getconftestmodules
mod = self._importconftest(conftestpath)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:381: in _importconftest
raise ConftestImportFailure(conftestpath, sys.exc_info())
E _pytest.config.ConftestImportFailure: AttributeError("module 'pandas._libs.lib' has no attribute 'is_interval'",)
E File "/Users/username/Dropbox/dev/pandas/pandas/init.py", line 40, in
E import pandas.core.config_init
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config_init.py", line 12, in
E import pandas.core.config as cf
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config.py", line 820, in
E is_text = is_instance_factory((str, bytes))
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config.py", line 781, in is_instance_factory
E from pandas.io.formats.printing import pprint_thing
E File "/Users/username/Dropbox/dev/pandas/pandas/io/formats/printing.py", line 6, in
E from pandas.core.dtypes.inference import is_sequence
E File "/Users/username/Dropbox/dev/pandas/pandas/core/dtypes/inference.py", line 24, in
E is_interval = lib.is_interval

</details>

If this is not a trivial problem, I'll start a separate issue.

jorisvandenbossche · 2017-10-16T14:21:27Z

@drorata I suppose the test problem is solved in the meantime since you opened a PR? If not, from a quick look it seems to me you need to rebuild the pandas C-extensions locally (see http://pandas.pydata.org/pandas-docs/stable/contributing.html#making-changes)

jreback added Can't Repro IO SQL to_sql, read_sql, read_sql_query Datetime Datetime data dtype labels Oct 13, 2017

jorisvandenbossche removed the Can't Repro label Oct 13, 2017

jorisvandenbossche added this to the Next Major Release milestone Oct 13, 2017

jorisvandenbossche added the Difficulty Novice label Oct 13, 2017

drorata mentioned this issue Oct 15, 2017

BUG: formating integers datetimes using sql GH17855 #17882

Merged

4 tasks

jorisvandenbossche closed this as completed in #17882 Nov 22, 2017

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to parse date given as integers from a (MS)SQL query #17855

Failing to parse date given as integers from a (MS)SQL query #17855

drorata commented Oct 12, 2017

INSTALLED VERSIONS

jreback commented Oct 13, 2017

drorata commented Oct 13, 2017

jorisvandenbossche commented Oct 13, 2017

drorata commented Oct 13, 2017

drorata commented Oct 13, 2017

jorisvandenbossche commented Oct 16, 2017

Failing to parse date given as integers from a (MS)SQL query #17855

Failing to parse date given as integers from a (MS)SQL query #17855

Comments

drorata commented Oct 12, 2017

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Oct 13, 2017

drorata commented Oct 13, 2017

jorisvandenbossche commented Oct 13, 2017

drorata commented Oct 13, 2017

drorata commented Oct 13, 2017

jorisvandenbossche commented Oct 16, 2017

Output of `pd.show_versions()`