Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to parse date given as integers from a (MS)SQL query #17855

Closed
drorata opened this issue Oct 12, 2017 · 6 comments · Fixed by #17882
Closed

Failing to parse date given as integers from a (MS)SQL query #17855

drorata opened this issue Oct 12, 2017 · 6 comments · Fixed by #17882
Labels
Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query

Comments

@drorata
Copy link
Contributor

drorata commented Oct 12, 2017

Code Sample

SQL = """
SELECT TOP 10
submitted_date_id
FROM my_table
WHERE submitted_date_id BETWEEN %(mindate)s AND %(maxdate)s
"""

df = pd.read_sql(
    SQL, 
    con, 
    params={'mindate': 20171008, 'maxdate': '20171010'},
    parse_dates={'submitted_date_id': '%Y%m%d'})

df

The underlaying database is MSSQL and I'm using pymssql to build the connection con.

Problem description

The column submitted_date_id has integers representing the date YYYYMMDD. The code above is expected to parse the column and populate the resulting column in the dataframe with Timestamp objects. However, this code yields an error:

ValueError: cannot cast unit %Y%m%d

The current behavior is counter intuitive and seems to assume that the date column contains stings. A work around is to cast the column in the query to VARCHAR, but this is cumbersome. Moreover, it adds unneeded complexity to the query.

Expected Output

When providing the format of the dates to parse_dates, it is expected that the column will hold Timestamp objects.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.2.0
sphinx: 1.6.4
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 13, 2017

w/o a reproducible example you would have to debug this and see where things are going wrong.
as we handle this case already. (not 100% sure that the implementation actually passes things thru properly though).

In [10]: pd.to_datetime(20170101, format='%Y%m%d')
Out[10]: Timestamp('2017-01-01 00:00:00')

In [11]: pd.to_datetime('20170101', format='%Y%m%d')
Out[11]: Timestamp('2017-01-01 00:00:00')

@jreback jreback added Can't Repro IO SQL to_sql, read_sql, read_sql_query Datetime Datetime data dtype labels Oct 13, 2017
@drorata
Copy link
Contributor Author

drorata commented Oct 13, 2017

Here's a reproducible example:

import sqlalchemy
import pymysql
import pandas as pd

eng = sqlalchemy.create_engine('mysql+pymysql://sql2199180:kI1*yG4%@sql2.freemysqlhosting.net/sql2199180')
con = eng.connect()

pd.read_sql('select d from foo', con) # Works
pd.read_sql('select d from foo', con, parse_dates={'d': '%Y%m%d'}) # Fails
pd.read_sql('select CONVERT(d, CHAR(10)) as d from foo', con, parse_dates={'d': '%Y%m%d'}) # Works

@jorisvandenbossche
Copy link
Member

So this is the current code to handle the parse_dates keyword:

pandas/pandas/io/sql.py

Lines 102 to 119 in c277cd7

def _handle_date_column(col, utc=None, format=None):
if isinstance(format, dict):
return to_datetime(col, errors='ignore', **format)
else:
if format in ['D', 's', 'ms', 'us', 'ns']:
return to_datetime(col, errors='coerce', unit=format, utc=utc)
elif (issubclass(col.dtype.type, np.floating) or
issubclass(col.dtype.type, np.integer)):
# parse dates as timestamp
format = 's' if format is None else format
return to_datetime(col, errors='coerce', unit=format, utc=utc)
elif is_datetime64tz_dtype(col):
# coerce to UTC timezone
# GH11216
return (to_datetime(col, errors='coerce')
.astype('datetime64[ns, UTC]'))
else:
return to_datetime(col, errors='coerce', format=format, utc=utc)

where it explicitly checks for numeric column, and then passes format as the unit, which then fails.

I agree this is a bit inconvenient.
As a workaround for now, going from the code (didn't try), I think you can do parse_dates={'d':dict(format='%Y%m%d')} to force the use of the correct keyword.

An easy fix could also be to check whether '%' is contained in the format string, and in that case not pass it to unit (cleaner fix would be to not make this check on the data type, and always pass the format to format if it is not in the list of known units, but still parse it with unit='s' when no format is passed. I think this should not break any code).

@drorata Would you like to do a PR?

@jorisvandenbossche jorisvandenbossche added this to the Next Major Release milestone Oct 13, 2017
@drorata
Copy link
Contributor Author

drorata commented Oct 13, 2017

I am missing something. Why to_datetime is not used directly and if there's a formatting available it should be passed and used by to_datetime? The code @jorisvandenbossche referred to is changing the expected behavior of converting a Series to Timestampss using to_datetime.

@drorata
Copy link
Contributor Author

drorata commented Oct 13, 2017

@jorisvandenbossche I just tried to run the test suite (pytest from the root directory of the repository) on my checked out copy and I get the following:

``` ../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:342: in _getconftestmodules return self._path2confmods[path] E KeyError: local('/Users/username/Dropbox/dev/pandas/pandas')

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:373: in _importconftest
return self._conftestpath2mod[conftestpath]
E KeyError: local('/Users/username/Dropbox/dev/pandas/pandas/conftest.py')

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:379: in _importconftest
mod = conftestpath.pyimport()
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/local.py:662: in pyimport
import(modname)
pandas/pandas/init.py:40: in
import pandas.core.config_init
pandas/pandas/core/config_init.py:12: in
import pandas.core.config as cf
pandas/pandas/core/config.py:820: in
is_text = is_instance_factory((str, bytes))
pandas/pandas/core/config.py:781: in is_instance_factory
from pandas.io.formats.printing import pprint_thing
pandas/pandas/io/formats/printing.py:6: in
from pandas.core.dtypes.inference import is_sequence
pandas/pandas/core/dtypes/inference.py:24: in
is_interval = lib.is_interval
E AttributeError: module 'pandas._libs.lib' has no attribute 'is_interval'

During handling of the above exception, another exception occurred:
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:372: in visit
for x in Visitor(fil, rec, ignore, bf, sort).gen(self):
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:410: in gen
dirs = self.optsort([p for p in entries
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/py/_path/common.py:411: in
if p.check(dir=1) and (rec is None or rec(p))])
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/main.py:728: in _recurse
ihook = self.gethookproxy(path)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/main.py:632: in gethookproxy
my_conftestmodules = pm._getconftestmodules(fspath)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:356: in _getconftestmodules
mod = self._importconftest(conftestpath)
../../anaconda3/envs/pandas_dev/lib/python3.6/site-packages/_pytest/config.py:381: in _importconftest
raise ConftestImportFailure(conftestpath, sys.exc_info())
E _pytest.config.ConftestImportFailure: AttributeError("module 'pandas._libs.lib' has no attribute 'is_interval'",)
E File "/Users/username/Dropbox/dev/pandas/pandas/init.py", line 40, in
E import pandas.core.config_init
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config_init.py", line 12, in
E import pandas.core.config as cf
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config.py", line 820, in
E is_text = is_instance_factory((str, bytes))
E File "/Users/username/Dropbox/dev/pandas/pandas/core/config.py", line 781, in is_instance_factory
E from pandas.io.formats.printing import pprint_thing
E File "/Users/username/Dropbox/dev/pandas/pandas/io/formats/printing.py", line 6, in
E from pandas.core.dtypes.inference import is_sequence
E File "/Users/username/Dropbox/dev/pandas/pandas/core/dtypes/inference.py", line 24, in
E is_interval = lib.is_interval

</details>

If this is not a trivial problem, I'll start a separate issue.

@jorisvandenbossche
Copy link
Member

@drorata I suppose the test problem is solved in the meantime since you opened a PR? If not, from a quick look it seems to me you need to rebuild the pandas C-extensions locally (see http://pandas.pydata.org/pandas-docs/stable/contributing.html#making-changes)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants