WIP: Add Support for ODBC Connection #999

napjon · 2017-05-30T08:26:28Z

Closes #985

Some caveats:

I’m not sure ODBC connection belong to impala module, since all databases are supported, provided a supporting ODBC driver. I should probably create ODBCConnection class and derived sub for each of the client.
I choose turbodbc over pyodbc for two things,
- I’m unable to create weakref out of pyodbc connection object
- turbodbc support fetchnumpy object which I believe integrate nicely to pandas.
There is no ping in turbodbc cursor
I have to modify _column_batches_to_dataframe to make a space for turbodbc
If someone willing me to point a test connection would be great.

nthreads option
unit tests

napjon · 2017-05-30T09:12:49Z

Seems a bit weird for CircleCI test result. It said,

AttributeError: module 'ibis.impala' has no attribute 'connect'

I don't touch anything inside connect function.

cpcloud · 2017-05-30T13:01:11Z

I will try to review this today. I'll also investigate why the build isn't passing.

cpcloud · 2017-05-30T13:06:10Z

The error you're seeing is usually related to imports. There might be a circular import or a missing dependency, though master is passing so it would be strange if there was a missing dependency.

cpcloud · 2017-05-30T13:08:32Z

@napjon, it looks like you didn't write your patch on top of master. Can you rebase on top of master and push again?

napjon · 2017-05-30T13:40:45Z

I hope I do the right rebase. Let me know if I still do the same thing.

cpcloud · 2017-05-30T13:44:19Z

ibis/impala/api.py

+        argument is None, then certificate validation is skipped.
+    user : string, LDAP user to authenticate
+    password : string, LDAP password to authenticate
+    auth_mechanism : string, {'NOSASL' <- default, 'PLAIN', 'GSSAPI', 'LDAP'}.


@napjon These docstrings are old. You need to remove these changes, but keep your odbc_connect and other changes. git add -p <file> is very good for this as it allows you stage and commit the things you want and then discard (by using git checkout <file>) the things you don't want.

napjon · 2017-05-30T14:38:50Z

@cpcloud: ok, I think it's because my first fork about 6 months ago, back then from cloudera. I've tried to resync my master repo, but git still said nothing changes. I'll solve this.

napjon · 2017-06-01T06:14:53Z

Seems I correct the changes now.

@cpcloud: could you please review it again?

cpcloud · 2017-06-01T23:01:54Z

@napjon Reviewing now.

cpcloud · 2017-06-01T23:05:01Z

ibis/impala/client.py

-    for name, chunks in czip(names, czip(*[b.columns for b in batches])):
-        cols[name] = _chunks_to_pandas_array(chunks)
-    return pd.DataFrame(cols, columns=names)
+    try: #Give Space for turbodbc type


It looks like you have quite a few lint errors: https://circleci.com/gh/pandas-dev/ibis/128?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I've found it helpful to integrate flake8 with my editor so that every time I save, flake8 will run and report any errors in the editor. Then, I can fix them up right away. Most common editors (vim, emacs, and pycharm) support this kind of integration.

cpcloud · 2017-06-01T23:05:52Z

ibis/impala/api.py

+
+    params = {'con_string': connection_string,
+               'dsn':dsn,
+               'turbodbc_options': turbodbc_options}


The indentation here is inconsistent.

cpcloud · 2017-06-01T23:06:24Z

ibis/impala/api.py

+    ImpalaClient
+    """
+
+    params = {'con_string': connection_string,


Why not just pass these directly instead of putting them in a dict and then **-ing them in.

Okay. though I just following code style of connect

cpcloud · 2017-06-01T23:08:13Z

ibis/impala/client.py

@@ -26,6 +26,8 @@
 import numpy as np
 import pandas as pd

+import turbodbc


This will need a new entry into setup.py. You can add a new pip extra called 'odbc', like this:

odbc_extras = ['turbodbc'] setup( ... 'odbc': odbc_extras ... )

cpcloud · 2017-06-02T16:29:59Z

@napjon A few changes are needed before we can merge this. If you run flake8 locally you should be able to see what lint errors you need to fix before any "real" tests will run.

napjon · 2017-06-03T10:29:04Z

@cpcloud: I do believe something has to be tidied up for the changes, so need some guidance here about odbc only in impala module. But I guess it's okay for now to merge. Let me know if there is any correction that needed to be changed.

And I also encounter some error when using this odbc_connect, but I don't know whether it's related to odbc_connect or it also happen in usual connect method.

Suppose I have one table with this schema,

t = ibis.table(ibis.Schema(['i'], ['int8']), name='held')

I got error when executing this statement,

t.i.mean().execute()

Called KeyError, 'mean' not in dictionary. But simple filtering and aggregation works.

Our cluster is going to be restarted this weekend, so I can't reproduce and show you what is exactly the error.

Perhaps I should install the docker ibis, and see from there.

cpcloud

Few comments on formatting.

cpcloud · 2017-06-04T17:44:38Z

ibis/impala/client.py

@@ -145,7 +145,7 @@ def _get_cursor(self):
            raise com.InternalError('Too many concurrent / hung queries')
        else:
            if (cursor.database != self.database or
-                    cursor.options != self.options):
+               cursor.options != self.options):


Leave this formatting as is.

cpcloud · 2017-06-04T17:44:48Z

ibis/impala/client.py

-            ])
+                                       column for column in expr.columns
+                                       if column not in partition_schema_names
+                                       ])


Leave this formatting as is.

cpcloud · 2017-06-04T17:50:54Z

@napjon I think we need to separate things a bit here so that we don't require users to install turbodbc.

What do you think about the following directory structure and API?

An ibis/odbc directory that contains impala.py, postgres.py etc for the different odbc backends. You'd only have to implement the connect function in impala.py (and any supporting code).
Users interact with odbc by doing:

import ibis
con = ibis.odbc.impala.connect(...)

napjon · 2017-06-05T07:55:42Z

@cpcloud: Isn't turbodbc already put in extras?

That sounds reasonable. I'll move around to create impala.py and see from there.

cpcloud · 2017-06-05T13:48:18Z

@napjon it is in extras but anyone using the Impala client will run the code that imports turbodbc the way you currently have it. This needs to be isolated to people who've installed turbodbc.

napjon · 2017-06-05T14:16:03Z

@cpcloud: ah yes, I forgot turbodbc imported at the beginning. Got it, so it makes sense to separate the module.

…-connection Merge with upstream master

napjon · 2017-06-12T10:04:52Z

Changes:

ODBC separate package
turbodbc to fetch using fetchallarrow instead, requires turbodbc>2.00

Caveat: turbodbc arrow are not exist in windows. if arrow not exist or windows os, we can create exception to use fetchnumpy instead.

cpcloud · 2017-06-12T13:31:56Z

@napjon So it looks like your builds are failing because you're missing the OS ODBC package and pybind11.

So, you need to make the following changes.

In circle.yml, replace

sudo apt-get -qq install clang libboost-dev graphviz

with

sudo apt-get -qq install clang libboost-dev graphviz unixodbc-dev unixodbc-bin unixodbc

Add pybind11 to the front of the turbodbc_requires dependency list in setup.py.

cpcloud · 2017-06-12T13:38:16Z

@napjon It's fine that we're missing turbodbc for windows for now. We don't test impala support on windows ATM.

napjon · 2017-06-18T16:26:29Z

It seems not everything have to result to pandas dataframe. Should've separate columnar like in original.

napjon · 2017-06-19T07:36:39Z

@cpcloud: What kind of tests do we want have here? I think it's more of a connection test or do you want to have other type of queries?

kszucs · 2017-10-11T10:02:45Z

What's the state of this? I could use odbc with clickhouse too.

napjon · 2017-10-11T14:23:53Z

I'm waiting for docker update to include impala odbc connector. @cpcloud?

cpcloud · 2017-10-11T14:55:07Z

@napjon @kszucs I'm not sure when I'll be able to get to this.

@napjon I'm not sure the ODBC driver needs a dockerfile modification. Shouldn't it live on the client and not the server?

kszucs · 2017-10-11T15:46:18Z

setup.py

@@ -40,13 +40,15 @@
 kerberos_requires = ['requests-kerberos']
 visualization_requires = ['graphviz']
 pandas_requires = ['multipledispatch']
+turbodbc_requires = ['pybind11', 'pyarrow', 'turbodbc>=2.0.0']


Is pybind11 necessary?

It is when I tested it back then.

As long as turbodbc does not ship wheels on all platforms, you might need it with some setuptools versions (older ones). Note that there is the extra-requirement [arrow] so that turbodbc[arrow] will already pull in pyarrow.

okay got it.

kszucs · 2017-10-11T15:48:25Z

@cpcloud Agree, odbc drivers live on the client side.

kszucs · 2017-10-11T15:50:30Z

The simplest case would be to create conda forge packages for impala and/or clickhouse odbc drivers.

napjon · 2017-10-12T02:05:19Z

@kszucs and @cpcloud : forgive my ignorance. I thought the docker act as a client and server.

So the next step should be;

I setup the docker-impala
make the circleci to have odbc-impala
figuring out the odbc configuration of the impala docker server
use turbodbc.

Should we use odbc.ini or embed it in the test script?

@kszucs: I'm not familiar with creating conda-forge, could you create it?

kszucs · 2017-10-12T08:39:11Z

@napjon @cpcloud

It might be better to ask someone from cloudera or one of the impyla maintainers to create an impala odbc package. I've also created an issue at turbodbc, they are probably more experienced with installing drivers.

napjon · 2017-10-16T14:56:15Z

@cpcloud: So I've configured turbodbc to connect to docker-impala (codingtony/impala). However, there's no data to test for simple queries.

Should I suppose to use "cpcloud86/impala:java8-1" instead?

cpcloud · 2017-10-16T15:04:24Z

@napjon Yep, you can use that, however it won't have data. But, you can use the test_data_admin.py script to load data into impala. Look at https://github.com/ibis-project/ibis/blob/master/.circleci/config.yml#L63 to see how we use it in CircleCI.

napjon · 2017-10-16T15:34:31Z

Got it, thanks for the pointer.

napjon · 2017-10-30T10:05:01Z

Hi @cpcloud,

I've read the circleci/config.yml and use docker cpcloud86/impala:java8-1.
I use following step,

ci/datamgr.py download --directory /tmp/workspace
export DATA_DIR=/tmp/workspace/ibis-testing-data
export IBIS ENV

export IBIS_TEST_IMPALA_HOST=localhost
export IBIS_TEST_IMPALA_PORT=21050
export IBIS_TEST_NN_HOST=localhost
export IBIS_TEST_WEBHDFS_PORT=50070
export IBIS_TEST_WEBHDFS_USER=ubuntu

export database ENV

export IBIS_TEST_CRUNCHBASE_DB=/tmp/workspace/crunchbase.db
export IBIS_TEST_SQLITE_DB_PATH=/tmp/workspace/ibis_testing.db
export FUNCTIONAL_ALLTYPES_CSV="\$DATA_DIR/functional_alltypes.csv"
export DIAMONDS_CSV="\$DATA_DIR/diamonds.csv"
export BATTING_CSV="\$DATA_DIR/batting.csv"
export AWARDS_PLAYERS_CSV="\$DATA_DIR/awards_players.csv"

Start the docker

docker run --rm  -ti  -p 9000:9000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 21000:21000 -p 21050:21050 -p 25000:25000 -p 25010:25010 -p 25020:25020 cpcloud86/impala:java8-1 /start-bash.sh

The docker loaded until I've seen the response

...
Redirecting stdout to /tmp/impalad.INFO
Redirecting stderr to /tmp/impalad.ERROR

There are some error about Postgres, but I think it's not relevant for impala. Finally the step I execute,

 test_data_admin.py load --data --data-dir "$DATA_DIR"

This gave me some error,

impala.error.HiveServer2Error: AnalysisException: This Impala daemon is not ready to accept user requests. Status: Waiting for catalog update from the StateStore.

I've seen the solution is to upgrade impala, which I don't believe to be the problem. Could you give me some pointer to where I should look into?

xmnlab · 2018-12-31T14:52:49Z

hey @napjon, what is the status of this PR?

there are some files with conflict. could you rebase your code?

napjon · 2019-01-24T02:11:31Z

@xmnlab: sorry for late reply. I currently don't have the machine to do this anymore, and maybe for a while until I get the correct setup.

If I recall the issue is configuring unit test to be able to download odbc driver programmatically without asking for license.

datapythonista · 2020-06-26T14:04:37Z

Closing this as stale. @napjon let me know if at some point you can work in this again. Thanks!

napjon force-pushed the odbc-connection branch from e95d92c to 60c13e7 Compare May 30, 2017 13:37

cpcloud self-requested a review May 30, 2017 13:41

cpcloud reviewed May 30, 2017

View reviewed changes

napjon force-pushed the odbc-connection branch 2 times, most recently from e2d9728 to 2793cbf Compare June 1, 2017 06:12

cpcloud requested changes Jun 1, 2017

View reviewed changes

cpcloud requested changes Jun 4, 2017

View reviewed changes

cpcloud self-assigned this Jun 10, 2017

cpcloud added the feature Features or general enhancements label Jun 10, 2017

cpcloud added this to the 0.12 milestone Jun 10, 2017

napjon added 2 commits June 12, 2017 16:59

Adding ODBC package

97ee6a3

Merge branch 'master' of https://github.com/pandas-dev/ibis into odbc…

990736b

…-connection Merge with upstream master

napjon force-pushed the odbc-connection branch from 5b14bdf to 990736b Compare June 12, 2017 10:03

specify inline nthreads

d1cdf4d

napjon added 3 commits June 19, 2017 13:40

abide columnar type query like in original

5264f4c

Adding ImpalaODBCClienct

d77c39a

Re-hijack ImpalaQuery

8524173

napjon added 2 commits June 19, 2017 15:57

Removed ImpalaODBCClient and its query

819efc9

flake8

3c58b13

kszucs reviewed Oct 11, 2017

View reviewed changes

kszucs mentioned this pull request Oct 12, 2017

Conda packages for odbc drivers blue-yonder/turbodbc#136

Closed

cpcloud modified the milestones: 0.13, 0.14 Feb 5, 2018

cpcloud modified the milestones: 0.14, Future Aug 1, 2018

Posnet mentioned this pull request Apr 26, 2019

Dremio support #1769

Closed

datapythonista closed this Jun 26, 2020

WIP: Add Support for ODBC Connection #999

WIP: Add Support for ODBC Connection #999

Conversation

napjon commented May 30, 2017 • edited

napjon commented May 30, 2017

cpcloud commented May 30, 2017

cpcloud commented May 30, 2017

cpcloud commented May 30, 2017

napjon commented May 30, 2017

Choose a reason for hiding this comment

napjon commented May 30, 2017

napjon commented Jun 1, 2017

cpcloud commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Jun 2, 2017

napjon commented Jun 3, 2017

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Jun 4, 2017

napjon commented Jun 5, 2017

cpcloud commented Jun 5, 2017

napjon commented Jun 5, 2017

napjon commented Jun 12, 2017 • edited

cpcloud commented Jun 12, 2017

cpcloud commented Jun 12, 2017

napjon commented Jun 18, 2017

napjon commented Jun 19, 2017

kszucs commented Oct 11, 2017

napjon commented Oct 11, 2017

cpcloud commented Oct 11, 2017

kszucs Oct 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszucs commented Oct 11, 2017

kszucs commented Oct 11, 2017

napjon commented Oct 12, 2017

kszucs commented Oct 12, 2017

napjon commented Oct 16, 2017

cpcloud commented Oct 16, 2017

napjon commented Oct 16, 2017

napjon commented Oct 30, 2017

xmnlab commented Dec 31, 2018 • edited

napjon commented Jan 24, 2019

datapythonista commented Jun 26, 2020

napjon commented May 30, 2017 •

edited

napjon commented Jun 12, 2017 •

edited

kszucs Oct 11, 2017 •

edited

xmnlab commented Dec 31, 2018 •

edited