Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add Support for ODBC Connection #999

Closed
wants to merge 12 commits into from

Conversation

napjon
Copy link

@napjon napjon commented May 30, 2017

Closes #985

Some caveats:

  • I’m not sure ODBC connection belong to impala module, since all databases are supported, provided a supporting ODBC driver. I should probably create ODBCConnection class and derived sub for each of the client.
  • I choose turbodbc over pyodbc for two things,
    • I’m unable to create weakref out of pyodbc connection object
    • turbodbc support fetchnumpy object which I believe integrate nicely to pandas.
  • There is no ping in turbodbc cursor
  • I have to modify _column_batches_to_dataframe to make a space for turbodbc
  • If someone willing me to point a test connection would be great.

cc @mariusvniekerk

  • nthreads option

  • unit tests

@napjon
Copy link
Author

napjon commented May 30, 2017

Seems a bit weird for CircleCI test result. It said,

AttributeError: module 'ibis.impala' has no attribute 'connect'

I don't touch anything inside connect function.

@cpcloud
Copy link
Member

cpcloud commented May 30, 2017

I will try to review this today. I'll also investigate why the build isn't passing.

@cpcloud
Copy link
Member

cpcloud commented May 30, 2017

The error you're seeing is usually related to imports. There might be a circular import or a missing dependency, though master​ is passing so it would be strange if there was a missing dependency.

@cpcloud
Copy link
Member

cpcloud commented May 30, 2017

@napjon, it looks like you didn't write your patch on top of master. Can you rebase on top of master and push again?

@napjon
Copy link
Author

napjon commented May 30, 2017

I hope I do the right rebase. Let me know if I still do the same thing.

@cpcloud cpcloud self-requested a review May 30, 2017 13:41
argument is None, then certificate validation is skipped.
user : string, LDAP user to authenticate
password : string, LDAP password to authenticate
auth_mechanism : string, {'NOSASL' <- default, 'PLAIN', 'GSSAPI', 'LDAP'}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@napjon These docstrings are old. You need to remove these changes, but keep your odbc_connect and other changes. git add -p <file> is very good for this as it allows you stage and commit the things you want and then discard (by using git checkout <file>) the things you don't want.

@napjon
Copy link
Author

napjon commented May 30, 2017

@cpcloud: ok, I think it's because my first fork about 6 months ago, back then from cloudera. I've tried to resync my master repo, but git still said nothing changes. I'll solve this.

@napjon napjon force-pushed the odbc-connection branch 2 times, most recently from e2d9728 to 2793cbf Compare June 1, 2017 06:12
@napjon
Copy link
Author

napjon commented Jun 1, 2017

Seems I correct the changes now.

@cpcloud: could you please review it again?

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2017

@napjon Reviewing now.

for name, chunks in czip(names, czip(*[b.columns for b in batches])):
cols[name] = _chunks_to_pandas_array(chunks)
return pd.DataFrame(cols, columns=names)
try: #Give Space for turbodbc type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have quite a few lint errors: https://circleci.com/gh/pandas-dev/ibis/128?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I've found it helpful to integrate flake8 with my editor so that every time I save, flake8 will run and report any errors in the editor. Then, I can fix them up right away. Most common editors (vim, emacs, and pycharm) support this kind of integration.


params = {'con_string': connection_string,
'dsn':dsn,
'turbodbc_options': turbodbc_options}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here is inconsistent.

ImpalaClient
"""

params = {'con_string': connection_string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass these directly instead of putting them in a dict and then **-ing them in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. though I just following code style of connect

@@ -26,6 +26,8 @@
import numpy as np
import pandas as pd

import turbodbc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need a new entry into setup.py. You can add a new pip extra called 'odbc', like this:

odbc_extras = ['turbodbc']

setup(
    ...
    'odbc': odbc_extras
    ...
)

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2017

@napjon A few changes are needed before we can merge this. If you run flake8 locally you should be able to see what lint errors you need to fix before any "real" tests will run.

@napjon
Copy link
Author

napjon commented Jun 3, 2017

@cpcloud: I do believe something has to be tidied up for the changes, so need some guidance here about odbc only in impala module. But I guess it's okay for now to merge. Let me know if there is any correction that needed to be changed.

And I also encounter some error when using this odbc_connect, but I don't know whether it's related to odbc_connect or it also happen in usual connect method.

Suppose I have one table with this schema,

t = ibis.table(ibis.Schema(['i'], ['int8']), name='held')

I got error when executing this statement,

t.i.mean().execute()

Called KeyError, 'mean' not in dictionary. But simple filtering and aggregation works.

Our cluster is going to be restarted this weekend, so I can't reproduce and show you what is exactly the error.

Perhaps I should install the docker ibis, and see from there.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on formatting.

@@ -145,7 +145,7 @@ def _get_cursor(self):
raise com.InternalError('Too many concurrent / hung queries')
else:
if (cursor.database != self.database or
cursor.options != self.options):
cursor.options != self.options):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this formatting as is.

])
column for column in expr.columns
if column not in partition_schema_names
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this formatting as is.

@cpcloud
Copy link
Member

cpcloud commented Jun 4, 2017

@napjon I think we need to separate things a bit here so that we don't require users to install turbodbc.

What do you think about the following directory structure and API?

  1. An ibis/odbc directory that contains impala.py, postgres.py etc for the different odbc backends. You'd only have to implement the connect function in impala.py (and any supporting code).
  2. Users interact with odbc by doing:
import ibis
con = ibis.odbc.impala.connect(...)

@napjon
Copy link
Author

napjon commented Jun 5, 2017

@cpcloud: Isn't turbodbc already put in extras?

That sounds reasonable. I'll move around to create impala.py and see from there.

@cpcloud
Copy link
Member

cpcloud commented Jun 5, 2017

@napjon it is in extras but anyone using the Impala client will run the code that imports turbodbc the way you currently have it. This needs to be isolated to people who've installed turbodbc.

@napjon
Copy link
Author

napjon commented Jun 5, 2017

@cpcloud: ah yes, I forgot turbodbc imported at the beginning. Got it, so it makes sense to separate the module.

@cpcloud cpcloud self-assigned this Jun 10, 2017
@cpcloud cpcloud added the feature Features or general enhancements label Jun 10, 2017
@cpcloud cpcloud added this to the 0.12 milestone Jun 10, 2017
@napjon
Copy link
Author

napjon commented Jun 12, 2017

Changes:

  • ODBC separate package
  • turbodbc to fetch using fetchallarrow instead, requires turbodbc>2.00

Caveat: turbodbc arrow are not exist in windows. if arrow not exist or windows os, we can create exception to use fetchnumpy instead.

@cpcloud
Copy link
Member

cpcloud commented Jun 12, 2017

@napjon So it looks like your builds are failing because you're missing the OS ODBC package and pybind11.

So, you need to make the following changes.

  1. In circle.yml, replace
sudo apt-get -qq install clang libboost-dev graphviz

with

sudo apt-get -qq install clang libboost-dev graphviz unixodbc-dev unixodbc-bin unixodbc
  1. Add pybind11 to the front of the turbodbc_requires dependency list in setup.py.

@cpcloud
Copy link
Member

cpcloud commented Jun 12, 2017

@napjon It's fine that we're missing turbodbc for windows for now. We don't test impala support on windows ATM.

@napjon
Copy link
Author

napjon commented Jun 18, 2017

It seems not everything have to result to pandas dataframe. Should've separate columnar like in original.

@napjon
Copy link
Author

napjon commented Jun 19, 2017

@cpcloud: What kind of tests do we want have here? I think it's more of a connection test or do you want to have other type of queries?

@kszucs
Copy link
Member

kszucs commented Oct 11, 2017

What's the state of this? I could use odbc with clickhouse too.

@napjon
Copy link
Author

napjon commented Oct 11, 2017

I'm waiting for docker update to include impala odbc connector. @cpcloud?

@cpcloud
Copy link
Member

cpcloud commented Oct 11, 2017

@napjon @kszucs I'm not sure when I'll be able to get to this.

@napjon I'm not sure the ODBC driver needs a dockerfile modification. Shouldn't it live on the client and not the server?

@@ -40,13 +40,15 @@
kerberos_requires = ['requests-kerberos']
visualization_requires = ['graphviz']
pandas_requires = ['multipledispatch']
turbodbc_requires = ['pybind11', 'pyarrow', 'turbodbc>=2.0.0']
Copy link
Member

@kszucs kszucs Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pybind11 necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is when I tested it back then.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as turbodbc does not ship wheels on all platforms, you might need it with some setuptools versions (older ones). Note that there is the extra-requirement [arrow] so that turbodbc[arrow] will already pull in pyarrow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay got it.

@kszucs
Copy link
Member

kszucs commented Oct 11, 2017

@cpcloud Agree, odbc drivers live on the client side.

@kszucs
Copy link
Member

kszucs commented Oct 11, 2017

The simplest case would be to create conda forge packages for impala and/or clickhouse odbc drivers.

@napjon
Copy link
Author

napjon commented Oct 12, 2017

@kszucs and @cpcloud : forgive my ignorance. I thought the docker act as a client and server.

So the next step should be;

  1. I setup the docker-impala
  2. make the circleci to have odbc-impala
  3. figuring out the odbc configuration of the impala docker server
  4. use turbodbc.

Should we use odbc.ini or embed it in the test script?

@kszucs: I'm not familiar with creating conda-forge, could you create it?

@kszucs
Copy link
Member

kszucs commented Oct 12, 2017

@napjon @cpcloud

It might be better to ask someone from cloudera or one of the impyla maintainers to create an impala odbc package. I've also created an issue at turbodbc, they are probably more experienced with installing drivers.

@napjon
Copy link
Author

napjon commented Oct 16, 2017

@cpcloud: So I've configured turbodbc to connect to docker-impala (codingtony/impala). However, there's no data to test for simple queries.

Should I suppose to use "cpcloud86/impala:java8-1" instead?

@cpcloud
Copy link
Member

cpcloud commented Oct 16, 2017

@napjon Yep, you can use that, however it won't have data. But, you can use the test_data_admin.py script to load data into impala. Look at https://github.com/ibis-project/ibis/blob/master/.circleci/config.yml#L63 to see how we use it in CircleCI.

@napjon
Copy link
Author

napjon commented Oct 16, 2017

Got it, thanks for the pointer.

@napjon
Copy link
Author

napjon commented Oct 30, 2017

Hi @cpcloud,

I've read the circleci/config.yml and use docker cpcloud86/impala:java8-1.
I use following step,

  • ci/datamgr.py download --directory /tmp/workspace

  • export DATA_DIR=/tmp/workspace/ibis-testing-data

  • export IBIS ENV

export IBIS_TEST_IMPALA_HOST=localhost
export IBIS_TEST_IMPALA_PORT=21050
export IBIS_TEST_NN_HOST=localhost
export IBIS_TEST_WEBHDFS_PORT=50070
export IBIS_TEST_WEBHDFS_USER=ubuntu
  • export database ENV
export IBIS_TEST_CRUNCHBASE_DB=/tmp/workspace/crunchbase.db
export IBIS_TEST_SQLITE_DB_PATH=/tmp/workspace/ibis_testing.db
export FUNCTIONAL_ALLTYPES_CSV="\$DATA_DIR/functional_alltypes.csv"
export DIAMONDS_CSV="\$DATA_DIR/diamonds.csv"
export BATTING_CSV="\$DATA_DIR/batting.csv"
export AWARDS_PLAYERS_CSV="\$DATA_DIR/awards_players.csv"
  • Start the docker
docker run --rm  -ti  -p 9000:9000 -p 50010:50010 -p 50020:50020 -p 50070:50070 -p 50075:50075 -p 21000:21000 -p 21050:21050 -p 25000:25000 -p 25010:25010 -p 25020:25020 cpcloud86/impala:java8-1 /start-bash.sh

The docker loaded until I've seen the response

...
Redirecting stdout to /tmp/impalad.INFO
Redirecting stderr to /tmp/impalad.ERROR

There are some error about Postgres, but I think it's not relevant for impala. Finally the step I execute,

 test_data_admin.py load --data --data-dir "$DATA_DIR"

This gave me some error,

impala.error.HiveServer2Error: AnalysisException: This Impala daemon is not ready to accept user requests. Status: Waiting for catalog update from the StateStore.

I've seen the solution is to upgrade impala, which I don't believe to be the problem. Could you give me some pointer to where I should look into?

@cpcloud cpcloud modified the milestones: 0.13, 0.14 Feb 5, 2018
@cpcloud cpcloud modified the milestones: 0.14, Future Aug 1, 2018
@xmnlab
Copy link
Contributor

xmnlab commented Dec 31, 2018

hey @napjon, what is the status of this PR?

there are some files with conflict. could you rebase your code?

@napjon
Copy link
Author

napjon commented Jan 24, 2019

@xmnlab: sorry for late reply. I currently don't have the machine to do this anymore, and maybe for a while until I get the correct setup.

If I recall the issue is configuring unit test to be able to download odbc driver programmatically without asking for license.

@Posnet Posnet mentioned this pull request Apr 26, 2019
@datapythonista
Copy link
Contributor

Closing this as stale. @napjon let me know if at some point you can work in this again. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using pyodbc to connect to Impala for kerberos problem
6 participants