PyCQuery

PyCQuery is a collection of Python DB-API and SQLAlchemy interfaces for Hive 3.x. Other than PyHive, it supports features below.

Dynamic service discovery through Zookeeper
Kerberos authentication with pure-python Kerberos library, minikerberos

Requirements

Python 3.6 or above
HiveServer2 instance for Hive 3.x

Install

pip3 install PyCQuery

Usage

DB-API

from pycquery import hive
cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_data LIMIT 10')
print(cursor.fetchone())
print(cursor.fetchall())
cursor.close()

DB-API (asynchronous)

from pycquery import hive
from TCLIService.ttypes import TOperationState

cursor = hive.connect('localhost').cursor()
cursor.execute('SELECT * FROM my_data LIMIT 10', async_=True)

status = cursor.poll().operationState
while status in (TOperationState.INITIALIZED_STATE, TOperationState.RUNNING_STATE):
    logs = cursor.fetch_logs()
    for message in logs:
        print(message)

    # If needed, an asynchronous query can be cancelled at any time with:
    # cursor.cancel()

    status = cursor.poll().operationState

print(cursor.fetchall())
cursor.close()

SQLAlchemy

First install this package to register it with SQLAlchemy (see setup.py).

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *

engine = create_engine('hive://localhost:10000/default')
logs = Table('my_data', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=logs).scalar())

Note: query generation functionality is not fully tested, but there should be no problem with raw SQL.

Passing session configuration

# DB-API
hive.connect('localhost', configuration={'hive.exec.reducers.max': '123'})
# SQLAlchemy
create_engine(
    'hive://user@host:10000/database',
    connect_args={'configuration': {'hive.exec.reducers.max': '123'}},
)
# SQLAlchemy with LDAP
create_engine(
    'hive://user:password@host:10000/database',
    connect_args={'auth': 'LDAP'},
)

Kerberos Authentication with HTTP Thrift

For Single HiveServer2 Instance

from pycquery import hive
from pycquery_krb.common import conf

params = {
    'host': 'localhost',
    'port': '10000',
    'http_path': 'cliservice',
    'service_mode': 'http',
    'kerberos_service_name': 'hive/localhost',
    'auth': 'KERBEROS',
    'username': 'user1@EXAMPLE.COM',
    'keytab_file': 'keytab-file-for-username',
    'krb_conf': conf.KerberosConf.from_osenv() # load krb5.conf from the path set by KRB5_CONFIG or /etc/krb5.conf
     # 'krb_conf': conf.KerberosConf.from_file('other-krb5-conf-file') # load krb5.conf from your path
}

if __name__ == "__main__":
    conn = hive.connect(**params)

    try:
        cursor = conn.cursor()
        try:
            cursor.execute('SELECT * FROM my_data LIMIT 10')
            print(cursor.fetchall())
        finally:
            cursor.close()
    finally:
        conn.close()

For Multiple HiveServer2 Instances with Zookeeper Discovery

from pycquery import hive
from pycquery_krb.common import conf

params = {
    'host': 'zk1.host, zk2.host, zk3.host', # list of zookeeper server hosts
    'port': '2181', # zookeeper server port
    'is_zookeeper': True,
    'http_path': 'cliservice',
    'service_mode': 'http',
    'kerberos_service_name': 'hive/localhost',
    'auth': 'KERBEROS',
    'username': 'user1@EXAMPLE.COM',
    'keytab_file': 'keytab-file-for-username',
    'krb_conf': conf.KerberosConf.from_osenv() # load krb5.conf from the path set by KRB5_CONFIG or /etc/krb5.conf
     # 'krb_conf': conf.KerberosConf.from_file('other-krb5-conf-file') # load krb5.conf from your path 
}

if __name__ == "__main__":
    conn = hive.connect(**params)

    try:
        cursor = conn.cursor()
        try:
            cursor.execute('SELECT * FROM my_data LIMIT 10')
            print(cursor.fetchall())
        finally:
            cursor.close()
    finally:
        conn.close()

Kerberos authentication with Binary Thrift

It requires sasl and thrift-sasl. The pure-python Kerberos library is not supported yet.

Updating TCLIService

The TCLIService module is autogenerated using a TCLIService.thrift file. To update it, the generate.py file can be used: python generate.py <TCLIServiceURL>. When left blank, the version for Hive 3.0 will be downloaded.

License

Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
TCLIService		TCLIService
pycquery		pycquery
pycquery_krb		pycquery_krb
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
dev_requirements.txt		dev_requirements.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

naver/PyCQuery

Folders and files

Latest commit

History

Repository files navigation

PyCQuery

Requirements

Install

Usage

DB-API

DB-API (asynchronous)

SQLAlchemy

Passing session configuration

Kerberos Authentication with HTTP Thrift

Kerberos authentication with Binary Thrift

Updating TCLIService

License

About

Resources

License

Stars

Watchers

Forks

Languages