# Interacting with impala using python

Connecting to impala is easy using the [impyla](https://github.com/cloudera/impyla) or [ibis](https://github.com/cloudera/ibis) module. Since the Ibis module is still under devlopment, this tutorial will only cover impyla.

## Install impyla

In [None]:
pip install impyla

If pip is not installed on your system, you can visit [this website](http://pip.readthedocs.org/en/stable/installing/) or install pip using anaconda:

In [None]:
conda install pip

## Creating a Connection

To connect to impala, first create a connection string specifying your impala hostname and port. The default port for impala is 21050.

In [2]:
from impala.dbapi import connect

#create a connection, replace 'impala_host' with your host name
conn=connect(host='glados19', port=21050)

Once you setup the connection string, you can create a cursor object for intereacting with the database:

In [3]:
#create a cursor object to interact with the db
cur = conn.cursor()

In [4]:
# view cursor object
print cur

<impala.hiveserver2.HiveServer2Cursor object at 0x00000000041A2A20>


## Run Queries

Python interacts with impala by executing sql queries using cur.execute() to execute the query, and then using cur.fetchall() to grab the results.

### Print results to screen

In [None]:
# execute sql query
cur.execute('SQL query')
# grab results
cur.fetchall()
# print results
for row in cur.fetchall():
    print row

### Save results as pandas table

In [None]:
# import pandas impala api
from impala.util import as_pandas 
# execute sql query
cur.execute('SQL query')
# grab results as dataframe
results = as_pandas(cur)

## View available databases and tables

To see what databases are available in impala, let's run a simple SQL statement to view available databases 'SHOW databases'. 

In [None]:
#view available databases
cur.execute('SHOW DATABASES')

#fetch results of cur.execute()
for row in cur.fetchall():
    print row

Let's take a look at the public resources available for the grch37 build by first selecting that database (p7_ref_grch37) with a SQL USE statement 'USE p7_ref_grch37', then asking to see all the tables in that database:

In [None]:
#select a particular database to use
cur.execute('USE p7_ref_grch37')

#view tables in selected database 
#if no db is selected, you will see tables in default db
cur.execute('SHOW TABLES')

# view results
for row in cur.fetchall():
    print row

### Viewing table information

In order to match up fields from different tables, it helps to have more information about what each table contains. The SQL "DESCRIBE" statement can be used to find out column names, data types and a description of the contents:

In [None]:
cur.execute('DESCRIBE p7_ref_grch37.cytoband')
for row in cur.fetchall():
    print row

## Connect to a table as pandas dataframe

In [7]:
# import pandas impala api
from impala.util import as_pandas 
# execute sql query
cur.execute('SELECT * from p7_ref_grch37.clinvar LIMIT 5')
# grab results as dataframe
clinvar = as_pandas(cur)

In [8]:
print clinvar

  chrom     pos        rs_id ref alt  qual filter     rs_pos    rv    vp  \
0     1  883516  rs267598747   G   A  None   None  267598747  None  None   
1     1  891344  rs267598748   G   A  None   None  267598748  None  None   
2     1  906168  rs267598759   G   A  None   None  267598759  None  None   
3     1  949696  rs672601345   C  CG  None   None  672601345  None  None   
4     1  949739  rs672601312   G   T  None   None  672601312  None  None   

        ...       clin_allele              clin_src  clin_origin  clin_src_id  \
0       ...                 1                     .            2            .   
1       ...                 1                     .            2            .   
2       ...                 1                     .            2            .   
3       ...                 1  OMIM_Allelic_Variant            1  147571.0002   
4       ...                 1  OMIM_Allelic_Variant            1  147571.0001   

   clin_sig         clin_dsdb      clin_dsdb_id         

## View explain plan

Unfortunately, the explain plan doesn't work with the impyla module. However, this is possible with Ibis and will be covered in that tutorial. 

In [12]:
stuff= cur.execute('EXPLAIN SELECT * from p7_ref_grch37.clinvar LIMIT 5')
print stuff

None


## Close connection to impala

Once you are finished with a query, it's important to close the connection object. 

In [None]:
conn.close()