## Python_Pandas_Athena_Examples

This notebook shows how to:

- Read from Athena to a Pandas dataframe.
- Write from Pandas to an Athena table.

This demonstrates basic types - float64, int64 and object. More type mappings can be added if needed. See Athena Supported Types here: https://docs.aws.amazon.com/athena/latest/ug/data-types.html

The user experience is kept as simple as possible:

- To Read a Pandas Dataframe from Athena:

```
df=read_from_athena(<sql_statement>)
```

- To write a Pandas Dataframe to an Athena table:

```
save_to_athena(<dataframe>, <database_name>, <table_name>)
```

In [1]:
region='us-west-2'
defaultdb="default"
default_output="s3://aws-athena-query-results-959874710265-us-west-2/"
default_write_location="s3://neilawstmp2/my_home/"

## Helper Functions

In [2]:
import boto3,time
import pandas as pd

## execute Athena SQL
def executeQuery(query, database=defaultdb, s3_output=default_output, poll=10):
    athena = boto3.client('athena',region_name=region)
    response = athena.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )

    print('Execution ID: ' + response['QueryExecutionId'])
    queryExecutionId=response['QueryExecutionId']
    state='QUEUED'
    while( state=='RUNNING' or state=='QUEUED'):
       response = athena.get_query_execution(QueryExecutionId=queryExecutionId)
       state=response['QueryExecution']['Status']['State']
       print (state)
       if  state=='RUNNING' or state=='QUEUED':
            time.sleep(poll)
       elif (state=='FAILED'):
             print (response['QueryExecution']['Status']['StateChangeReason'])
        
        
    #print (response)    
    return response

## Read from Athena to a Pandas Dataframe
def read_from_athena(sql):
    response=executeQuery(sql)
    return pd.read_csv(response['QueryExecution']['ResultConfiguration']['OutputLocation'])

## Save Pandas Dataframe to Athena table
def save_to_athena(df, database, tablename):
    pandas_to_athena_types_lookup={ "int64":"bigint", "object":"string", "float64":"double"}
    
    ## save the data
    table_location=default_write_location+tablename
    file_location=table_location+'/'+tablename+".pq"
    df.to_parquet(file_location)
    
    ## add the table to Athena
    data_types=df.dtypes
    columns = ''
    for i,k in enumerate(df.columns): 
        key = str(data_types[i])
        #print (i,key)
        if key in pandas_to_athena_types_lookup.keys():
            columns += (k + " " + pandas_to_athena_types_lookup.get(key)+", ")
        else:
            raise ValueError('Type mapping does not exist for type : '+key) 
    columns=columns[:-2]
    
    sql = F"CREATE EXTERNAL TABLE {database}.{tablename} \
    ( {columns} )  \
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  \
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  \
    OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  \
    LOCATION '{table_location}'  \
    TBLPROPERTIES ( 'classification'='parquet','typeOfData'='file')"
    
    #print (sql)
    
    response=executeQuery(sql)

## Read data from Athena

In [13]:
sql="""Select *
from cdc.sales_order_fact
LIMIT 10"""

df=read_from_athena(sql)

Execution ID: 012dbc0e-37ad-4c13-a3e7-fac20e0744b7
QUEUED
SUCCEEDED


In [14]:
df

Unnamed: 0,discount,last_modified_timestamp,line_id,line_number,order_date,order_id,product_id,quantity,ship_mode,site_id,supply_cost,tax,unit_price,year,month,day,hour
0,2.822773,2018-04-14T14:21:04,1,1,2018-04-14T01:00:00,2550899500,999,11,NO-RUSH,327,0.562963,0.651442,2,2018,4,14,1
1,118.499121,2018-04-14T14:21:04,2,2,2018-04-14T01:00:00,2550899500,846,74,NO-RUSH,327,29.767831,36.392407,133,2018,4,14,1
2,93.665862,2018-04-14T14:21:04,3,3,2018-04-14T01:00:00,2550899500,984,18,NO-RUSH,327,621.371628,36.580441,149,2018,4,14,1
3,241.543153,2018-04-14T14:21:04,4,4,2018-04-14T01:00:00,2550899500,239,29,NO-RUSH,327,238.603943,161.031018,731,2018,4,14,1
4,117.350523,2018-04-14T21:31:26,1,1,2018-04-14T01:00:00,2550899501,832,90,STANDARD,44,219.627455,201.261957,462,2018,4,14,1
5,152.034471,2018-04-14T21:31:26,2,2,2018-04-14T01:00:00,2550899501,818,67,STANDARD,44,168.970523,216.051699,524,2018,4,14,1
6,193.128071,2018-04-14T19:58:03,1,1,2018-04-14T01:00:00,2550899502,288,20,ONE-DAY,158,217.463232,141.378774,700,2018,4,14,1
7,72.114141,2018-04-14T19:58:03,2,2,2018-04-14T01:00:00,2550899502,523,93,ONE-DAY,158,33.959294,40.44674,152,2018,4,14,1
8,611.641426,2018-04-14T03:45:39,1,1,2018-04-14T01:00:00,2550899503,630,87,ONE-DAY,237,74.448345,71.560414,332,2018,4,14,1
9,47.199852,2018-04-14T18:18:12,1,1,2018-04-14T01:00:00,2550899504,117,10,ONE-DAY,292,367.296527,1803.600129,227,2018,4,14,1


## Write Data to an Athena table

    


In [15]:
save_to_athena(df, defaultdb, "example_table_new")

Execution ID: cb77e5d5-bd5f-45cd-9277-bcbeb1806142
QUEUED
SUCCEEDED


## Verify the written data

In [16]:
sql="""Select * from default.example_table_new"""

new_df=read_from_athena(sql)
new_df

Execution ID: 31ea88ae-c91e-4d09-8f76-e8d76adf3af3
QUEUED
SUCCEEDED


Unnamed: 0,discount,last_modified_timestamp,line_id,line_number,order_date,order_id,product_id,quantity,ship_mode,site_id,supply_cost,tax,unit_price,year,month,day,hour
0,2.822773,2018-04-14T14:21:04,1,1,2018-04-14T01:00:00,2550899500,999,11,NO-RUSH,327,0.562963,0.651442,2,2018,4,14,1
1,118.499121,2018-04-14T14:21:04,2,2,2018-04-14T01:00:00,2550899500,846,74,NO-RUSH,327,29.767831,36.392407,133,2018,4,14,1
2,93.665862,2018-04-14T14:21:04,3,3,2018-04-14T01:00:00,2550899500,984,18,NO-RUSH,327,621.371628,36.580441,149,2018,4,14,1
3,241.543153,2018-04-14T14:21:04,4,4,2018-04-14T01:00:00,2550899500,239,29,NO-RUSH,327,238.603943,161.031018,731,2018,4,14,1
4,117.350523,2018-04-14T21:31:26,1,1,2018-04-14T01:00:00,2550899501,832,90,STANDARD,44,219.627455,201.261957,462,2018,4,14,1
5,152.034471,2018-04-14T21:31:26,2,2,2018-04-14T01:00:00,2550899501,818,67,STANDARD,44,168.970523,216.051699,524,2018,4,14,1
6,193.128071,2018-04-14T19:58:03,1,1,2018-04-14T01:00:00,2550899502,288,20,ONE-DAY,158,217.463232,141.378774,700,2018,4,14,1
7,72.114141,2018-04-14T19:58:03,2,2,2018-04-14T01:00:00,2550899502,523,93,ONE-DAY,158,33.959294,40.44674,152,2018,4,14,1
8,611.641426,2018-04-14T03:45:39,1,1,2018-04-14T01:00:00,2550899503,630,87,ONE-DAY,237,74.448345,71.560414,332,2018,4,14,1
9,47.199852,2018-04-14T18:18:12,1,1,2018-04-14T01:00:00,2550899504,117,10,ONE-DAY,292,367.296527,1803.600129,227,2018,4,14,1
