# Create Athena Database Schema

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. 

Athena is based on Presto, and supports various standard data formats, including CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC. 

Presto is an open source, distributed SQL query engine, developed for fast analytic queries against data of any size. It can query data where it is stored, without the need to move the data. Query execution runs in parallel over a pure memory-based architecture which makes Presto extremely fast. 


<img src="img/athena_setup.png" width="60%" align="left">

In [1]:
import boto3
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [2]:
ingest_create_athena_db_passed = False

In [3]:
%store -r s3_public_path_tsv

no stored variable or alias s3_public_path_tsv


In [4]:
try:
    s3_public_path_tsv
except NameError:
    print("*****************************************************************************")
    print("[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************")
    print("[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************")
    print("*****************************************************************************")

*****************************************************************************
[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************
[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************
*****************************************************************************


In [None]:
print(s3_public_path_tsv)

In [5]:
%store -r s3_private_path_tsv

In [6]:
try:
    s3_private_path_tsv
except NameError:
    print("*****************************************************************************")
    print("[ERROR] PLEASE RE-RUN THE PREVIOUS COPY TSV TO S3 NOTEBOOK ******************")
    print("[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************")
    print("*****************************************************************************")

In [7]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-546928460657/airline-delay-cause/csv


# Import PyAthena

[PyAthena](https://pypi.org/project/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena.

In [9]:
!pip install pyathena

Collecting pyathena
  Using cached pyathena-3.12.2-py3-none-any.whl.metadata (6.3 kB)
Using cached pyathena-3.12.2-py3-none-any.whl (75 kB)
Installing collected packages: pyathena
Successfully installed pyathena-3.12.2
[0m

In [10]:
from pyathena import connect

# Create Athena Database

In [26]:
database_name = "db_airline_delay_cause"

Note: The databases and tables that we create in Athena use a data catalog service to store the metadata of your data. For example, schema information consisting of the column names and data type of each column in a table, together with the table name, is saved as metadata information in a data catalog. 

Athena natively supports the AWS Glue Data Catalog service. When we run `CREATE DATABASE` and `CREATE TABLE` queries in Athena with the AWS Glue Data Catalog as our source, we automatically see the database and table metadata entries being created in the AWS Glue Data Catalog.

In [27]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = "s3://{0}/athena/staging".format(bucket)

In [28]:
print(s3_staging_dir)

s3://sagemaker-us-east-1-546928460657/athena/staging


In [29]:
conn = connect(region_name=region, s3_staging_dir=s3_staging_dir)

In [31]:
statement = "CREATE DATABASE IF NOT EXISTS {}".format(database_name)
print(statement)

CREATE DATABASE IF NOT EXISTS db_airline_delay_cause


In [32]:
import pandas as pd

pd.read_sql(statement, conn)

  pd.read_sql(statement, conn)


# Verify The Database Has Been Created Succesfully

In [33]:
statement = "SHOW DATABASES"

df_show = pd.read_sql(statement, conn)
df_show.head(5)

  df_show = pd.read_sql(statement, conn)


Unnamed: 0,database_name
0,db_airline_delay_cause
1,default
2,dsoaws


In [34]:
if database_name in df_show.values:
    ingest_create_athena_db_passed = True

In [35]:
%store ingest_create_athena_db_passed

Stored 'ingest_create_athena_db_passed' (bool)


# Store Variables for the Next Notebooks

In [36]:
%store

Stored variables and their in-db values:
ingest_create_athena_db_passed             -> True
s3_private_path_tsv                        -> 's3://sagemaker-us-east-1-546928460657/airline-del
setup_dependencies_passed                  -> True
setup_s3_bucket_passed                     -> True


# Release Resources

In [37]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}