# Simple Database Creation and Manipulation

In this tutorial we are going to use aws-wrangler to create a database of different tables.

Let's create a database out of the test data `employees.csv`, `sales.csv` and `department.csv` (all in the `data/` folder)

Note this is basically taken from: https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/014%20-%20Schema%20Evolution.ipynb

In [None]:
import pandas as pd
import awswrangler as wr
import datetime
import pydbtools as pydb

## Setup first

In [None]:
# setup your own testing area (set foldername = GH username)
foldername = "mratford" # GH username
foldername = foldername.lower().replace("-","_")

In [None]:
bucketname = "alpha-everyone"
db_name = f"aws_example_{foldername}"
db_base_path = f"s3://{bucketname}/{foldername}/database"
s3_base_path = f"s3://{bucketname}/{foldername}/"

# Delete all the s3 files in a given path
if wr.s3.list_objects(s3_base_path):
    print("deleting objs")
    wr.s3.delete_objects(s3_base_path)

# Delete the database if it exists
df_dbs = wr.catalog.databases(None)
if db_name in df_dbs["Database"].to_list():
    wr.catalog.delete_database(
        name=db_name
    )

### Lets get the data in pandas first

In [None]:
df = pd.read_csv("data/employees.csv")
df.head()

### Lets do some transforms on it

In [None]:
df["creation_date"] = datetime.date(2021, 1, 1)
df.head()

### write the table to a database

parquet is always your best bet for writing data to a Glue Database especially if you only want to retrieve that data via Athena SQL queries.

In [None]:
# Create the database
wr.catalog.create_database(db_name)

# note table_path is a folder as glue treats all the
# data in a folder as contents of a single table
table_path = f"{db_base_path}/employees/"

# Write your pandas dataframe to S3 and add it as a table in your database
wr.s3.to_parquet(
    df=df,
    path=table_path,
    index=False,
    dataset=True, # True allows the other params below i.e. overwriting to db.table
    database=db_name,
    table='employees',
    mode="overwrite",
)

### Append new data to the table

Let's for fun also add new cols as well

In [None]:
df["creation_date"] = datetime.date(2021, 1, 1)

df["new_col1"] = df["employee_id"] + 100
df["new_col2"] = "some text"

df.head()

In [None]:
# Write the new data to S3.
# Note the only thing has changed is mode="append" whereas previously it was mode="overwrite"
wr.s3.to_parquet(
    df=df,
    path=table_path,
    index=False,
    dataset=True,
    database=db_name,
    table='employees',
    mode="append",
)

### Now query the data with Athena to look at it

This should use pydbtools rather than aws_wrangler (if you are a AP user).

In [None]:
# Each uploaded dataset had one employee with an employee_id == 1
# So lets pull that down to demonstrate both tables were added to the data
sql = f"SELECT * from {db_name}.employees where employee_id = 1"
db_table = pydb.read_sql_query(
    sql,
    ctas_approach=False
)

In [None]:
print(sql)

In [None]:
db_table.head()

In [None]:
### Clean up

# Delete all the s3 files in a given path
if wr.s3.list_objects(s3_base_path):
    print("deleting objs")
    wr.s3.delete_objects(s3_base_path)

# Delete the database if it exists
df_dbs = wr.catalog.databases(None)
if db_name in df_dbs["Database"].to_list():
    print(f"deleting {db_name}")
    wr.catalog.delete_database(
        name=db_name
    )

In [None]:
# Demonstrate db no longer exists
db_name in wr.catalog.databases()["Database"].to_list()