# Data Retrieval

In this notebook, we shall provide the code required to retrieve the data intended for analysis. It should be noted that appropriate credentials must be supplied in order to access the database.

We shall employ the following modules:

In [None]:
import os
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv

from ipynb_utils import CFG
from ipynb_utils import dump_df

A brief examination of the given database, conducted within a GUI application such as DBeaver, reveals that the following SQL query yields the correct dataset:

In [None]:
QUERY = """
    SET SCHEMA 'takemehome';
    SELECT * from takemehome.artsy_pageviews;
"""

We load the credentials required for database access from the `.env` file into Python.

In [None]:
load_dotenv()

DB_CONFIG = {
    "scheme": os.getenv("DB_SCHEME"),
    "database": os.getenv("DATABASE"),
    "user": os.getenv("USER_DB"),
    "password": os.getenv("PASSWORD"),
    "host": os.getenv("HOST"),
    "port": os.getenv("PORT")
}

DB_STRING = (
    "{scheme}://{user}:{password}@{host}:{port}/{database}"
    .format(**DB_CONFIG)
)

We establish a connect to the database and load the data specified by the query into a pandas data frame.

In [None]:
db = create_engine(DB_STRING)

with db.connect() as conn:
    df = pd.read_sql(QUERY, conn)

Let us confirm that the download process has been successful.

In [None]:
df.info()

In [None]:
df.sample(10)

The join operation in the SQL query may cause duplication of columns. We shall remove such duplicates immediately.

In [None]:
df = df.loc[:, ~df.columns.duplicated()]

At last, the data frame is ready to be stored: To preserve the original data structure, we employ a pickle file; the csv version serves solely for direct visual inspection.

In [None]:
df.to_pickle(CFG["DF_PKL_PATH"])
df.to_csv(CFG["DF_CSV_PATH"], index=False)