# 1. Data Retrieval

**TODO:** Copy from 

One for manual inspection (csv) and one conservative (pkl)

Loads necessary libraries for the EDA:

In [None]:
!pip install psycopg2-binary

import os
import subprocess
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv



Global configs:

In [None]:
# Path to root directory of the repo.
root_dir_ = subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
)
ROOT_DIR = root_dir_.strip()

# Path to data directory.
DATA_DIR = os.path.join(ROOT_DIR, "data")

# Paths to which dataframes will be saved
DF_PKL_PATH = os.path.join(DATA_DIR, "df.pkl")
DF_CSV_PATH = os.path.join(DATA_DIR, "df.csv") 

After exploring in db beaver. We join on id patient id

In [None]:
QUERY = """
    SET SCHEMA 'eda';

    SELECT kchd.*, kchs."date", kchs.price 
        FROM 
            king_county_house_details kchd 
        LEFT JOIN
            king_county_house_sales kchs 
        ON kchd.id = kchs.house_id
    ;
"""

Loads data from the database and stores it in pandas dataframes:

In [None]:
load_dotenv()

DB_CONFIG = {
    "database": os.getenv("DATABASE"),
    "user": os.getenv("USER_DB"),
    "password": os.getenv("PASSWORD"),
    "host": os.getenv("HOST"),
    "port": os.getenv("PORT")
}

DB_STRING = (
    "postgresql://{user}:{password}@{host}:{port}/{database}"
    .format(**DB_CONFIG)
)

In [None]:
db = create_engine(DB_STRING)

with db.connect() as conn:
    df = pd.read_sql(QUERY, conn)

Let us check that the data were retrieved.

In [None]:
df.info()

In [None]:
df.sample(10)



Let us remove possibly duplicated columns (resulting from SQL joining operations):

In [None]:
df = df.loc[:, ~df.columns.duplicated()]


Now, the dataframe is ready to be stored: In order to preserve the original data structure, we use a `.pickle` file; the `.csv` version only serves the purpose of a direct visual inspection.

In [None]:
df.to_pickle(DF_PKL_PATH)
df.to_csv(DF_CSV_PATH, index=False)

---

# Appendix

Let us briefly sketch how to download the data from [kaggle](https://www.kaggle.com). To that end, kaggle API credentials are required (such credential can be created on <https://www.kaggle.com/settings>).

We need one additional module for this task:

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

Let us specify which dataset we concretely want to download and where we want to put it:

In [None]:
DATASET_SLUG = "harlfoxem/housesalesprediction"

# This is the name kaggle gives to the downloaded file.
TMP_NAME="kc_house_data.csv"

# This is the name we want the downloaded file to have.
TARGET_NAME="df_kaggle.csv"

# # dataset_slug = "uciml/pima-indians-diabetes-database"
# # dataset_slug = "abderrahimalakouche/flight-delay-prediction"

Now, we load our kaggle API credentials which must be in the `.env` file:

In [None]:
load_dotenv()

os.environ["KAGGLE_USERNAME"] = os.getenv("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = os.getenv("KAGGLE_KEY")

At last, the 

In [None]:
api = KaggleApi()
api.authenticate()

api.dataset_download_files(DATASET_SLUG, path=DATA_DIR, unzip=True)

tmp_path = os.path.join(DATA_DIR, TMP_NAME)
target_path = os.path.join(DATA_DIR, TARGET_NAME)

os.rename(tmp_path, target_path)