# Intake-Postgres Plugin: Joins Demo

The following notebook demonstrates "join" functionality using the _Intake-Postgres_ plugin. Its purpose is to showcase a variety of scenarios in which an _Intake_ user may want to query their PostgreSQL-based relational datasets.

Joins are to be executed within the following scenarios:

- One database, two tables
- Two databases, several tables


## Setup
1. Download the PostgreSQL/PostGIS Docker images. With [Docker installed](https://www.docker.com/community-edition), execute:
    ```
    for db_inst in $(seq 0 4); do
        docker run -d -p $(expr 5432 + $db_inst):5432 --name intake-postgres-$db_inst mdillon/postgis:9.6-alpine;
    done
    ```
    All subsequent `docker run` commands will start containers from this image.

1. In the same conda environment as this notebook, install `pandas`, `sqlalchemy`, `psycopg2`, `shapely`, and (optionally) `postgresql`:
    ```
    conda install pandas sqlalchemy psycopg2 shapely postgresql
    ```
    The `postgresql` package is only for the command-line client library, so that we can verify that results were written to the database (externally from our programs).

1. Finally, install the _intake-postgres_ plugin:
    ```
    conda install -c intake intake-postgres
    ```


## Loading the data

Because _Intake_ only supports reading the data, we need to insert the data into our databases by another means. The general approach below relies on partitioning a pre-downloaded CSV file and inserting its partitions into each table. This can be thought of as a rudimentary form of application-level "sharding".

The code (below) begins by importing the necessary modules:

In [None]:
from __future__ import print_function, absolute_import

## For downloading the test data
import os
import requests
import urllib
import zipfile

## For inserting test data
import pandas as pd
from sqlalchemy import create_engine

## For using Intake
from intake.catalog import Catalog

## Global variables
N_PARTITIONS = 5

Here we download the data, if it doesn't already exist:

In [None]:
%%time

# Define download sources and destinations.
# For secure extraction, 'fpath' must be how the zip archive refers to the file.
loan_data = {'url': 'https://resources.lendingclub.com/LoanStats3a.csv.zip',
             'fpath': 'LoanStats3a.csv',
             'table': 'loan_stats',
             'date_col': 'issue_d',
             'normalize': ['term', 'home_ownership', 'verification_status', 'loan_status', 'addr_state', 'application_type', 'disbursement_method']}
decl_loan_data = {'url': 'https://resources.lendingclub.com/RejectStatsA.csv.zip',
                  'fpath': 'RejectStatsA.csv',
                  'table': 'reject_stats',
                  'date_col': 'Application Date',
                  'normalize': ['State']}

# Do the data downloading and extraction
for data in [loan_data, decl_loan_data]:
    url, fpath = data['url'], data['fpath']
    
    if os.path.isfile(fpath):
        print('{!r} already exists: skipping download.\n'.format(fpath))
        continue

    try:
        dl_fpath = os.path.basename(urllib.parse.urlsplit(url).path)
        print('Downloading data from {!r}...'.format(url))
        response = requests.get(url)
    except:
        raise ValueError('Download error. Check internet connection and URL.')

    try:
        with open(dl_fpath, 'wb') as fp:
            print('Writing data...'.format(dl_fpath))
            fp.write(response.content)

        try:
            print('Extracting data...')
            with zipfile.ZipFile(dl_fpath, 'r') as zip_ref:
                zip_ref.extract(fpath)
            if os.path.isfile(dl_fpath) and dl_fpath.endswith('.zip'):
                os.remove(dl_fpath)
        except:
            raise ValueError('File extraction error. Is the downloaded file a zip archive?')
    except:
        raise ValueError('File write error. Check destination file path and permissions')

    print('Success: {!r}\n'.format(fpath))

Next, we partition the data into `N_PARTITIONS` groups, and persist each partition into a separate database instance. Although there are many ways we can choose to partition the dataset, here we partition by the date the loans were issued (or if they were rejected, the date when they were applied for):

In [None]:
%time
for data in [loan_data, decl_loan_data]:
    fpath, date_col, table = data['fpath'], data['date_col'], data['table']
    norm_cols = data['normalize']
    pcol = '_' + date_col  # Used for partitioning the data
    
    df = pd.read_csv(fpath, skiprows=1)
    print('# {}: {}'.format(table, len(df)))
    print('# {} valued at N/A: {}'.format(table, len(df[df[date_col].isna()])))
    df.dropna(axis=0, subset=[date_col], inplace=True)
    
    df[pcol] = pd.to_datetime(df[date_col]) # , format='%b-%Y')

    # Cast strs with '%' into floats, so we can do analysis more easily
    if 'int_rate' in df.columns:
        df['int_rate'] = df['int_rate'].str.rstrip('%').astype(float)

    df.sort_values(pcol, inplace=True)
    grouped = df.groupby(pd.qcut(df[pcol],
                                 N_PARTITIONS,
                                 labels=list(range(N_PARTITIONS))))
    
    # Normalize what we can, store into first db instance
    engine = create_engine('postgresql://postgres@localhost:{}/postgres'.format(5432))
    for norm_col in norm_cols:
        norm_col_cats = df[norm_col].astype('category')
        norm_df = pd.DataFrame({'id': pd.np.arange(len(norm_col_cats.cat.categories)),
                                norm_col: norm_col_cats.cat.categories.values})
        df.loc[:, norm_col] = norm_col_cats.cat.codes
        print('Persisting normalized column, {!r}...'.format(norm_col+'_codes'))
        norm_df.to_sql(norm_col+'_codes', engine, if_exists='replace')
    
    for group_id, group_df in grouped:
        print('\n###', group_id)
        start = group_df[pcol].min().strftime('%b-%Y')
        end = group_df[pcol].max().strftime('%b-%Y')

        # Save each partition to a different database
        print('Persisting {} {} from {} to {}...'.format(len(group_df), table, start, end))
        engine = create_engine('postgresql://postgres@localhost:{}/postgres'.format(5432+group_id))
        try:
            group_df.drop(columns=pcol).to_sql(table, engine, if_exists='fail') #'replace')
        except ValueError:
            pass  # Table already exists, so do nothing.
        
    print()

Verify the data was written, by connecting to the databases directly with the `psql` command-line tool:

In [None]:
# Save each query from the `psql` command as HTML
!for db_inst in $(seq 0 4); do \
    psql -h localhost -p $(expr 5432 + $db_inst) -U postgres -q -H \
        -c 'select loan_amnt, term, int_rate, issue_d from loan_stats limit 5;' \
      > db${db_inst}.html; \
done

# Display the HTML files
from IPython.display import display, HTML
for db_inst in range(N_PARTITIONS):
    display(HTML('db{}.html'.format(db_inst)))

## Reading the data (with Intake-Postgres)

Write out a __joins\_catalog.yml__ file with the appropriate schema:

In [None]:
%%writefile joins_catalog.yml
plugins:
  source:
    - module: intake_postgres

sources:
  # Normalized columns
  term_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, term from term_codes'

  home_ownership_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, home_ownership from home_ownership_codes'

  verification_status_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, verification_status from verification_status_codes'

  loan_status_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, loan_status from loan_status_codes'

  addr_state_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, addr_state from addr_state_codes'

  application_type_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, application_type from application_type_codes'

  disbursement_method_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, disbursement_method from disbursement_method_codes'
        
  State_codes:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select id, "State" from "State_codes"'


  # loan_stats data
  loans_1:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select issue_d, term, application_type, disbursement_method, home_ownership, verification_status, loan_status, loan_amnt, int_rate from loan_stats'

  loans_5:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5436/postgres'
      sql_expr: 'select issue_d, term, application_type, disbursement_method, home_ownership, verification_status, loan_status, loan_amnt, int_rate from loan_stats'
        

  # reject_stats data
  rejects_1:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: 'select "Application Date", "State", "Amount Requested" from reject_stats'

  rejects_5:
    driver: postgres
    args:
      uri: 'postgresql://postgres@localhost:5436/postgres'
      sql_expr: 'select "Application Date", "State", "Amount Requested" from reject_stats'


  # Joins
  join_db_1_to_1:
    driver: postgres
    parameters:
        - name: interest_lowbound
          description: "Lower-bound for interest rate in query"
          type: float
          default: 0.0
          min: 0.0
    args:
      uri: 'postgresql://postgres@localhost:5432/postgres'
      sql_expr: !template "
        select issue_d,
               term_codes.term,
               application_type_codes.application_type,
               disbursement_method_codes.disbursement_method,
               home_ownership_codes.home_ownership,
               verification_status_codes.verification_status,
               loan_status_codes.loan_status,
               loan_amnt,
               int_rate
        from loan_stats
        inner join term_codes on loan_stats.term = term_codes.id
        inner join application_type_codes on loan_stats.application_type = application_type_codes.id
        inner join disbursement_method_codes on loan_stats.disbursement_method = disbursement_method_codes.id
        inner join home_ownership_codes on loan_stats.home_ownership = home_ownership_codes.id
        inner join verification_status_codes on loan_stats.verification_status = verification_status_codes.id
        inner join loan_status_codes on loan_stats.loan_status = loan_status_codes.id
        where int_rate > {{ interest_lowbound }}"

Access the catalog with Intake:

In [None]:
%time
catalog = Catalog('joins_catalog.yml')
catalog

Inspect the metadata about the first source (optional):

In [None]:
catalog.loans_1.discover()

In [None]:
catalog.application_type_codes.discover()

In [None]:
catalog.join_db_1_to_1.discover()

Read the data from the sources:

In [None]:
%%time
catalog.loans_1.read().tail()

In [None]:
%%time
catalog.loans_5.read().tail()

## _JOIN_ with one database

Here is our **JOIN**, with default parameters (`interest_lowbound == 0.0`):

In [None]:
%%time
catalog.join_db_1_to_1.read().tail()

Next, with our own parameter value(s):

In [None]:
%%time
catalog.join_db_1_to_1(interest_lowbound=15.0).read().tail()

## _JOIN_ with two databases

For a **JOIN** between tables of two separate databases, we first connect to the tables we are interested in. Then we **JOIN** (aka `.merge()`) them together afterward:

In [None]:
%%time
loans_5_df = catalog.loans_5.read()
term_df = catalog.term_codes.read()
application_type_df = catalog.application_type_codes.read()
disbursement_method_df = catalog.disbursement_method_codes.read()
home_ownership_df = catalog.home_ownership_codes.read()
verification_status_df = catalog.verification_status_codes.read()
loan_status_df = catalog.loan_status_codes.read()

In [None]:
term_df

In [None]:
loans_5_df.tail()

In [None]:
for col, lookup_df in [('term', term_df),
               ('application_type', application_type_df),
               ('disbursement_method', disbursement_method_df),
               ('home_ownership', home_ownership_df),
               ('verification_status', verification_status_df),
               ('loan_status', loan_status_df)]:
    loans_5_df = pd.merge(loans_5_df, lookup_df,
                          how='left', on=None,
                          left_on=col, right_on='id',
                          suffixes=['_', ''])
    loans_5_df.drop(columns=col+'_', inplace=True)
    if 'id_' in loans_5_df.columns:
        loans_5_df.drop(columns='id_', inplace=True)
    if 'id' in loans_5_df.columns:
        loans_5_df.drop(columns='id', inplace=True)
loans_5_df.tail()