# Database Joins 3: Combining LESA with JIRA

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

While this is inherently a data visualization question, it has one prerequisite question that has not yet been answered: "Can we bring together LESA, LPS, LPP, and BPR ticket metadata?"

However, this question still fairly ambitious for a tutorial that intends to introduce you to joining LESA data with JIRA data, as it requires aggregating all of it. For now, we'll look at this reduced scope.

<b style="color:green">Can we bring together LESA, LPS, LPP, and BPR ticket metadata for DXP?</b>

This notebook assumes that you've loaded a backup of the LESA database from `files.liferay.com` into a database named `lportal`, because LESA (like many internal Liferay systems) lacks a useful API for data analysis, and therefore we will extract the data by querying the database directly.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install -y beautifulsoup4 mysql-connector-python

## Notebook Imports

In [None]:
from __future__ import print_function

from checklpp import *

from datetime import datetime
import mysql.connector
import pandas as pd
import re
import requests
import ujson as json

## Save Raw Data

Before we start, we'll make sure that we establish one rule for this script and all future scripts: we will save all raw data with timestamps. This time, we're going to be saving the results of a database query.

In [None]:
connection = mysql.connector.connect(
    user='lportal', password=get_config('mysql.password'),
    host='ec2-34-208-59-105.us-west-2.compute.amazonaws.com', database='lportal'
)

In [None]:
cursor = connection.cursor()

In [None]:
def save_query(cache_name, query, row_function=None):
    file_name = get_file_name(cache_name, 'json')

    if os.path.exists(file_name):
        print('[%s] Skipping query execution due to cache file' % datetime.now().isoformat())
        return

    print('[%s] Executing query %s' % (datetime.now().isoformat(), query))

    cursor.execute(query)

    with open(file_name, 'w') as outfile:
        for i, item in enumerate(cursor):
            if i % 1000 == 0:
                print('[%s] Processed %d items' % (datetime.now().isoformat(), i))

            row_value = {key: value for key, value in zip(cursor.column_names, item)}

            if row_function is None:
                save_row(outfile, [], row_value)
                continue

            for return_value in row_function(row_value):
                save_row(outfile, [], return_value)

In [None]:
def load_query(cache_name, row_function=None):
    file_name = get_file_name(cache_name, 'json')

    with open(file_name, 'r') as infile:
        if row_function is None:
            return [load_row(line) for line in infile]
        else:
            return [row_function(load_row(line)) for line in infile]

In [None]:
def explain_query(query):
    print('[%s] Explaining query %s' % (datetime.now().isoformat(), query))

    cursor.execute('explain %s' % query)

    return [
        { key: value for key, value in zip(cursor.column_names, item) }
            for item in cursor
    ]

def run_query(query):
    print('[%s] Executing query %s' % (datetime.now().isoformat(), query))

    cursor.execute(query)

    return [
        { key: value for key, value in zip(cursor.column_names, item) }
            for item in cursor
    ]

## Extract JIRA Links

### Extract Explicit JIRA Links

Some links are found in the `OSB_TicketLink` table.

In [None]:
p1 = re.compile('[A-Z]*-[\d]*')

def extract_ticket(row_value):
    jira_url = row_value['url']

    if jira_url.find('https://issues.liferay.com/browse/') != 0:
        yield row_value
        return

    candidate_key = jira_url[len('https://issues.liferay.com/browse/'):]

    result = p1.search(candidate_key)

    if result is None:
        yield row_value
        return

    row_value['jira_key'] = result.group(0)
    yield row_value

In [None]:
query = """
select * from OSB_TicketLink where url like 'https://issues.liferay.com/%'
"""

save_query('lesa/JIRALink_1', query, extract_ticket)

### Extract JIRA Links in Comments

Some links are buried inside of the Liferay-only sections of comments and never formally linked on the ticket. Therefore, we'll need to perform some text extraction in order to identify those links.

In [None]:
p2 = re.compile('https://issues.liferay.com/browse/[A-Z]*-[\d]*')

patterns = [p2]

def extract_links(row_value):
    for url in [item for p in patterns for item in p.findall(row_value['body'])]:
        link_value = {
            'userName': row_value['userName'],
            'url': url,
            'createDate': row_value['createDate'],
            'userId': row_value['userId'],
            'visibility': row_value['visibility'],
            'type_': row_value['type_'],
            'ticketEntryId': row_value['ticketEntryId'],
            'ticketCommentId': row_value['ticketCommentId']
        }

        yield extract_ticket(link_value)

In [None]:
query = """
select * from OSB_TicketComment where body like '%https://issues.liferay.com/%'
"""

save_query('lesa/JIRALink_2', query, extract_links)

## Detour: Database Indices, Part 1

A database index is built around the notion that you might be able to pre-create a lookup structure that the database can use whenever it needs to lookup values, whether that lookup is for a join or that lookup is the actual query.

* [Database index](https://en.wikipedia.org/wiki/Database_index)

Looking back at what we've already learned about database indices, you can think of a database index as an in-between compromise. Namely, it's a disk-based lookup structure that allows it to favor a different sort of join strategy than a nested loop join. This disk-based structure, if small, can also be loaded to quickly provide a hash-join strategy.

So how do database indices achieve this? At a high level, they do it by implementing a column-based lookup structure on top of the row-oriented storage system.

At a lower level, there are a variety of these column-based lookup data structures that have performance implications (ones that are probabilistic rather than strictly error-free, ones that require clustering or reordering the rows to improve sequential scans, etc.), but to start out, we'll look at two of the more popular types: b-trees and bitmaps.

### B-Tree

A [b-tree index](https://en.wikipedia.org/wiki/B-tree) can be conceptualized a disk access efficient implementation of a binary tree, where the keys used for navigating the binary tree are the keys of the index. Because it is a binary tree, it is a sorted data structure.

Because the data structure is sorted, this means that it is extremely efficient to walk the keys of the index in sorted order, which will be relevant when we talk about sort-merge joins and range queries.

Knowing that the data structure is sorted, this means that the order of the columns for the index matters. For example, if you have one index that indexes the values three columns and you specify two of those columns, the database can only predictably use the database index if the two columns are the first two columns specified for the index.

Since the data structure is sorted, this has a specific implication for `like` clauses. Namely, the `%` wildcard will only predictably use the index if it appears at the end, because placing one at the beginning or in the middle will force the database to second guess whether the index will really help query runtime (it usually depends on cardinality and how large the index is relative to the table).

To put this into perspective common side-effect of this is that you have to be careful when you build queries for the Liferay `Group_` table. If there is an index on `companyId`, `classNameId`, and `classPK`, a database query that skips the `classNameId` may not be able to use the index. Additionally, Liferay's obsession with alphabetization can theoretically result in unusable indices if we do not pay attention to what's happening.

### Bitmap

A [bitmap index](https://en.wikipedia.org/wiki/Bitmap_index) can be conceptualized as transforming a query into bitwise operations. It achieves this by converting every value for a column into a bitmap or bit set, where each bit corresponds to a row, and the value of the bit (whether it's 0 or 1) corresponds to whether that row has the specified value for this column.

This means that for multiple checks against the same table, you simply load the bitmap for each checked column value into vectors of 32-bit or 64-bit blocks, and perform vectorized bitwise operations. The resulting vector precisely describes the rows that satisfy all checks.

Naturally, because you can only have zeroes and ones in a bitmap, these can be compressed with run-length encodings to make these space efficient, but run-length encodings and similar compression strategies make it potentially expensive to update the value for a single entry.

* [Run-length encoding](https://en.wikipedia.org/wiki/Run-length_encoding)

As a result, bitmap indices are uncommon in write-heavy workload tables in active databases, but they are very popular in read-heavy workload tables in data warehouses.

## Find the Liferay Version

For each LESA ticket found in the `OSB_TicketEntry` table, we identify the version of Liferay based on the value in the `envLFR` column. Each value in this column corresponds to an entry in the `ListType` table, with a few different types, all of them having a `type_` that starts with `com.liferay.osb.model.ProductEntry`.

Of course, not all of these are used.

First, we'll find the used `ListType` values that correspond to Liferay versions, which can be achieved by looking for the distinct values of the `envLFR` table and checking for the matching `listTypeId` value in the `ListType` table. The query might look like the following.

```
SELECT *
FROM   ListType
WHERE  listTypeId IN (SELECT DISTINCT envLFR
                      FROM   OSB_TicketEntry)
       AND type_ LIKE 'com.liferay.osb.model.ProductEntry.%'
```

However, before we do that, we'll need to see if it's a good idea to execute the query by looking at the query plan.

### Guess the Query Plan

First, let's take a look at the database indices and compare them to our database query to see if we have an intuition about how the database query will take shape.

In [None]:
pd.DataFrame(run_query('show indexes from ListType'))

In [None]:
pd.DataFrame(run_query('select count(*) from ListType'))

From what we see above, there is only one index on the `ListType` table, and the current index statistics estimate there are 97 distinct values in 488 rows. There is an index on the `type_` column, which we've included in our query, so we'd expect the database to be able to use an index.

In [None]:
pd.DataFrame(run_query('show indexes from OSB_TicketEntry'))

In [None]:
pd.DataFrame(run_query('select count(*) from OSB_TicketEntry'))

One of the things that might happen with the above query is that it may demonstrate how the index statistics are approximate, if the database estimates that the cardinality of the `ticketEntryId` column is different from the actual count of entries in the `OSB_TicketEntry` table.

From what we see above, there are five different indices, but none of them are on the `envLFR` column. Therefore, we expect for the database to iterate over all of the rows in order to extract the column value.

### Check the Query Plan

In [None]:
query = """
SELECT *
FROM   ListType
WHERE  listTypeId IN (SELECT DISTINCT envLFR
                      FROM   OSB_TicketEntry)
       AND type_ LIKE 'com.liferay.osb.model.ProductEntry.%'
"""

pd.DataFrame(explain_query(query))

If you've never seen this format before, the MySQL documentation provides a good high-level explanation of what is going on in the above query.

* [Explain Output Format](https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types)

According to the explain plan, the database will materialize our subquery as a table (which, you might remember from the previous tutorial, is the result of any query).

`select envLFR from OSB_TicketEntry`

From there, it will compute the distinct values. Most databases compute a distinct by performing a sort and then iterating over the sorted result so that it can pick off the distinct values.

This list of distinct values will then be used with a filtered `ListType` table to find the entries of interest. Because we have only a prefix on our `LIKE` clause, and because there is an index on the `type_` column, this filtering is effectively a range query, where everything is greater than our prefix, and everything is less than whatever comes after it in a sorted list of possible values.

### Execute the Query

In [None]:
%%time

pd.DataFrame(run_query(query))

From here, we see that our `name` column has a value resembling 7.0 only once, and therefore LESA tickets for DXP correspond to LESA tickets that specify that their `envLFR` value is 41000.

## Load Data from LESA

In [None]:
query = """
select * from OSB_TicketEntry where envLFR = 41000
"""

save_query('lesa/OSB_TicketEntry', query)

## Detour: Sort-Merge Join

Let's look back to hash-joins. For a hash table, in practice, we have an expected runtime that is $O(m) + O(n)$ and a worst-case runtime that is $O(m \cdot n) + O(n^2)$.

If you are concerned about worst case runtimes, you could instead build a tree-like data structure that has $O(\log n)$ insertion and lookup time. As a result, attempting a hash join using this data structure would have both an expected and a worst-case runtime of $O(m \log n) + O(n \log n)$.

Now let's take a look at the data we currently have. We have two different data sets that store every link, along with a `ticketEntryId` indicating which ticket contains this link. We also have a data set that stores each ticket coming from DXP, and it also has a `ticketEntryId` column.

An initial question we might ask is, given just these two tables, how do we combine these two data sets together? The straightforward solution is to recognize that we could perform a hash join or even a nested loop join, given that these are both very small tables.

However, if the tables were both very large, could we theoretically do better?

### Sort-Merge Clustered Index

Imagine that all the rows in all our tables were sorted by `ticketEntryId`. In this situation, what we've effectively done is made `ticketEntryId` a clustered index.

* [What is a clustered and non-clustered index?](https://www.quora.com/What-is-a-clustered-and-non-clustered-index)

In this scenario, you could view the join as the equivalent of merging two already sorted lists via merge sort.

* [Sorting visualizations](http://cs.stanford.edu/people/jcjohns/sorting.js/)

A sort-merge join is a join where the query optimizer decides the best way to accomplish the join is to sort the two tables on the specified join keys and then walk both tables in the same way as the merge step of merge sort.

* [Sort-merge join](https://en.wikipedia.org/wiki/Sort-merge_join)

Note that it doesn't necessarily have to use merge sort for this sort step, though merge sort is well-known to be one of the best external sorting algorithms (it's used by Hadoop during its shuffle step, for example).

* [External sorting](https://en.wikipedia.org/wiki/External_sorting)

From there, it will compare the sorted tables. It will place a cursor at the smallest key value for both tables (at the top of the sorted tables). At each step, it determines whether it should advance the cursor on one table or the other based on the values of the keys for the rows located at the current cursor position.

If you need a visualization in order to understand how the cursor advances in merge sort (possibly due to lack of familiarity with the merge sort algorithm), you're encouraged to consult this visualization.

* https://www.youtube.com/watch?v=kPRA0W1kECg#t=1m7

### Sort-Merge Sargable Query

Imagine that both of the tables had a b-tree index on `ticketEntryId`. You would be able to identify which rows in each table needed to be join by walking the b-tree index for both tables in the same merge sort style as for walking two sorted tables.