# Database Joins 2: Combining LESA with JIRA

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

While this is inherently a data visualization question, what we can do is convert it into a smaller question and see if a little bit of data exploration gives us hints on how this data visualization should be implemented.

<b style="color:green">Can we create data visualizations on top of LESA, LPS, LPP, and BPR ticket metadata that lets us group together DXP tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

This notebook assumes that you've loaded a backup of the LESA database from `files.liferay.com` into a database named `lportal`, because LESA (like many internal Liferay systems) lacks a useful API for data analysis, and therefore we will extract the data by querying the database directly.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install -y beautifulsoup4 mysql-connector-python

## Notebook Imports

In [None]:
from __future__ import print_function

from checklpp import *

from datetime import datetime
import mysql.connector
import pandas as pd
import re
import requests
import ujson as json

## Save Raw Data

Before we start, we'll make sure that we establish one rule for this script and all future scripts: we will save all raw data with timestamps. This time, we're going to be saving the results of a database query.

In [None]:
connection = mysql.connector.connect(
    user='lportal', password='lportal',
    host='ec2-34-208-59-105.us-west-2.compute.amazonaws.com', database='lportal'
)

In [None]:
cursor = connection.cursor()

In [None]:
def save_query(cache_name, query, row_function=None):
    file_name = get_file_name(cache_name, 'json')

    if os.path.exists(file_name):
        print('[%s] Skipping query execution due to cache file' % datetime.now().isoformat())
        return

    print('[%s] Executing query %s' % (datetime.now().isoformat(), query))
    
    cursor.execute(query)

    with open(file_name, 'w') as outfile:
        for i, item in enumerate(cursor):
            if i % 1000 == 0:
                print('[%s] Processed %d items' % (datetime.now().isoformat(), i))

            row_value = {key: value for key, value in zip(cursor.column_names, item)}

            if row_function is None:
                save_row(outfile, [], row_value)
                continue

            for return_value in row_function(row_value):
                save_row(outfile, [], return_value)

In [None]:
def load_query(cache_name, row_function=None):
    file_name = get_file_name(cache_name, 'json')

    with open(file_name, 'r') as infile:
        if row_function is None:
            return [load_row(line) for line in infile]
        else:
            return [row_function(load_row(line)) for line in infile]

In [None]:
def explain_query(query):
    print('[%s] Explaining query %s' % (datetime.now().isoformat(), query))
    
    cursor.execute('explain %s' % query)

    return [
        { key: value for key, value in zip(cursor.column_names, item) }
            for item in cursor
    ]

def run_query(query):
    print('[%s] Executing query %s' % (datetime.now().isoformat(), query))
    
    cursor.execute(query)

    return [
        { key: value for key, value in zip(cursor.column_names, item) }
            for item in cursor
    ]

## Extract JIRA Links

### Extract Explicit JIRA Links

Some links are found in the `OSB_TicketLink` table.

In [None]:
query = """
select * from OSB_TicketLink where url like 'https://issues.liferay.com/%'
"""

save_query('lesa/JIRALink_1', query)

### Extract JIRA Links in Comments

Some links are buried inside of the Liferay-only sections of comments and never formally linked on the ticket. Therefore, we'll need to perform some text extraction in order to identify those links.

In [None]:
p1 = re.compile('https://issues.liferay.com/browse/[A-Z]*-[\d]*')

patterns = [p1]

def extract_links(row_value):
    for url in [item for p in patterns for item in p.findall(row_value['body'])]:
        yield {
            'userName': row_value['userName'],
            'url': url,
            'createDate': row_value['createDate'],
            'userId': row_value['userId'],
            'visibility': row_value['visibility'],
            'type_': row_value['type_'],
            'ticketEntryId': row_value['ticketEntryId'],
            'ticketCommentId': row_value['ticketCommentId']
        }

In [None]:
query = """
select * from OSB_TicketComment where body like '%https://issues.liferay.com/%'
"""

save_query('lesa/JIRALink_2', query, extract_links)

## Detour: Database Indices

## Finding the Liferay Version

For each LESA ticket found in the `OSB_TicketEntry` table, we identify the version of Liferay based on the value in the `envLFR` column. Each value in this column corresponds to an entry in the `ListType` table, with a few different types, all of them having a `type_` that starts with `com.liferay.osb.model.ProductEntry`.

Of course, not all of these are used.

First, we'll find the used `ListType` values that correspond to Liferay versions, which can be achieved by looking for the distinct values of the `envLFR` table and checking for the matching `listTypeId` value in the `ListType` table. The query might look like the following.

```
SELECT * 
FROM   ListType 
WHERE  listTypeId IN (SELECT DISTINCT envLFR 
                      FROM   OSB_TicketEntry) 
       AND type_ LIKE 'com.liferay.osb.model.ProductEntry.%' 
```

However, before we do that, we'll need to see if it's a good idea to execute the query by looking at the query plan.

### Guess the Query Plan

First, let's take a look at the database indices and compare them to our database query to see if we have an intuition about how the database query will take shape.

In [None]:
pd.DataFrame(run_query('show indexes from ListType'))

In [None]:
pd.DataFrame(run_query('show indexes from OSB_TicketEntry'))

### Check the Query Plan

In [None]:
query = """
SELECT * 
FROM   ListType 
WHERE  listTypeId IN (SELECT DISTINCT envLFR 
                      FROM   OSB_TicketEntry) 
       AND type_ LIKE 'com.liferay.osb.model.ProductEntry.%'
"""

pd.DataFrame(explain_query(query))

If you've never seen this format before, the MySQL documentation provides a good high-level explanation of what is going on in the above query.

* [Explain Output Format](https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types)

According to the explain plan, the database will materialize our subquery.

`select envLFR from OSB_TicketEntry`

From there, it will compute the distinct values. Most databases compute a distinct by performing a sort and then iterating over the sorted result so that it can pick off the distinct values.

This list of distinct values will then be used with a filtered `ListType` table to find the entries of interest. Because we have only a prefix on our `LIKE` clause, and because there is an index on the `type_` column, this filtering is implemented as a range query.

### Execute the Query

In [None]:
%%time

pd.DataFrame(run_query(query))

In [None]:
save_query('lesa/ListType_EnvLFR', query)