# Database Joins 2: Combining LESA with JIRA

For this notebook, our research question is as follows:

<b style="color:green">Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

While this is inherently a data visualization question, what we can do is break it down into a smaller question and see if a little bit of data exploration gives us hints on how this data visualization should be implemented.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install -y beautifulsoup4 mysql-connector-python

## Notebook Imports

In [None]:
from __future__ import print_function

from checklpp import *

from datetime import datetime
import mysql.connector
import pandas as pd
import re
import requests
import ujson as json

## Save Raw Data

Before we start, we'll make sure that we establish one rule for this script and all future scripts: we will save all raw data with timestamps. This time, we're going to be saving the results of a database query.

In [None]:
connection = mysql.connector.connect(
    user='lportal', password='lportal',
    host='ec2-34-208-59-105.us-west-2.compute.amazonaws.com', database='lportal'
)

In [None]:
cursor = connection.cursor()

In [None]:
def save_query(cache_name, query, row_function=None):
    file_name = get_file_name(cache_name, '.json')
    cursor.execute(query)

    with open(file_name, 'w') as outfile:
        for i, item in enumerate(cursor):
            if i % 1000 == 0:
                print '[%s] Processed %d items' % (datetime.now().isoformat(), i)

            row_value = {key: value for key, value in zip(cursor.column_names, item)}

            if row_function is None:
                save_row(outfile, [], row_value)
                continue

            for return_value in row_function(row_value):
                save_row(outfile, [], return_value)

## Load Explicit JIRA Links

First, this assumes that you've loaded a backup of the LESA database from `files.liferay.com` into a database named `lportal`, because LESA (like many internal Liferay systems) lacks a useful API for data analysis, and therefore we will extract the data by querying the database directly.

In [None]:
query = "select * from OSB_TicketLink where url like 'https://issues.liferay.com/%'"

save_query('JIRALink_1', query)

## Extract Explicit JIRA Links

Some links are buried inside of the Liferay-only sections of comments and never formally linked on the ticket. Therefore, we'll need to perform some text extraction in order to identify those links.

In [None]:
p1 = re.compile('https://issues.liferay.com/browse/[A-Z]*-[\d]*')

patterns = [p1]

def extract_links(row_value):
    for url in [item for p in patterns for item in p.findall(row_value['body'])]:
        yield {
            'userName': row_value['userName'],
            'url': url,
            'createDate': row_value['createDate'],
            'userId': row_value['userId'],
            'visibility': row_value['visibility'],
            'type_': row_value['type_'],
            'ticketEntryId': row_value['ticketEntryId'],
            'ticketCommentId': row_value['ticketCommentId']
        }

In [None]:
query = "select * from OSB_TicketComment where body like '%https://issues.liferay.com/%'"

save_query('JIRALink_2', query, extract_links)