# Database Joins 2: Visualizing JIRA Statuses

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

In order to investigate the answer to this question, we start with a much smaller sub-question that focuses on more recent data.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

In order to answer this question, this notebook looks at simple visualizations of the JIRA data based on several ways you can group that data.

At the end of this notebook, we will have a script that takes a sample of data from JIRA and enriches it with more data from JIRA, and the reader will have an improved understanding of the capabilities databases provide when it comes to processing information for visualization.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install matplotlib scikit-learn seaborn statsmodels

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import print_function

from checklpp import *
from datetime import datetime, timedelta
import functools
import matplotlib
from multiprocessing import Pool, cpu_count
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

## Enable Process Pool

Most of our computations can run independently of each other, so let's take advantage of some parallelization that's available on our machine.

In [None]:
pool = Pool(cpu_count())

In [None]:
plt.rcParams['figure.figsize'] = (6.4*2, 4.8*1.5)

## Detour: Database Views

As you work at Liferay, it is easy to be lead to believe that databases exist as naive information storage and retrieval systems, because that is really all Liferay does with a database. However, the truth is much more complex.

One of the other reasons databases exist is to manage relationships between data sets and allow you to analyze those relationships. As a result, many enterprise database vendors have created a rich set of proprietary functions on top of SQL that allow you to perform very insightful analysis.

Because you're uncovering relationships within the data, whenever you want to answer questions, you often join together tables and then perform additional analysis on top of those joined tables. It is not uncommon for an analysis to simply take multiple sets of SQL used to generate join tables and use them as subqueries.

Over time, it can be tedious to constantly paste in the the same nested queries over and over again, and having too many of them will also make it harder for anyone reading the query to understand what it is you were trying to analyze. To address this problem, it is common to create a database view.

* [Why do you create a view in a database?](https://stackoverflow.com/questions/1278521/why-do-you-create-a-view-in-a-database)

You can think of a database view as an alias for a query result that someone has needed before. We also know that every query result is a table, and knowing why people want that information to begin with (constantly reusing it in different analyses), you can also see why people might like a feature that allows you to materialize those views.

* [Materialized view](https://en.wikipedia.org/wiki/Materialized_view)

In our case, whenever we fetch tickets from JIRA, the JSON object may be cumbersome to work with. Therefore, in order to make it easier for people to understand the underlying information, we will create a view on top of it that allows us to concentrate on the specific fields we want to analyze. As we do this, however, we should keep in mind that all of the original raw data still exists, should we ever need it.

## Fetch LPP Tickets

Reminding ourselves of our question:

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

The key part here is that when we retrieve each ticket from JIRA, we're interested in the times that tickets remain in each status. So, let's start by making sure that, at the very least, we can extract that metadata from an LPP ticket.

First, we'll look at LPP tickets that relate to DXP, but are not any of the known workflow testing tickets (because those constantly close and re-open and might end up with the DXP affected version).

In [None]:
lpp_jql = """
project = LPP and affectedVersion = "7.0 DE (7.0.10)" and
key not in (LPP-10825,LPP-10826,LPP-12114,LPP-13367)
"""

It turns out that JIRA allows you to perform a join as part of its API by requesting an expansion of certain fields. Because we're interested in the time the ticket spends in each status, we can ask it to expand the `changelog` field, which effectively asks JIRA to perform a join on its equivalent of a `changelog` table.

In [None]:
if __name__ == '__main__':
    lpp_issues = get_jira_issues(lpp_jql, ['changelog'])
else:
    lpp_issues = {}

Let's accumulate the history transitions. Note that we're going to be making a simplification that will matter later on in the notebook: namely, we're going to assume that the creation of the issue is where we'll focus for the visualization, rather than the time of the transition.

This can be a dangerous simplification if you are to use this for modeling or machine learning, but it's a useful simplification because it allows us to have a cumulative value for how long something spends in each status rather than having to understand each separate value (in the event that it enters the "In Review" status multiple times, for example.

In [None]:
%%writefile jiratime.py
from checklpp import get_time_delta_as_days
from collections import defaultdict
import dateparser
from six import string_types

region_field_name = 'customfield_11523'

def extract_time(issue):
    issue_created = dateparser.parse(issue['fields']['created']).date()

    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']

    old_status = 'Open'
    current_assignee = issue['fields']['assignee']['displayName']
    old_status_date = dateparser.parse(issue['fields']['created'])

    status_times = defaultdict(lambda: defaultdict(float))
    history_entries = issue['changelog']['histories']

    for history_entry in history_entries:
        history_entry['createdTime'] = dateparser.parse(history_entry['created'])

    sorted_history_entries = sorted(
        history_entries,
        lambda x, y: -1 if get_time_delta_as_days(x['createdTime'] - y['createdTime']) < 0.0 else 1
    )

    for history_entry in sorted_history_entries:
        useful_history_items = [
            item for item in history_entry['items']
                if item['field'] == 'status' or item['field'] == 'assignee'
        ]

        for item in useful_history_items:
            if item['field'] == 'assignee':
                current_assignee = item['fromString']
            else:
                old_status = item['fromString']

            new_status_date = history_entry['createdTime']

            elapsed_time = get_time_delta_as_days(new_status_date - old_status_date)

            status_times[current_assignee][old_status] += elapsed_time
            status_times['(all assignees)'][old_status] += elapsed_time

            old_status_date = new_status_date

    return [
        {
            'jiraKey': issue['key'],
            'type': issue['fields']['issuetype']['name'],
            'assignee': assignee,
            'region': regions[0],
            'issueCreated': issue_created,
            'status': status,
            'elapsedTime': elapsed_time
        }
        for assignee, record in status_times.items()
            for status, elapsed_time in record.items()
    ]

    return status_times

The following block uses the `reload` ability to allow us to constantly change the time extraction above (which writes to a separate Python file so that it can be parallelized) and then reload it.

In [None]:
import jiratime
reload(jiratime)

Now, we perform our parallel processing.

In [None]:
times = []

region_field_name = 'customfield_11523'

num_finished = 0

for result in pool.imap_unordered(jiratime.extract_time, lpp_issues.values()):
    if num_finished % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

    num_finished += 1

    for entry in result:
        times.append(entry)

## Table Visualizations

Now that we have our view of the JIRA data, naturally, we'll want to visualize it. We'll start with the most basic visualization: a table.

When you start out in data visualization, the first thing you're taught is that when we generate descriptive statistics and visualize them as a table, we know that we're missing out on a large part of the story.

This is because when you boil something down to a single number (whether it's a measure of central tendency or really any other descriptive statistics), all of the context is lost in that summarization. It's possible for things that have very similar statistics and even very similar linear regression predictions to actually be quite different from each other.

* [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

However, tables give us a very concise overview of a lot of information at once, and they're extremely easy to generate, so they make for an excellent starting point to give you a sense of what you're dealing with.

In [None]:
df = pd.DataFrame(times)

Liferay has a lot of different statuses, so we'll limit our exploration to only a handful of states that we know are related to tickets that are in progress, and we'll ignore the individual assignee times (at least, in our initial analysis).

In [None]:
statuses = ['Open', 'Verified', 'In Progress', 'In Review', 'On Hold']

In [None]:
df = df[df['status'].isin(statuses) & (df['assignee'] == '(all assignees)')]
del df['assignee']

### Aggregate Statistics

First, we look at everything as a single unit, and we simply try to visualize how long we spend in each status.

In [None]:
df_groupby = df[['status', 'elapsedTime']].groupby(['status'])

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})
df_quant = df_groupby.quantile([0.8, 0.9]).unstack()
df_quant.columns = ['%s%d%%' % (col[0], col[1] * 100) for col in df_quant.columns.values]

In [None]:
df_count.join(df_norm1).join(df_norm2).join(df_quant)

### Region-Specific Statistics

We can add one more level to this and group the statistics by region, and then take a look at what the statistics look like on the different ticket statuses.

In [None]:
df_groupby = df[['region', 'status', 'elapsedTime']].groupby(['region', 'status'])

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})
df_quant = df_groupby.quantile([0.8, 0.9]).unstack()
df_quant.columns = ['%s%d%%' % (col[0], col[1] * 100) for col in df_quant.columns.values]

In [None]:
df_count.join(df_norm1).join(df_norm2).join(df_quant)

## Density Plot Visualizations

Another way to understand summary statistics is to visualize how the data is distributed. When looking at one-dimensional values (for example, the distribution of the time elapsed), this is often explained as a probability density function, which allows you to see how the data is concentrated. Peaks indicate that many values exist near that value.

* [An introduction to kernel density estimation](http://www.mvstat.net/tduong/research/seminars/seminar-2001-05/)
* [The idea of a probability density function](http://mathinsight.org/probability_density_function_idea)

Once you have the probability density function, you can derive the cumulative distribution function. The values tell you how much of the data is less than a certain value, while the shape tells you how quickly that probability mass builds up.

* [Cumulative distribution function](http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm)

To plot these particular functions, we will be using the [seaborn](https://seaborn.pydata.org/) statistical data visualization library.

In [None]:
sns.set_style('whitegrid')

### Simple Density Plot

In [None]:
for key, group in df.groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], bw=0.5, label=key)

    ax.set_xlim((0, 30))

In [None]:
for key, group in df.groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], cumulative=True, bw=0.5, label=key)

    ax.set_xlim((0, 30))

### Region-Specific Density Plot

In [None]:
for key, group in df[df['region'] == 'US'].groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], bw=0.5, label=key)

    ax.set_xlim((0, 30))

In [None]:
for key, group in df[df['region'] == 'US'].groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], cumulative=True, bw=0.5, label=key)

    ax.set_xlim((0, 30))

## Sliding Window Visualizations

Another thing we can do is to use a sliding window over time as a way of estimating how long a ticket may take to close. In other words, for any given point in time, we take a look at all tickets opened thirty days before that point and see how long those tickets took to close.

In [None]:
def get_time_rolling_plot(times, aggregator, status, region=None):
    sorted_times = sorted(
        [time for time in times if time['status'] == status and (region is None or time['region'] == region)],
        lambda x, y: int(get_time_delta_as_days(x['issueCreated'] - y['issueCreated']))
    )

    if len(sorted_times) < 2:
        return

    start = 0
    end = 1

    max_end_date = sorted_times[end]['issueCreated']
    min_start_date = max_end_date - timedelta(days=30)

    rolling_window = [sorted_times[end]['elapsedTime']]

    dates = []
    values = []

    while end + 1 < len(sorted_times):
        while start + 1 < end and sorted_times[start+1]['issueCreated'] < min_start_date:
            rolling_window.pop(0)
            start += 1

        while end < len(sorted_times) and sorted_times[end]['issueCreated'] < max_end_date:
            rolling_window.append(sorted_times[end]['elapsedTime'])
            end += 1

        dates.append(min_start_date)
        values.append(aggregator(rolling_window))

        min_start_date += timedelta(days=1)
        max_end_date = min(today + timedelta(days=1), max_end_date + timedelta(days=1))

    plt.plot_date(dates, values, '-', label=status)

### Simple Sliding Window Median

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.median, status)

plt.legend()

plt.show()

### Region-Specific Sliding Window Median

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.median, status, 'US')

plt.legend()

plt.show()

### Simple Sliding Window Average

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.mean, status)

plt.legend()

plt.show()

### Region-Specific Sliding Window Average

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.mean, status, 'US')

plt.legend()

plt.show()

### Simple Sliding Window 90th Percentile

In [None]:
for status in statuses:
    get_time_rolling_plot(times, functools.partial(np.percentile, q=90), status)

plt.legend()

plt.show()

### Region-Specific Sliding Window 90th Percentile

In [None]:
for status in statuses:
    get_time_rolling_plot(times, functools.partial(np.percentile, q=90), status, 'US')

plt.legend()

plt.show()

### Regression as Averages

We might also want to look at the average difference introduced simply by changing a type. To do that, we could check how long each region spends in review as a table of linear regression coefficients.

In [None]:
def split_records(df, key_columns, value_column):
    columns = key_columns + [value_column]

    records = df[columns].to_dict(orient = 'records')

    for record in records:
        for key, value in record.items():
            if value is None:
                record[key] = ''

    vectorizer = DictVectorizer()

    train_x = vectorizer.fit_transform(
        [{ key: value for key, value in record.items() if key != value_column } for record in records]
    )

    train_y = [record[value_column] for record in records]

    return train_x, train_y, vectorizer