# Database Joins 2: Visualizing JIRA Statuses

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

In order to investigate the answer to this question, we start with a much smaller sub-question that focuses on more recent data.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

In order to answer this question, this notebook looks at simple visualizations of the JIRA data based on several ways you can group that data.

At the end of this notebook, we will have a script that takes a sample of data from JIRA and enriches it with more data from JIRA, and the reader will have a further improved understanding of what goes into a join.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install matplotlib scikit-learn seaborn statsmodels

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import print_function

from checklpp import *
from datetime import datetime, timedelta
import matplotlib
from multiprocessing import Pool, cpu_count
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

## Enable Process Pool

Most of our computations can run independently of each other, so let's take advantage of some parallelization that's available on our machine.

In [None]:
pool = Pool(cpu_count())

In [None]:
plt.rcParams['figure.figsize'] = (6.4*2, 4.8*1.5)

## Fetch JIRA Tickets

Reminding ourselves of our question:

<b style="color:green">Can we create data visualizations on top of LPS, LPP, and BPR ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

The key part here is that when we retrieve each ticket from JIRA, we're interested in the times that tickets remain in each status. So, let's start by making sure that, at the very least, we can extract that metadata from an LPP ticket.

### Fetch LPP Tickets

First, we'll look at LPP tickets that relate to DXP.

In [None]:
lpp_jql = 'project = LPP and affectedVersion = "7.0 DE (7.0.10)"'

It turns out that JIRA allows you to perform a join as part of its API by requesting an expansion of certain fields. Because we're interested in the time the ticket spends in each status, we can ask it to expand the `changelog` field, which effectively asks JIRA to perform a join on its equivalent of a `changelog` table.

In [None]:
if __name__ == '__main__':
    lpp_issues = get_jira_issues(lpp_jql, ['changelog'])
else:
    lpp_issues = {}

Let's accumulate the history transitions.

In [None]:
%%writefile jiratime.py
from checklpp import get_time_delta_as_days
from collections import defaultdict
import dateparser

region_field_name = 'customfield_11523'

def extract_time(issue):
    status_times = defaultdict(float)
    old_status_date = dateparser.parse(issue['fields']['created'])

    for history_entry in issue['changelog']['histories']:
        status_history = [item for item in history_entry['items'] if item['field'] == 'status']

        for item in status_history:
            old_status = item['fromString']
            new_status_date = dateparser.parse(history_entry['created'])

            elapsed_time = get_time_delta_as_days(new_status_date - old_status_date)

            status_times[old_status] += elapsed_time

            old_status_date = new_status_date

    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']

    return [
        {
            'jiraKey': issue['key'],
            'type': issue['fields']['issuetype']['name'],
            'region': regions[0],
            'issueCreated': dateparser.parse(issue['fields']['created']).date(),
            'status': status,
            'elapsedTime': elapsed_time
        }
        for status, elapsed_time in status_times.items()
    ]

In [None]:
import jiratime
reload(jiratime)

In [None]:
times = []

region_field_name = 'customfield_11523'

num_finished = 0

for result in pool.imap_unordered(jiratime.extract_time, lpp_issues.values()):
    if num_finished % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

    num_finished += 1
        
    for entry in result:        
        times.append(entry)

## Table Visualizations

In [None]:
df = pd.DataFrame(times)

In [None]:
statuses = ['Open', 'Verified', 'In Progress', 'In Review', 'On Hold']

In [None]:
df = df[df['status'].isin(statuses)]

### Simple Average

We can start with some basic statistics by status.

In [None]:
df_groupby = df[['status', 'elapsedTime']].groupby(['status'])

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})

In [None]:
df_count.join(df_norm1).join(df_norm2)

### Region-Specific Average

We can add one more level of this and show the raw statistics by region and status.

In [None]:
df_groupby = df[['region', 'status', 'elapsedTime']].groupby(['region', 'status'])

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})

In [None]:
df_count.join(df_norm1).join(df_norm2)

## Density Plot Visualizations

In [None]:
sns.set_style('whitegrid')

### Simple Density Plot

In [None]:
for key, group in df.groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], bw=0.5, label=key)
    
    ax.set_xlim((0, 30))

### Region-Specific Density Plot

In [None]:
for key, group in df[df['region'] == 'US'].groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], bw=0.5, label=key)
    
    ax.set_xlim((0, 30))

## Sliding Window Visualizations

Another thing we can do is to use a sliding window over time as a way of estimating how long a ticket may take to close. In other words, for any given point in time, we take a look at all tickets opened thirty days before that point and see how long those tickets took to close.

In [None]:
def get_time_rolling_plot(times, aggregator, status, region=None):
    sorted_times = sorted(
        [time for time in times if time['status'] == status and (region is None or time['region'] == region)],
        lambda x, y: int(get_time_delta_as_days(x['issueCreated'] - y['issueCreated']))
    )
    
    if len(sorted_times) < 2:
        return
    
    start = 0
    end = 1
    
    max_end_date = sorted_times[end]['issueCreated']
    min_start_date = max_end_date - timedelta(days=30)

    rolling_window = [sorted_times[end]['elapsedTime']]

    dates = []
    values = []

    while end + 1 < len(sorted_times):
        while start + 1 < end and sorted_times[start+1]['issueCreated'] < min_start_date:
            rolling_window.pop(0)
            start += 1

        while end < len(sorted_times) and sorted_times[end]['issueCreated'] < max_end_date:
            rolling_window.append(sorted_times[end]['elapsedTime'])
            end += 1

        dates.append(min_start_date)
        values.append(aggregator(rolling_window))

        min_start_date += timedelta(days=1)
        max_end_date = min(today + timedelta(days=1), max_end_date + timedelta(days=1))
    
    plt.plot_date(dates, values, '-', label=status)

### Simple Sliding Window Median

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.median, status)

plt.legend()

plt.show()

### Region-Specific Sliding Window Median

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.median, status, 'US')

plt.legend()

plt.show()

### Simple Sliding Window Average

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.mean, status)

plt.legend()

plt.show()

### Region-Specific Sliding Window Average

In [None]:
for status in statuses:
    get_time_rolling_plot(times, np.mean, status, 'US')

plt.legend()

plt.show()

### Regression as Averages

We might also want to look at the average difference introduced simply by changing a type. To do that, we could check how long each region spends in review as a table of linear regression coefficients.

In [None]:
def split_records(df, key_columns, value_column):
    columns = key_columns + [value_column]

    records = df[columns].to_dict(orient = 'records')

    for record in records:
        for key, value in record.items():
            if value is None:
                record[key] = ''

    vectorizer = DictVectorizer()

    train_x = vectorizer.fit_transform(
        [{ key: value for key, value in record.items() if key != value_column } for record in records]
    )
    
    train_y = [record[value_column] for record in records]

    return train_x, train_y, vectorizer