# Database Joins 2: Visualizing JIRA

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

In order to investigate the answer to this question, we start with a much smaller sub-question that focuses on more recent data.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

In order to answer this question, this notebook looks at simple visualizations of the JIRA data based on several ways you can group that data.

At the end of this notebook, we will have a script that takes a sample of data from JIRA and enriches it with more data from JIRA, and the reader will have a further improved understanding of what goes into a join.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install matplotlib scikit-learn statsmodels

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import print_function

from checklpp import *
import dateparser
import matplotlib
from matplotlib import pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

## Fetch JIRA Tickets

Reminding ourselves of our question:

<b style="color:green">Can we create data visualizations on top of LPS, LPP, and BPR ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain in each status based on those groupings?</b>

The key part here is that when we retrieve each ticket from JIRA, we're interested in the times that tickets remain in each status. So, let's start by making sure that, at the very least, we can extract that metadata from an LPP ticket.

### Fetch LPP Tickets

First, we'll look at LPP tickets that relate to DXP.

In [None]:
lpp_jql = 'project = LPP and affectedVersion = "7.0 DE (7.0.10)"'

It turns out that JIRA allows you to perform a join as part of its API by requesting an expansion of certain fields. Because we're interested in the time the ticket spends in each status, we can ask it to expand the `changelog` field, which effectively asks JIRA to perform a join on its equivalent of a `changelog` table.

In [None]:
if __name__ == '__main__':
    lpp_issues = get_jira_issues(lpp_jql, ['changelog'])
else:
    lpp_issues = {}

Let's accumulate the history transitions.

In [None]:
times = []

region_field_name = 'customfield_11523'

for num, issue in enumerate(lpp_issues.values()):
    if num % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num, len(lpp_issues)))
    
    status_times = defaultdict(float)
    old_status_date = dateparser.parse(issue['fields']['created'])

    for history_entry in issue['changelog']['histories']:
        status_history = [item for item in history_entry['items'] if item['field'] == 'status']

        for item in status_history:
            old_status = item['fromString']
            new_status_date = dateparser.parse(history_entry['created'])

            elapsed_time = get_time_delta_as_days(new_status_date - old_status_date)

            status_times[old_status] += elapsed_time

            old_status_date = new_status_date

    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']
            
    for status, elapsed_time in status_times.items():
        entry = {
            'jiraKey': issue['key'],
            'type': issue['fields']['issuetype']['name'],
            'region': regions[0],
            'issueCreated': issue['fields']['created'],
            'status': status,
            'elapsedTime': elapsed_time
        }
        
        times.append(entry)

## Visualize LPP Ticket Status Times

### Basic Statistics

The simplest visualization is to show how often each region uses each status as a table.

In [None]:
df = pd.DataFrame(times)

In [None]:
df[['region', 'status', 'elapsedTime']].groupby(['region', 'status']).count()

We might also see what the average time is for each status for each region.

In [None]:
df[['region', 'status', 'elapsedTime']].groupby(['region', 'status']).mean()

We might also want to look at differences between regions. For example, we could check how long each region spends in review as a table of linear regression coefficients.

In [None]:
def split_records(df, key_columns, value_column):
    columns = key_columns + [value_column]

    records = df[columns].to_dict(orient = 'records')

    for record in records:
        for key, value in record.items():
            if value is None:
                record[key] = ''

    vectorizer = DictVectorizer()

    train_x = vectorizer.fit_transform(
        [{ key: value for key, value in record.items() if key != value_column } for record in records]
    )
    
    train_y = [record[value_column] for record in records]

    return train_x, train_y, vectorizer