# Database Joins 2: Visualizing JIRA Statuses

For this notebook, we start with the following research question. "Can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

In order to investigate the answer to this question, we start with a much smaller sub-question that focuses on more recent data.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

In order to answer this question, this notebook looks at simple visualizations of the JIRA data based on several ways you can group that data.

At the end of this notebook, we will have a script that takes a sample of data from JIRA and enriches it with more data from JIRA, and the reader will have an improved understanding of the capabilities databases provide when it comes to processing information for visualization.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install matplotlib scikit-learn seaborn statsmodels

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import print_function

from checklpp import *
from datetime import datetime, timedelta
import functools
import matplotlib
from multiprocessing import Pool, cpu_count
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

## Enable Process Pool

Most of our computations can run independently of each other, so let's take advantage of some parallelization that's available on our machine.

In [None]:
pool = Pool(cpu_count())

We'll also change some of the default plots so that they're larger and use a white background instead of a gray one.

In [None]:
plt.rcParams['figure.figsize'] = (6.4*2, 4.8*1.5)
sns.set_style('whitegrid')

We'll also change some of our default colors to make them a little more color-blind friendly than the default plotting colors.

In [None]:
sns.set_palette('colorblind')
sns.palplot(sns.color_palette('colorblind'))

## Detour: Database Views

As you work at Liferay, it is easy to be lead to believe that databases exist as naive information storage and retrieval systems, because that is really all Liferay does with a database. However, the truth is much more complex.

One of the other reasons databases exist is to manage relationships between data sets and allow you to analyze those relationships. As a result, many enterprise database vendors have created a rich set of proprietary functions on top of SQL that allow you to perform very insightful analysis.

Because you're uncovering relationships within the data, whenever you want to answer questions, you often join together tables and then perform additional analysis on top of those joined tables. It is not uncommon for an analysis to simply take multiple sets of SQL used to generate join tables and use them as subqueries.

Over time, it can be tedious to constantly paste in the the same nested queries over and over again, and having too many of them will also make it harder for anyone reading the query to understand what it is you were trying to analyze. To address this problem, it is common to create a database view.

* [Why do you create a view in a database?](https://stackoverflow.com/questions/1278521/why-do-you-create-a-view-in-a-database)

You can think of a database view as an alias for a query result that someone has needed before. We also know that every query result is a table, and knowing why people want that information to begin with (constantly reusing it in different analyses), you can also see why people might like a feature that allows you to materialize those views.

* [Materialized view](https://en.wikipedia.org/wiki/Materialized_view)

In our case, whenever we fetch tickets from JIRA, the JSON object may be cumbersome to work with. Therefore, in order to make it easier for people to understand the underlying information, we will create a view on top of it that allows us to concentrate on the specific fields we want to analyze. As we do this, however, we should keep in mind that all of the original raw data still exists, should we ever need it.

## Fetch LPP Tickets

Reminding ourselves of our question:

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

The key part here is that when we retrieve each ticket from JIRA, we're interested in the times that tickets remain in each status. So, let's start by making sure that, at the very least, we can extract that metadata from an LPP ticket.

First, we'll look at LPP tickets that relate to DXP, but are not any of the known workflow testing tickets (because those constantly close and re-open and might end up with the DXP affected version).

In [None]:
lpp_jql = """
project = LPP and affectedVersion = "7.0 DE (7.0.10)" and
key not in (LPP-10825,LPP-10826,LPP-12114,LPP-13367)
"""

It turns out that JIRA allows you to perform a join as part of its API by requesting an expansion of certain fields. Because we're interested in the time the ticket spends in each status, we can ask it to expand the `changelog` field, which effectively asks JIRA to perform a join on its equivalent of a `changelog` table.

In [None]:
if __name__ == '__main__':
    lpp_issues = get_jira_issues(lpp_jql, ['changelog'])
else:
    lpp_issues = {}

Let's accumulate the history transitions. Note that we're going to be making a simplification that will matter later on in the notebook: namely, we're going to assume that the creation of the issue is where we'll focus for the visualization, rather than the time of the transition.

This can be a dangerous simplification if you are to use this for modeling or machine learning, but it's a useful simplification because it allows us to have a cumulative value for how long something spends in each status rather than having to understand each separate value (in the event that it enters the "In Review" status multiple times, for example.

In [None]:
%%writefile jiratime.py
from checklpp import get_time_delta_as_days
from collections import defaultdict
import dateparser
from six import string_types

region_field_name = 'customfield_11523'

def extract_time(issue):
    issue_created = dateparser.parse(issue['fields']['created']).date()

    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']

    old_status = 'Open'
    current_assignee = issue['fields']['assignee']['displayName']
    old_status_date = dateparser.parse(issue['fields']['created'])

    status_times = defaultdict(lambda: defaultdict(float))
    history_entries = issue['changelog']['histories']

    for history_entry in history_entries:
        history_entry['createdTime'] = dateparser.parse(history_entry['created'])

    sorted_history_entries = sorted(
        history_entries,
        lambda x, y: -1 if get_time_delta_as_days(x['createdTime'] - y['createdTime']) < 0.0 else 1
    )

    for history_entry in sorted_history_entries:
        useful_history_items = [
            item for item in history_entry['items']
                if item['field'] == 'status' or item['field'] == 'assignee'
        ]

        for item in useful_history_items:
            if item['field'] == 'assignee':
                current_assignee = item['fromString']
            else:
                old_status = item['fromString']

            new_status_date = history_entry['createdTime']

            elapsed_time = get_time_delta_as_days(new_status_date - old_status_date)

            status_times[current_assignee][old_status] += elapsed_time
            status_times['(all assignees)'][old_status] += elapsed_time

            old_status_date = new_status_date

    return [
        {
            'jiraKey': issue['key'],
            'type': issue['fields']['issuetype']['name'],
            'assignee': assignee,
            'region': regions[0],
            'issueCreated': issue_created,
            'status': status,
            'elapsedTime': elapsed_time
        }
        for assignee, record in status_times.items()
            for status, elapsed_time in record.items()
    ]

    return status_times

The following block uses the `reload` ability to allow us to constantly change the time extraction above (which writes to a separate Python file so that it can be parallelized) and then reload it.

In [None]:
import jiratime
reload(jiratime)

Now, we perform our parallel processing.

In [None]:
times = []

region_field_name = 'customfield_11523'

num_finished = 0

for result in pool.imap_unordered(jiratime.extract_time, lpp_issues.values()):
    if num_finished % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

    num_finished += 1

    for entry in result:
        times.append(entry)

## Table Visualizations

Now that we have our view of the JIRA data, let's revisit our research question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

Given our question, naturally, our next step is to visualize our JIRA data. We'll start with the most basic visualization: a table.

When you start out in data visualization, the first thing you're taught is that when we generate descriptive statistics and visualize them as a table, we know that we're missing out on a large part of the story.

This is because when you boil something down to a single number (whether it's a measure of central tendency or really any other descriptive statistics), all of the context is lost in that summarization. It's possible for things that have very similar statistics and even very similar linear regression predictions to actually be quite different from each other.

* [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

However, tables give us a very concise overview of a lot of information at once, and they're extremely easy to generate, so they make for an excellent starting point to give you a sense of what you're dealing with.

In [None]:
aggregate_times = [entry for entry in times if entry['assignee'] == '(all assignees)']

In [None]:
df = pd.DataFrame(aggregate_times)

Liferay has a lot of different statuses, so we'll limit our exploration to only a handful of states that we know are related to tickets that are in progress, and we'll ignore the individual assignee times (at least, in our initial analysis).

In [None]:
statuses = ['Open', 'Verified', 'In Progress', 'In Review', 'On Hold']

In [None]:
df = df[df['status'].isin(statuses)]
del df['assignee']

### Cross-Tabulation

Before we begin creating tables that summarize the table across a numerical statistic (in our case, elapsed time), it's useful to do some data validation so that we can check our assumptions.

One validation we might do is this: how much each status is used for each different ticket type in our data set? This question can be answered with a contingency table, or cross-tabulation, which is available in almost all statistic computing libraries, because of how simple yet useful they are as a visualization tool.

* [Crosstabs](http://libguides.library.kent.edu/SPSS/Crosstabs)

A cross-tabulation looks at how two or more columns co-occur with each other, and it's especially useful when examining how two categorical variables relate to each other.

* [Categorical Variable](https://en.wikipedia.org/wiki/Categorical_variable)

This allows you to check any assumptions about the data, such as whether two column values never co-occur, or whether two column values seem to co-occur a lot more than expected. This kind of visualization is also relevant for our original question, where we will look only at "In Review" statuses, where we might want to make sure that any assumptions about "In Review" are accurate.

In [None]:
pd.crosstab(df['status'], [df['type']])

You might wonder, can you cross-tabulate two non-categorical columns, such as two columns that are both continuous, floating point values?

* [Continuous and Discrete Variable](https://en.wikipedia.org/wiki/Continuous_and_discrete_variable)

If you think about it, a cross-tabulation will have too many columns and too many rows. In this case, if you are prefer to use a table to start out, one popular option is to simply convert the continuous variable into a discrete variable using binning, which allows you to tabulate the resulting artificial categories.

* [Data binning](https://en.wikipedia.org/wiki/Data_binning)

For the more mathematics and visually oriented, this table might eliminate too much information if you do not choose your discretization well. An alternative popular option is to simply interpret the co-occurrences as probabilities rather than raw counts, which allows you to construct a contour plot of the the joint probability function.

If you've never heard of a contour plot before, imagine that one variable is geo latitude, and another variable is geo longitude. A famous contour plot would be something that depicts the elevation at each geo latitude. The end result is an elevation map, which you can render in two dimensions with color, or allow interacting in three dimensions like in Google Earth.

* [Contour plots](http://www.statisticshowto.com/contour-plots/)

If you don't remember what a joint probability function is, and you're a visual learner, these lecture notes give a good visual refresher, and also provide visualizations that help you connect the idea behind a joint probability distribution to contour plots.

* [Joint Density Functions, Marginal Density Functions, Conditional Density Functions, Expectations and Independence](http://www.colorado.edu/economics/morey/7818/jointdensity/jointdensity.pdf)

### Aggregate Statistics

First, we look at everything as a single unit, and we simply try to visualize how long we spend in each status.

In [None]:
df_groupby = df[['status', 'elapsedTime']].groupby(['status'])

Some common statistics that people like to look at include measures of central tendency (mean, median), as well as quantiles (x% of values are less than or equal to y), where the median is equivalent to the 50% quantile.

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})
df_quant = df_groupby.quantile([0.8, 0.9]).unstack()
df_quant.columns = ['%s%d%%' % (col[0], col[1] * 100) for col in df_quant.columns.values]

In [None]:
df_count.join(df_norm1).join(df_norm2).join(df_quant)

### Region-Specific Statistics

We can add one more level to this and group the statistics by region, and then take a look at what the statistics look like on the different ticket statuses.

In [None]:
df_groupby = df[['region', 'status', 'elapsedTime']].groupby(['region', 'status'])

Again, we'll look at central tendency and quantiles. This time, though, these numbers are divided across the different Liferay regions, which means the table will be much larger.

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})
df_quant = df_groupby.quantile([0.8, 0.9]).unstack()
df_quant.columns = ['%s%d%%' % (col[0], col[1] * 100) for col in df_quant.columns.values]

In [None]:
df_count.join(df_norm1).join(df_norm2).join(df_quant)

## Density Plot Visualizations

Let's come back to our original question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

One question you might ask yourself is, "Should we copy another regions approach to handling tickets?"

Region-based groupings are useful when you are below the average (your ticket status times are higher than average), because it gives you opportunities to explore how your region differs from other regions and provides opportunities for that kind of process improvement.

However, how do you know that your region is different from other regions? You could use a table, but a table is good for summarizing information, and not as good for pointing out differences.

Another way to understand summary statistics is to visualize how the data is distributed. When looking at one-dimensional values (for example, the distribution of the time elapsed), this is often explained as a probability density function, which allows you to see how the data is concentrated. Peaks indicate that many values exist near that value.

* [An introduction to kernel density estimation](http://www.mvstat.net/tduong/research/seminars/seminar-2001-05/)
* [The idea of a probability density function](http://mathinsight.org/probability_density_function_idea)

Neither histograms nor probability density functions are easy for non-mathematical people to interpret, so what you can do instead is use the probability density function to derive the cumulative distribution function. The values tell you how much of the data is less than a certain value, while the shape tells you how quickly that probability mass builds up.

* [Cumulative distribution function](http://www.itl.nist.gov/div898/handbook/eda/section3/eda362.htm)

To plot these particular functions, we will be using the [seaborn](https://seaborn.pydata.org/) statistical data visualization library.

### Aggregate Cumulative Distribution

There are two ways to read the cumulative distribution function, depending on the question you want to ask.

The first question is, "What percentage of tickets stay in the In Review status for 5 days or less?" To answer this question, you search the "Days in Status" for 5, and then see where the vertical line intersects the In Review line.

The second question is, "Our goal is for 90% of tickets to spend 10 days or less In Review. What is the current cut-off point for 90% of tickets?" To answer this question, you search the "Proportion of Tickets" for 90%, and then see where the horizontal line intersects the In Review line.

In [None]:
for key, group in df.groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], cumulative=True, bw=0.5, label=key)

ax.set_xlim((0, 40))

ax.set_yticks(np.linspace(0, 1, 11))
ax.set_yticklabels(['%d%%' % x for x in np.linspace(0, 100, 11)])

ax.set_xlabel('Days in Status')
ax.set_ylabel('Proportion of Tickets')

plt.show()

### Region-Specific Cumulative Distribution

Now let's say you wanted to answer the same question, but for a specific region, because we theoretically have more control over the results of our own region than we do over the entire department across multiple regions.

To reiterate, the first question is, "What percentage of tickets stay in the In Review status for 5 days or less *within my region of interest*?" To answer this question, you search the "Days in Status" for 5, and then see where the vertical line intersects the In Review line.

The second question is, "Our goal is for 90% of tickets to spend 10 days or less In Review *within my region of interest*. What is the current cut-off point for 90% of tickets *within my region of interest*?" To answer this question, you search the "Proportion of Tickets" for 90%, and then see where the horizontal line intersects the In Review line.

In [None]:
for key, group in df[df['region'] == 'US'].groupby(['status']):
    ax = sns.kdeplot(group['elapsedTime'], cumulative=True, bw=0.5, label=key)

ax.set_xlim((0, 40))

ax.set_yticks(np.linspace(0, 1, 11))
ax.set_yticklabels(['%d%%' % x for x in np.linspace(0, 100, 11)])

ax.set_xlabel('Days in Status')
ax.set_ylabel('Proportion of Tickets')

plt.show()

## Time Visualizations

We now have a sense of how our ticket times are distributed. Let's come back to our research question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

While we have some great summarization of our information, we haven't actually done any grouping of our data. Let's start with one of the most common groupings: time.

When you group data over time and then visualize it, essentially you are monitoring is what is happening to a specific value, with data spaced out at equal intervals (hours, days, weeks, months, years), which opens up a lot of different ways to both analyze and transform the time data. The types of analyses you can do depend on the data set as well as the strength of your background in applied mathematics.

* [Time series](https://en.wikipedia.org/wiki/Time_series)

For the purposes of this notebook, we aren't going to try to analyze the data. Rather, we're going to try to apply a simple time-based summarization and visualize those summaries. Given the nature of our tickets, one easy time-based summarization is to look at the tickets created each week.

In [None]:
def get_time_bucket_plot(times, day_count, aggregator, label, solid_line=True):
    if len(times) < 2:
        return

    sorted_times = sorted(
        times,
        lambda x, y: int(get_time_delta_as_days(x['issueCreated'] - y['issueCreated']))
    )

    if len(sorted_times) < 2:
        return

    start = 0
    end = start

    start_date = sorted_times[0]['issueCreated']
    max_end_date = start_date + timedelta(days=day_count)

    dates = []
    values = []

    while end + 1 < len(sorted_times):
        while end + 1 < len(sorted_times) and sorted_times[end+1]['issueCreated'] < max_end_date:
            end += 1

        bucket = [entry['elapsedTime'] for entry in sorted_times[start:end+1]]

        dates.append(start_date)
        values.append(aggregator(bucket))

        start = end + 1
        end = start

        start_date = max_end_date
        max_end_date = start_date + timedelta(days=day_count)

    if solid_line:
        plt.plot_date(dates, values, fmt='-', label=label)
    else:
        plt.plot_date(dates, values, fmt='--', label=label)

When working with this visualization, we'll focus on just those tickets that are in review.

In [None]:
in_review_aggregate_times = [entry for entry in aggregate_times if entry['status'] == 'In Review']
unique_regions = set([entry['region'] for entry in aggregate_times])

We'll also focus our graphs to a handful of regions.

In [None]:
graph_regions = ['US']

In [None]:
for region in graph_regions:
    assert region in unique_regions

### Weekly Visualization: Median

The median will give us a sense of central tendency that is not as heavily influenced by extreme values as the mean. By plotting both the global information and the regional information on the same time plot, we get a sense of whether the central tendency in our region is similar to the central tendency of other regions over time.

In [None]:
get_time_bucket_plot(in_review_aggregate_times, 7, np.median, 'Global Median')

for region in graph_regions:
    region_times = [time for time in in_review_aggregate_times if time['region'] == region]
    get_time_bucket_plot(region_times, 7, np.median, '%s Regional Median' % region, False)

plt.legend()

plt.show()

### Weekly Visualization: 90th Percentile

The 90% line will tell us how close we are to meeting our goal. By plotting both the global information and the regional information on the same time plot, we get a sense of whether there are any noticeable differences and thus explain if our own region may have more trouble meeting the goal than the global tendency.

In [None]:
nineties = functools.partial(np.percentile, q=90)

In [None]:
get_time_bucket_plot(in_review_aggregate_times, 7, nineties, 'Global 90%')

for region in graph_regions:
    region_times = [time for time in in_review_aggregate_times if time['region'] == region]
    get_time_bucket_plot(region_times, 7, np.median, '%s Regional 90%%' % region, False)

plt.legend()

plt.show()

## Sliding Window Visualizations

When working with time, you're not required to think of each data point only contributing to a single data point. With that in mind, let's remind ourselves of our original research question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

What we've done is we've made it so that each data point only exists in one bucket. However, this isn't necessarily required. You can think of the data points having an impact on multiple buckets or groupings (which we represent as a time-based data point).

To do that, one option is to have each data point affect all of the time-based data points that happen after it. When this impact is equal (so no matter how much time has elapsed, you still give it the same weight), you get the traditional statistics that we tabulated previously, just this value incorporates more and more data over time. This is known as a cumulative moving average.

* [Cumulative moving average](https://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average)

Another option is that this impact diminishes over time. For example, things that happen further and further in the past have less and less of an impact. A popular strategy to use is exponential decay, where you have a weighted moving average where dates further in the past have less and less of an impact. This is similar to Liferay's social activity implementation.

* [The decaying average, relative and sensitive](http://donlehmanjr.com/Science/03%20Decay%20Ave/032.htm)

For our visualization example, we're going to look at something that sits in between the two.

The idea is that you have accumulate a traditional stastistic where you don't think too much about weight (so you don't define a time-based decay), but you don't want to use all data and instead decide to discard data after a certain amount of time has elapsed (giving it a weight of zero).

The net effect is a simple moving average, which is implemented using a sliding (or rolling) time window.

* [Simple moving average](https://www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/sma)

If we choose 30 days for this sliding window, then for any given point in time, each value takes a look at all tickets opened thirty days before that point and computes a statistic describing how long those tickets took to close.

In [None]:
def get_time_rolling_plot(times, window_size, aggregator, label, solid_line=True):
    sorted_times = sorted(
        times,
        lambda x, y: int(get_time_delta_as_days(x['issueCreated'] - y['issueCreated']))
    )

    if len(sorted_times) < 2:
        return

    start = 0
    end = 1

    max_end_date = sorted_times[end]['issueCreated']
    min_start_date = max_end_date - timedelta(days=window_size)

    rolling_window = [sorted_times[end]['elapsedTime']]

    dates = []
    values = []

    while end + 1 < len(sorted_times):
        while start + 1 < end and sorted_times[start+1]['issueCreated'] < min_start_date:
            rolling_window.pop(0)
            start += 1

        while end < len(sorted_times) and sorted_times[end]['issueCreated'] < max_end_date:
            rolling_window.append(sorted_times[end]['elapsedTime'])
            end += 1

        dates.append(min_start_date)
        values.append(aggregator(rolling_window))

        min_start_date += timedelta(days=1)
        max_end_date = min(today + timedelta(days=1), max_end_date + timedelta(days=1))

    if solid_line:
        plt.plot_date(dates, values, '-', label=label)
    else:
        plt.plot_date(dates, values, '--', label=label)

### Sliding Window: Mean

Plotting a sliding window has a smoothing effect, and so while outliers will still have an effect (and they have an effect for a longer period of time), it is still customary to think of the mean first when using a sliding window.

* [Smoothing algorithms](https://en.wikipedia.org/wiki/Smoothing#Smoothing_algorithms)

Just as with our earlier plot, by plotting both the global information and the regional information on the same time plot, we get a sense of whether the central tendency in our region is similar to the central tendency of other regions over time.

In [None]:
get_time_rolling_plot(in_review_aggregate_times, 30, np.mean, 'Global Mean')

for region in graph_regions:
    region_times = [time for time in in_review_aggregate_times if time['region'] == region]
    get_time_rolling_plot(region_times, 30, np.mean, '%s Regional Mean' % region, False)

plt.legend()

plt.show()

### Sliding Window: 90th Percentile

When using the 90th percentile on a sliding window, you end up answering the question, "How well have we been doing over the past month, and how has that varied over time?" Having both global and regional values gives us a sense of how that compares to other regions.

In [None]:
get_time_rolling_plot(in_review_aggregate_times, 30, nineties, 'Global 90%')

for region in graph_regions:
    region_times = [time for time in in_review_aggregate_times if time['region'] == region]
    get_time_rolling_plot(region_times, 30, nineties, '%s Regional 90%%' % region, False)

plt.legend()

plt.show()

## Group LPP Tickets by Component

We've now finished up with a lot of introductory level visualizations. Let's remind ourselves of our original question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

If we were to take a step back, the question we want to ask is: why do we care about groupings? Well, naturally because we believe that after we apply some grouping, it gives us something *actionable* that will impact our ticket times.

One actioanble item might be, "Do we need to restructure our teams to reflect the types of tickets we are receiving?" In this case, you might consider component-based groupings to see if there are any components where you are performing below average (have higher status elapsed times), and thus deliver fixes more quickly by restructuring your teams to bring more resources into a given component, or improve the relationship between our support SMEs and engineering teams for those components to review those tickets more quickly.

### Generate Component-Based Grouping, Take 1

To determine our component-based groupings, we have a few options. One of the easier options to simply generate the mapping table from the JIRA data we already have. All `components` tied to a JIRA issue are included in the `components` field. This gives us the following code to generate a mapping table.

In [None]:
def get_component_names(issue):
    for component in issue['fields']['components']:
        yield component['name']

In [None]:
component_mapping = [
    {'jiraKey': key, 'component': component_name}
        for key, issue in lpp_issues.items()
            for component_name in get_component_names(issue)
]

df_component_mapping = pd.DataFrame(component_mapping)

Now that we have that component mapping table, we'll want to join it with our existing view that documents all of our elapsed times for every status.

In [None]:
df_mapping_join = pd.merge(
    df[df['status'] == 'In Review'],
    df_component_mapping, on='jiraKey')

### Cross-Tabulation, Take 1

Now, what we know for sure is that a lot of components are very rarely used. This means that data visualization by component might not actually be useful due to data sparsity. Let's confirm that by creating a cross-tabulation of status by component.

In [None]:
pd.crosstab(
    df_mapping_join['component'],
    df_mapping_join['status']
)

This cross-tabulation aligns with the initial belief that what we have is extremely sparse. That means, we'll want to increase the coarsity of the component if we want to understand it.

### Generate Component-Based Grouping, Take 2

There are a variety of ways to redo the component-based grouping, all of them rooted in different ways of organizing and retrieving information, and all of them valid depending on the context.

* [Information science](https://en.wikipedia.org/wiki/Information_science)

In our case, where different categorizations have been created based on Engineering teams, the most obvious is to simply use the Engineering team names that Liferay has defined, where everything following the greater than the sign is stripped. We'll want to make sure that we don't return the same value twice.

In [None]:
def get_parent_component_names(issue):
    component_names = [
        component['name'] for component in issue['fields']['components']
    ]

    parent_component_names = set([
        component_name[0:component_name.find(' > ')] if component_name.find(' > ') > 0 else component_name
            for component_name in component_names
    ])

    return parent_component_names

In [None]:
parent_component_mapping = [
    {'jiraKey': key, 'component': component_name}
        for key, issue in lpp_issues.items()
            for component_name in get_parent_component_names(issue)
]

df_parent_component_mapping = pd.DataFrame(parent_component_mapping)

We'll want to join our new mapping table with our existing view that documents all of our elapsed times for every status.

In [None]:
df_parent_component_mapping_join = pd.merge(
    df[df['status'] == 'In Review'],
    df_parent_component_mapping,
    on='jiraKey'
)

### Cross-Tabulation, Take 2

Let's see how well we addressed our data sparsity by creating a cross-tabulation of status by component.

In [None]:
df_parent_component_crosstab = pd.crosstab(
    df_parent_component_mapping_join['component'],
    df_parent_component_mapping_join['status']
)

df_parent_component_crosstab

As we can see, we still have a sparsity problem if we decide to look only at the "In Review" column. Therefore, we'll want to consolidate our groupings even more if we want to really understand what is happening. Should we do so, we need to keep in mind that we also run the risk of over-summarizing and losing too much contextual information.

For now, what we'll do is we'll only explore the top 5 components, because that's the number of colors we have available in our color-blind friendly palette.

In [None]:
df_parent_component_crosstab[
    df_parent_component_crosstab['In Review'] > 20
].sort_values('In Review').tail(5)

These, of course, overlap exactly with what we know to be components where bug reports frequently arive.

In [None]:
highcount_components = df_parent_component_crosstab[
    df_parent_component_crosstab['In Review'] > 20
].sort_values('In Review').tail(5).index.values

highcount_join = df_parent_component_mapping_join[
    df_parent_component_mapping_join['component'].isin(highcount_components)
]

With that in mind, note that this isn't the only way to derive components.

An alternate approach is to derive brand new groupings the same way a search engine might derive groupings. After all, each ticket has descriptions and comments, so you might try to derive new categorizations based on summarizations of those descriptions and comments (such as tags) rather than rely on the existing Liferay categorizations.

## Visualize LPP Tickets by Component

Now we're going to bring everything together to answer our research question.

<b style="color:green">Can we create data visualizations on top of LPS ticket metadata that let us group together different DXP tickets so that we can explore the times that tickets remain "In Review" based on those groupings?</b>

Now that we have derived the grouping (component) by performing a join, we will apply the same visualizations that we used for regions to instead compare the "In Review" times for each component.

### Aggregate Statistics

In [None]:
df_groupby = highcount_join[['component', 'elapsedTime']].groupby(['component'])

In [None]:
df_count = df_groupby.count()
df_count = df_count.rename(columns={'elapsedTime': 'count'})
df_norm1 = df_groupby.median()
df_norm1 = df_norm1.rename(columns={'elapsedTime': 'elapsedTimeMedian'})
df_norm2 = df_groupby.mean()
df_norm2 = df_norm2.rename(columns={'elapsedTime': 'elapsedTimeMean'})
df_quant = df_groupby.quantile([0.8, 0.9]).unstack()
df_quant.columns = ['%s%d%%' % (col[0], col[1] * 100) for col in df_quant.columns.values]

In [None]:
df_count.join(df_norm1).join(df_norm2).join(df_quant)

### Aggregate Cumulative Distribution

In [None]:
for key, group in highcount_join.groupby(['component']):
    ax = sns.kdeplot(group['elapsedTime'], cumulative=True, bw=0.5, label=key)

ax.set_xlim((0, 40))

ax.set_yticks(np.linspace(0, 1, 11))
ax.set_yticklabels(['%d%%' % x for x in np.linspace(0, 100, 11)])

ax.set_xlabel('Days In Review')
ax.set_ylabel('Proportion of Tickets')

plt.show()

### Monthly Visualization: Median

In [None]:
for component_name in highcount_components:
    component_tickets = highcount_join[highcount_join['component'] == component_name]['jiraKey'].values
    component_times = [time for time in in_review_aggregate_times if time['jiraKey'] in component_tickets]

    get_time_bucket_plot(component_times, 30, np.median, '%s Median' % component_name)

plt.legend()

plt.show()

### Monthly Visualization: 90th Percentile

In [None]:
for component_name in highcount_components:
    component_tickets = highcount_join[highcount_join['component'] == component_name]['jiraKey'].values
    component_times = [time for time in in_review_aggregate_times if time['jiraKey'] in component_tickets]

    get_time_bucket_plot(component_times, 30, nineties, '%s 90%%' % component_name)

plt.legend()

plt.show()

### Sliding Window: Mean

In [None]:
for component_name in highcount_components:
    component_tickets = highcount_join[highcount_join['component'] == component_name]['jiraKey'].values
    component_times = [time for time in in_review_aggregate_times if time['jiraKey'] in component_tickets]

    get_time_rolling_plot(component_times, 90, np.mean, '%s Mean' % component_name)

plt.legend()

plt.show()

### Sliding Window: 90th Percentile

In [None]:
for component_name in highcount_components:
    component_tickets = highcount_join[highcount_join['component'] == component_name]['jiraKey'].values
    component_times = [time for time in in_review_aggregate_times if time['jiraKey'] in component_tickets]

    get_time_rolling_plot(component_times, 90, nineties, '%s 90%%' % component_name)

plt.legend()

plt.show()