# Database Joins 2: Visualizing Ticket Load

For this notebook, we start with the following research question. "How can we create data visualizations on top of the LESA, LPS, LPP, and BPR ticket metadata that lets us group together different tickets so that we can explore the times that tickets remain in each status based on those groupings?"

In order to investigate the answer to this question, we start with a much smaller sub-question that focuses on more recent data.

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

In order to answer this question, this notebook explores a few concepts in databases related to metadata and deriving additional statistics on top of that metadata.

At the end of this notebook, we will have a script that takes a sample of data from JIRA and enriches it with more data from JIRA, and the reader will have an improved understanding of the capabilities databases provide when it comes to processing information for visualization.

## Prerequisites

The following cell attempts to use `conda` and `pip` to install the libraries that are used by this notebook. If the output indicates that additional items were installed, you will need to restart the kernel after the installation completes before you can run the later cells in the notebook.

In [None]:
!conda install -y matplotlib scikit-learn seaborn statsmodels

## Notebook Imports

In [None]:
%matplotlib inline

In [None]:
from __future__ import print_function

from checklpp import *
from datetime import datetime, timedelta
import functools
from IPython.core.display import display, HTML
import matplotlib
from multiprocessing import Pool, cpu_count
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

## Enable Process Pool

Most of our computations can run independently of each other, so let's take advantage of some parallelization that's available on our machine.

In [None]:
pool = Pool(cpu_count())

We'll also change some of the default plots so that they're larger and use a white background instead of a gray one.

In [None]:
plt.rcParams['figure.figsize'] = (6.4*2, 4.8*1.5)
sns.set_style('whitegrid')

We'll also change some of our default colors to make them a little more color-blind friendly than the default plotting colors.

In [None]:
sns.set_palette('colorblind')
sns.palplot(sns.color_palette('colorblind'))

## Detour: Database Views

As you work at Liferay, it is easy to be lead to believe that databases exist as naive information storage and retrieval systems, because that is really all Liferay does with a database. However, the truth is much more complex.

One of the other reasons databases exist is to manage relationships between data sets and allow you to analyze those relationships. As a result, many enterprise database vendors have created a rich set of proprietary functions on top of SQL that allow you to perform very insightful analysis.

Because you're uncovering relationships within the data, whenever you want to answer questions, you often join together tables and then perform additional analysis on top of those joined tables. It is not uncommon for an analysis to simply take multiple sets of SQL used to generate join tables and use them as subqueries.

Over time, it can be tedious to constantly paste in the the same nested queries over and over again, and having too many of them will also make it harder for anyone reading the query to understand what it is you were trying to analyze. To address this problem, it is common to create a database view.

* [Why do you create a view in a database?](https://stackoverflow.com/questions/1278521/why-do-you-create-a-view-in-a-database)

You can think of a database view as an alias for a query result that someone has needed before. We also know that every query result is a table, and knowing why people want that information to begin with (constantly reusing it in different analyses), you can also see why people might like a feature that allows you to materialize those views.

* [Materialized view](https://en.wikipedia.org/wiki/Materialized_view)

In our case, whenever we fetch tickets from JIRA, the JSON object may be cumbersome to work with. Therefore, in order to make it easier for people to understand the underlying information, we will create a view on top of it that allows us to concentrate on the specific fields we want to analyze. As we do this, however, we should keep in mind that all of the original raw data still exists, should we ever need it.

## Fetch LPP Tickets

Reminding ourselves of our question:

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

So, let's start by making sure that we can some time-related metadata from an LPP ticket. We'll look at LPP tickets that relate to DXP, but are not any of the known workflow testing tickets (because those aren't really DXP tickets, even if they have a DXP affected version).

In [None]:
lpp_jql = """
project = LPP and affectedVersion = "7.0 DE (7.0.10)" and
key not in (LPP-10825,LPP-10826,LPP-12114,LPP-13367)
"""

JIRA allows you to perform a join as part of its API by requesting an expansion of certain fields. Because we're interested in the time the ticket spends in each status, we can ask it to expand the `changelog` field, which effectively asks JIRA to perform a join on its equivalent of a `changelog` table.

In [None]:
if __name__ == '__main__':
    lpp_issues = get_jira_issues(lpp_jql, ['changelog'])
else:
    lpp_issues = {}

In [None]:
len(lpp_issues)

## Extract Resolution Transitions

Now that we have the raw data:

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

To answer that question, we'll first need to accumulate all of the transitions. One way to think about active tickets is to look at all tickets that are unresolved on any given day.

In [None]:
%%writefile unresolved.py
from collections import defaultdict
from datetime import date
import dateparser
from six import string_types

region_field_name = 'customfield_11523'

def extract_time(issue):
    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']

    old_resolution = None
    old_status_date = dateparser.parse(issue['fields']['created'])

    history_entries = issue['changelog']['histories']

    for history_entry in history_entries:
        history_entry['createdTime'] = dateparser.parse(history_entry['created'])

    transitions = []

    for history_entry in history_entries:
        useful_history_items = [
            item for item in history_entry['items']
                if item['field'] == 'resolution'
        ]

        if len(useful_history_items) == 0:
            continue

        new_status_date = history_entry['createdTime']

        for item in useful_history_items:
            new_resolution = item['toString']

            if old_resolution is None and new_resolution is not None:
                transitions.append({
                    'jiraKey': issue['key'],
                    'type': issue['fields']['issuetype']['name'],
                    'region': regions[0],
                    'statusStart': old_status_date.date(),
                    'statusEnd': new_status_date.date()
                })

            old_resolution = new_resolution

        if old_resolution is None:
            old_status_date = new_status_date

    if old_resolution is None:
        transitions.append({
            'jiraKey': issue['key'],
            'type': issue['fields']['issuetype']['name'],
            'region': regions[0],
            'statusStart': old_status_date.date(),
            'statusEnd': date.today()
        })

    return transitions

The following block uses the `reload` ability to allow us to constantly change the time extraction above (which writes to a separate Python file so that it can be parallelized) and then reload it.

In [None]:
import unresolved
reload(unresolved)

Now, we perform our parallel processing.

In [None]:
times = []

region_field_name = 'customfield_11523'

num_finished = 0

for result in pool.imap_unordered(unresolved.extract_time, lpp_issues.values()):
    if num_finished % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

    num_finished += 1

    for entry in result:
        times.append(entry)

print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

In [None]:
df = pd.DataFrame(times)

## Detour: Cross-Tabulation

Before we begin creating tables that summarize the table across a numerical statistic (in our case, counts), it's useful to do some data validation so that we can check our assumptions.

### Discrete Variables

One validation we might do is this: how much tickets of each type are open in each region? This question can be answered with a contingency table, or a cross-tabulation, which is available in almost all statistic computing libraries.

* [Crosstabs](http://libguides.library.kent.edu/SPSS/Crosstabs)

For those who are familiar with Excel pivot table, you can think of a cross-tabulations as the simplest pivot table, where each data point has a value of 1 and you're only computing the sum.

* [Pivot table](https://exceljet.net/things-to-know-about-excel-pivot-tables)

In our case, the function we're using to compute the value of each cell is to simply count them. The result looks at how two or more columns co-occur with each other, and it's especially useful when examining how two categorical variables relate to each other.

* [Categorical Variable](https://en.wikipedia.org/wiki/Categorical_variable)

This allows you to check any assumptions about the data, such as whether two column values never co-occur, or whether two column values seem to co-occur a lot more than expected.

### Non-Discrete Variables

You might wonder, can you cross-tabulate two non-categorical columns, such as two columns that are both continuous, floating point values?

* [Continuous and Discrete Variable](https://en.wikipedia.org/wiki/Continuous_and_discrete_variable)

If you think about it, a cross-tabulation will have too many columns and too many rows. In this case, if you are prefer to use a table to start out, one popular option is to simply convert the continuous variable into a discrete variable using binning, which allows you to tabulate the resulting artificial categories. This is something that is common within pivot tables as well.

* [Data binning](https://en.wikipedia.org/wiki/Data_binning)

For the more mathematics and visually oriented, this table might over-summarize the information if you do not choose your discretization (or binning) function well. An alternative popular option is to simply interpret the co-occurrences as probabilities rather than raw counts, which allows you to construct a contour plot of the the joint probability function.

If you've never heard of a contour plot before, imagine that one variable is geo latitude, and another variable is geo longitude. A famous contour plot would be something that depicts the elevation at each geo latitude. The end result is an elevation map, which you can render in two dimensions with color, or allow interacting in three dimensions like in Google Earth.

* [Contour plots](http://www.statisticshowto.com/contour-plots/)

If you don't remember what a joint probability function is, and you're a visual learner, these lecture notes give a good visual refresher, and also provide visualizations that help you connect the idea behind a joint probability distribution to contour plots.

* [Joint Density Functions, Marginal Density Functions, Conditional Density Functions, Expectations and Independence](http://www.colorado.edu/economics/morey/7818/jointdensity/jointdensity.pdf)

## Table Visualizations

Now that we have our filtered view of the JIRA data, let's revisit our research question.

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

Given our question, naturally, our next step is to visualize our JIRA data. We'll start with the most basic visualization: a table.

In this case, we'll take a look at the cross-tabulation of ticket types and regions. Note that because our table allows for a ticket to appear multiple times (because tickets can become resolved and then be reopened), this table is interpreted as how often 7.0.x tickets of different types have been flagged as resolved in each region, according to the JIRA changelog entries.

In [None]:
pd.crosstab(df['region'], [df['type']])

We can see from the above that there are some activity transitions where the tickets have no region at all. DXP Product Escalations have tickets that appear to have a region field, while Sub-Task tickets exclusively have no region.

If we were to do some data cleansing, we might go revisit those in order to fill in the appropriate region in case it is relevant to other data analysis tasks. As we are summarizing the data using visualizations, these will only matter if we were to create region-specific visualizations of these activity counts.

In [None]:
display(HTML(', '.join([
    '<nobr><a href="http://liferay.atlassian.net/browse/%s">%s</a></nobr>' % (jira_key, jira_key)
        for jira_key in df[df['region'] == '']['jiraKey'].unique()
])))

## Detour: Time Visualizations

What we've actually achieved is one of the most common groupings: time.

When you group data over time and then visualize it, essentially you are monitoring is what is happening to a specific value, with data spaced out at equal intervals (hours, days, weeks, months, years), which opens up a lot of different ways to both analyze and transform the time data. The types of transformations and analyses you can readily do depend on the data set and your background in applied mathematics.

* [Time series](https://en.wikipedia.org/wiki/Time_series)

For the purposes of this notebook, we aren't going to try to analyze the data. Rather, we're going to try to apply a simple time-based summarization: the total number of active tickets, with varying definitions of active.

## Visualizing Load: Raw Count

Let's come back to our research question.

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

One of the most natural is to simply ask, "How many tickets were active today?" Then ask that question every day.

So let's say that you wanted to compute how many tickets were active on each day given our current table, which has a column that indicates when a ticket started being active and a column that indicates when a ticket was no longer active. How can you use this table to compute how many tickets were active on any given day?

In [None]:
def expand_as_days(start, end):
    current = start

    while current <= end:
        yield current
        current += timedelta(days=1)

A naive approach is to simply generate a new mapping table similar to the tables we've already created, where we create an entry for every single day a ticket is active. Since we don't actually have that many tickets and that many days, this is actually extremely practical.

In [None]:
def get_active_mapping(df, expand_function):
    active_by_date = defaultdict(lambda: defaultdict(set))

    for jira_key, region, start, end in zip(df['jiraKey'], df['region'], df['statusStart'], df['statusEnd']):
        for current in expand_function(start, end):
            active_by_date[region][current].add(jira_key)

    df_active_by_day_mapping = pd.DataFrame([
        {'date': date_key, 'region': region, 'count': len(jira_keys)}
            for region, value in active_by_date.items()
                for date_key, jira_keys in sorted(value.items())
    ])

    return df_active_by_day_mapping

### Aggregate Visualization

Now that we have that information, we can achieve what we want through a group by expression.

In [None]:
df_active = get_active_mapping(df, expand_as_days)

In [None]:
df_groupby = df_active[['date', 'count']].groupby(['date'])

df_count = df_groupby.sum()
df_count.columns = ['Global Count']

In [None]:
df_count.plot()

What we see from this visualization is that the tickets are sticking around. This suggests that some tickets never enter a resolved state, and these are causing our visualization to show that tickets are persisting.

So this raises a question. Of the tickets that are supposedly not yet resolved today, what's their current resolution status?

In [None]:
df_today = df[df['statusEnd'] == date.today()].copy()
df_today['statusStartMonth'] = df_today['statusStart'].apply(lambda x: x.replace(day=1))

all_resolutions = [
    {
        'jiraKey': issue['key'],
        'resolution': issue['fields']['resolution']['name']
            if issue['fields']['resolution'] is not None else 'Unresolved'
    }
    for issue in lpp_issues.values()
]

df_resolutions = pd.DataFrame(all_resolutions)

df_today_resolutions = df_today.merge(df_resolutions)

This leads us to the following distribution.

In [None]:
pd.value_counts(df_today_resolutions['resolution']).plot.bar()

This visualization lets us know that while most of the tickets that lack a resolution history entry are truly unresolved, some of these tickets actually lack a history entry, because they aren't unresolved. This is essentially case of dirty data.

* [Dirty data](https://en.wikipedia.org/wiki/Dirty_data)

There are a variety of ways to handle dirty data, such as imputing all the missing values. For now, we'll look at a different way of extracting whether or not a ticket is active.

## Extract Status Transitions

Next, we'll look at status transitions.

In [None]:
%%writefile startstop.py
from collections import defaultdict
from datetime import date
import dateparser
from six import string_types

region_field_name = 'customfield_11523'

def extract_time(issue):
    if region_field_name in issue['fields'] and issue['fields'][region_field_name] is not None:
        regions = [region['value'] for region in issue['fields'][region_field_name]]
    else:
        regions = ['']

    old_status = 'Open'
    old_status_date = dateparser.parse(issue['fields']['created'])

    history_entries = issue['changelog']['histories']

    for history_entry in history_entries:
        history_entry['createdTime'] = dateparser.parse(history_entry['created'])

    transitions = []

    for history_entry in history_entries:
        useful_history_items = [
            item for item in history_entry['items']
                if item['field'] == 'status'
        ]

        if len(useful_history_items) == 0:
            continue

        new_status_date = history_entry['createdTime']

        for item in useful_history_items:
            new_status = item['toString']

            transitions.append({
                'jiraKey': issue['key'],
                'type': issue['fields']['issuetype']['name'],
                'region': regions[0],
                'status': old_status,
                'statusStart': old_status_date.date(),
                'statusEnd': new_status_date.date()
            })

            old_status = new_status

    transitions.append({
        'jiraKey': issue['key'],
        'type': issue['fields']['issuetype']['name'],
        'region': regions[0],
        'status': old_status,
        'statusStart': old_status_date.date(),
        'statusEnd': date.today()
    })

    return transitions

The following block uses the `reload` ability to allow us to constantly change the time extraction above (which writes to a separate Python file so that it can be parallelized) and then reload it.

In [None]:
import startstop
reload(startstop)

Now, we process all of our issues.

In [None]:
times = []

region_field_name = 'customfield_11523'

num_finished = 0

for result in pool.imap_unordered(startstop.extract_time, lpp_issues.values()):
    if num_finished % 100 == 0:
        print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

    num_finished += 1

    for entry in result:
        times.append(entry)

print('[%s] Processed %d of %d issues' % (datetime.now().isoformat(), num_finished, len(lpp_issues)))

## Identify Active Statuses

Now that we've extracted all of the status transitions:

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

First, we'll need to identify which statuses correspond to a ticket being active.

In [None]:
df = pd.DataFrame(times)

In [None]:
df['status'].unique()

When we see that list, almost all the statuses could be considered active other than Audit and Closed. The remaining statuses depend on whether you're looking at things from the Customer Support perspective or the Technical Support perspective.

We can capture whether or not a ticket is active from the Customer Support perspective through LESA. Therefore, it makes sense to focus on the Technical Support active statuses. These are the statuses we'll treat as active:

In [None]:
active_statuses = set([
    'Open', 'Verified', 'In Progress', 'On Hold', 'In Review',
    'Ready for Investigation', 'Reopened', 'Wormhole',
    'Awaiting Help', 'Developing Plan'
])

Using that, we filter our data.

In [None]:
df = df[df['status'].isin(active_statuses)]

### Aggregate Visualizations

Now that we have that information, we can achieve what we want through a group by expression.

In [None]:
df_active = get_active_mapping(df, expand_as_days)

In [None]:
df_groupby = df_active[['date', 'count']].groupby(['date'])

df_count = df_groupby.sum()
df_count.columns = ['Global Count']

In [None]:
df_count.plot()

### Region-Specific Visualization

One of the things we have to worry about when plotting a region-specific visualization is the number of regions we have.

In [None]:
regions = [region for region in sorted(df_active['region'].unique()) if region != '']

regions

To understand why, let's go ahead and plot all of our regions.

In [None]:
for region in regions:
    df_region = df_active[df_active['region'] == region][['date', 'count']]
    plt.plot_date(df_region['date'], df_region['count'], fmt='-', label='%s Count' % region)

plt.legend()
plt.show()

As you can see, we have seven regions and only six unique colors in our color palette, and so the last color repeats. Luckily, the regions have very different ticket counts for ee-7.0.x so it's not a problem per se, but it's a useful talking point.

When you run out into this situation, there are many options available to you. Some of the most popular are to choose what you wish to see (so don't plot all the regions), add more colors (which means you have to know for sure what kinds of color blindness exist in your target audience), or group together the different items and create multiple plots.

In this case, it looks like if we were to group in a typical negative UTC offset (US, Brazil) and positive UTC offset (Japan, EU, Spain, APAC, India), we have two graphs that have two and five colors, which is just enough of a division for us to distinguish colors.

In [None]:
negative_utc_offset = ['Brazil', 'US']
positive_utc_offset = ['APAC', 'EU', 'India', 'Japan', 'Spain']

In [None]:
for region in negative_utc_offset:
    df_region = df_active[df_active['region'] == region][['date', 'count']]
    plt.plot_date(df_region['date'], df_region['count'], fmt='-', label='%s Count' % region)

plt.legend()
plt.show()

for region in positive_utc_offset:
    df_region = df_active[df_active['region'] == region][['date', 'count']]
    plt.plot_date(df_region['date'], df_region['count'], fmt='-', label='%s Count' % region)

plt.legend()
plt.show()

## Visualizing Load: Change Count

You can think of the raw counts as having accumulated from tickets that became active and tickets that went inactive. Visualizing these increases and decreases along with the net change is also another way to visualize our counts.

First, let's compute these increases and decreases.

In [None]:
ActiveCount = namedtuple('ActivityCount', ['date', 'region', 'added', 'removed', 'net_change'])

In [None]:
def get_active_changes(df, expand_function):
    active_by_date = defaultdict(lambda: defaultdict(set))

    for jira_key, region, start, end in zip(df['jiraKey'], df['region'], df['statusStart'], df['statusEnd']):
        for current in expand_function(start, end):
            active_by_date[region][current].add(jira_key)

    changes = []

    for region, values in active_by_date.items():
        old_active = set()

        for date, new_active in sorted(values.items()):
            added = len(new_active - old_active)
            removed = 0 - len(old_active - new_active)
            net_change = len(new_active) - len(old_active)

            changes.append(ActiveCount(date, region, added, removed, net_change))

            old_active = new_active

    return pd.DataFrame(changes)

In [None]:
df_changes = get_active_changes(df, expand_as_days)

In [None]:
df_groupby = df_changes[['date', 'added', 'removed', 'net_change']].groupby('date')

df_total_changes = df_groupby.sum()

plt.plot_date(df_total_changes.index, df_total_changes['added'], fmt='^')
plt.plot_date(df_total_changes.index, df_total_changes['removed'], fmt='v')
plt.plot_date(df_total_changes.index, df_total_changes['net_change'], fmt='--')

plt.legend()
plt.show()

## Detour: Visualizing Distributions

Just as you can visualize values over time, you can also ignore the time dimension and simply visualize the values themselves.

* [Visualizing Distributions](http://www.darkhorseanalytics.com/blog/visualizing-distributions-3)

The goal of these visualizations is to understand where values are concentrated, and they give you a visual sense of whether this looks like a sequence of numbers that the world has seen and studied before (because a lot of analysis is comparing what you see to what you already understand).

* [Common Probability Distributions: A Data Scientist's Crib Sheet](https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/)

## Visualizing Load: Central Tendency, Take 1

Let's come back to our research question.

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

As we just reviewed, another way you can visualize ticket activity is to compare today's ticket counts with all ticket counts independent of time. We'll do this by combining two visualizations: a histogram and a box and whiskers plot.

* [Histogram](https://en.wikipedia.org/wiki/Histogram)
* [Box plot](https://en.wikipedia.org/wiki/Box_plot)

When we're plotting this data, we may want to worry about weekends, because we usually don't see much LPP ticket activity over the weekends, and so it will cause whatever ticket counts occurred on Friday will tend to repeat. At the same time, it might not make any difference in the shape of the distribution (just the raw values will be lower).

In [None]:
is_weekday = df_active['date'].apply(lambda x: x.weekday() < 5)

fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, gridspec_kw={"height_ratios": (.05, .95)})

count = df_active[['date', 'count']].groupby(['date']).sum()['count']

sns.boxplot(count, ax=axes[0][0])
sns.distplot(count, ax=axes[1][0], kde=False, rug=True)

axes[0][0].set_xlabel('')
axes[1][0].set_title('Include Weekends')
axes[1][0].set_xlabel('Global Count')

count = df_active[is_weekday][['date', 'count']].groupby(['date']).sum()['count']

sns.boxplot(count, ax=axes[0][1])
sns.distplot(count, ax=axes[1][1], kde=False, rug=True)

axes[0][1].set_xlabel('')
axes[1][1].set_title('Exclude Weekends')
axes[1][1].set_xlabel('Global Count')

axes[1][1].set_ylim(axes[1][0].get_ylim())

plt.show()

In the histogram, we get a sense of the shape of the data. The x-axis represents the number of tickets that were active, where each block represents a range of values (from 0 to 19, for example), and the y-axis represents the number of times we saw that range of active ticket counts in our data set. In this case, that means the number of days we saw those values.

In the box and whiskers plot, we get a sense of the concentration of the data. The width of the shaded box represents the range of values between the first and third quartiles, the heavy black line in the middle is the second quartile (also known as the median), and the whiskers are intended to be 1.5 times the difference between the first and third quartiles, bounded by the minimum and maximum values in the actual data.

* [Quartile](https://en.wikipedia.org/wiki/Quartile)
* [Interquartile Range](https://en.wikipedia.org/wiki/Interquartile_range)

## Visualizing Load: Central Tendency, Take 2

Again, we ask our question.

<b style="color:green">How can we visualize LPP ticket activity for DXP?</b>

Just as when we looked at change counts over time, we can also look at change counts independent of time, and we can see how these changes are concentrated.

In this case, understanding the counts for a specific region is more useful than seeing the counts globally. Below, we choose a region for our visualization.

In [None]:
selected_region = 'US'

In [None]:
df_region_changes = df_changes[df_changes['region'] == selected_region]

When looking at the change counts, we have the same problem as when visualizing the total counts, where there is less activity on weekends. In this case, rather than repetition, we expect that it will increase the number of small values (especially zero), due to the low activity over the weekends.

In [None]:
def create_changes_plot(column, title):
    is_weekday = df_region_changes['date'].apply(lambda x: x.weekday() < 5)

    fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, gridspec_kw={"height_ratios": (.05, .95)})

    count = df_region_changes[['date', column]].groupby('date').sum()[column]

    sns.boxplot(count, ax=axes[0][0])
    sns.distplot(count, ax=axes[1][0], bins=10, kde=False, rug=False)

    axes[0][0].set_xlabel('')
    axes[1][0].set_title('Include Weekends')
    axes[1][0].set_xlabel(title)

    count = df_region_changes[is_weekday][['date', column]].groupby('date').sum()[column]

    sns.boxplot(count, ax=axes[0][1])
    sns.distplot(count, ax=axes[1][1], bins=10, kde=False, rug=False)

    axes[0][1].set_xlabel('')
    axes[1][1].set_title('Exclude Weekends')
    axes[1][1].set_xlabel(title)

    axes[1][1].set_ylim(axes[1][0].get_ylim())

    plt.show()

In [None]:
create_changes_plot('added', 'Newly Active')

In [None]:
create_changes_plot('removed', 'Newly Inactive')

In [None]:
create_changes_plot('net_change', 'Net Change')

## Visualizing Load: Uncertainty

Now that we've visualized central tendency, and we've visualized the distribution of the different values, the next thing on our plate is to visualize uncertainty. Uncertainty is rooted in the idea of a prediction, and uncertainty is how wrong you believe that prediction might be.

For the purpose of explaining uncertainty, let's take a look at the net change in tickets for our selected region.

In [None]:
plt.plot_date(df_region_changes['date'], df_region_changes['net_change'], fmt='.')

plt.show()

The idea of uncertainty is to first make a prediction about what we think the value should be. Since this is a time series, there's a lot of different analysis you do as you're choosing a model, but we'll do something extremely simple to start: we guess the value as the average value for the previous 30 days.

In [None]:
rolling_window = df_region_changes['net_change'].rolling(30, 1)

guesses = rolling_window.mean()
shifted_guesses = np.concatenate([[0], guesses[:-1]])

plt.plot_date(df_region_changes['date'], df_region_changes['net_change'], fmt='.')
plt.plot_date(df_region_changes['date'], shifted_guesses, '-')

plt.show()

Now, our guess is somewhat educated, but as you can see from the plot above, we are actually usually way off from the guess. Uncertainty is the idea of just how confident you are in your guess and is represented as a confidence or prediction interval.

* [Confidence and Prediction Bands](https://en.wikipedia.org/wiki/Confidence_and_prediction_bands)

If you've chosen the average as your measure of central tendency, then if you make certain assumptions about your data (which may or may not be true), you might choose the standard deviation as a measure of uncertainty.

* [Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)

In [None]:
guesses_std = rolling_window.std()

shifted_guesses_std = np.concatenate([[0, 0], guesses_std[1:-1]])

shifted_guesses_std

plt.plot_date(df_region_changes['date'], df_region_changes['net_change'], fmt='.')
plt.plot_date(df_region_changes['date'], shifted_guesses, '-')

plt.fill_between(
    df_region_changes['date'].values,
    shifted_guesses - 2*shifted_guesses_std,
    shifted_guesses + 2*shifted_guesses_std,
    alpha=0.2,
    color=sns.color_palette()[0]
)

plt.fill_between(
    df_region_changes['date'].values,
    shifted_guesses - shifted_guesses_std,
    shifted_guesses + shifted_guesses_std,
    alpha=0.2,
    color=sns.color_palette()[1]
)

plt.show()

From the plot above, the shaded regions represents definitions of uncertainty, if we were to use one standard deviation as our measure of uncertainty, the light green region represents our uncertainty around that estimate. If we were to use two standard deviations as our measure of uncertainty, the light blue region represents that uncertainty.