# Software Development Capability Analysis
## by Marc Vitalis

## Preliminary Wrangling

> Briefly introduce your dataset here.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from pandas.api.types import CategoricalDtype
import pandas.plotting._converter as pandacnv

%matplotlib inline
pandacnv.register()

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [None]:
workitems = pd.read_csv('workitems_master.csv')
workitems.head()

In [None]:
workitems.info()

**Convert Dates to Date type**

Date format are represented as string (object), we should change them first to datetime format.

In [None]:
workitems.new = pd.to_datetime(workitems.new)
workitems.doing = pd.to_datetime(workitems.doing)
workitems.done = pd.to_datetime(workitems.done)

workitems.head()

In [None]:
workitems.info()

**Convert Releases to Numeric**

In [None]:
workitems.loc[workitems.sprint.isna(), 'sprint'] = 0
workitems.sprint = workitems.sprint.astype(int)
workitems.info()

**Convert Category Types**

In [None]:
workitems.workitem_type.value_counts()

In [None]:
workitem_types = CategoricalDtype(categories = ['Story', 'Bug', 'Issue'], ordered=True)
workitems.workitem_type = workitems.workitem_type.astype(workitem_types)
workitems.head()

In [None]:
workitems.describe()

### What is the structure of your dataset?

> The dataset consists of 2393, with 10 features (workitem_type, estimate, words, rel (release), sprint, assigned_to, new (date started), doing (date started working), done, and actual work (done - doing). Variables main point of interest are the date stamps for the work. Some are just to describe the work item such as sprint, release and assigned_to.

### What is/are the main feature(s) of interest in your dataset?

> I'm more interested how variables affects `actual_work`. The goal is find out for the patterns that affects the actual work.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The dataset contains data and underwent to three (3) SDLC pattern (non-structured, semi-agile, scrum). The date stamps are very important (`new`, `doing`, `done`), this will help me extract important information, such as days of the week, months, or observe the time flow pattern if the SDLC pattern improves through time, or made it worst. As bonus I can also make use the correlation of titles to the actual work.

## Univariate Exploration

> First to explore is the main point of interest, `actual_work`.

In [None]:
binsize = 1
bins = np.arange(0, workitems.actual_work.max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = workitems, x = 'actual_work', bins = bins)
plt.xlabel('Actual Work (Days)')
plt.show()

There's a huge spike on the 0-1 area, it's unlikely to have a workitem with zero day done, that is considered as no effort. Let's tidy this a bit.

In [None]:
#zero sum should be converted to a day of work if they have worked on it at least 2h
zero_work = workitems.actual_work == 0
workitems.loc[zero_work, 'actual_work'] = 1

#just remove the zero effort ones
workitems = workitems[((workitems.done - workitems.doing) / pd.Timedelta(hours = 1)) >  2]
#remove the time stamps in new, doing and done
workitems.new = pd.to_datetime(workitems.new.dt.date)
workitems.doing = pd.to_datetime(workitems.doing.dt.date)
workitems.done = pd.to_datetime(workitems.done.dt.date)

workitems.info()

In [None]:
plt.figure(figsize=[8, 5])
plt.hist(data = workitems, x = 'actual_work', bins = bins)
plt.xlabel('Actual Work (Days)')
plt.show()

Still a huge spike in 1, and have a very long tail, let's redistribute it with log scale.

In [None]:
# try log scale since it has a long tail
log_binsize = 0.05
bins = 10 ** np.arange(0, np.log10(workitems['actual_work'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = workitems, x = 'actual_work', bins = bins)
plt.xscale('log')
plt.xticks([2, 5, 8, 13, 20, 40, 50, 100, 200], [2, 5, 8, 13, 20, 40, 50, 100])
plt.xlabel('Actual Work (Days)')
plt.show()

We still have a big spike at one day work, and another clump in `20` days, somehow even distribution from `2-20`. On the later sprints, the team is now doing a 3 week scrum, which is somehow the same to 20 days.

It seems unusual to have work items that's take more than 30 days. Also, there's a big value on workitems th Let's try to find out if they are outliers.

The first thing to look at are the ones with value `1`.

In [None]:
#get items with actual_work = 1, and investigate the data
ones = workitems[workitems.actual_work == 1]
ones

In [None]:
ones.workitem_type.value_counts()

By visual investigation, most of the workitems with `actual_work == 1` are mostly bugs. Now this make sense as bugs usually are quick to fix. Let's get their actual values.

Next, let's investigate those workitems with more than 30 days value.

In [None]:
workitems[workitems.actual_work > 30]

Items here are either from our work when we are still doing `waterfall`, or the newer ones, they are legitimate workitems that got back and forth in development, and some went to hiatus, and still they are valid data.

Let's find out the averages.

In [None]:
[workitems.actual_work.mean(), workitems.actual_work.median(), workitems.actual_work.mode()]

We have different data here. Mean, says a workitem can be done 9 days, average. This is hardly conclusive as we have work items that's pulling the value up. Median however, makes a bit more sense as a normal workitem is normally done 5 days. 1 has more occurrence, but since this data is not categorical, it just provides a bit of information about the data.

Based from the information gathered from the team, one day of work are mostly possible for what they call `Bug Fest`. More exploring on the relationship of the `actual_work` and `workitem_type`. For now, let's investigate the distribution of the `workitem_type`.

In [None]:
base_color = sb.color_palette()[0]
sb.countplot(data = workitems, x = 'workitem_type', color = base_color)

That's a big number of bugs in comparison to stories. Which makes sense for the spike in 1 day `actual_work` in the data. Let's explore the data further by looking into number of workitems being worked on per month.

In [None]:
# let's extract the dates first
workitems['doing_year'] = workitems.doing.dt.year
workitems['done_year'] = workitems.done.dt.year
workitems['new_year'] = workitems.new.dt.year

workitems['doing_month'] = workitems.doing.dt.strftime('%b')
workitems['done_month'] = workitems.done.dt.strftime('%b')
workitems['new_month'] = workitems.new.dt.strftime('%b')

workitems['doing_dow'] = workitems.doing.dt.strftime('%a')
workitems['done_dow'] = workitems.done.dt.strftime('%a')
workitems['new_dow'] = workitems.new.dt.strftime('%a')

workitems['doing_my'] = workitems.doing.dt.strftime('%b %Y')
workitems['done_my'] = workitems.done.dt.strftime('%b %Y')
workitems['new_my'] = workitems.new.dt.strftime('%b %Y')

weekdays_type = CategoricalDtype(categories=['Mon' , 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], ordered=True)
months_type = CategoricalDtype(categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], ordered=True)

workitems.doing_month = workitems.doing_month.astype(months_type)
workitems.done_month = workitems.done_month.astype(months_type)
workitems.new_month = workitems.new_month.astype(months_type)

workitems.doing_dow = workitems.doing_dow.astype(weekdays_type)
workitems.done_dow = workitems.done_dow.astype(weekdays_type)
workitems.new_dow = workitems.new_dow.astype(weekdays_type)

workitems.head()

In [None]:
work_month = workitems.sort_values(['doing'])
work_month['month'] = work_month.doing.dt.strftime('%b %Y')

plt.figure(figsize=(10, 15))
sb.countplot(data = work_month, y = 'month', color = base_color)

In [None]:
work_month = workitems.sort_values(['done'])
work_month['month'] = work_month.done.dt.strftime('%b %Y')

plt.figure(figsize=(10, 15))
sb.countplot(data = work_month, y = 'month', color = base_color)

As the date has become more recent, it's getting more consistent on the number of workitems. We don't know yet the distribution of how many of these are stories, bugs or issues. More on that later.

With scrums, there's a normal trend that during Fridays, there's a spike of things suddenly getting done. Let's investigate the distribution on weekdays, both on `doing` and `done`.

In [None]:


work_wk_doing = workitems

work_wk_doing['week'] = workitems.doing.dt.strftime('%a').astype(weekdays_type)

sb.countplot(data = work_wk_doing, x = 'week', color = base_color)

In [None]:
work_wk_done = workitems

work_wk_done['week'] = workitems.done.dt.strftime('%a').astype(weekdays_type)

sb.countplot(data = work_wk_done, x = 'week', color = base_color)

This shows that the team is more productive when fresh during `Mondays`, finishing most items on `Tuesdays` and tends to slow down throughout the week.

The next we'll look at is the distribution for product when creating new features.

In [None]:
idea_month = workitems.sort_values(['new'])
idea_month['month'] = idea_month.new.dt.strftime('%b %Y')

plt.figure(figsize=(10, 15))
sb.countplot(data = idea_month, y = 'month', color = base_color)

There are 3 occurences when stories are created in bulk, `April 2015`, `May - Oct 2016` and `Dec 2017`. These are interesting points to ask what happened during these events. Let's check which day of the week normally the Product Team mostly creates stories.

In [None]:
work_wk_idea = workitems

work_wk_idea['week'] = workitems.new.dt.strftime('%a').astype(weekdays_type)

sb.countplot(data = work_wk_idea, x = 'week', color = base_color)

There's a balance on when they add stories in the whole week, of course `Saturdays` and `Sundays` are holidays.

Let's observe the workitems per sprint.

In [None]:
#TODO remove nan in sprint and change to int
work_sprint = workitems
work_sprint.loc[work_sprint.rel.isna(), 'rel'] = '0'
work_sprint = work_sprint.sort_values(['rel', 'sprint'])

work_sprint['rel_sprint'] = work_sprint.rel + '/' + work_sprint.sprint.astype(str).str.pad(width = 3, side = 'left', fillchar = '0')

plt.figure(figsize=(10, 15))
sb.countplot(data = work_sprint, y = 'rel_sprint', color = base_color)

The work items in each releases vary. Although there are items that slotted in Sprint zero which means there might be error in input on those.

Next to investigate is the estimates provided.

In [None]:
binsize = 4
bins = np.arange(0, workitems.estimate.max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = workitems, x = 'estimate', bins = bins)
plt.xlabel('Points')
plt.show()

Again, it has a long tail. Let's try plotting the log.

In [None]:
# try log scale since it has a long tail
log_binsize = 0.1
bins = 10 ** np.arange(0, np.log10(workitems['estimate'].max())+log_binsize, log_binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = workitems, x = 'estimate', bins = bins)
plt.xscale('log')
plt.xticks([2, 5, 8, 13, 20, 40, 50, 100, 200], [2, 5, 8, 13, 20, 40, 50, 100, 200])
plt.xlabel('points')
plt.show()

Distribution has its peak between 8 to 13 estimates. Based on my conversation with the team, this is the 'just enough' size of stories.

Last to explore are the title broken down into words. Let's find out the top 40 word occurence.

In [None]:
work_words = workitems[['id', 'words']]

work_words = work_words[['id', 'words']].words.str.split(',').apply(pd.Series) \
    .merge(work_words[['id', 'words']], right_index = True, left_index = True) \
    .drop(["words"], axis = 1) \
    .melt(id_vars = ['id'], value_name = "word") \
    .drop("variable", axis = 1) \
    .dropna()

plt.figure(figsize=(10, 15))
sb.countplot(data = work_words, y = 'word', color = base_color, order = work_words.word.value_counts().iloc[:40].index)

Results is interesting. Top entry is error, which probably evident mostly on bug items. Other items in top 40 mostly can describe the application itself.

Let's observe the distribution worked per resource.

In [None]:
plt.figure(figsize=(10, 10))
sb.countplot(data = workitems, y = 'assigned_to', color = base_color, order = workitems.assigned_to.value_counts().iloc[:40].index)

Lastly, find the distribution of workitems per era.

In [None]:
sb.countplot(data = workitems, x = 'era', color = base_color)

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The actual work contains `zero` days which means `no effort`. I recalculated the actual effort by getting what's normally considered as work which is `2 hours`, and considered that already as one day of work. There are also what seems to be outliers, with high value in `one-day work`, however, with investigation, I found out that these are mostly bugs which makes sense as bugs are usually easy to get done. On the high spectrum, I checked them online the patterns on why they have large values. Some of them are from our waterfall method which consists of large `stories` and took days to months just to finished. Some of the data are also difficult to resolve which spanned to multiple sprints. I feel that they are important part of our data for now.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I recalculated `actual_work` to weed out the `no effort` stories. The words field are comma-delimited string, was converted into it's own data frame so we can examine them individually. To investigate individual characteristics of the dates, i also extracted `year`, `month` and `dow` of `new`, `doing` and `done`.

## Bivariate Exploration

First let's check pairwise correlation between features.

In [None]:
workitems.info()

In [None]:
numeric_vars = ['actual_work', 'estimate']
categoric_vars = ['workitem_type', 'era', 'new_month', 'doing_month', 'done_month', 'new_dow', 'doing_dow', 'done_dow']

work_est = workitems[~workitems.estimate.isna()]

sb.heatmap(work_est[numeric_vars].corr(), annot = True, fmt = '.3f', cmap = 'vlag_r', center = 0);

In [None]:
g = sb.PairGrid(data = work_est, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 10)
g.map_offdiag(plt.scatter);

In [None]:
def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x, y, color = default_color)

plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = work_est, y_vars = ['actual_work', 'estimate'], x_vars = categoric_vars,
                height = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();

### `actual_work` vs. `estimates`

Looking at this information, the correlation is somehow far from +1, which means that estimating has no not quite have any relation to actual work. Let's look closely on this relationship.

In [None]:
plt.figure(figsize=(16, 8))
sb.regplot(data = work_est, x = 'estimate', y = 'actual_work')
ticks = [1, 3, 5, 8, 13, 20, 40, 100]
plt.xticks(ticks)
plt.yticks(ticks);

Correlation may conclude that estimates does not quite relate to the actual work done. However, looking closely at its scatter graph, estimates clump at estimate 1-13, and almost similarly, actual work clumps from 1 to 20.

### `actual_work` vs `workitem_type`

Let's find out if time to complete is similar across different work item types.

In [None]:
sb.barplot(data = workitems, x = 'workitem_type', y = 'actual_work', color = base_color)
plt.yticks([1, 3, 5, 8, 13, 20, 25]);

It may have outliers, let's have a different view with this data.

In [None]:
plt.figure(figsize = (10, 10))
sb.violinplot(data = workitems, x = 'workitem_type', y = 'actual_work', color = base_color)
plt.yticks([1, 3, 5, 8, 13, 20, 40, 100]);

We can see some outliers, let's try to look at the median instead.

In [None]:
sb.barplot(data = workitems, x = 'workitem_type', y = 'actual_work', color = base_color, estimator = np.median);

Stories finish much longer than bugs and issues. Therefore we cannot have them counted in the same level when we are looking at number of workitems done on a certain period of time.

### `actual_work` vs `era`

Let's look at the distribution of workitem types per era.

In [None]:
sb.barplot(data = workitems, x = 'era', y = 'actual_work', color = base_color);

In [None]:
sb.violinplot(data = workitems, x = 'era', y = 'actual_work', color = base_color);

Waterfall has the best cycle time, which odd. Let's find out more, by looking at the work item types per `era`.

In [None]:
sb.countplot(data = workitems, x = 'era', hue = 'workitem_type', palette = 'Blues')
plt.legend(loc = 'upper right')

Now this makes sense, bug fixing take shorter time to complete thus making waterfall era with the best cycle time. Also, we can notice here that during waterfall era, we produced a lot more bugs compared to the recent era.

### `actual_work` per month

Find out patterns when we view actual_work per month.

In [None]:
plt.figure(figsize = (32, 12))
g = sb.barplot(data = workitems, x = 'done_my', y = 'actual_work', color = base_color)
g.set_xticklabels(workitems.done_my.unique(), rotation = 30);

Teams capability (average cycle time) has some stable line with some months with notable spikes. A boxplot might provide us some perspective.

In [None]:
plt.figure(figsize = (12, 24))
sb.boxplot(data = workitems, y = 'done_my', x = 'actual_work', color = base_color)
plt.xticks([1, 3, 5, 8, 13, 20, 40, 100]);

We can notice some spikes and high value. We plot next the count per work item type.

In [None]:
plt.figure(figsize = (8, 24))
sb.countplot(data = workitems.sort_values(['done_year', 'done_month']), y = 'done_my', hue = 'workitem_type', palette = 'Blues');

Discarding the months with sudden spike, recently workitems are completed in between 3 to 5 days.

### `actual_work` vs. `assigned_to`

Let's look at the distribution of workitems worked by the resource

In [None]:
plt.figure(figsize = (10, 10))
sb.barplot(data = workitems, y = 'assigned_to', x = 'actual_work', color = base_color, order = workitems.assigned_to.value_counts().index)
plt.xticks([1, 3, 5, 8, 13, 20, 40]);

In [None]:
plt.figure(figsize=(10, 10))
sb.countplot(data = workitems, y = 'assigned_to', hue = 'workitem_type', palette = 'Blues', order = workitems.assigned_to.value_counts().index)
plt.legend(loc = 'lower right')

Despite the fact that `panuelk` has worked on more items, most of it are of type bugs. `delossj` on the other hand has worked on more stories and issues.

### November 2017 and Later

As observed in the graph, number of work per month started to normalize from November 2017. Let's try to cut our dataset from there and create new graphs for observations.

In [None]:
workitems_scrums = workitems[workitems.done >= '2017-11-01']

In [None]:
work_est = workitems_scrums[~workitems_scrums.estimate.isna()]

sb.heatmap(work_est[numeric_vars].corr(), annot = True, fmt = '.3f', cmap = 'vlag_r', center = 0);

In [None]:
plt.figure(figsize=(16, 8))
sb.regplot(data = work_est, x = 'estimate', y = 'actual_work')
plt.xticks(ticks)
plt.yticks(ticks);

In [None]:
sb.barplot(data = workitems_scrums, x = 'workitem_type', y = 'actual_work', color = base_color, estimator = np.median);

In [None]:
sb.violinplot(data = workitems_scrums, x = 'workitem_type', y = 'actual_work', color = base_color)

In [None]:
plt.figure(figsize=(10, 10))
sb.countplot(data = workitems_scrums, y = 'assigned_to', hue = 'workitem_type', palette = 'Blues', order = workitems_scrums.assigned_to.value_counts().index)
plt.legend(loc = 'lower right')

In [None]:
plt.figure(figsize = (10, 10))
sb.barplot(data = workitems_scrums, y = 'assigned_to', x = 'actual_work', color = base_color, order = workitems_scrums.assigned_to.value_counts().index)
plt.xticks([1, 3, 5, 8, 13, 20, 40]);

This provided a surprising insight, which may solve our estimation issue. The correlation is now very close to zero, which means, regardless how big or small we estimate, the actual work remains the same. The resources' capability on the other hand is a bit cleaner and almost on the same level with others.

### `workitem_type` with weight

Let's extract `workitem_type` weight for the workitems post November 2017 and find more insight with this feature.

In [None]:
story_median = workitems_scrums[workitems_scrums.workitem_type == 'Story'].actual_work.median()
bug_median = workitems_scrums[workitems_scrums.workitem_type == 'Bug'].actual_work.median()

story_weight = story_median / (story_median + bug_median)
bug_weight = bug_median / (story_median + bug_median)

story_weight, bug_weight

In [None]:
workitems_scrums.loc[workitems_scrums.workitem_type == 'Bug', 'wi_weight'] = bug_weight
workitems_scrums.loc[workitems_scrums.workitem_type == 'Story', 'wi_weight'] = story_weight

workitems_scrums.wi_weight.value_counts()

In [None]:
plt.figure(figsize = (10, 10))
sb.barplot(data = workitems_scrums.sort_values(['done_year', 'done_month']), x = 'wi_weight', y = 'done_my', estimator = np.sum, color = base_color)

In [None]:
sb.barplot(data = workitems_scrums, x = 'wi_weight', y = 'assigned_to', estimator = np.sum, color = base_color)

Getting the weight gives us more accurate view as to the amount of work done per month and per resource. Interestingly enough we did not find any pattern as the graph shows that every month, not the same amount of work is distributed to the team. This maybe due to poor estimation which resulted to uneven work amount per month.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Estimation is an integral part of software development. It helps stakeholders forecast when the project is going to completed, and how many items should be committed to work on a certain timeframe. After comparing the estimates with the actual work, we found out that the estimation activity done by the team has low correlation to the actual work done. With this we may need to find ways to improve how estimation should be done to the team. We may find out more insight when we compare more variables at once. 

Another interesting observation is how work item types relates to the actual work done. Stories took much longer to finish than bugs and issues. Given this information, it's hard to determine the teams capability with work items with different weight are being worked on a certain timeframe.

We also measured capability (average cycle time) of each resource, the main developers (panuelk, delossj, tungald, bautise) are almost on the same capability rating except deguzmm, which makes sense because she is their lead which has other things on her plate that delays finishing her work items.

After cutting the data from November 2017 onwards, we found more interesting patterns with the data. Correlation between estimate and actual work now very close to zero, meaning, even how much estimate we do, the actual days of work almost stays the same. This gives proof that we might better end up implementing counting of stories/bugs rather than estimates.

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The team underwent three major changes in software development style (era). Number of workitems done in a month varies a lot during waterfall era, which makes this analysis more challenging. As we observed above, the number of items being worked on started to stabilize starting November 2017, 6 months after we transition from waterfall to hybrid.

Also, interesting observation, during early stage (waterfall), software development produced a lot more bugs than on hybrid and scrum era, which can delay the productivity of the team.

## Multivariate Exploration

> answer here

### Actual and Estimation Reinforcement

In [None]:
plt.figure(figsize = (10, 10))
sb.scatterplot(data = workitems_scrums, x = 'estimate', y = 'actual_work', hue = 'workitem_type', alpha = 0.4)
ticks = [1, 3, 5, 8, 13, 20, 40, 100]
plt.xticks(ticks)
plt.yticks(ticks);

In [None]:
plt.figure(figsize = (10, 10))
sb.scatterplot(data = workitems, x = 'estimate', y = 'actual_work', hue = 'workitem_type', alpha = 0.4)
ticks = [1, 3, 5, 8, 13, 20, 40, 100]
plt.xticks(ticks)
plt.yticks(ticks);

### Work distribution

In [None]:
work_actual_mean = workitems_scrums.groupby(['done_year', 'done_month', 'done_my', 'workitem_type']).actual_work.mean().reset_index()

plt.figure(figsize = (16, 4))
g = sb.lineplot(data = work_actual_mean, x = 'done_my', y = 'actual_work', hue = 'workitem_type')
g.set_xticklabels(work_actual_mean.done_my.unique(), rotation = 30);
plt.yticks([1, 3, 5, 8, 13, 20, 40, 50]);

In [None]:
plt.figure(figsize = (10, 10))
sb.barplot(data = workitems_scrums, x = 'assigned_to', y = 'actual_work', hue = 'workitem_type')

### Finding out the sweet average

In [None]:
plt.figure(figsize=(24, 24))
plt.subplot(4, 1, 1)
g = sb.barplot(data = workitems_scrums[workitems_scrums.assigned_to == 'panuelk'], x = 'done_my', y = 'actual_work', hue = 'workitem_type')
g.set_xticklabels(work_actual_mean.done_my.unique());

plt.subplot(4, 1, 2)
g = sb.barplot(data = workitems_scrums[workitems_scrums.assigned_to == 'delossj'], x = 'done_my', y = 'actual_work', hue = 'workitem_type')
g.set_xticklabels(work_actual_mean.done_my.unique());

plt.subplot(4, 1, 3)
g = sb.barplot(data = workitems_scrums[workitems_scrums.assigned_to == 'tungald'], x = 'done_my', y = 'actual_work', hue = 'workitem_type')
g.set_xticklabels(work_actual_mean.done_my.unique());

plt.subplot(4, 1, 4)
g = sb.barplot(data = workitems_scrums[workitems_scrums.assigned_to == 'bautise'], x = 'done_my', y = 'actual_work', hue = 'workitem_type')
g.set_xticklabels(work_actual_mean.done_my.unique());

In [None]:
main_devs = workitems_scrums[workitems_scrums.assigned_to.isin(['panuelk', 'delossj', 'tungald', 'bautise'])]

plt.figure(figsize = (24, 6))
sb.barplot(data = main_devs, x = 'done_my', y = 'wi_weight', hue = 'assigned_to', estimator = np.sum)
plt.legend(loc = 'upper right')

In [None]:
weight_sums = main_devs.groupby(['done_year', 'done_month', 'done_my', 'assigned_to']).wi_weight.sum().reset_index()
weight_sums = weight_sums.groupby(['done_year', 'done_month', 'done_my']).agg({'assigned_to': 'count', 'wi_weight': 'sum'}).reset_index()
weight_sums['unit'] = weight_sums.wi_weight / weight_sums.assigned_to

plt.figure(figsize = (24, 6))

g = sb.barplot(data = weight_sums, x = 'done_my', y = 'unit', color = base_color)

In [None]:
plt.figure(figsize = (24, 8))
sb.scatterplot(data = workitems_scrums, x = 'doing', y = 'actual_work', hue = 'workitem_type')
plt.xlim(pd.to_datetime('2017-11-01'), pd.to_datetime('2019-07-31'))
plt.yticks([1, 3, 5, 8, 13, 20, 40, 100]);

In [None]:
# amount of work a developer can do in a month
print(weight_sums.unit.mean())

# average amount of days each work can be done
print(main_devs.actual_work.mean())

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

## Notes

* Create ratio between work item types
* Create separate investigation between actual work, work item type, per month and era
* Construct a mathematical formula that will compute the estimate capacity
* Prove the better estimation by testing with current data and running correlation