## Goal: understanding contribution patterns

Analyzing groups of contributors, according to their activity patterns, and their evolution over time, helps to understand the structure of the community. These groups will be defined according to how much active they are (from casual to core contributors), and which kinds of activity they have (for example, producing code, reviewing code, submitting issues, contributing in discussions, etc.). Whenever convenient, the characterization will be combined with the contributor groups identified in the first goal.

This goal is refined in the following questions:

**Questions**:

 * How often do contributors contribute?
 * How is the structure of contribution, according to level of activity?
 * How is the structure of contribution, according to the different data sources?
 * How are the structures of contribution evolving over time?
 * How is people flowing in the structure of contribution?

These questions can be answered with the following metrics:

**Metrics**:

(Still to be refined)

 * Groups of contributors, by level of activity (core, regular, casual)
 * Groups of contributors, by kind of activity (committing, opening issues, merging pull requests, etc.)
 * Groups of contributors, by kind of activity (specialists, spread, etc.)
 * Activity metrics for each group
 * Absolute number of contributors moving from one group to another
 * Fraction of contributors moving from a group to another
 
Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.


In [1]:
import pandas

import os

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")

project_name = os.environ.get('PROJECT', 'all')

date_range = {'gte': '2013-01-01', 'lt': '2017-01-01'}


In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    s.params(timeout=30)
    return s

In [3]:
def stack_by(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[time_field]['buckets']:
            df.loc[len(df)] = [b['key'], i['key_as_string'], i[value_field]['value']]
    
    return df

In [4]:
def onion(df, bucket_column, time_column, value_column):
    
    total = df[value_column].sum()
    
    percent_80 = total * 0.8
    percent_95 = total * 0.95
    core = 0
    core_sum = 0
    regular = 0
    regular_sum = 0
    casual = 0

    for row in df.iterrows():
        value = row[1][value_column]
        
        if (percent_80 > core_sum):
            core = core + 1
            core_sum = core_sum + value
            regular_sum = regular_sum + value
        elif percent_95 > regular_sum:
            regular = regular + 1
            regular_sum = regular_sum + value
        else:
            casual = casual + 1

    return {"core":core,
            "regular":regular,
            "casual":casual} 

def onion_evolution(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field):
    
    df = stack_by(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field)
    
    onion_df = pandas.DataFrame(columns=['Time', 'Core', 'Regular', 'Casual'])
    
    for time in df[time_column].unique():
        slice_df = df.loc[df['Time'] == time]
        slice_df = slice_df.sort_values(by=value_column, ascending=False)
        onion_result = onion(slice_df, bucket_column=bucket_column, time_column=time_column,\
                             value_column=value_column)
        #print(time, '->', len(slice_df), slice_df.columns.values.tolist(), '->', onion_result)
        onion_df.loc[len(onion_df)] = [time, onion_result['core'], onion_result['regular'], onion_result['casual']]
        
    
    #print(len(df))
    return onion_df

In [5]:
def print_grouped_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='grouped-bar')

In [6]:
def add_bot_filter(s):
    return s.filter('term', author_bot='false')

def add_merges_filter(s):
    return s.filter('range', files={'gt': 0})

def add_date_filter(s):
    return s.filter('range', grimoire_creation_date=date_range)

def add_project_filter(s):
    if project_name != 'all':
        github = projects['Github']
        repos = github[github['Project'] == project_name]['Repo'].tolist()
        #print(repos)
        s = s.filter('terms', repo_name=repos)
    return s

# Metrics

## Groups of contributors, by level of activity: core, regular, casual

Following table and chart shows number of contributors in three groups:
* Core: minimum number of authors who made 80% of contributions.
* Regular: minimum number of authors who made between 80% and 95% of contributions.
* Casual: the rest of contributors, who made the last 5% of contributions.

Looking at their evolution through time we can see the structure of a community at some point and its evolution.

In [7]:
s = Search(using=es_conn, index='git')

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

# Adds date range to retrieve data from
s = add_date_filter(s)


# Unique count of Commits by Authors
s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

onion_df=onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
onion_df['Quarter'] = onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))

onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter
0,2013-01-01T00:00:00.000Z,246.0,250.0,670.0,2013Q1
1,2013-04-01T00:00:00.000Z,295.0,292.0,880.0,2013Q2
2,2013-07-01T00:00:00.000Z,307.0,292.0,989.0,2013Q3
3,2013-10-01T00:00:00.000Z,294.0,330.0,1146.0,2013Q4
4,2014-01-01T00:00:00.000Z,331.0,357.0,1304.0,2014Q1
5,2014-04-01T00:00:00.000Z,364.0,388.0,1447.0,2014Q2
6,2014-07-01T00:00:00.000Z,342.0,391.0,1520.0,2014Q3
7,2014-10-01T00:00:00.000Z,349.0,416.0,1652.0,2014Q4
8,2015-01-01T00:00:00.000Z,337.0,426.0,1784.0,2015Q1
9,2015-04-01T00:00:00.000Z,336.0,422.0,1724.0,2015Q2


In [8]:
print_grouped_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')

### Employees vs Non-Employees

In [9]:
s = Search(using=es_conn, index='git')

# Adds date range to retrieve data from
s = add_date_filter(s)


s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

### INCLUDE EMPLOYEES ONLY #######################################
s = s.filter('terms', author_org_name=['Mozilla Staff'])
##################################################################


# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

moz_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
moz_onion_df['Quarter'] = moz_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
moz_onion_df['Organization'] = 'Mozilla Staff'

moz_onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2013-01-01T00:00:00.000Z,189.0,155.0,278.0,2013Q1,Mozilla Staff
1,2013-04-01T00:00:00.000Z,219.0,180.0,362.0,2013Q2,Mozilla Staff
2,2013-07-01T00:00:00.000Z,234.0,188.0,400.0,2013Q3,Mozilla Staff
3,2013-10-01T00:00:00.000Z,220.0,190.0,459.0,2013Q4,Mozilla Staff
4,2014-01-01T00:00:00.000Z,242.0,192.0,482.0,2014Q1,Mozilla Staff
5,2014-04-01T00:00:00.000Z,258.0,209.0,505.0,2014Q2,Mozilla Staff
6,2014-07-01T00:00:00.000Z,252.0,206.0,516.0,2014Q3,Mozilla Staff
7,2014-10-01T00:00:00.000Z,242.0,189.0,529.0,2014Q4,Mozilla Staff
8,2015-01-01T00:00:00.000Z,229.0,185.0,534.0,2015Q1,Mozilla Staff
9,2015-04-01T00:00:00.000Z,224.0,187.0,512.0,2015Q2,Mozilla Staff


In [10]:
s = Search(using=es_conn, index='git')

# Adds date range to retrieve data from
s = add_date_filter(s)

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

### EXCLUDE EMPLOYEES ############################################
s = s.exclude('terms', author_org_name=['Mozilla Staff'])
##################################################################


# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

com_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
com_onion_df['Quarter'] = com_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
com_onion_df['Organization'] = 'Other'

com_onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2013-01-01T00:00:00.000Z,92.0,186.0,266.0,2013Q1,Other
1,2013-04-01T00:00:00.000Z,112.0,219.0,375.0,2013Q2,Other
2,2013-07-01T00:00:00.000Z,106.0,188.0,472.0,2013Q3,Other
3,2013-10-01T00:00:00.000Z,125.0,263.0,513.0,2013Q4,Other
4,2014-01-01T00:00:00.000Z,135.0,310.0,631.0,2014Q1,Other
5,2014-04-01T00:00:00.000Z,152.0,326.0,749.0,2014Q2,Other
6,2014-07-01T00:00:00.000Z,201.0,363.0,715.0,2014Q3,Other
7,2014-10-01T00:00:00.000Z,228.0,424.0,805.0,2014Q4,Other
8,2015-01-01T00:00:00.000Z,252.0,485.0,862.0,2015Q1,Other
9,2015-04-01T00:00:00.000Z,235.0,451.0,873.0,2015Q2,Other


In [11]:
print_grouped_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Employees')
print_grouped_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Non-Employees')

# Answers:

* How often do contributors contribute?
* How is the structure of contribution, according to level of activity?
* How is the structure of contribution, according to the different data sources?
* How are the structures of contribution evolving over time?
* How is people flowing in the structure of contribution?
 

## Level of Activity

* How often do contributors contribute?
* How is the structure of contribution, according to level of activity?
* How are the structures of contribution evolving over time?

### Initial considerations

In order to understand what follows, there are some definitions it is worth to mention. Data provided is based on  number of contributors in three groups:
* **Core**: minimum number of authors who made 80% of contributions.
* **Regular**: minimum number of authors who made between 80% and 95% of contributions.
* **Casual**: the rest of contributors, who made the last 5% of contributions.

**Bars** in plots correspond to number of people in a given group.
**Colors** correspond to groups.
**Time** is noted as YYYYQQ. Quarters are defined as Q1 (from Jan. to Mar.), Q2 (from Apr. to Jun.), Q3 (from Jul. to Sep.) and Q4 (from Oct. to Dec.).

**NOTE: as we are using quarters, we should ignore first and last ones as they show incomplete results due to relative dates are being used.**

#### Contribution Pattern Structure Analysis
(How often do contributors contribute)

(How is the structure of contribution, according to level of activity)

First question has to do with the definition of patterns we estated above. Looking at [Table 1][1] we find that casual contributors are between two and four times more than core contributors. Regular contributors are quite close to core ones. Number of people being core is reasonable from the point of view they are those who push the community.

#### Evolution of Contribution Patterns through Time
(How are the structures of contribution evolving over time?)
(How is people flowing in the structure of contribution?)

The most interesting and clear issue we find when looking at [Figure 1][2] is related with casual developers. Numbers show an increasing number of casual contributors until first quarter 2015, but then numbers decrease consistently quarter by quarter to the present. In fact if we focus on numbers instead of bars (see [Table 1][1]) we found a slightly similar pattern for core developers if we take into account we are looking at numbers two or four times smaller.

**This could be due to a decrease in general number of contributors as it seems to be the case if we look at [the evolution of contributors in Git plot][3]**

[1]: #Table-1.-Contribution-patterns-evolution.-General-view
[2]: #Figure-1.-Contribution-patterns-evolution.-General-view
[3]: Understanding%20Contributors.ipynb#Number-of-contributors-by-organization-over-time

In [12]:
onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter
0,2013-01-01T00:00:00.000Z,246.0,250.0,670.0,2013Q1
1,2013-04-01T00:00:00.000Z,295.0,292.0,880.0,2013Q2
2,2013-07-01T00:00:00.000Z,307.0,292.0,989.0,2013Q3
3,2013-10-01T00:00:00.000Z,294.0,330.0,1146.0,2013Q4
4,2014-01-01T00:00:00.000Z,331.0,357.0,1304.0,2014Q1
5,2014-04-01T00:00:00.000Z,364.0,388.0,1447.0,2014Q2
6,2014-07-01T00:00:00.000Z,342.0,391.0,1520.0,2014Q3
7,2014-10-01T00:00:00.000Z,349.0,416.0,1652.0,2014Q4
8,2015-01-01T00:00:00.000Z,337.0,426.0,1784.0,2015Q1
9,2015-04-01T00:00:00.000Z,336.0,422.0,1724.0,2015Q2


##### Table 1. Contribution patterns evolution. General view

In [13]:
print_grouped_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')

##### Figure 1. Contribution patterns evolution. General view

### Mozilla Staff and Community as Different Organizations

If we divide contributions and analyze contribution patterns we can talk about how each of them are structured regardless global patterns (**TODO: analyze previous plots dividing groups by organization to get an insight of global patterns composition**).

Both of them (see [Figure 2][1] and [3][2]) follow the same trend we mention for the general view above. As one could expect, Mozilla Staff shows smaller differences among groups. The group of people sending most of contributions is small compared to the other two groups when looking at community only.

Comparing both, number of core developers is similar in both populations, so we can say that community has much more people doing small contributions than Mozilla Staff.

Besides, when people belong to Mozilla Staff, that is, they are hired, there are more people involved in most of contributions. This sounds reasonable because they are paid to use their time contributing. When they are not, the proportion compared to other groups is smaller, moving towards bigger groups as engagement decreases.

[1]: #Figure-2.-Contribution-patterns-evolution.-Mozilla-Staff-isolated-view
[2]: #Figure-3.-Contribution-patterns-evolution.-Community-isolated-view

In [14]:
print_grouped_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Employees')

##### Figure 2. Contribution patterns evolution. Mozilla Staff isolated view

In [15]:
print_grouped_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Non-Employees')

##### Figure 3. Contribution patterns evolution. Community isolated view