## Goal: understanding contribution patterns

Analyzing groups of contributors, according to their activity patterns, and their evolution over time, helps to understand the structure of the community. These groups will be defined according to how much active they are (from casual to core contributors), and which kinds of activity they have (for example, producing code, reviewing code, submitting issues, contributing in discussions, etc.). Whenever convenient, the characterization will be combined with the contributor groups identified in the first goal.

This goal is refined in the following questions:

**Questions**:

 * How often do contributors contribute?
 * How is the structure of contribution, according to level of activity?
 * How is the structure of contribution, according to the different data sources?
 * How are the structures of contribution evolving over time?
 * How is people flowing in the structure of contribution?

These questions can be answered with the following metrics:

**Metrics**:

(Still to be refined)

 * Groups of contributors, by level of activity (core, regular, casual)
 * Groups of contributors, by kind of activity (committing, opening issues, merging pull requests, etc.)
 * Groups of contributors, by kind of activity (specialists, spread, etc.)
 * Activity metrics for each group
 * Absolute number of contributors moving from one group to another
 * Fraction of contributors moving from a group to another
 
Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.


In [1]:
import pandas

import os

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")

project_name = os.environ.get('PROJECT', 'all')

date_range = {'gte': '2010-01-01', 'lt': 'now/y'}


In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    
    if source == 'git' or source == 'github':
        github = projects['Github']
        repos = github['Repo'].tolist()
        #print (repos)
        s = s.filter('terms', repo_name=repos)
    
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    return s

In [3]:
def stack_by(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[time_field]['buckets']:
            df.loc[len(df)] = [b['key'], i['key_as_string'], i[value_field]['value']]
    
    return df

def stack_by_2(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[time_column, group_column, value_column])

    for b in result.to_dict()['aggregations'][time_field]['buckets']:
        for i in b[group_field]['buckets']:
            df.loc[len(df)] = [b['key_as_string'], i['key'], i[value_field]['value']]
    
    return df

In [4]:
def onion(df, bucket_column, time_column, value_column):
    
    total = df[value_column].sum()
    
    percent_80 = total * 0.8
    percent_95 = total * 0.95
    core = 0
    core_sum = 0
    regular = 0
    regular_sum = 0
    casual = 0

    for row in df.iterrows():
        value = row[1][value_column]
        
        if (percent_80 > core_sum):
            core = core + 1
            core_sum = core_sum + value
            regular_sum = regular_sum + value
        elif percent_95 > regular_sum:
            regular = regular + 1
            regular_sum = regular_sum + value
        else:
            casual = casual + 1

    return {"core":core,
            "regular":regular,
            "casual":casual} 

def onion_evolution(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field):
    
    df = stack_by_2(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field)
    
    print(len(df))
    
    onion_df = pandas.DataFrame(columns=['Time', 'Core', 'Regular', 'Casual'])
    
    for time in df[time_column].unique():
        slice_df = df.loc[df['Time'] == time]
        slice_df = slice_df.sort_values(by=value_column, ascending=False)
        onion_result = onion(slice_df, bucket_column=bucket_column, time_column=time_column,\
                             value_column=value_column)
        print(time, '->', len(slice_df))#, slice_df.columns.values.tolist(), '->', onion_result)
        onion_df.loc[len(onion_df)] = [time, onion_result['core'], onion_result['regular'], onion_result['casual']]
        
    
    #print(len(df))
    return onion_df

def onion_evolution_bad(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field):
    
    df = stack_by(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field)
    
    print(len(df))
    
    onion_df = pandas.DataFrame(columns=['Time', 'Core', 'Regular', 'Casual'])
    
    for time in df[time_column].unique():
        slice_df = df.loc[df['Time'] == time]
        slice_df = slice_df.sort_values(by=value_column, ascending=False)
        onion_result = onion(slice_df, bucket_column=bucket_column, time_column=time_column,\
                             value_column=value_column)
        print(time, '->', len(slice_df))#, slice_df.columns.values.tolist(), '->', onion_result)
        onion_df.loc[len(onion_df)] = [time, onion_result['core'], onion_result['regular'], onion_result['casual']]
        
    
    #print(len(df))
    return onion_df

In [5]:
def print_grouped_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='grouped-bar')
    
def print_stacked_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='stack',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='stacked-bar')

In [6]:
def add_bot_filter(s):
    return s.filter('term', author_bot='false')

def add_merges_filter(s):
    return s.filter('range', files={'gt': 0})

def add_date_filter(s):
    return s.filter('range', grimoire_creation_date=date_range)

def add_general_date_filters(s):
    # 01/01/1998
    initial_ts = '883609200000'
    return s.filter('range', grimoire_creation_date={'gt': initial_ts})


def add_project_filter(s):
    if project_name != 'all':
        github = projects['Github']
        repos = github[github['Project'] == project_name]['Repo'].tolist()
        #print(repos)
        s = s.filter('terms', repo_name=repos)
    return s

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")


# Metrics

## Groups of contributors, by level of activity: core, regular, casual

Following table and chart shows number of contributors in three groups:
* Core: minimum number of authors who made 80% of contributions.
* Regular: minimum number of authors who made between 80% and 95% of contributions.
* Casual: the rest of contributors, who made the last 5% of contributions.

Looking at their evolution through time we can see the structure of a community at some point and its evolution.

In [7]:
s = create_search(source='git')

s = add_general_date_filters(s)

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

# Adds date range to retrieve data from
s = add_date_filter(s)


# Unique count of Commits by Authors over time
s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .bucket('authors', 'terms', field='author_uuid', size=100000)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=3000)
result = s.execute()

onion_df=onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
onion_df['Quarter'] = onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))

onion_df

36371
2010-01-01T00:00:00.000Z -> 497
2010-04-01T00:00:00.000Z -> 560
2010-07-01T00:00:00.000Z -> 640
2010-10-01T00:00:00.000Z -> 579
2011-01-01T00:00:00.000Z -> 670
2011-04-01T00:00:00.000Z -> 738
2011-07-01T00:00:00.000Z -> 794
2011-10-01T00:00:00.000Z -> 816
2012-01-01T00:00:00.000Z -> 927
2012-04-01T00:00:00.000Z -> 994
2012-07-01T00:00:00.000Z -> 977
2012-10-01T00:00:00.000Z -> 1038
2013-01-01T00:00:00.000Z -> 1142
2013-04-01T00:00:00.000Z -> 1308
2013-07-01T00:00:00.000Z -> 1297
2013-10-01T00:00:00.000Z -> 1421
2014-01-01T00:00:00.000Z -> 1567
2014-04-01T00:00:00.000Z -> 1681
2014-07-01T00:00:00.000Z -> 1725
2014-10-01T00:00:00.000Z -> 1824
2015-01-01T00:00:00.000Z -> 1978
2015-04-01T00:00:00.000Z -> 1882
2015-07-01T00:00:00.000Z -> 1734
2015-10-01T00:00:00.000Z -> 1929
2016-01-01T00:00:00.000Z -> 2050
2016-04-01T00:00:00.000Z -> 1939
2016-07-01T00:00:00.000Z -> 1974
2016-10-01T00:00:00.000Z -> 1690


Unnamed: 0,Time,Core,Regular,Casual,Quarter
0,2010-01-01T00:00:00.000Z,96.0,119.0,282.0,2010Q1
1,2010-04-01T00:00:00.000Z,100.0,123.0,337.0,2010Q2
2,2010-07-01T00:00:00.000Z,124.0,133.0,383.0,2010Q3
3,2010-10-01T00:00:00.000Z,127.0,122.0,330.0,2010Q4
4,2011-01-01T00:00:00.000Z,144.0,145.0,381.0,2011Q1
5,2011-04-01T00:00:00.000Z,166.0,167.0,405.0,2011Q2
6,2011-07-01T00:00:00.000Z,178.0,178.0,438.0,2011Q3
7,2011-10-01T00:00:00.000Z,188.0,170.0,458.0,2011Q4
8,2012-01-01T00:00:00.000Z,196.0,196.0,535.0,2012Q1
9,2012-04-01T00:00:00.000Z,228.0,214.0,552.0,2012Q2


In [8]:
print_grouped_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')
print_stacked_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')

### Employees vs Non-Employees

In [9]:
s = create_search(source='git')

# Adds date range to retrieve data from
s = add_date_filter(s)

s = add_general_date_filters(s)


s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

### INCLUDE EMPLOYEES ONLY #######################################
s = s.filter('terms', author_org_name=['Mozilla Staff', 'Code Sheriff'])
##################################################################


# Unique count of Commits by Author over time (max 100 projects)
s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .bucket('authors', 'terms', field='author_uuid', size=100000)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=3000)
    
result = s.execute()

moz_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
moz_onion_df['Quarter'] = moz_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
moz_onion_df['Organization'] = 'Mozilla Staff'

moz_onion_df

16737
2010-01-01T00:00:00.000Z -> 270
2010-04-01T00:00:00.000Z -> 297
2010-07-01T00:00:00.000Z -> 335
2010-10-01T00:00:00.000Z -> 313
2011-01-01T00:00:00.000Z -> 359
2011-04-01T00:00:00.000Z -> 392
2011-07-01T00:00:00.000Z -> 443
2011-10-01T00:00:00.000Z -> 441
2012-01-01T00:00:00.000Z -> 491
2012-04-01T00:00:00.000Z -> 554
2012-07-01T00:00:00.000Z -> 569
2012-10-01T00:00:00.000Z -> 579
2013-01-01T00:00:00.000Z -> 613
2013-04-01T00:00:00.000Z -> 696
2013-07-01T00:00:00.000Z -> 708
2013-10-01T00:00:00.000Z -> 733
2014-01-01T00:00:00.000Z -> 763
2014-04-01T00:00:00.000Z -> 802
2014-07-01T00:00:00.000Z -> 794
2014-10-01T00:00:00.000Z -> 753
2015-01-01T00:00:00.000Z -> 755
2015-04-01T00:00:00.000Z -> 737
2015-07-01T00:00:00.000Z -> 744
2015-10-01T00:00:00.000Z -> 727
2016-01-01T00:00:00.000Z -> 737
2016-04-01T00:00:00.000Z -> 743
2016-07-01T00:00:00.000Z -> 733
2016-10-01T00:00:00.000Z -> 656


Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2010-01-01T00:00:00.000Z,71.0,67.0,132.0,2010Q1,Mozilla Staff
1,2010-04-01T00:00:00.000Z,75.0,67.0,155.0,2010Q2,Mozilla Staff
2,2010-07-01T00:00:00.000Z,95.0,80.0,160.0,2010Q3,Mozilla Staff
3,2010-10-01T00:00:00.000Z,99.0,72.0,142.0,2010Q4,Mozilla Staff
4,2011-01-01T00:00:00.000Z,106.0,86.0,167.0,2011Q1,Mozilla Staff
5,2011-04-01T00:00:00.000Z,120.0,97.0,175.0,2011Q2,Mozilla Staff
6,2011-07-01T00:00:00.000Z,132.0,118.0,193.0,2011Q3,Mozilla Staff
7,2011-10-01T00:00:00.000Z,143.0,101.0,197.0,2011Q4,Mozilla Staff
8,2012-01-01T00:00:00.000Z,147.0,115.0,229.0,2012Q1,Mozilla Staff
9,2012-04-01T00:00:00.000Z,178.0,138.0,238.0,2012Q2,Mozilla Staff


In [10]:
s = create_search(source='git')

# Adds date range to retrieve data from
s = add_date_filter(s)

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

### EXCLUDE EMPLOYEES ############################################
s = s.exclude('terms', author_org_name=['Mozilla Staff', 'Code Sheriff'])
##################################################################


# Unique count of Commits by Author over time (max 100 projects)
s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .bucket('authors', 'terms', field='author_uuid', size=100000)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=3000)
    
result = s.execute()

com_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
com_onion_df['Quarter'] = com_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
com_onion_df['Organization'] = 'Other'

com_onion_df

19634
2010-01-01T00:00:00.000Z -> 227
2010-04-01T00:00:00.000Z -> 263
2010-07-01T00:00:00.000Z -> 305
2010-10-01T00:00:00.000Z -> 266
2011-01-01T00:00:00.000Z -> 311
2011-04-01T00:00:00.000Z -> 346
2011-07-01T00:00:00.000Z -> 351
2011-10-01T00:00:00.000Z -> 375
2012-01-01T00:00:00.000Z -> 436
2012-04-01T00:00:00.000Z -> 440
2012-07-01T00:00:00.000Z -> 408
2012-10-01T00:00:00.000Z -> 459
2013-01-01T00:00:00.000Z -> 529
2013-04-01T00:00:00.000Z -> 612
2013-07-01T00:00:00.000Z -> 589
2013-10-01T00:00:00.000Z -> 688
2014-01-01T00:00:00.000Z -> 804
2014-04-01T00:00:00.000Z -> 879
2014-07-01T00:00:00.000Z -> 931
2014-10-01T00:00:00.000Z -> 1071
2015-01-01T00:00:00.000Z -> 1223
2015-04-01T00:00:00.000Z -> 1145
2015-07-01T00:00:00.000Z -> 990
2015-10-01T00:00:00.000Z -> 1202
2016-01-01T00:00:00.000Z -> 1313
2016-04-01T00:00:00.000Z -> 1196
2016-07-01T00:00:00.000Z -> 1241
2016-10-01T00:00:00.000Z -> 1034


Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2010-01-01T00:00:00.000Z,45.0,88.0,94.0,2010Q1,Other
1,2010-04-01T00:00:00.000Z,47.0,93.0,123.0,2010Q2,Other
2,2010-07-01T00:00:00.000Z,50.0,110.0,145.0,2010Q3,Other
3,2010-10-01T00:00:00.000Z,46.0,95.0,125.0,2010Q4,Other
4,2011-01-01T00:00:00.000Z,45.0,93.0,173.0,2011Q1,Other
5,2011-04-01T00:00:00.000Z,53.0,113.0,180.0,2011Q2,Other
6,2011-07-01T00:00:00.000Z,54.0,112.0,185.0,2011Q3,Other
7,2011-10-01T00:00:00.000Z,64.0,129.0,182.0,2011Q4,Other
8,2012-01-01T00:00:00.000Z,75.0,148.0,213.0,2012Q1,Other
9,2012-04-01T00:00:00.000Z,73.0,145.0,222.0,2012Q2,Other


In [11]:
print_grouped_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Employees')
print_grouped_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Non-Employees')
print_stacked_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Employees')
print_stacked_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Non-Employees')

# Answers:

* How often do contributors contribute?
* How is the structure of contribution, according to level of activity?
* How is the structure of contribution, according to the different data sources?
* How are the structures of contribution evolving over time?
* How is people flowing in the structure of contribution?
 

## Level of Activity

* How often do contributors contribute?
* How is the structure of contribution, according to level of activity?
* How are the structures of contribution evolving over time?

### Initial considerations

In order to understand what follows, there are some definitions it is worth to mention. Data provided is based on  number of contributors in three groups:
* **Core**: minimum number of authors who made 80% of contributions.
* **Regular**: minimum number of authors who made between 80% and 95% of contributions.
* **Casual**: the rest of contributors, who made the last 5% of contributions.

**Bars** in plots correspond to number of people in a given group.
**Colors** correspond to groups.
**Time** is noted as YYYYQQ. Quarters are defined as Q1 (from Jan. to Mar.), Q2 (from Apr. to Jun.), Q3 (from Jul. to Sep.) and Q4 (from Oct. to Dec.).

**NOTE: as we are using quarters, we should ignore first and last ones as they show incomplete results due to relative dates are being used.**

#### Contribution Pattern Structure Analysis
(How often do contributors contribute)

(How is the structure of contribution, according to level of activity)

First question has to do with the definition of patterns we estated above. Looking at [Table 1][1] we find that casual contributors are between two and four times more than core contributors. Regular contributors are quite close to core ones. Number of people being core is reasonable from the point of view they are those who push the community.

#### Evolution of Contribution Patterns through Time
(How are the structures of contribution evolving over time?)
(How is people flowing in the structure of contribution?)

The most interesting and clear issue we find when looking at [Figure 1][2] is related with casual developers. Numbers show an increasing number of casual contributors until first quarter 2015, but then numbers decrease consistently quarter by quarter to the present. In fact if we focus on numbers instead of bars (see [Table 1][1]) we found a slightly similar pattern for core developers if we take into account we are looking at numbers two or four times smaller.

**This could be due to a decrease in general number of contributors as it seems to be the case if we look at [the evolution of contributors in Git plot][3]**

[1]: #Table-1.-Contribution-patterns-evolution.-General-view
[2]: #Figure-1.-Contribution-patterns-evolution.-General-view
[3]: Understanding%20Contributors.ipynb#Number-of-contributors-by-organization-over-time

In [12]:
onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter
0,2010-01-01T00:00:00.000Z,96.0,119.0,282.0,2010Q1
1,2010-04-01T00:00:00.000Z,100.0,123.0,337.0,2010Q2
2,2010-07-01T00:00:00.000Z,124.0,133.0,383.0,2010Q3
3,2010-10-01T00:00:00.000Z,127.0,122.0,330.0,2010Q4
4,2011-01-01T00:00:00.000Z,144.0,145.0,381.0,2011Q1
5,2011-04-01T00:00:00.000Z,166.0,167.0,405.0,2011Q2
6,2011-07-01T00:00:00.000Z,178.0,178.0,438.0,2011Q3
7,2011-10-01T00:00:00.000Z,188.0,170.0,458.0,2011Q4
8,2012-01-01T00:00:00.000Z,196.0,196.0,535.0,2012Q1
9,2012-04-01T00:00:00.000Z,228.0,214.0,552.0,2012Q2


##### Table 1. Contribution patterns evolution. General view

In [13]:
print_grouped_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')

##### Figure 1. Contribution patterns evolution. General view

### Mozilla Staff and Community as Different Organizations

If we divide contributions and analyze contribution patterns we can talk about how each of them are structured regardless global patterns (**TODO: analyze previous plots dividing groups by organization to get an insight of global patterns composition**).

Both of them (see [Figure 2][1] and [3][2]) follow the same trend we mention for the general view above. As one could expect, Mozilla Staff shows smaller differences among groups. The group of people sending most of contributions is small compared to the other two groups when looking at community only.

Comparing both, number of core developers is similar in both populations, so we can say that community has much more people doing small contributions than Mozilla Staff.

Besides, when people belong to Mozilla Staff, that is, they are hired, there are more people involved in most of contributions. This sounds reasonable because they are paid to use their time contributing. When they are not, the proportion compared to other groups is smaller, moving towards bigger groups as engagement decreases.

[1]: #Figure-2.-Contribution-patterns-evolution.-Mozilla-Staff-isolated-view
[2]: #Figure-3.-Contribution-patterns-evolution.-Community-isolated-view

In [14]:
print_grouped_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Employees')

##### Figure 2. Contribution patterns evolution. Mozilla Staff isolated view

In [15]:
print_grouped_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Non-Employees')

##### Figure 3. Contribution patterns evolution. Community isolated view