## Goal: understanding contribution patterns

Analyzing groups of contributors, according to their activity patterns, and their evolution over time, helps to understand the structure of the community. These groups will be defined according to how much active they are (from casual to core contributors), and which kinds of activity they have (for example, producing code, reviewing code, submitting issues, contributing in discussions, etc.). Whenever convenient, the characterization will be combined with the contributor groups identified in the first goal.

This goal is refined in the following questions:

**Questions**:

 * How often do contributors contribute?
 * How is the structure of contribution, according to level of activity?
 * How is the structure of contribution, according to the different data sources?
 * How are the structures of contribution evolving over time?
 * How is people flowing in the structure of contribution?

These questions can be answered with the following metrics:

**Metrics**:

(Still to be refined)

 * Groups of contributors, by level of activity (core, regular, casual)
 * Groups of contributors, by kind of activity (committing, opening issues, merging pull requests, etc.)
 * Groups of contributors, by kind of activity (specialists, spread, etc.)
 * Activity metrics for each group
 * Absolute number of contributors moving from one group to another
 * Fraction of contributors moving from a group to another
 
Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.

**TODO**: Add bot and merges filtering.

In [9]:
import pandas

import plotly as plotly
import plotly.graph_objs as go

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

In [10]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    s.params(timeout=30)
    return s

In [11]:
def stack_by(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[time_field]['buckets']:
            df.loc[len(df)] = [b['key'], i['key_as_string'], i[value_field]['value']]
    
    return df

In [17]:
def onion(df, bucket_column, time_column, value_column):
    
    total = df[value_column].sum()
    
    percent_80 = total * 0.8
    percent_95 = total * 0.95
    core = 0
    core_sum = 0
    regular = 0
    regular_sum = 0
    casual = 0

    for row in df.iterrows():
        value = row[1][value_column]
        
        if (percent_80 > core_sum):
            core = core + 1
            core_sum = core_sum + value
            regular_sum = regular_sum + value
        elif percent_95 > regular_sum:
            regular = regular + 1
            regular_sum = regular_sum + value
        else:
            casual = casual + 1

    return {"core":core,
            "regular":regular,
            "casual":casual} 

def onion_evolution(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field):
    
    df = stack_by(result, bucket_column, time_column, value_column, bucket_field, time_field, metric_field)
    
    onion_df = pandas.DataFrame(columns=['Time', 'Core', 'Regular', 'Casual'])
    
    for time in df[time_column].unique():
        slice_df = df.loc[df['Time'] == time]
        slice_df = slice_df.sort_values(by=value_column, ascending=False)
        onion_result = onion(slice_df, bucket_column=bucket_column, time_column=time_column,\
                             value_column=value_column)
        #print(time, '->', len(slice_df), slice_df.columns.values.tolist(), '->', onion_result)
        onion_df.loc[len(onion_df)] = [time, onion_result['core'], onion_result['regular'], onion_result['casual']]
        
    
    #print(len(df))
    return onion_df

In [32]:
def print_grouped_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='grouped-bar')

# Metrics

## Groups of contributors, by level of activity (core, regular, casual)

In [18]:
s = Search(using=es_conn, index='git')

s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('authors', 'terms', field='author_uuid', size=10000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

onion_df=onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
onion_df['Quarter'] = onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))

onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter
0,2015-04-01T00:00:00.000Z,164.0,154.0,270.0,2015Q2
1,2015-07-01T00:00:00.000Z,154.0,160.0,405.0,2015Q3
2,2015-10-01T00:00:00.000Z,157.0,165.0,436.0,2015Q4
3,2016-01-01T00:00:00.000Z,186.0,187.0,461.0,2016Q1
4,2016-04-01T00:00:00.000Z,198.0,190.0,395.0,2016Q2
5,2016-07-01T00:00:00.000Z,172.0,158.0,255.0,2016Q3


In [33]:
print_grouped_bar(onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Global Contribution Groups')

### Compare Staff vs Community

In [26]:
s = Search(using=es_conn, index='git')

s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})


### CHANGE THIS TO MOZILLA STAFF ONCE AWS ES IS READY AGAIN ######
s = s.filter('terms', author_org_name=['Independent'])
##################################################################


# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('authors', 'terms', field='author_uuid', size=10000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

moz_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
moz_onion_df['Quarter'] = moz_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
moz_onion_df['Organization'] = 'Mozilla Staff'

moz_onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2015-04-01T00:00:00.000Z,96.0,107.0,184.0,2015Q2,Mozilla Staff
1,2015-07-01T00:00:00.000Z,90.0,107.0,294.0,2015Q3,Mozilla Staff
2,2015-10-01T00:00:00.000Z,90.0,124.0,313.0,2015Q4,Mozilla Staff
3,2016-01-01T00:00:00.000Z,106.0,130.0,341.0,2016Q1,Mozilla Staff
4,2016-04-01T00:00:00.000Z,111.0,140.0,289.0,2016Q2,Mozilla Staff
5,2016-07-01T00:00:00.000Z,101.0,109.0,179.0,2016Q3,Mozilla Staff


In [28]:
s = Search(using=es_conn, index='git')

s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})


### CHANGE THIS TO MOZILLA STAFF ONCE AWS ES IS READY AGAIN ######
s = s.exclude('terms', author_org_name=['Independent'])
##################################################################


# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('authors', 'terms', field='author_uuid', size=10000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('commits', 'cardinality', field='hash', precision_threshold=1000)
result = s.execute()

com_onion_df = onion_evolution(result, bucket_column='Author', time_column='Time', value_column='Commits',\
      bucket_field='authors', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
com_onion_df['Quarter'] = com_onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))
com_onion_df['Organization'] = 'Other'

com_onion_df

Unnamed: 0,Time,Core,Regular,Casual,Quarter,Organization
0,2015-04-01T00:00:00.000Z,73.0,52.0,76.0,2015Q2,Other
1,2015-07-01T00:00:00.000Z,66.0,58.0,104.0,2015Q3,Other
2,2015-10-01T00:00:00.000Z,69.0,52.0,110.0,2015Q4,Other
3,2016-01-01T00:00:00.000Z,81.0,67.0,109.0,2016Q1,Other
4,2016-04-01T00:00:00.000Z,90.0,59.0,94.0,2016Q2,Other
5,2016-07-01T00:00:00.000Z,72.0,51.0,73.0,2016Q3,Other


In [34]:
print_grouped_bar(moz_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Staff Contributors')
print_grouped_bar(com_onion_df, 'Quarter', ['Core', 'Regular', 'Casual'], 'Other Contributors')