## Goal: understanding contribution patterns

Analyzing groups of contributors, according to their activity patterns, and their evolution over time, helps to understand the structure of the community. These groups will be defined according to how much active they are (from casual to core contributors), and which kinds of activity they have (for example, producing code, reviewing code, submitting issues, contributing in discussions, etc.). Whenever convenient, the characterization will be combined with the contributor groups identified in the first goal.

This goal is refined in the following questions:

**Questions**:

 * How often do contributors contribute?
 * How is the structure of contribution, according to level of activity?
 * How is the structure of contribution, according to the different data sources?
 * How are the structures of contribution evolving over time?
 * How is people flowing in the structure of contribution?

These questions can be answered with the following metrics:

**Metrics**:

(Still to be refined)

 * Groups of contributors, by level of activity (core, regular, casual)
 * Groups of contributors, by kind of activity (committing, opening issues, merging pull requests, etc.)
 * Groups of contributors, by kind of activity (specialists, spread, etc.)
 * Activity metrics for each group
 * Absolute number of contributors moving from one group to another
 * Fraction of contributors moving from a group to another
 
Some of these metrics will be computed for the speficied contributor groups, over time.

# Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

This section includes common code to manage and plot data. Queries will be available at the corresponding section.


In [1]:
import pandas

import os

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")

project_name = os.environ.get('PROJECT', 'all')

date_range = {'gte': '2010-01-01', 'lt': 'now/y'}


In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    
    if source == 'git' or source == 'github':
        github = projects['Github']
        repos = github['Repo'].tolist()
        #print (repos)
        s = s.filter('terms', repo_name=repos)
    
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    return s

In [3]:
def print_df(result, group_field, value_field, group_column, value_column):
    df = pandas.DataFrame()

    df = df.from_dict(result.to_dict()['aggregations'][group_field]['buckets'])
    df = df.drop('doc_count', axis=1)
    df[value_field] = df[value_field].apply(lambda row: row['value'])
    df=df[['key', value_field]]
    df.columns = [group_column, value_column]

    return df

In [4]:
def stack_by(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[time_field]['buckets']:
            df.loc[len(df)] = [b['key'], i['key_as_string'], i[value_field]['value']]
    
    return df

def stack_by_2(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[time_column, group_column, value_column])

    for b in result.to_dict()['aggregations'][time_field]['buckets']:
        for i in b[group_field]['buckets']:
            df.loc[len(df)] = [b['key_as_string'], i['key'], i[value_field]['value']]
    
    return df

In [5]:
def onion(df, bucket_column, time_column, value_column):
    
    total = df[value_column].sum()
    
    percent_80 = total * 0.8
    percent_95 = total * 0.95
    core = 0
    core_sum = 0
    regular = 0
    regular_sum = 0
    casual = 0
    core_non = 0
    regular_non = 0
    casual_non = 0
    core_emp = 0
    regular_emp = 0
    casual_emp = 0

    for row in df.iterrows():
        value = row[1][value_column]
        non = False
        if row[1]['org'] == 'Non-Employees':
            non = True
        
        if (percent_80 > core_sum):
            core = core + 1
            core_sum = core_sum + value
            regular_sum = regular_sum + value
            if non:
                core_non = core_non + 1
            else:
                core_emp = core_emp + 1
                
        elif percent_95 > regular_sum:
            regular = regular + 1
            regular_sum = regular_sum + value
            if non:
                regular_non = regular_non + 1
            else:
                regular_emp = regular_emp + 1
        else:
            casual = casual + 1
            if non:
                casual_non = casual_non + 1
            else:
                casual_emp = casual_emp + 1

    return {"core":core,
            "regular":regular,
            "casual":casual,
            "core-non": core_non,
            "regular-non": regular_non,
            "casual-non": casual_non,
            "core-emp": core_emp,
            "regular-emp": regular_emp,
            "casual-emp": casual_emp} 

def onion_evolution(df, bucket_field, time_field, metric_field):
    
    #print(len(df))
    
    onion_df = pandas.DataFrame(
        columns=['Time', 
                 'Core', 'Core-Non', 'Core-Emp',
                 'Regular', 'Regular-Non', 'Regular-Emp',
                 'Casual', 'Casual-Non', 'Casual-Emp'])
    
    for time in df[time_field].unique():
        slice_df = df.loc[df['time'] == time]
        slice_df = slice_df.sort_values(by=metric_field, ascending=False)
        onion_result = onion(slice_df, 
                             bucket_column=bucket_field, 
                             time_column=time_field,
                             value_column=metric_field)
        #print(time, '->', len(slice_df))#, slice_df.columns.values.tolist(), '->', onion_result)
        onion_df.loc[len(onion_df)] = [time, 
                                       onion_result['core'],
                                       onion_result['core-non'], 
                                       onion_result['core-emp'],
                                       onion_result['regular'],
                                       onion_result['regular-non'],
                                       onion_result['regular-emp'],
                                       onion_result['casual'], 
                                       onion_result['casual-non'],
                                       onion_result['casual-emp']]
        
    
    #print(len(df))
    return onion_df

In [6]:
def print_grouped_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='group',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='grouped-bar')
    
def print_stacked_bar(df, x_column, value_columns, title):
    """
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    x_values = df[x_column].tolist()
    for value_column in value_columns:
        bars.append(go.Bar(
            x=x_values,
            y=df[value_column].tolist(),
            name=value_column))

    layout = go.Layout(
        barmode='stack',
        title= title
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='stacked-bar')

In [7]:
def add_bot_filter(s):
    return s.filter('term', author_bot='false')

def add_merges_filter(s):
    return s.filter('range', files={'gt': 0})

def add_date_filter(s):
    return s.filter('range', grimoire_creation_date=date_range)

def add_general_date_filters(s):
    # 01/01/1998
    initial_ts = '883609200000'
    return s.filter('range', grimoire_creation_date={'gt': initial_ts})


def add_project_filter(s):
    if project_name != 'all':
        github = projects['Github']
        repos = github[github['Project'] == project_name]['Repo'].tolist()
        #print(repos)
        s = s.filter('terms', repo_name=repos)
    return s

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")


# Metrics

## Groups of contributors, by level of activity: core, regular, casual

Following table and chart shows number of contributors in three groups:
* Core: minimum number of authors who made 80% of contributions.
* Regular: minimum number of authors who made between 80% and 95% of contributions.
* Casual: the rest of contributors, who made the last 5% of contributions.

Looking at their evolution through time we can see the structure of a community at some point and its evolution.

In [8]:
s = create_search(source='git')

s = add_general_date_filters(s)

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

# Adds date range to retrieve data from
s = add_date_filter(s)


# Unique count of Commits by Authors over time
s.aggs.bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .bucket('uuid', 'terms', field='author_uuid', size=100000)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=3000)
result = s.execute()

authors_df = stack_by_2(result, 'uuid', 'time', 'commits', 'uuid', 'time', 'commits')


s = create_search(source='git')

s = add_general_date_filters(s)

s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)

# Adds date range to retrieve data from
s = add_date_filter(s)


# Unique count of Commits by Authors over time
s.aggs.bucket('uuid', 'terms', field='author_uuid', size=100000)\
    .bucket('org', 'terms', field='author_org_name', size=1000)
result = s.execute()


orgs_df = pandas.DataFrame(columns=['uuid', 'org'])

for uuid in result.to_dict()['aggregations']['uuid']['buckets']:
    for org in uuid['org']['buckets']:
        orgs_df.loc[len(orgs_df)] = [uuid['key'], org['key']]
        
# Divide authors in Employees and Non-Employees based on org name
orgs_df.loc[orgs_df['org'] == 'Community', 'org'] = 'Non-Employees'
orgs_df.loc[orgs_df['org'] == 'Mozilla Staff', 'org'] = 'Employees'
orgs_df.loc[orgs_df['org'] == 'Code Sheriff', 'org'] = 'Employees'
# Add org to commits by author over time dataframe
authors_org_df = authors_df.merge(orgs_df, on='uuid', how='left')


In [9]:
onion_df=onion_evolution(authors_org_df, bucket_field='uuid', time_field='time', metric_field='commits')

# Calculate quarters
#onion_df['Quarter'] = pandas.PeriodIndex(pandas.to_datetime(onion_df.Time), freq='Q')
onion_df['Quarter'] = onion_df['Time'].map(lambda x: str(pandas.Period(x,'Q')))

onion_df

Unnamed: 0,Time,Core,Core-Non,Core-Emp,Regular,Regular-Non,Regular-Emp,Casual,Casual-Non,Casual-Emp,Quarter
0,2010-01-01T00:00:00.000Z,96,13,83,119,33,86,282,181,101,2010Q1
1,2010-04-01T00:00:00.000Z,100,14,86,123,37,86,337,212,125,2010Q2
2,2010-07-01T00:00:00.000Z,124,14,110,133,38,95,383,253,130,2010Q3
3,2010-10-01T00:00:00.000Z,127,15,112,122,32,90,330,219,111,2010Q4
4,2011-01-01T00:00:00.000Z,144,28,116,145,40,105,381,243,138,2011Q1
5,2011-04-01T00:00:00.000Z,166,30,136,167,45,122,405,271,134,2011Q2
6,2011-07-01T00:00:00.000Z,178,30,148,178,44,134,438,277,161,2011Q3
7,2011-10-01T00:00:00.000Z,188,23,165,170,50,120,458,302,156,2011Q4
8,2012-01-01T00:00:00.000Z,196,25,171,196,58,138,535,353,182,2012Q1
9,2012-04-01T00:00:00.000Z,228,29,199,214,49,165,552,362,190,2012Q2


In [11]:
print_grouped_bar(onion_df, 
                  'Quarter',
                  ['Core', 'Core-Non', 'Regular', 'Regular-Non', 'Casual', 'Casual-Non'],
                  'Contribution Groups: all developers / non-employees')
print_stacked_bar(onion_df, 
                  'Quarter',
                  ['Core-Emp', 'Core-Non', 'Regular-Emp', 'Regular-Non','Casual-Emp', 'Casual-Non'],
                  'Contribution Groups: employees / non-employees')

## Authors contributing to N projects

In [25]:
#results = []
#for i in analyzed_range:

# Buckets by author name, finding first commit for each of them
s = create_search(source='git')

# General filters
s = add_general_date_filters(s)
s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)


# Retrieve commits before given year
s = s.filter('range', grimoire_creation_date={'lt': 'now/y'})

# Bucketize by uuid and get first and last commit
s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
    .metric('projects', 'cardinality', field='project', precision_threshold=10000)
#print(s.to_dict())
result = s.execute()

project_count_df = print_df(result, group_field='authors', group_column='authors', value_field='projects', value_column='projects')

project_count_df = project_count_df.groupby(['projects']).agg({'authors': 'count'})
#projects_df.rename(columns={"author": "# authors"}, inplace=True)
#projects_df = projects_df.reset_index().sort_values(by=['first_commit', '# authors'], ascending=[False, False])
project_count_df

plotly.offline.init_notebook_mode(connected=True)

bars = []

bars.append(go.Bar(
    x=project_count_df.reset_index()['projects'].tolist(),
    y=project_count_df['authors'].tolist()))

layout = go.Layout(
    barmode='bar'
)

fig = go.Figure(data=bars, layout=layout)
plotly.offline.iplot(fig, filename='authors-contributing-to-projects')


In [89]:
#results = []
#for i in analyzed_range:

# Buckets by author name, finding first commit for each of them
s = create_search(source='git')

# General filters
s = add_general_date_filters(s)
s = add_bot_filter(s)
s = add_merges_filter(s)

# Filter commits to the Project Repos
s = add_project_filter(s)


# Retrieve commits from 2012
s = s.filter('range', grimoire_creation_date={'gte': '2016-01-01', 'lt': '2017-01-01'})

# Bucketize by uuid and get first and last commit
s.aggs.bucket('authors', 'terms', field='author_uuid', size=100000) \
    .bucket('projects', 'terms', field='project', size=1000)
#print(s.to_dict())
result = s.execute()

df = pandas.DataFrame(columns=['author', 'project'])

for uuid in result.to_dict()['aggregations']['authors']['buckets']:
    for project in uuid['projects']['buckets']:
        df.loc[len(df)] = [uuid['key'], project['key']]

df = df.loc[(df['project'] == 'Gecko') | (df['project'] == 'Rust') | (df['project'] == 'Servo')\
           | (df['project'] == 'Firefox OS (FxOS / B2G)') | (df['project'] == 'WebVR')]


In [83]:
 def get_project_intersection(df):
    project_project_df = pandas.DataFrame(columns=['project'])
    project_project_df.set_index('project')

    project_list = df['project'].unique()
    for project_row in project_list:
        row = {'project': project_row}
        for project_col in project_list:
            # compute intersection between projects
            authors_row_df = df.loc[df['project'] == project_row]['author']
            authors_col_df = df.loc[df['project'] == project_col]['author']
            intersection = len(set(authors_row_df).intersection(set(authors_col_df)))
            # Add intersection to row
            row[project_col] = intersection

        # add row to dataframe
        project_project_df = project_project_df.append(row, ignore_index=True)

    return project_project_df
    

In [90]:
get_project_intersection(df)

Unnamed: 0,project,Firefox OS (FxOS / B2G),Gecko,Rust,Servo,WebVR
0,Gecko,4.0,1363.0,80.0,330.0,7.0
1,Rust,0.0,80.0,857.0,87.0,0.0
2,Servo,1.0,330.0,87.0,461.0,2.0
3,WebVR,1.0,7.0,0.0,2.0,141.0
4,Firefox OS (FxOS / B2G),13.0,4.0,0.0,1.0,1.0
