## Goal:  understanding activity

The term activity in this document referes to actions performed by contributors. Based on the characterization of contributors, we will look at activities of the different identified groups over time. Having into account the the different data sources, we have into account the following actions:

 * Git: sending a commit
 * GitHub issues: opening an issue, closing an issue
 * GitHub pull requests: submiting a pull request, accepting (merging) a pull request
 * Bugzilla: opening a ticket, closing a ticket
 * Mailing Lists: sending a message
 * Discourse: initiating a thread, commmenting in a thread
  
This goal is refined in the following questions:

**Questions**:

For each of the activities and contributor groups identified above,

* How does activity evolve over time?

The metrics identified are:

**Metrics**: 

* Git:
  * Number of commits authored
* GitHub:
  * Number of issues opened
  * Number of issues closed
  * Number of pull requests opened
  * Number of pull requests merged
* Bugzilla:
  * Number of tickets opened
  * Number of tickets closed
* Mailing lists:
  * Number of e-mails sent
* Discourse:
  * Number of threads initiated
  * Number of comments posted
  
These metrics will be computed for the speficied contributor groups, over time.

### Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

**TODO**: add bot and merges filtering

In [1]:
import pandas

import plotly as plotly
import plotly.graph_objs as go

import util as ut

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

In [2]:
def create_search(source):
    s = Search(using=es_conn, index=source)
    # TODO: Add bot and merges filtering.
    #s = s.filter('range', grimoire_creation_date={'gt': 'now/M-2y', 'lt': 'now/M'})
    s.params(timeout=30)
    return s

In [3]:
def print_result(result):
    """In case you need to check query response, call this function
    """
    print(result.to_dict()['aggregations'])

In [4]:
def print_df(result, group_field, value_field, group_column, value_column):
    df = pandas.DataFrame()

    df = df.from_dict(result.to_dict()['aggregations'][group_field]['buckets'])
    df = df.drop('doc_count', axis=1)
    df[value_field] = df[value_field].apply(lambda row: row['value'])
    df=df[['key', value_field]]
    df.columns = [group_column, value_column]

    return df

In [5]:
def stack_by(result, group_column, time_column, value_column, group_field, time_field, value_field):
    """Creates a dataframe based on group and time values
    """
    df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[time_field]['buckets']:
            df.loc[len(df)] = [b['key'], i['key_as_string'], i[value_field]['value']]
    
    return df

In [6]:
def stack_by_terms(result, group_column, subgroup_column, value_column, group_field, subgroup_field, value_field):
    """Creates a dataframe based on group and subgroup values
    """
    df = pandas.DataFrame(columns=[group_column, subgroup_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        for i in b[subgroup_field]['buckets']:
           df.loc[len(df)] = [b['key'], i['key'], i[value_field]]
    
    return df

In [7]:
def stack_by_terms_cusum(result, group_column, subgroup_column, value_column,\
                         group_field, subgroup_field, value_field,\
                         staff_org_names, staff_org):
    """Creates a dataframe based on group and subgroup values
    aggregating Non-mozilla staff numbers together
    """
    df = pandas.DataFrame(columns=[group_column, subgroup_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        
        key = b['key']
        if key in staff_org_names:
            key = staff_org
        else:    
            key  = 'Other'    

        print(b['key'], '->' ,key)
        
        for i in b[subgroup_field]['buckets']:
            
            subgroup = i['key']
            count = i[value_field]
            
            if key in df[group_column].unique() \
                and subgroup in df[df[group_column] == key][subgroup_column].tolist():

                df.loc[(df[group_column] == key) & (df[subgroup_column] == subgroup),\
                        value_column] += count
                #print('1', key,  subgroup, count)

            else:
                df.loc[len(df)] = [key, subgroup, count]
                #print('2', key,  subgroup, count)
    
    return df

In [8]:
def stack_by_cusum(result, group_column, time_column, value_column, group_field, time_field, value_field,\
                   staff_org_names, staff_org):
    authors_org_df = pandas.DataFrame(columns=[group_column, time_column, value_column])

    for b in result.to_dict()['aggregations'][group_field]['buckets']:
        key = b['key']
        if key in staff_org_names:
            key = staff_org
        else:    
            key  = 'Other'    

        print(b['key'], '->' ,key)

        for i in b[time_field]['buckets']:

            time = i['key_as_string']
            contributors = i[value_field]['value']

            if key in authors_org_df[group_column].unique() \
                and time in authors_org_df[authors_org_df[group_column] == key][time_column].tolist():

                authors_org_df.loc[(authors_org_df[group_column] == key) \
                                     & (authors_org_df[time_column] == time),\
                                   value_column] += contributors
                #print('1', key,  time, contributors)

            else:
                authors_org_df.loc[len(authors_org_df)] = [key, time, contributors]
                #print('2', key,  time, contributors)
    
    return authors_org_df

In [9]:
def print_stacked_bar(df, time_column, value_column, group_column):
    """Print stacked bar chart from dataframe based on time_field,
    grouped by group field.
    """
    plotly.offline.init_notebook_mode(connected=True)

    bars = []
    for group in df[group_column].unique():
        group_slice_df = df.loc[df[group_column] == group]
        bars.append(go.Bar(
            x=group_slice_df[time_column].tolist(),
            y=group_slice_df[value_column].tolist(),
            name=group))

    layout = go.Layout(
        barmode='stack'
    )

    fig = go.Figure(data=bars, layout=layout)
    plotly.offline.iplot(fig, filename='stacked-bar')

In [27]:
def add_general_date_filters(s):
    # 01/01/1998
    initial_ts = '883609200000'
    return s.filter('range', grimoire_creation_date={'gt': initial_ts})

def add_bot_filter(s):
    return s.filter('term', author_bot='false')

def add_merges_filter(s):
    return s.filter('range', files={'gt': 0})

# Let's load projects from the REVIEWED SPREADSHEET
projects = ut.read_projects("data/Contributors and Communities Analysis - Project grouping.xlsx")

initial_date = '2010-01-01'

## Git: Number of commits authored
Commits are contributions in terms of Git. Looking at them we can measure not only global activity of projects and organizations, but also how these projects and organizations evolve through time.

Global numbers were already provided in [List of projects](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#List-of-Projects) and [List of Organizations](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#List-of-organizations) sections of the previous goal.

Next plot shows evolution of contributions by project through last two years, grouped by quarters:

In [28]:
s = create_search(source='git')

s = add_bot_filter(s)
s = add_merges_filter(s)

# Unique count of Commits by Project (max 100 projects)
s = s.filter('range', grimoire_creation_date={'gte': initial_date, 'lt': 'now/y'})
s.aggs.bucket('projects', 'terms', field='project', size=10)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('contributions', 'cardinality', field='hash', precision_threshold=100000)

result = s.execute()
            
df = stack_by(result=result, group_column='Project', time_column='Time', value_column='# Contributions',\
        group_field='projects', time_field='time', value_field='contributions')

print_stacked_bar(df=df, time_column='Time', value_column='# Contributions', group_column='Project')

**Figure above: commits by project using Dashboard based project grouping**

In [29]:
s = create_search(source='git')

# General filters
s = add_general_date_filters(s)
s = add_bot_filter(s)
s = add_merges_filter(s)

# Unique count of Commits by Project (max 100 projects)
s = s.filter('range', grimoire_creation_date={'gte': initial_date, 'lt': 'now/y'})
s.aggs.bucket('repos', 'terms', field='repo_name', size=100000)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('contributions', 'cardinality', field='hash', precision_threshold=100000)

result = s.execute()
            
repos = stack_by(result=result, group_column='Repo', time_column='Time', value_column='Commits',\
        group_field='repos', time_field='time', value_field='contributions')

# Group By project
merged_df = repos.merge(projects['Github'], on='Repo', how='left')

projects_df = merged_df.groupby(['Project', 'Time']).agg({'Commits': 'sum', 'Repo': 'count'})
projects_df = projects_df.sort_values(by='Commits', ascending=0)

# Plot it
print_stacked_bar(df=projects_df.reset_index(), time_column='Time', value_column='Commits',
                  group_column='Project')

**Figure above: commits by project using Spreadsheet based project grouping**

Below plot show Git contributions of each organization grouped by quarters for last two years:

In [30]:
s = create_search(source='git')

s = add_bot_filter(s)
s = add_merges_filter(s)

# Unique count of Commits by Project (max 100 projects)
s = s.filter('range', grimoire_creation_date={'gte': initial_date, 'lt': 'now/M'})

s.aggs.bucket('organizations', 'terms', field='author_org_name', size=10)\
    .bucket('time', 'date_histogram', field='grimoire_creation_date', interval='quarter')\
    .metric('contributions', 'cardinality', field='hash', precision_threshold=100000)

result = s.execute()
            
df = stack_by(result=result, group_column='Organization', time_column='Time', value_column='# Contributions',\
        group_field='organizations', time_field='time', value_field='contributions')

In [31]:
print_stacked_bar(df=df, time_column='Time', value_column='# Contributions', group_column='Organization')

## GitHub: Issues and Pull Requests by status

Next table shows number of issues and Pull Request open and closed for each **Project**.

**TODO**: provide plots like:
  * PRs: https://analytics.mozilla.community:443/goto/99a2cf4d0e06986fe5886ccafa01c88f
  * Issues: https://analytics.mozilla.community:443/goto/db1d0243582548c8f8a0469f3e099677
  

In [64]:
# Open & Closed PRs by Project (max 100 projects)
s_prs = create_search(source='github_issues')
s_prs = add_bot_filter(s_prs)
s_prs = s_prs.filter('terms', pull_request=['true'])
s_prs.aggs.bucket('projects', 'terms', field='project_1', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_prs = s_prs.execute()

# Open & Closed Issues by Project (max 100 projects)
s_iss = create_search(source='github_issues')
s_iss = add_bot_filter(s_iss)
s_iss = s_iss.filter('terms', pull_request=['false'])
s_iss.aggs.bucket('projects', 'terms', field='project_1', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_iss = s_iss.execute()

In [74]:
prs_df = stack_by_terms(result=result_prs, group_column='Project', subgroup_column='Status', value_column='# Pull Requests',\
         group_field='projects', subgroup_field='status', value_field='doc_count')
iss_df = stack_by_terms(result=result_iss, group_column='Project', subgroup_column='Status', value_column='# Issues',\
         group_field='projects', subgroup_field='status', value_field='doc_count')

joined_df = pandas.merge(prs_df, iss_df, how='outer', on=['Project', 'Status'])
joined_df = joined_df.fillna(0)
joined_df

Unnamed: 0,Project,Status,# Pull Requests,# Issues
0,mozilla,closed,25801.0,19346.0
1,mozilla,open,1113.0,6329.0
2,mozilla-services,closed,7493.0,4773.0
3,mozilla-services,open,154.0,1115.0
4,rust-lang,closed,4100.0,2925.0
5,rust-lang,open,242.0,2954.0
6,servo,closed,3177.0,1250.0
7,servo,open,153.0,1085.0
8,mozilla-mobile,closed,3099.0,538.0
9,mozilla-mobile,open,52.0,185.0


### GitHub: Issues and Pull Requests by Organization

Below we show number of Pull Requests and Issues open and closed **by Organization**:


In [75]:
# Open & Closed PRs by Organization (max 100 projects)
s_prs = create_search(source='github_issues')
s_prs = s_prs.filter('terms', pull_request=['true'])
s_prs.aggs.bucket('organizations', 'terms', field='author_org_name', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_prs = s_prs.execute()

# Open & Closed Issues by Project (max 100 projects)
s_iss = create_search(source='github_issues')
s_iss = s_iss.filter('terms', pull_request=['false'])
s_iss.aggs.bucket('organizations', 'terms', field='author_org_name', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_iss = s_iss.execute()

In [76]:
prs_df = stack_by_terms(result=result_prs, group_column='Organization', subgroup_column='Status',\
                        value_column='# Pull Requests', group_field='organizations', subgroup_field='status',\
                        value_field='doc_count')
iss_df = stack_by_terms(result=result_iss, group_column='Organization', subgroup_column='Status',\
                        value_column='# Issues', group_field='organizations', subgroup_field='status',\
                        value_field='doc_count')

joined_df = pandas.merge(prs_df, iss_df, how='outer', on=['Organization', 'Status'])
joined_df = joined_df.fillna(0)
joined_df

Unnamed: 0,Organization,Status,# Pull Requests,# Issues
0,Community,closed,146388.0,74177.0
1,Community,open,2127.0,18764.0
2,Mozilla Corporation,closed,11827.0,7943.0
3,Mozilla Corporation,open,165.0,2787.0
4,Mozilla Staff,closed,10935.0,8308.0
5,Mozilla Staff,open,418.0,2801.0
6,Unknown,closed,6174.0,7268.0
7,Unknown,open,256.0,2460.0
8,Catalyst,closed,37.0,27.0
9,"Adobe Systems, Inc.",closed,35.0,31.0


### GitHub: Issues and Pull Requests made by people hired by Mozilla

To compare contributors **hired by Mozilla** to the rest of contributors we first show a list of Organizations we are considering as 'Mozilla Staff' or 'Others'. Next a table is shown with aggregated numbers to compare both contributor groups.

In [79]:
# Open & Closed PRs by Organization (max 100 projects)
s_prs = create_search(source='github_issues')
s_prs = s_prs.filter('terms', pull_request=['true'])
s_prs.aggs.bucket('organizations', 'terms', field='author_org_name', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_prs = s_prs.execute()

# Open & Closed Issues by Project (max 100 projects)
s_iss = create_search(source='github_issues')
s_iss = s_iss.filter('terms', pull_request=['false'])
s_iss.aggs.bucket('organizations', 'terms', field='author_org_name', size=100)\
    .bucket('status', 'terms', field='state', size=100)
result_iss = s_iss.execute()

In [86]:
print("\nPRS\n")
prs_df = stack_by_terms_cusum(result=result_prs, group_column='Organization', subgroup_column='Status',\
                        value_column='# Pull Requests', group_field='organizations', subgroup_field='status',\
                        value_field='doc_count', staff_org_names=['Mozilla Staff'], staff_org='Mozilla Staff')
print("\nISSUES\n")
iss_df = stack_by_terms_cusum(result=result_iss, group_column='Organization', subgroup_column='Status',\
                        value_column='# Issues', group_field='organizations', subgroup_field='status',\
                        value_field='doc_count', staff_org_names=['Mozilla Staff'], staff_org='Mozilla Staff')

joined_df = pandas.merge(prs_df, iss_df, how='outer', on=['Organization', 'Status'])
joined_df = joined_df.fillna(0)
joined_df


PRS

Community -> Other
Mozilla Corporation -> Other
Mozilla Staff -> Mozilla Staff
Unknown -> Other
Catalyst -> Other
Adobe Systems, Inc. -> Other
Mozilla Reps -> Other
MIT -> Other
Apple -> Other
Canonical, Ltd. -> Other
Debian GNU/Linux -> Other
Cloudscaling -> Other
Collabora -> Other
Chef -> Other
Oracle -> Other
University of North Carolina at Chapel Hill -> Other

ISSUES

Community -> Other
Mozilla Staff -> Mozilla Staff
Mozilla Corporation -> Other
Unknown -> Other
Mozilla Reps -> Other
MIT -> Other
Adobe Systems, Inc. -> Other
Catalyst -> Other
Debian GNU/Linux -> Other
Cloudscaling -> Other
Canonical, Ltd. -> Other
Collabora -> Other
Google, Inc. -> Other
Bitergia -> Other
Apple -> Other
Aptana, Inc. -> Other
Capital One -> Other
Carnegie Mellon University -> Other
CodeSourcery -> Other
Intel -> Other
Oracle -> Other
The Apache Software Foundation -> Other


Unnamed: 0,Organization,Status,# Pull Requests,# Issues
0,Other,closed,164542.0,89663.0
1,Other,open,2559.0,24219.0
2,Mozilla Staff,closed,10935.0,8308.0
3,Mozilla Staff,open,418.0,2801.0


#### Bugzilla:
  * Number of tickets opened
  * Number of tickets closed
  
**TODO**: provide plots like:
  * Open tickets: https://analytics.mozilla.community:443/goto/4c749e3e1cdada7e3aeaf81ddf25e4fb
  * Closed tickets: https://analytics.mozilla.community:443/goto/2140cf24fb3b876d27b1ba62c21141bc

**Mailing lists**:
  * Number of e-mails sent
    * https://analytics.mozilla.community:443/goto/d64333339c487777686182cb62ad1053
  
**Discourse**:
  * Number of threads initiated
    * https://analytics.mozilla.community:443/goto/71771202d68a10cc422c6bda86c7cf3e
  * Number of comments posted
    * https://analytics.mozilla.community:443/goto/73c76412902180d14e0418d03fb30884
  
These metrics will be computed for the speficied contributor groups, over time.