## Goal: understanding contributors

The term "community" in this context refers to the group of people contributing to Mozilla projects. Thus, this goal could be summarized as characterizing Mozilla community based on their contributors. A contributor will be understood as a person who performs an action that can be tracked in the set of considered data sources. For example: sending a commit, opening or closing a ticket. As they will be different depending on the data source, particular actions used in each analysis will be detailed within particular goals.

The main objective of this goal is to determine a set of characteristics of contributors:

  * Projects: to which projects they contribute.
  * Organizations: to which organizations they are affiliated
  * Gender: which one is their gender
  * Age: which one is their "age" in the project (time contributing)
  * Geographical origin: where do they come from

Those goals can be refined in the following questions:

**Questions**:

* Which projects can be identified?
* Which contributors have activity related to each project?
* Which organizations can be identified?
* Which contributors are affiliated to each organization?
* Which of those contributors are hired by Mozilla, and which are not?
* Which gender are contributors?
* How long have been contributors contributing?
* Where do contributors come from?

These questions can be answered with the following metrics/data:

**Metrics**:

* List of projects
* Contributors by project
* Number of contributors by project over time
* List of organizations
* Contributors by organization
* Number of contributors by organization over time
* Contributors by groups: hired by Mozilla, the rest
* Contributors by gender
* Number of contributors by gender over time
* Time of first and last commit for each contributor
* Length of period of activity for each contributor
* Contributors by time zone (when possible)
* Contributors by city name (when possible)

All the characeterizations of developers (by project, by organization, by hired by Mozilla/rest, by gender, by period of activity, by time zone, by city name) can be a discriminator / grouping factor for the metrics defined for the next goals. Most of these metrics can be made particular for each of the considered data sources.

### Metric Calculations
First we need to load a connection against the proper ES instance. We use an external module to load credentials from a file that will not be shared. If you want to run this, please use your own credentials, just put them in a file named '.settings' (in the same directory as this notebook) following the example file 'settings.sample'.

In [1]:
import pandas

from util import ESConnection
from elasticsearch_dsl import Search

es_conn = ESConnection()

#### List of Projects

To get the list of projects we will query ES to retrieve the unique count of commits for each project. To do that, we bucketize data based on 'project' field (to a maximum of 100 projects, given by 'size' parameter set below).

In [67]:
s = Search(using=es_conn, index='git')

# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('projects', 'terms', field='project', size=100)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=100000)
result = s.execute()

# In case you need to check response, uncomment line below
#print(result.to_dict()['aggregations'])


In [68]:
df = pandas.DataFrame()

df = df.from_dict(result.to_dict()['aggregations']['projects']['buckets'])
df = df.drop('doc_count', axis=1)
df['commits'] = df['commits'].apply(lambda row: row['value'])
df=df[['key', 'commits']]
df.columns = ['Project', '# Commits']

df

Unnamed: 0,Project,# Commits
0,mozilla,1901245
1,mozilla-services,147314
2,rust-lang,135323
3,servo,78035
4,mdn,12739
5,moztw,8618
6,mozilla-mobile,9039
7,aframevr,7090
8,mozilla-japan,5787
9,mozmar,5359


#### Contributors by Project

In [74]:
s = Search(using=es_conn, index='git')

# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('projects', 'terms', field='project', size=100)\
    .metric('contributors', 'cardinality', field='author_uuid', precision_threshold=100000)
result = s.execute()

# In case you need to check response, uncomment line below
#print(result.to_dict()['aggregations'])


In [75]:
df = pandas.DataFrame()

df = df.from_dict(result.to_dict()['aggregations']['projects']['buckets'])
df = df.drop('doc_count', axis=1)
df['contributors'] = df['contributors'].apply(lambda row: row['value'])
df=df[['key', 'contributors']]
df.columns = ['Project', '# Contributors']

df

Unnamed: 0,Project,# Contributors
0,mozilla,11191
1,mozilla-services,2053
2,rust-lang,2494
3,servo,1404
4,mdn,355
5,moztw,130
6,mozilla-mobile,108
7,aframevr,208
8,mozilla-japan,71
9,mozmar,66


#### Number of contributors by project over time
**TODO**: provide a plot similar to https://analytics.mozilla.community:443/goto/9523b9b00de0b35645de488a1a06514e

#### List of organizations


In [69]:
s = Search(using=es_conn, index='git')

# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('organizations', 'terms', field='author_org_name', size=100)\
    .metric('commits', 'cardinality', field='hash', precision_threshold=100000)
result = s.execute()

# In case you need to check response, uncomment line below
#print(result.to_dict()['aggregations'])

In [70]:
df = pandas.DataFrame()

df = df.from_dict(result.to_dict()['aggregations']['organizations']['buckets'])
df = df.drop('doc_count', axis=1)
df['commits'] = df['commits'].apply(lambda row: row['value'])
df=df[['key', 'commits']]
df.columns = ['Organization', '# Commits']

df

Unnamed: 0,Organization,# Commits
0,Community,1255010
1,Mozilla Staff,1057068
2,Unknown,23763
3,Mozilla Reps,605
4,Mozilla Corporation,35


#### Contributors by organization

In [72]:
s = Search(using=es_conn, index='git')

# Unique count of Commits by Project (max 100 projects)
s.aggs.bucket('organizations', 'terms', field='author_org_name', size=100).\
    metric('contributors', 'cardinality', field='author_uuid', precision_threshold=100000)
result = s.execute()

# In case you need to check response, uncomment line below
#print(result.to_dict()['aggregations'])

In [73]:
df = pandas.DataFrame()

df = df.from_dict(result.to_dict()['aggregations']['organizations']['buckets'])
df = df.drop('doc_count', axis=1)
df['contributors'] = df['contributors'].apply(lambda row: row['value'])
df=df[['key', 'contributors']]
df.columns = ['Organization', '# Contributors']

df

Unnamed: 0,Organization,# Contributors
0,Community,13204
1,Mozilla Staff,1970
2,Unknown,1272
3,Mozilla Reps,2
4,Mozilla Corporation,5


#### Number of contributors by organization over time
**TODO**: provide a plot similar to https://analytics.mozilla.community:443/goto/5dce2b36ec14405a09827860169ae234

#### Contributors by groups: hired by Mozilla, the rest
**TODO**: needs to know who are hired by Mozilla

#### Contributors by gender
**TODO**: Pending of running gender study over the data.

#### Number of contributors by gender over time
**TODO**: Pending of running gender study over the data.

#### Time of first and last commit for each contributor


#### Length of period of activity for each contributor


#### Contributors by time zone (when possible)

#### Contributors by city name (when possible)