# Analytics study of the Mozilla community

*Authors:* Alberto Pérez (Bitergia), Daniel Izquierdo (Bitergia), Jesus M. Gonzalez-Barahona (URJC / Bitergia)

*Collaborators in the Mozilla community:*

*Date:*

This is a report on the Mozilla community, based on analytics of data obtained from some of the systems that people in the community use to perform their tasks. The report is commissioned by Mozilla, and performed by Bitergia.

## Summary

As a result of the interest by Mozilla of better understanding the current state of the Mozilla community, Bitergia is producing this report with the help and feedback from several people from Mozilla (see Contribution section below).

This report will introduce some general issues about the analysis and its results, and then will provide specific sections for each goal, which will be refined in questions and metrics, with the corresponding analysis, as the project progresses.

This report is work in progress, until stated otherwise, and could contain errors and unverified data. Feedback and bug reports are welcome.

## Contributions

If you want to contribute to this report, by reporting a bug, by proposing some new idea, by fixing and/or improving any part of it, or otherwise, please:

* If it is a report, please open an issue on this repository.
* If it is some proposal for a change, please submit a pull request on this repository.

## Scope

The scope of this project is to analyze and explore data already available in the [Mozilla Analytics Dashobard](http://analytics.mozilla.community) (including the database that supports it), related to the following areas:

* Understanding people. Characterization of the community from several points of view, including “age” (time contributing) in the project, history (different kinds of contributions), migration between Mozilla projects, affiliation (Mozilla employees or independent volunteers), gender (based on a name analysis), geographical distribution (based on time zone data and GitHub data, when available), performance, etc.

* Understanding activity. Characterization of activity in the different Mozilla projects, including evolution of activity over time, key performance factors, key good practices in specific projects, in the context of the groups characterized in the previous item. 

* Understanding interconnection. Learning how contributors work with each other in common projects, how they migrate together (or not), how they evolve, and learning about the key people acting as “connectors” between projects.

The analysis uses public data, available for retrieval from the following data sources of the Mozilla community:

* Git
* GitHub (including issues and pull requests)
* Bugzilla
* Mailing Lists
* Discourse

Analytics based on affiliation (mapping of persons to supporting organizations) depend on the quality of affiliation data in the Mozilla Analytics Dashboard.

Other interesting data sources that could be considered in the future (but are not the subject of this study, except that is specifically agreed during the exectution of this project) are:

* Mozilla Developers Network, as a data source for affiliation information
* Forums (Kitsune), as a data source for likely first contributions by future developers

The final deliverables of this project will be:

* A Jupyter notebook, with the results of the project, including code to analyze the relevant data.
* If convenient, a PDF summary of results.
* If agreed, some panels in the Mozilla Analytics Dashboard, for specific metrics produced by this project.

## Methodology

To get focus in the questions that might be really interesting for Mozilla, Bitergia will use [Goals – Question – Metrics (GQM) methodology](https://en.wikipedia.org/wiki/GQM), based on:

1. A set of interviews or meetings with Mozilla staff to understand the pursued goals and define them together.
2. A set of questions that might help answering if the set of goals previously defined are reached or not, or how the organization is pursuing them.
3. A set of metrics of metrics that provide facts based on data to answer the previous questions.

The metrics will be derived from data already collected for the Mozilla Analytics Dashboard, stored in a ElasticSearch database in the form of raw indexes (information obtained directly from data sources, usually with all the details found in the corresponding data source), or enriched indexes (summary of relevant information, prepared to be shown by the dashboard).

The process will be iterative, so that once some metrics are derived from a first goal-questions iteration, it will be discussed with Mozilla, to maybe refine the goal and questions, for subsequent refinement of the metrics. Once metrics are finally defined and obtained, analysis of the results will also be included in the report.

## Data still needed for producing the analysis

To produce the analysis, and to refine some questions and metrics, some details are needed. This section includes information on the missing data or details that are still to be provided. This is an open list and may grow or shrink as the project progresses or details are provided:

* Details on how code review is done, both in GitHub and Bugzilla, by Mozilla. In both cases, documents describing recommendations to developers on how to perform code review, or similar, would be useful. In the specific case of Bugzilla, ways of determining that an specific ticket is devoted to code review, and how it is done, are needed.
* Details for tracking `Good first bug` tickets in GitHub and Bugzilla (tags, or whatever).
* Repository list per project across data sources. This data could may be used to improve the Mozilla Analytics Dashboard as well. For this, Bitergia will provide the current list of repositories, as used for the data retrieval of the dashboard.

## Known limitations

There are some limitations to the results of the analysis that are worth mentioning:

* The structure in projects, and the affiliation information, will be as good as can be defined with the help of the Mozilla community.
* Gender is estimated, starting from contributors name, using genderize.io API. Thereforee accuracy depends on this API, and the fact that gender is estimated based on name.
* Time zone analysis is limited by its characteristics. On the one hand, the same time zone includes several large geographical areas (eg, larg parts of Europe and Africa are in the same timezone). On the other hand, the identification of time zones depends on user system configuration, so for some cases it may not be accurate.
* City names, when obtained from GitHub profiles, we will be limited by the availability and reliability of this information, which is dependent on developers.

## List of goals

The goals that are to be addressed by this analysis are, in summary:

* understanding contributors
* understanding activity

These gooals (with their corresponding subgoals, questions and metrics) will be detailed in the following sections.

## Goal: understanding contributors

The term "community" in this context refers to the group of people contributing to Mozilla projects. Thus, this goal could be summarized as characterizing Mozilla community based on their contributors. A contributor will be understood as a person who performs an action that can be tracked in the set of considered data sources. For example: sending a commit, opening or closing a ticket. As they will be different depending on the data source, particular actions used in each analysis will be detailed within particular goals.

The main objective of this goal is to determine a set of characteristics of contributors:

  * [Projects: to which projects they contribute.
  * Organizations: to which organizations they are affiliated
  * Gender: which one is their gender
  * Age: which one is their "age" in the project (time contributing)
  * Geographical origin: where do they come from

Those goals can be refined in the following questions:

**Questions**:

* [Which projects can be identified?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#List-of-Projects)
* [Which contributors have activity related to each project?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#Contributors-by-Project)
* [Which organizations can be identified?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#List-of-organizations)
* [ Which contributors are affiliated to each organization?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#Contributors-by-organization)
* [Which of those contributors are hired by Mozilla, and which are not?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contributors.ipynb#Contributors-by-groups:-hired-by-Mozilla,-the-rest)
* [Which gender are contributors?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Mozilla%20Gender.ipynb#Contributors-by-Gender)
* [How long have been contributors contributing?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Determining%20Attraction%20Retention.ipynb#Time-from-first-to-last-contrib-for-authors-who-made-a-commit-before-a-given-year)
* Where do contributors come from?

These questions can be answered with the following metrics/data:

**Metrics**:

* List of projects
* Contributors by project
* Number of contributors by project over time
* List of organizations
* Contributors by organization
* Number of contributors by organization over time
* Contributors by groups: hired by Mozilla, the rest
* Contributors by gender
* Number of contributors by gender over time
* Time of first and last commit for each contributor
* Length of period of activity for each contributor
* Contributors by time zone (when possible)
* Contributors by city name (when possible)

All the characeterizations of developers (by project, by organization, by hired by Mozilla/rest, by gender, by period of activity, by time zone, by city name) can be a discriminator / grouping factor for the metrics defined for the next goals. Most of these metrics can be made particular for each of the considered data sources.

## Goal:  understanding activity

The term activity in this document referes to actions performed by contributors. Based on the characterization of contributors, we will look at activities of the different identified groups over time. Having into account the the different data sources, we have into account the following actions:

 * Git: sending a commit
 * GitHub issues: opening an issue, closing an issue
 * GitHub pull requests: submiting a pull request, accepting (merging) a pull request
 * Bugzilla: opening a ticket, closing a ticket
 * Mailing Lists: sending a message
 * Discourse: initiating a thread, commmenting in a thread
  
This goal is refined in the following questions:

**Questions**:

For each of the activities and contributor groups identified above,

* How does activity evolve over time?

The metrics identified are:

**Metrics**: 

* [Git](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Activity.ipynb#Git:-Number-of-commits-authored):
  * Number of commits authored
* [GitHub:](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Activity.ipynb#GitHub:-Issues-and-Pull-Requests-by-status)
  * Number of issues opened
  * Number of issues closed
  * Number of pull requests opened
  * Number of pull requests merged
* Bugzilla:
  * Number of tickets opened
  * Number of tickets closed
* Mailing lists:
  * Number of e-mails sent
* Discourse:
  * Number of threads initiated
  * Number of comments posted
  
These metrics will be computed for the speficied contributor groups, over time.

## Goal: understanding contribution patterns

Analyzing groups of contributors, according to their activity patterns, and their evolution over time, helps to understand the structure of the community. These groups will be defined according to how much active they are (from casual to core contributors), and which kinds of activity they have (for example, producing code, reviewing code, submitting issues, contributing in discussions, etc.). Whenever convenient, the characterization will be combined with the contributor groups identified in the first goal.

This goal is refined in the following questions:

**Questions**:

 * [How often do contributors contribute?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contribution%20Patterns.ipynb#Groups-of-contributors,-by-level-of-activity:-core,-regular,-casual)
 * [How is the structure of contribution, according to level of activity?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contribution%20Patterns.ipynb#Groups-of-contributors,-by-level-of-activity:-core,-regular,-casual)
 * How is the structure of contribution, according to the different data sources?
 * [How are the structures of contribution evolving over time?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Understanding%20Contribution%20Patterns.ipynb#Groups-of-contributors,-by-level-of-activity:-core,-regular,-casual)
 * How is people flowing in the structure of contribution?

These questions can be answered with the following metrics:

**Metrics**:

(Still to be refined)

 * Groups of contributors, by level of activity (core, regular, casual)
 * Groups of contributors, by kind of activity (committing, opening issues, merging pull requests, etc.)
 * Groups of contributors, by kind of activity (specialists, spread, etc.)
 * Activity metrics for each group
 * Absolute number of contributors moving from one group to another
 * Fraction of contributors moving from a group to another
 
Some of these metrics will be computed for the speficied contributor groups, over time.

## Goal: determining attraction / retention

An important characteristic of a how a community evolves is how contributors are joining and leaving. Determining when contributors leave is not easy (as they could be on a temporary leave, but coming back later), but if after a certain period they are still inactive, it is very likely they can be considered 'gone'. With this definition, the evolution of attraction and retention, and its difference (net gain of developers, which can be negative) can be computed.

These data could be determined for each of the specific contributors groups defined in the first goal. 

This goal can be refined in the following questions:

**Questions**:

* [How many contributors are joining the community?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Determining%20Attraction%20Retention.ipynb#Attraction)
* [How many contributors are no longer active (leaving) in the community?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Determining%20Attraction%20Retention.ipynb#Retention)
* [How is the attraction / retention ratio, and the net gain of contributors, over time?](http://localhost:8888/notebooks/mozilla-contribution-analysis/data-analysis/Determining%20Attraction%20Retention.ipynb#Evolution-of-Community)

To answer these questions, the following metrics can be used:

**Metrics**:

* Number of contributors joining the community over time (attracted)
* Number of contirbutors leaving (becoming inactive) over time
* Number of contributors not leaving (retained) over time

These metrics can be computed for each of the "cohorts", defined as the groups of contributors joining during a certain period of time (for example, during each year). Some of these metrics will be computed for the speficied contributor groups, over time.

## Goal: understanding connections

Analyze connections within the community in order to understand relations between contributors, groups, and projects.

This goal can be refined in the following questions:

**Questions**:

* Do specific groups tend to contribute to particular projects?
* Which groups and/or contributors act as nexus between different projects?
* Which contributors / groups are isolated according to their activity patterns?

These questions can be answered with the following metrics:

**Metrics**:

(to be completed)

* Contributors organized by number of projects / repositories to which they contribute
* Groups organized by number of projects / repositories to which they contribute

Some of these metrics will be computed for the speficied contributor groups, over time.

## Goal: diversity analysis

This goal is focused on analyzing diversity from several points of view: affiliation, gender, geographical area. This will show how diverse are the different areas of contributions in the project, and will help to spot points with low or extremely high diversity. A careful analysis will also help to understand if certain groups are beeing expelled or attracted to certain areas of contribution (for example, projects).

These goals can be refined in the following questions:

**Questions**:

* How is the organization diversity in different areas of the project?
* On how many organizations does the project critically depends?
* How is the gender diversity in different areas of the project?
* Can areas of the project be identified that are specially diverse (or not) from the gender point of view?
* How is geographical diversity in different areas of the project?
* Can areas of the project be identified that are specially diverse (or not) from the geographical origin point of view?

These questions can be answered with the following metrics:

**Metrics**:

* Number of organizations contributing to each project.
* Gender balance in each project.
* Geographical balance in each project.

Each of those metrics can be computed over time. Some of these metrics will be computed for the speficied contributor groups, over time.