# Analytics study of the Mozilla community

*Authors:* Alberto Pérez (Bitergia), Daniel Izquierdo (Bitergia), Jesus M. Gonzalez-Barahona (URJC / Bitergia)

*Collaborators in the Mozilla community:*

*Date:*

This is a report on the Mozilla community, based on analytics of data obtained from some of the systems that people in the community use to perform their tasks. The report is commissioned by Mozilla, and performed by Bitergia.

## Summary

As a result of the interest by Mozilla of better understanding the current state of the Mozilla community, Bitergia is producing this report with the help and feedback from several people from Mozilla (see Contribution section below).

This report will introduce some general issues about the analysis and its results, and then will provide specific sections for each goal, which will be refined in questions and metrics, with the corresponding analysis, as the project progresses.

This report is work in progress, until stated otherwise, and could contain errors and unverified data. Feedback and bug reports are welcome.

## Contributions

If you want to contribute to this report, by reporting a bug, by proposing some new idea, by fixing and/or improving any part of it, or otherwise, please:

* If it is a report, please open an issue on this repository.
* If it is some proposal for a change, please submit a pull request on this repository.

## Scope

The scope of this project is to analyze and explore data already available in the [Mozilla Analytics Dashobard](http://analytics.mozilla.community) (including the database that supports it), related to the following areas:

* Understanding people. Characterization of the community from several points of view, including “age” (time contributing) in the project, history (different kinds of contributions), migration between Mozilla projects, affiliation (Mozilla employees or independent volunteers), gender (based on a name analysis), geographical distribution (based on time zone data and GitHub data, when available), performance, etc.

* Understanding activity. Characterization of activity in the different Mozilla projects, including evolution of activity over time, key performance factors, key good practices in specific projects, in the context of the groups characterized in the previous item. 

* Understanding interconnection. Learning how contributors work with each other in common projects, how they migrate together (or not), how they evolve, and learning about the key people acting as “connectors” between projects.

The analysis uses public data, available for retrieval from the following data sources of the Mozilla community:

* Git
* GitHub (including issues and pull requests)
* Bugzilla
* Mailing Lists
* Discourse

Analytics based on affiliation (mapping of persons to supporting organizations) depend on the quality of affiliation data in the Mozilla Analytics Dashboard.

Other interesting data sources that could be considered in the future (but are not the subject of this study, except that is specifically agreed during the exectution of this project) are:

* Mozilla Developers Network, as a data source for affiliation information
* Forums (Kitsune), as a data source for likely first contributions by future developers

The final deliverables of this project will be:

* A Jupyter notebook, with the results of the project, including code to analyze the relevant data.
* If convenient, a PDF summary of results.
* If agreed, some panels in the Mozilla Analytics Dashboard, for specific metrics produced by this project.

## Methodology

To get focus in the questions that might be really interesting for Mozilla, Bitergia will use [Goals – Question – Metrics (GQM) methodology](https://en.wikipedia.org/wiki/GQM), based on:

1. A set of interviews or meetings with Mozilla staff to understand the pursued goals and define them together.
2. A set of questions that might help answering if the set of goals previously defined are reached or not, or how the organization is pursuing them.
3. A set of metrics of metrics that provide facts based on data to answer the previous questions.

The metrics will be derived from data already collected for the Mozilla Analytics Dashboard, stored in a ElasticSearch database in the form of raw indexes (information obtained directly from data sources, usually with all the details found in the corresponding data source), or enriched indexes (summary of relevant information, prepared to be shown by the dashboard).

The process will be iterative, so that once some metrics are derived from a first goal-questions iteration, it will be discussed with Mozilla, to maybe refine the goal and questions, for subsequent refinement of the metrics. Once metrics are finally defined and obtained, analysis of the results will also be included in the report.

## Data still needed for producing the analysis

To produce the analysis, and to refine some questions and metrics, some details are needed. This section includes information on the missing data or details that are still to be provided. This is an open list and may grow or shrink as the project progresses or details are provided:

* Details on how code review is done, both in GitHub and Bugzilla, by Mozilla. In both cases, documents describing recommendations to developers on how to perform code review, or similar, would be useful. In the specific case of Bugzilla, ways of determining that an specific ticket is devoted to code review, and how it is done, are needed.
* Details for tracking `Good first bug` tickets in GitHub and Bugzilla (tags, or whatever).
* Repository list per project across data sources. This data could maybe used to improve the Mozilla Analytics Dashboard as well. For this, Bitergia will provide the current list of repositories, as used for the data retrieval of the dashboard.

## Known limitations

There are some limitations to the results of the analysis that are worth mentioning:

* The structure in projects, and the affiliation information, will be as good as can be defined with the help of the Mozilla community.
* Gender is estimated, starting from contributors name, using genderize.io API. Thereforee accuracy depends on this API, and the fact that gender is estimated based on name.
* Time zone analysis is limited by its characteristics. On the one hand, the same time zone includes several large geographical areas (eg, larg parts of Europe and Africa are in the same timezone). On the other hand, the identification of time zones depends on user system configuration, so for some cases it may not be accurate.
* City names, when obtained from GitHub profiles, we will be limited by the availability and reliability of this information, which is dependent on developers.

## Goal: understanding the people in Mozilla community

The term "community" in this context refers to the group of people contributing to Mozilla projects. Thus, this goal could be summarized as characterizing Mozilla community based on their contributors. A contributor will be understood as a person who performs an action that can be tracked in the set of considered data sources. For example: sending a commit, opening or closing a ticket. As they will be different depending on the data source, particular actions used in each analysis will be detailed within particular goals.

The main objective of this goal is to determine a set of characteristics of contributors:

  * Projects: to which projects they contribute.
  * Organizations: to which organizations they are affiliated
  * Gender: which one is their gender
  * Age: which one is their "age" in the project (time contributing)
  * Geographical origin: where do they come from

Those goals can be refined in the following questions:

**Questions**:

* Which projects can be identified?
* Which contributors have activity related to each project?
* Which organizations can be identified?
* Which contributors are affiliated to each organization?
* Which of those contributors are hired by Mozilla, and which are not?
* Which gender are contributors?
* How long have been contributors contributing?
* Where do contributors come from?

These questions can be answered with the following metrics/data:

**Metrics**:

* List of projects
* Contributors by project
* Number of contributors by project over time
* List of organizations
* Contributors by organization
* Number of contributors by organization over time
* Contributors by groups: hired by Mozilla, the rest
* Contributors by gender
* Number of contributors by gender over time
* Time of first and last commit for each contributor
* Length of period of activity for each contributor
* Contributors by time zone (when possible)
* Contributors by city name (when possible)

All the characeterizations of developers (by project, by organization, by hired by Mozilla/rest, by gender, by period of activity, by time zone, by city name) can be a discriminator / grouping factor for the metrics defined for the next goals. Most of these metrics can be made particular for each of the considered data sources.

## Goal:  understanding people activity
The term activity in this document has to do with actions performed by contributors. Based on [Community Goal](#Goal:-understanding-the-people-in-Mozilla-Community) characterization, we will look at activities within the community through time, in order to show the evolution of community activity. Activity is understood here as the actions performed by contributors (commit, actions on tickets, sending e-mails and so on). As it was said before, these actions will depend on the data sources. So, at least, we define the following actions:
 * Git: sending a commit.
 * GitHub: open tickets, closed tickets, submitted reviews, closed reviews.
 * Bugzilla: open tickets, closed tickets.
 * Mailing Lists: sent e-mails.
 * Discourse: questions, answers.
 
 
          
**Questions**:
* How activity evolves in the last period (e.g. month) compared to previous ones?
    
**Metrics**: 
* Git:
  * Number of commits.
* GitHub:
  * Number of open tickets.
  * Number of closed tickets.
  * Number of submitted reviews.
  * Number of closed reviews
* Bugzilla:
  * Number of open tickets.
  * Number of closed tickets.
* Mailing lists:
  * Number of sent e-mails.
* Discourse:
  * Number of questions.
  * Number of answers.

## Goal: understanding Contribution Patterns
Analyzing groups of contributors and their evolution through time could help to understand the real structure of a given community. These groups will help to classify people from those contributing once in a while to more usual contributors. Looking at the evolution of these groups over time can help to understand how the community is evolving. Besides, these groups could be filtered using metrics described in [Community Goal](#Goal:-understanding-the-people-in-Mozilla-Community) to get a more specific insight.
  
**Questions**:
 * How often are people contributing?
 * Do we have a strong community in terms of contributions?
 * Is the contribution pattern moving to a different pattern?
 * How many people are the core, regular and occasional contributors?
 * How is the people flow between contribution categories (core, regular and occasional)?
 
**Metrics**:
 * Population of each group of contributors over time (by quarters).

## Goal: understanding what the community is doing
Understanding how the community is working, identifying their activity in different areas. These areas might be divided in terms of what people is doing (code, support, etc.) and/or where they are contributing within product pipeline. This will be focus on characterizing projects/repos where people is contributing, using features as programming languages and type of contributions.

**Questions**:
 * On what Projects or repos (**groups of repos**) are people working?
 * What is the type of their contributions? (Coding, Support, Issues, etc.)
 * Where are they contributing in the product development pipeline? (Coding, Solving issues, sending patches, discussion, etc.)
 * ~~What time they are giving?~~ (~~Activity~~)
 * What programming languages are they using?
 * Are they contributing more to repos in particular languages?
 
**Metrics**:
 * Number of people by project/repo depending on the base programming lanaguage of the repos.
 * Number of contributions grouped by type and/or by project/repo.

## Goal: Community Demography Analisis
Measuring how many contributors come and leave the community is a good measure of that community's health. A project can be considered alive if their contributors are alive. People whose last contribution was long time ago can be considered 'gone' from community's perspective. Analyzing the evolution of attraction and retention of contributors, i.e. the difference between people gone and new people coming is what we call her demography analysis and, as it was mentioned above, intends to give an idea of community's health.

These data could be filtered using some of the metrics described in [Community Goal](#Goal:-understanding-the-people-in-Mozilla-Community) to get a more specific insight and filter data based on project or/and organization. 

**Questions**:
 * How many people are joining the community?
 * How many people are no longer active (leaving) in the community?
 * How is the attraction / retention ratio over time?

**Metrics**:
   * People joining the community per month.
   * People leaving the community per month.

## Goal: understanding how the community is connected
Analyze connections within the community in order to understand relations between projects. Finding groups and relationships among them.

**Questions**:
 * Do organizations contribute to particular projects?
 * How many repos are being contributed by the same organization/developer?

**Metrics**:
  * Number of contributions by organization and repository.
  * Number of contributors by organization and repository.

## Goal: organizations Diversity Analysis
In this section we will focus on analyzing the diversity in terms of organizations contributing to Mozilla community. Looking at contributions based on organizations involved will help us to find if there exists some dependency on any organization or a small group of them, allowing us to see how diverse is the community in terms of organizations. In the same way, we will use gender information to get a deeper look on diversity.

**Questions**:
 * Which organizations (beyond Mozilla) are contributing to the project?
 * How many organizations does a project depend on?
 * How gender-diverse is the community?

**Metrics**:
 * Number of organizations contributing to projects.
 * Bitergia's Elephant Factor. Elephant Factor is defined as the minimum number of organizations whose employees perform 50% of the commits. This number provides an indication of how many people or organizations the community depends on.
 * Divide contributions by gender.