# Report Definition



## Summary
Mozilla is interested in getting better understanding the current state of its volunteer
contributors, contributions, communities, and their connection to one another. Mozilla has
accumulated a substantial amount of data over the years (and will be generating some more
through a Community Census it is been undertaken in parallel to this work), but it
hasn’t been analyzed in any significant or rigorous way.

## Scope
The scope of this project is to analyze and explore data already available in the [Mozilla Analytics
Dashobard](http://analytics.mozilla.community) (including the database that supports it), related to the following areas:
* Understanding the people. Characterization of the community from several points of view, including “age” (time contributing) in the project, history (different kinds of contributions), migration between Mozilla projects, affiliation (Mozilla employees or independent volunteers), gender (based on a name analysis), geographical distribution (based on time zone data and GitHub data, when available), performance, etc.
* Understanding activity. Characterization of activity in the different Mozilla projects, including evolution of activity over time, key performance factors, key good practices in specific projects, in the context of the groups characterized in the previous item. 
* Understanding interconnection. Learning how contributors work with each other in common projects, how they migrate together (or not), how they evolve, and learning about the key people acting as “connectors” between projects.

The analysis proposed in this document is limited to data available for retrieval from the following data sources:
* Git
* GitHub
* Bugzilla
* Mailing Lists
* Discourse

Besides, all metrics based on affiliations will depend on the quality of affiliation data provided by the Mozilla.

## Methodology
To get focus in the questions that might be really interesting for Mozilla, Bitergia will use [Goals – Question – Metrics (GQM) methodology](https://en.wikipedia.org/wiki/GQM), based on:
1. A set of interviews or meetings with Mozilla staff to understand the pursued goals and define them together.
2. A set of questions that might help answering if the set of goals previously defined are reached or not, or how the organization is pursuing them.
3. A set of metrics of metrics that provide facts based on data to answer the previous questions.

## Requirements
* Code review process
  * List of labels used for GitHub and Bugzilla
* For tracking `Good first bug` tickets in GitHub, we would need to know which tag is used, if it is not exactly this label in GitHub.
* Repository list per project across data sources.
  * Bitergia will provide a repository list to fill in with information related to projects, repos or groups of repos unavailable from original data sources.

### Understanding Mozilla Data
* Understand how code review process works in GitHub and Bugzilla.
    * Find out flags marking items as part of code review process.
* Need for creating new enrinched indexes.
    * Information about contribution type and programming language in Git.
    * Demographics analysis for other data sources.
* New Data Sources:
    * MDN
    * Localizations (Git or Mozilla's own tool)
    * Forums (Kitsune): test the hypothesis of having kitsune as a good entry point for new contributors.

## Understanding the people in Mozilla community
Characterize Mozilla community based on their contributors. A contributor will be understood as a person who performs an action that can be tracked by Bitergia within the set of data sources specified in the scope section at the beginning of this document. To give an example, some of these actions could be sending a commit, opening or closing a ticket. As they will be different depending on the data source, particular actions used in each analysis will be detailed within particular goals.

There are two main areas to study:
  * Community: related to people contributing and its characterization. 
  * Activity: related to actions performed by contributors.

Both of them are detailed in next sub-sections.

### Community

  * **Definition**: 
    * To better understand Mozilla community, first we will define a set of characteristics from the point of view of people contributing. These characteristics are:
      * Projects.
      * Organizations.
      * Gender.
      * Age (understood as time contributing)
      * Geographical information. To infer this, we rely on time zones or city name (if available in GitHub).
    
  * **Limitations**:
    * Projects and Organizations depends on customer provided information.
    * Gender is guessed from contributors name using genderize.io API, so its accuracy is limited to this API.
    * As we try to place people based on time zones (as it is the only information available to do it), we are limited by its scope. On the one hand, the same time zone includes several geographical places that are not close. On the other hand, time zones depends on user system configuration, so for some cases it may not be accurate.
    * If we use city names from GitHub profiles, we will be limited by the availability and granularity of this information.
  
  * **Questions**:
    * What organizations have more contributors?
    * Are contributors hired by Mozilla?
    * What is the gender of contributors?
    * How long are people contributing?
    * What are the most common time zones where contributions are coming from?
  
  * **Metrics**:
    * Contributors by organization.
    * Contributors in two groups: hired by Mozilla and the rest.
    * Contributors by gender.
    * Time from first to last commit of each contributor.
    * Contributors by time zone (or city name if possible in terms of data availability)

### Activity

  * **Definition**: 
    * Based on [Community](#Community) characterization, we will look at activities within the community through time in order to show the evolution of community activity. Activity is understood here as the actions performed by contributors (commit, actions on tickets, sending e-mails and so on). As it was said before, these actions will depend on the data sources. So, at least, we define the following actions:
      * Git: sending a commit.
      * GitHub: open tickets, closed tickets, submitted reviews, closed reviews.
      * Bugzilla: open tickets, closed tickets.
      * Mailing Lists: sent e-mails.
      * Discourse: questions, answers.
          
  * **Questions**:
    * How activity evolves in the last period (e.g. month) compared to previous ones?
    
  * **Metrics**: 
    * Git:
      * Number of commits.
    * GitHub:
      * Number of open tickets.
      * Number of closed tickets.
      * Number of submitted reviews.
      * Number of closed reviews
    * Bugzilla:
      * Number of open tickets.
      * Number of closed tickets.
    * Mailing lists:
      * Number of sent e-mails.
    * Discourse:
      * Number of questions.
      * Number of answers.

## Contribution Patterns
Analyze groups of contributors and their evolution through time. These groups will help to classify people from those contributing once in a while to more usual contributors. Looking at the evolution of these groups over time can help to understand how the community is evolving. Besides, these groups could be filtered using metrics described in [Community](#Community) to get a more specific insight.

 * **Definition**:
   * Define groups of contributors based on number of contributions. These groups could be:
     * People contributing twice or less.
     * People contributing more than twice but less than 10 times.
     * People contributing more than 10 but less than 20 times.
     * People contributing more than 20 times.
   * These groups could be modified depending on actual data on each data source.
 * **Questions**:
   * How often are people contributing?
   * Do we have a strong community in terms of contributions?
   * Is the contribution pattern moving to a different pattern?
 * **Metrics**:
   * Population of each group of contributors over time.

## Understanding what the community is doing
Understand how the community is working, identifying their activity in different areas. These areas might be divided in terms of what people is doing (code, support, etc.) and/or where they are contributing within product pipeline.

 * **Definition**: 
   * Characterize projects/repos where people is contributing, using features as programming languages and type of contributions.
 * **Questions**:
   * On what Projects or repos (**groups of repos**) are people working?
   * What is the type of their contributions? (Coding, Support, Issues, etc.)
   * Where are they contributing in the product development pipeline? (Coding, Solving issues, sending patches, discussion, etc.)
   * ~~What time they are giving?~~ (~~Activity~~)
   * What programming languages are they using?
   * Are they contributing more to repos in particular languages?
 * **Metrics**:
   * Number of people by project/repo depending on the base programming lanaguage of the repos.
   * Number of contributions grouped by type and/or by project/repo.



## Community Demography Analisis
Analyze the evolution of attraction and retention of contributors. These data could be filtered using some of the metrics described in [Community](#Community) to get a more specific insight and filter data based on project or/and organization. 

* **Definition**: 
   * Calculating contributors that join or left the community in a given period.
* **Questions**:
   * How many people are joining the community?
   * How many people are no longer active (leaving) in the community?
   * How is the attraction / retention ratio over time?
* **Metrics**:
   * People joining the community.
   * People leaving the community.

## Understanding how the community is connected
Analyze connections within the community in order to find relations between projects.

 * **Definition**: 
   * Finding groups and relationships among them.
 * **Questions**:
   * Do organizations contribute to particular projects?
   * How many repos are being contributed by the same organization/developer?
 * **Metrics**:
    * Number of contributions by organization and repository.
    * Number of contributors by organization and repository.

## Organizations Diversity Analysis
In this section we will focus on analyzing the diversity in terms of organizations contributing to Mozilla community. We use the Elephant Factor, defined as the minimum number of organizations whose employees perform 50% of the commits. This number provides an indication of how many people or organizations the community depends on.

 * **Definition**: 
   * Displaying contributions based on organizations involved. Finding if there exists some dependency on an organization or a small group of them.
 * **Questions**:
   * Which organizations (beyond Mozilla) are contributing to the project?
   * What is Mozilla products Bitergia Elephant Factor?
 * **Metrics**:
   * Number of organizations contributing to projects.
   * Bitergia's Elephant Factor.

## Community Onion Model Analysis
 * **Definition**:
   * ...
 * **Questions**:
   * How many people are the core, regular and occasional contributors?
   * How is the evolution of each type of contributor over time? What is the trend?
 * **Metrics**:
   * ...



## Community contributors Funnel
 * **Definition**:
   * ...
 * **Questions**:
   * How is the people flow between contribution categories (core, regular and occasional)?
   * How is the people flow between engagement channels?
 * **Metrics**:
   * ...


## Community Diversity Analysis

 * **Definition**:
   * ...
 * **Questions**:
   * How gender-diverse is the community?
 * **Metrics**:
   * ...