# Extracting Flow Data From Mozilla Systems

Our goal is to analyze the efficiency of the Firefox engineering workflow.

According to Dr. Mik Kersten's *Flow Framework*, the Ground Truth in developer workflows are the artifacts generated by the developer's tools.  To measure Firefox development flow we can extract the artifact data from Mozilla's various tool systems.

*Flow Time* is measured as the time from when a developer first picked up an issue until the time the change is delivered into the user's hands.  In Firefox we can start measuring Flow Time as the time a bug was assigned or picked up by a developer, and the end of flow time as the time when the commit(s) addressing the bug landed in Firefox Nightly.

To calculate Flow Time we can trace the flow of user-delivered changes in reverse through the Firefox developer tool network.

1. Find the list of commits shipped in each nightly build
1. For each commit, find the time the bug # this commit addresses was assigned or created


In [16]:
# pre-amble
import dateutil
import json
from pprint import pprint
from mozautomation import commitparser
from tqdm import tnrange, tqdm_notebook

# We will use an on-disk web cache to save traffic and round-trip time when re-running the notebook.
import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache
sess = CacheControl(requests.Session(), cache=FileCache('.web_cache'))

We will start be fetching the list of changeset IDs between two sequential nightly builds taken from the Firefox [Release catalogue](https://mozilla-services.github.io/buildhub/?platform[0]=linux-x86_64&locale[0]=en-US&products[0]=firefox).

In [17]:
nightly_changeset_id = "f0c23db0d035dbe81e23eb4d619e493e38582d24"
nightly_publish_time = "2019-01-22T12:16:52Z"
previous_nightly_changeset_id = "44369796f148630ff496be99f77a5eeea41c7d23"
previous_nightly_publish_time = "2019-01-22T00:12:01Z"

# See https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/pushlog.html#hgweb-commands for the URL structure.
hgmo_search_url = f"https://hg.mozilla.org/mozilla-central/json-pushes/?fromchange={previous_nightly_changeset_id}&tochange={nightly_changeset_id}&version=2"

r = sess.get(hgmo_search_url)
r.raise_for_status()
pushdata = r.json()

print(len(pushdata['pushes']))

4


Changesets are nested in chronological order inside pushes.  We'll flatten the nested lists to get just the changesets.

In [18]:
changes = []
for push in pushdata["pushes"].values():
    for changeset_id in push["changesets"]:
        changes.append(changeset_id)

print(len(changes))
pprint(changes[0:5])

95
['cbbf07c28138f414b792059ddb07423848a413d6',
 'd532474e710f8a58ab10a2702dca432bc2eefa69',
 'c367b5259d4683c8d7fcf7efdda68f6ca93f9913',
 '05ab29790c5718f31d2910f8728af5f742eaf14d',
 'f833d5220c821a2c19c60c6ac5378c677bebbe71']


Now that we have the list of changes we need to determine when the change was first worked on.

Every author landing a code change in the Firefox source tree is supposed to include a bug number in the first line of their commit message.  We can parse out this bug number to get the bug ID.

In [20]:
# WARNING: THIS STEP CAN TAKE A WHILE TO COMPLETE

def fetch_rev_summary(rev_id):
    r = sess.get(f"https://hg.mozilla.org/mozilla-central/json-rev/{rev_id}")
    r.raise_for_status()
    return r.json()["desc"]

published_ts = dateutil.parser.parse(nightly_publish_time)
flow_data = []
for rev_id in tqdm_notebook(changes, desc="Changeset data import"):
    summary = fetch_rev_summary(rev_id)
    bug_id = commitparser.parse_bugs(summary)
    # Note that not all summarys will contain a bug ID. e.g. merges don't have one.
    if bug_id:
        # Take just the first bug ID.  Multiple bugs don't happen in practice.
        bug_id = bug_id.pop()
        flow_data.append(dict(rev=rev_id, bug=int(bug_id), published_ts=published_ts))

pprint(flow_data[0:5])

HBox(children=(IntProgress(value=0, description='Changeset data import', max=95, style=ProgressStyle(descripti…


[{'bug': 1519107,
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': 'd532474e710f8a58ab10a2702dca432bc2eefa69'},
 {'bug': 1519107,
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': 'c367b5259d4683c8d7fcf7efdda68f6ca93f9913'},
 {'bug': 1519107,
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': '05ab29790c5718f31d2910f8728af5f742eaf14d'},
 {'bug': 1493184,
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': 'f833d5220c821a2c19c60c6ac5378c677bebbe71'},
 {'bug': 1521518,
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': '88b4f92e2197c7478923e9d399adb62ed6f415ef'}]


Now that we have the bug numbers we can fetch the bug creation times and calculate the end-to-end flow time.

In [22]:
# WARNING: THIS STEP CAN TAKE A WHILE TO COMPLETE

def fetch_bug_creation_time(bug_id):
    r = sess.get(f"https://bugzilla.mozilla.org/rest/bug/{bug_id}")
    r.raise_for_status()
    return dateutil.parser.parse(r.json()["bugs"][0]["creation_time"])


for change in tqdm_notebook(flow_data, desc="Bug data import"):
    bug_creation_ts = fetch_bug_creation_time(change["bug"])
    flow_time = change['published_ts'] - bug_creation_ts
    change.update({'bug_creation_ts': bug_creation_ts, 'flow_time': flow_time})
    
pprint(flow_data[0:5])


HBox(children=(IntProgress(value=0, description='Bug data import', max=88, style=ProgressStyle(description_wid…


[{'bug': 1519107,
  'bug_creation_ts': datetime.datetime(2019, 1, 10, 14, 32, 35, tzinfo=tzutc()),
  'flow_time': datetime.timedelta(11, 78257),
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': 'd532474e710f8a58ab10a2702dca432bc2eefa69'},
 {'bug': 1519107,
  'bug_creation_ts': datetime.datetime(2019, 1, 10, 14, 32, 35, tzinfo=tzutc()),
  'flow_time': datetime.timedelta(11, 78257),
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': 'c367b5259d4683c8d7fcf7efdda68f6ca93f9913'},
 {'bug': 1519107,
  'bug_creation_ts': datetime.datetime(2019, 1, 10, 14, 32, 35, tzinfo=tzutc()),
  'flow_time': datetime.timedelta(11, 78257),
  'published_ts': datetime.datetime(2019, 1, 22, 12, 16, 52, tzinfo=tzutc()),
  'rev': '05ab29790c5718f31d2910f8728af5f742eaf14d'},
 {'bug': 1493184,
  'bug_creation_ts': datetime.datetime(2018, 9, 21, 14, 47, 25, tzinfo=tzutc()),
  'flow_time': datetime.timedelta(122, 77367),
  'published_ts': date

In [24]:
import datetime

def serialize_dates_and_times(obj):
    if isinstance(obj, datetime.datetime):
        return obj.isoformat()
    elif isinstance(obj, datetime.timedelta):
        return obj.total_seconds()
    else:
        raise TypeError(f"Could not serialize {obj}")


with open("data/flowdata.json", "w") as f:
    json.dump(flow_data, f, default=serialize_dates_and_times)

In [29]:
# Elasticsearch query DSL
esquery = {
    "from": 0,
    "size": 2,
    "query": {
        "term": {}
    }
}

r = sess.post("https://buildhub.moz.tools/api/search", json=esquery)
r.raise_for_status()
r.json()

HTTPError: 500 Server Error: Internal Server Error for url: https://buildhub.moz.tools/api/search

In [26]:
r.content

b'<h1>Server Error (500)</h1>'

In [27]:
r

<Response [500]>

In [28]:
r.headers

{'Server': 'nginx/1.15.4', 'Date': 'Thu, 24 Jan 2019 02:49:19 GMT', 'Content-Type': 'text/html', 'Content-Length': '27', 'X-Response-Time': '302', 'X-Frame-Options': 'SAMEORIGIN', 'Vary': 'Origin', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 'X-Sentry-ID': '61907a54f5c14278a76ef7eebbd604ad', 'Via': '1.1 google', 'Alt-Svc': 'clear'}