# CRIU (Checkpoint and Restore In Userspace) metrics

crafted by [Sergey Bronnikov](https://bronevichok.ru/), BSD license
<img src="https://static.openvz.org/artwork/CRIU-560px.png" alt="CRIU logo" style="width: 200px;" align="left"/>

## Code

In [9]:
import pandas as pd
import json
import itertools
import collections
import numpy
import re
import datetime
import arrow
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,3)
plt.show(block=True)

Checkout [source repository](https://github.com/xemul/criu) and install Python package called [git2json](https://github.com/tarmstrong/git2json) for parsing git logs.

In [10]:
#!git2json --git-dir=/home/sergeyb/source/criu/.git > criu-log.json

We can parse the resulting JSON file and take a peek at the data structure.

In [11]:
log = json.load(open('criu-log.json'))
print log[0]

{u'committer': {u'date': 1502293902, u'timezone': u'+0300', u'name': u'Pavel Emelyanov', u'email': u'xemul@virtuozzo.com'}, u'author': {u'date': 1500566916, u'timezone': u'-0400', u'name': u'Adrian Reber', u'email': u'areber@redhat.com'}, u'tree': u'c84f48490e244a3276f4a007c157973f0b3f0b27', u'parents': [u'f2899a728cf2baf79655b5b2559f826af7c8452d'], u'commit': u'f07adae6905d10533928209637f003d025bf8140', u'message': u"compel/s390: glibc renamed ucontext to ucontext_t\n\nThe upcoming glibc release renamed 'struct ucontext' to\n'struct ucontext_t':\n\nhttps://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commitdiff;h=251287734e89a52da3db682a8241eb6bccc050c9;hp=c86ed71d633c22d6f638576f7660c52a5f783d66\n\nInstead of using 'struct ucontext' this patch changes it\nto the typedef ucontext_t which already exists in older and\nnew versions of glibc.\n\nSigned-off-by: Adrian Reber <areber@redhat.com>\nReviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com>\nReviewed-by: Michael Holzheu <holzheu@linux

In [12]:
def commit_ts_to_date(commit_num):
    commit_ts = log[commit_num]["committer"]["date"]
    commit_date = datetime.datetime.fromtimestamp(int(commit_ts)).strftime('%Y-%m-%d')
    return commit_date

print "First commit on: ", commit_ts_to_date(-1)
print "Last commit on: ", commit_ts_to_date(1)

print "Number of commits = ", len(log)

committers = [c['committer']['name'] for c in log]
print "Number of committers = ", len(set(committers))

authors = [a['author']['name'] for a in log]
print "Number of authors = ", len(set(authors))

changes = [commit for changeset in log for commit in changeset['changes']]
files = [file[2] for file in changes]
print "Number of files = ", len(set(files))

First commit on:  2011-09-23
Last commit on:  2017-08-09
Number of commits =  8441
Number of committers =  5
Number of authors =  103
Number of files =  2765


This is a simplified version of "code churn" which is [reasonably](https://research.microsoft.com/apps/pubs/default.aspx?id=69126) [effective](http://google-engtools.blogspot.ca/2011/12/bug-prediction-at-google.html) for predicting bugs. (More complicated models include lines modified or [take semantic differences into account](http://dl.acm.org/citation.cfm?id=1985456)). So we'll just take the number of commits for each `.c` file.

In [21]:
file_changes = lambda: itertools.chain.from_iterable(
    [change[2] for change in commit['changes'] if re.match(r'^.*\.[c,h]$', change[2])]
    for commit in log
)

In [22]:
plt.rcParams["figure.figsize"] = (15,6)
fchanges = file_changes()
fchange_count = collections.Counter(fchanges)
a = numpy.average(fchange_count.values())
most_common = fchange_count.most_common(20)
df = pd.DataFrame(most_common)
df.head()
df.index = df[0]
df = df[[1]]
df.head()
p = df.plot(kind='bar', legend=False)
p.set_title('Most often changed files in CRIU')
p.set_ylabel('Commits')
plt.hlines(a, 0, len(df), colors='r')

<matplotlib.collections.LineCollection at 0xaad2b10>

Next, I'll make a simple plot showing weekly commit counts over time, similar to the plots GitHub would show you. I'll create a data frame from a list in the format `[(date_rounded_down_to_week, commit_id)]` and then `groupby()` the date.

In [23]:
def weekly_date_resolution(ts):
    ar = arrow.Arrow.utcfromtimestamp(ts)
    day_of_month = ar.timetuple().tm_mday
    week = int(day_of_month) / 7
    new_day = (week*7)+1
    assert new_day > 0
    assert new_day < 30
    try:
        day_adjusted = ar.replace(day=new_day)
    except ValueError:
        new_day = day_of_month
        day_adjusted = ar.replace(day=new_day)
    return day_adjusted.date()

commit_times = lambda: (
    (weekly_date_resolution(commit['committer']['date']), commit['commit'])
    for commit in log
)

dfct = pd.DataFrame(commit_times(), columns=['date', 'id'])
dfct = dfct.groupby('date').aggregate(len)
dfct.head()

Unnamed: 0_level_0,id
date,Unnamed: 1_level_1
2011-09-22,18
2011-09-29,22
2011-10-01,14
2011-10-08,42
2011-10-15,7


A few more lines gives us a basic plot.

In [24]:
p = dfct.plot(legend=False)
p.set_title('Weekly commits on CRIU')
p.set_ylabel('Commits')

<matplotlib.text.Text at 0x5106a90>

The most active developers

In [25]:
devs = lambda: (
    (c['committer']['name'], c['changes'])
    for c in log
)

dfad = pd.DataFrame(devs(), columns=['name', 'num'])
dfad = dfad.groupby('name').aggregate(len)
dfad.head()

Unnamed: 0_level_0,num
name,Unnamed: 1_level_1
Andrei Vagin,577
Andrey Vagin,54
Cyrill Gorcunov,924
GitHub,6
Pavel Emelyanov,6880


In [26]:
pad = dfad.plot(legend=False)
pad.set_title('The most active developers')
pad.set_ylabel('Commits')

<matplotlib.text.Text at 0x50e4b50>

## Mailing list

First of all we should download mail archive of criu@openvz.org from [Mailman archive](https://lists.openvz.org/pipermail/criu/) and [mbox2json](https://gist.github.com/ligurio/06a9fd236c70fe9dcf0f769823a0aeee) script.

In [34]:
mbox = json.load(open('criu.mbox.json'))
print mbox[0]

{u'Date': u'Thu, 1 Dec 2016 10:09:23 +0200', u'Message-ID': u'<1480579763-21825-1-git-send-email-rppt@linux.vnet.ibm.com>', u'From': u'rppt at linux.vnet.ibm.com (Mike Rapoport)', u'Subject': u'[CRIU] [PATCH] lazy-pages: spelling: s/pagefalt/#PF'}


In [39]:
senders = [m['From'] for m in mbox if 'From' in m.keys()]
print "Number of senders: ", len(set(senders))

Number of senders:  371


Weekly mails on CRIU

In [46]:
def normalize_date(date_string):
    # Example: Sat, 31 May 2003 14:40:40 -0400
    from dateutil import parser
    try:
        date = parser.parse(date_string).strftime("%Y-%m-%d")
    except ValueError:
        date = "0000-00-00"  # FIXME: improve date parsing
    return date
    
report_times = lambda: (
    (normalize_date(message['Date']), message['Subject'])
    for message in mbox
)

dfbg = pd.DataFrame(report_times(), columns=['date', 'id'])
dfbg = dfbg.groupby('date').aggregate(len)
dfbg.head()

KeyError: 'Date'

In [None]:
b = dfbg.plot(legend=False)
b.set_title('Weekly mails on')
b.set_ylabel('Message')

The most active bugreporters

In [None]:
def extract_domain(from_string):
    return from_string.split("@")[1].split(">")[0]
    
domains = lambda: (
    (message['from'], message['subject'])
    for message in reports
)

dfdomain = pd.DataFrame(domains, columns=['domain', 'id'])
dfdomain = dfdomain.groupby('domain').aggregate(len)
dfdomain.head()