# Assignment 2

## 1st step
The window to consider is from 01-01-2013-00:00 to 31-12-2013-23:59.

First thing to do is to get the list of commits in that timespan, including for each of them the information about the author, changed files, and timestamp. We can then create the following structure, in which the "primary key" is the (commit, filepath) pair:

    struct = {
        (‘commit_hash_1’, 'path/to/a.java’): {‘timestamp’: ‘timestamp_1’,
                                              'author': 'email@example.com'
                                              },
        (‘commit_hash_2’, ‘path/to/b.java’): {‘timestamp’: ‘timestamp_2’,
                                              'author': 'email@example.com'
                                              },
        (‘commit_hash_2’, ‘path/to/c.java’): {‘timestamp’: ‘timestamp_2’,
                                              'author': 'email@example.com'
                                              },
        (‘commit_hash_3’, ‘path/to/d.java’): {‘timestamp’: ‘timestamp_3’,
                                              'author': 'email@example.com'
                                              },
        (‘commit_hash_3’, 'path/to/e.java’): {‘timestamp’: ‘timestamp_3’,
                                              'author': 'email@example.com'
                                              },
    }    

Let's first import the Python modules we need

In [1]:
import sh
import re
import sys
from collections import Counter

And position ourselves in the right directory

In [2]:
git = sh.git.bake(_cwd='lucene-solr')

We can then proceed with the first step

In [3]:
# for testing purposes we restrict the timespan to
# speed up development
log_output = str(git('--no-pager', 'log', '--name-only', after='2015-11-01 00:00',
                 before='2016-01-01 00:00', pretty="format:(%H,%at,%ae)"))

splitted_output = (s.split('\n') for s in log_output.split('\n\n'))

struct = {}

for commit_group in splitted_output:
    i = 0

    # filter out empty commits
    while commit_group[i+1][0] == '(' and commit_group[i+1][-1] == ')':
        i += 1

    commit_info = commit_group[i][1:-1]

    commit_hash, timestamp, author_email = commit_info.split(',')

    changed_files = commit_group[i+1:]

    # we are interested only in Java files inside the core
    core_regex = r'^lucene\/core\/src\/java\/org\/apache\/lucene.*?\.java$'

    for file_path in changed_files:
        if file_path and re.match(core_regex, file_path):
            struct[(commit_hash, file_path)] = {
                'timestamp': timestamp,
                'author_email': author_email
            }
            
struct

{('0bc10ecb72c68b27230659f728db0b8e6996ca40',
  'lucene/core/src/java/org/apache/lucene/util/RamUsageEstimator.java'): {'author_email': 'rmuir@apache.org',
  'timestamp': '1449425049'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java'): {'author_email': 'jpountz@apache.org',
  'timestamp': '1447445281'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/DisjunctionScorer.java'): {'author_email': 'jpountz@apache.org',
  'timestamp': '1447445281'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/ExactPhraseScorer.java'): {'author_email': 'jpountz@apache.org',
  'timestamp': '1447445281'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/MultiPhraseQuery.java'): {'author_email': 'jpountz@apache.org',
  'timestamp': '1447445281'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/c

To get the line contributors metrics set we need to use the “git blame” command, which returns all the lines of the file as well as the author of each line.

For each (commit_hash, filepath) pair we “git blame” (commit_hash^1, filepath) since we are interested in the state of the file before the changes made by the commit. If the file didn’t exist before the commit we are analyzing (i.e., it was created by that commit) we leave the metrics columns blank.

As an intermediary step,, for each (commit_hash, filepath) we compute the following

    struct_line_contributors = {
    	‘author_1’: lines_contributed_by_author_1,
	    ‘author_2’: lines_contributed_by_author_2,
	    ‘author_3’: lines_contributed_by_author_3,
    	...
    }

We then are able to compute the following metrics:

- `line_contributors_total`: number of developers who contributed to at least one line in the committed file (= number of keys in the dictionary)
- `line_contributors_minor`: number of developers who contributed to <= 5% of the lines in the committed file (lines_contributed_by_author/total_lines < 0.05)
- `line_contributors_major`: number of developers who contributed to > 5% of the lines in the committed file, comprising the author (lines_contributed_by_author/total_lines > 0.05)
- `line_contributors_ownership`: ratio of lines by the developers with the highest number of lines (lines_contributed_by_highest_contributors/total_lines)
- `line_contributors_author`: ratio of lines by the developer who is author of the current commit (lines_contributed_author/total_lines)
- `line_contributors_author_owner`: true if the developer who is author of the current commit is the one with the highest ratio of contributed lines


In [4]:
for ((commit_hash, file_path), info) in struct.items():
    try:
        blame_out = str(
            git('--no-pager', 'blame', file_path, commit_hash + '^1',
                '--line-porcelain').stdout,
            encoding=sys.stdout.encoding)
    except sh.ErrorReturnCode_128:
        # the file was not found at that specific commit,
        # meaning that it still didn't exist
        continue

    authors = re.findall(r'^author-mail <(.*)>', blame_out, flags=re.M)
    authors_counter = Counter(authors)

    line_contributors_total = len(authors_counter)
    total_lines = sum(authors_counter.values())

    line_contributors_minor, line_contributors_major = 0, 0
    for author, n_lines in authors_counter.items():
        if n_lines/total_lines < 0.05:
            line_contributors_minor += 1
        else:
            line_contributors_major += 1

    commit_author = info['author_email']
    if commit_author in authors_counter:
        line_contributors_author = authors_counter[commit_author]/total_lines
    else:
        line_contributors_author = 0

    author_max_lines = max(authors_counter.keys(),
                           key=(lambda k: authors_counter[k]))

    line_contributors_ownership = \
        authors_counter[author_max_lines] / total_lines

    line_contributors_author_owner = \
        True if author_max_lines == commit_author else False

    info.update({
        'line_contributors_total': line_contributors_total,
        'line_contributors_minor': line_contributors_minor,
        'line_contributors_major': line_contributors_major,
        'line_contributors_author': line_contributors_author,
        'line_contributors_ownership': line_contributors_ownership,
        'line_contributors_author_owner': line_contributors_author_owner})
    
struct

{('0bc10ecb72c68b27230659f728db0b8e6996ca40',
  'lucene/core/src/java/org/apache/lucene/util/RamUsageEstimator.java'): {'author_email': 'rmuir@apache.org',
  'line_contributors_author': 0.010752688172043012,
  'line_contributors_author_owner': False,
  'line_contributors_major': 3,
  'line_contributors_minor': 1,
  'line_contributors_ownership': 0.7258064516129032,
  'line_contributors_total': 4,
  'timestamp': '1449425049'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java'): {'author_email': 'jpountz@apache.org',
  'line_contributors_author': 0.8264840182648402,
  'line_contributors_author_owner': True,
  'line_contributors_major': 2,
  'line_contributors_minor': 3,
  'line_contributors_ownership': 0.8264840182648402,
  'line_contributors_total': 5,
  'timestamp': '1447445281'},
 ('0ed54b3105c6f6462293b69792d3878d6f0a7bbc',
  'lucene/core/src/java/org/apache/lucene/search/DisjunctionScorer.java'): {'author_email': 'jp