# Examining the effects of ownership on software quality
## The Case Of Lucene

We want to replicate the [study](http://dl.acm.org/citation.cfm?doid=2025113.2025119 "Examining the effects of ownership on software quality") done by Bird et al. and published at FSE'11. The idea is to see the results of a similar investigation on an OSS system. We select [Lucene](https://lucene.apache.org/core/), a search engine written in Java.

## Data collection

First we need to get the data to create our **table**, in other words we do what is called *data collection*.

In our case, we are interested in checking the relation between some ownership related metrics and post-release bugs. We investigating this relation at *file level*, because we focus on Java and in this language the building blocks are the classes, which most of the time correspond 1-to-1 to files.

This means that our table will have one row per each source code file and as many columns as the metrics we want to compute for that file, plus one column with the number of post release bugs.

### Collecting git data

For computing most of the metrics we want to investigate (e.g., how many people changed a file in its entire history) we need to know the history of files. We can do so by analyzing the *versioning system*. In our case, Lucene has a Subversion repository, but a [git mirror](https://github.com/apache/lucene-solr.git) is also available. We use the git repository as it allows to have the entire history locally, thus making the computations faster.

We clone the repository. For this we use the python library 'sh'.

In [1]:
import sh

We start by cloning the repository

In [2]:
sh.git.clone("https://github.com/apache/lucene-solr.git")

And we make sure that we point our 'git' command to the right directory.

In [3]:
git = sh.git.bake(_cwd='lucene-solr')
git.status()

[31mHEAD detached at [mb5db48c
nothing to commit, working tree clean

To perform the replication, we could either reason in terms of releases (see [list of Lucene releases](http://archive.apache.org/dist/lucene/java/)), or we could just inspect the 'trunk' in the versioning system and start from a given date.

In this assignment, we go for the second option: We consider the 'master' branch and focus on a 6-month period in which we look at the bugs occurring to the files existing at that moment. Concerning bug data, you will consider a time window from Feb 01, 2015 to Jul 31, 2015.

#### Retrieving list of files on Feb 1, 2015

Let's retrieve the list of files existing in the trunk on Feb 01, 2015.

In [4]:
sha_feb_15 = str(git("rev-list", "master", n=1, before="2015-02-01 00:01"))[:-1]
sha_feb_15

'b5db48c783e9d0dc19c087101f03d8834b201106'

Now that we have the hash of the commit we're interested in, we checkout the commit

In [5]:
git.checkout(sha_feb_15)
git.status()

[31mHEAD detached at [mb5db48c
nothing to commit, working tree clean

We then get the list of Java files in the folder `lucene/core/src/java/org/apache/lucene`, which contains the core of the software.

In [6]:
file_list_feb_15 = sh.find('lucene/core/src/java/org/apache/lucene', '-type', 'f', '-name', '*.java', _cwd='lucene-solr').split("\n")[:-1]
file_list_feb_15[:10] # visualizing first 10 elements for readability

['lucene/core/src/java/org/apache/lucene/LucenePackage.java',
 'lucene/core/src/java/org/apache/lucene/analysis/Analyzer.java',
 'lucene/core/src/java/org/apache/lucene/analysis/AnalyzerWrapper.java',
 'lucene/core/src/java/org/apache/lucene/analysis/CachingTokenFilter.java',
 'lucene/core/src/java/org/apache/lucene/analysis/CharFilter.java',
 'lucene/core/src/java/org/apache/lucene/analysis/DelegatingAnalyzerWrapper.java',
 'lucene/core/src/java/org/apache/lucene/analysis/NumericTokenStream.java',
 'lucene/core/src/java/org/apache/lucene/analysis/ReusableStringReader.java',
 'lucene/core/src/java/org/apache/lucene/analysis/Token.java',
 'lucene/core/src/java/org/apache/lucene/analysis/TokenFilter.java']

#### Retrieving commits between Jan 2, 2014 and Feb 1, 2015
We now retrieve the list of commit hashes between 2014-01-02 00:23:24 (43974d668667ba1b1dacf26a18a22c7fea909539) and 2015-02-01 00:01 (b5db48c783e9d0dc19c087101f03d8834b201106)

In [7]:
commit_hashes_firstpart = git('rev-list', 'master', after="2014-01-02 00:23:24", before="2015-02-01 00:01").split('\n')[:-1]
commit_hashes_firstpart[:10]

['b5db48c783e9d0dc19c087101f03d8834b201106',
 'a705371bfce8227f8aa24c152f133330437afae4',
 '97e0a1c8ad9a47f77823e44d75205b9f30fd2257',
 'fd35bd5ae4496bd94affa37f99f5f0294caf894b',
 '669e9cf617c532442de87a36b21258898b669c42',
 '0068708e149c1a4a645474aee0f2ad91f8de266a',
 '84ffb0855fec76a4d1e6021124c4e00d2ba785e4',
 '142a75624df0e5471fb52859c97eddcad2eb1f82',
 'b696595cc6f2ef09dfb2dbd347e64d7abdb6df9a',
 '1f131a6b2061017dbd595b40ce70921f81a8ff10']

#### Computing ownership values

To compute the ownership values, we will use an object with this structure

    struct = {
        'path_of_file_1': {
            'contributors': {
                'author_1': 10,
                'author_2': 5
            }
        },
        'path_of_file_2': {
            'contributors': {
                'author_1': 8,
                'author_2': 4
            }
        }
    }
    
where the keys of the dictionary are the paths of the files present on Feb 1, 2015, and the values are dictionaries containing the contributions information relative to the file.

The first step is then to create the object structure using the list of files we found as keys, and `{'contributors': {}}` as values

In [8]:
struct = {}
for file_path in file_list_feb_15:
    struct[file_path] = {'contributors': {}}

Now, for every commit in the list we have found, we get the author's email and the list of files that have been modified. Then we modify the structure accordingly

In [9]:
for commit in commit_hashes_firstpart:
    commit_info = git.show('--name-only', commit, pretty='%ae').split('\n')
    
    author = commit_info[0] # the first line is the author
    changed_files = commit_info[2:-1] # the last lines are the changed files

    for file in changed_files:
        if file in struct:
            if author in struct[file]['contributors']:
                struct[file]['contributors'][author] += 1
            else:
                struct[file]['contributors'][author] = 1

import itertools
list(itertools.islice(struct.items(), 10)) # visualizing first 10 elements for readability

[('lucene/core/src/java/org/apache/lucene/util/packed/Direct16.java',
  {'contributors': {'jpountz@apache.org': 1, 'rmuir@apache.org': 1}}),
 ('lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java',
  {'contributors': {'mikemccand@apache.org': 1,
    'rjernst@apache.org': 1,
    'rmuir@apache.org': 3}}),
 ('lucene/core/src/java/org/apache/lucene/util/packed/BulkOperationPacked24.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/analysis/CharFilter.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/index/FreqProxFields.java',
  {'contributors': {'jpountz@apache.org': 1,
    'mikemccand@apache.org': 7,
    'rmuir@apache.org': 1}}),
 ('lucene/core/src/java/org/apache/lucene/search/similarities/BasicStats.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/codecs/DocValuesProducer.java',
  {'contributors': {'jpountz@apache.org': 1, 'rmuir@apache.org': 5}}),
 ('lucene/core/src/java/org/apache/lucene/search/Numeri

Now we are able to compute the actual ownership values. We will store them in the same structure in this way

    struct = {
        'path_of_file_1': {
            'contributors': {
                'author_1': 10,
                'author_2': 5
            },
            'values': {
                'total': 0,
                'major': 1,
                'minor': 2,
                'ownership': 0.5
            }
        },
        'path_of_file_2': {
            'contributors': {
                'author_1': 8,
                'author_2': 4
            },
            'values': {
                'total': 0,
                'major': 1,
                'minor': 2,
                'ownership': 0.5
            }
        }
    }

In [10]:
for file_path, authors_info in struct.items():
    if authors_info['contributors']:
        contributors_info = authors_info['contributors']
        total_n_commits, major, minor = 0, 0, 0
    
        owner = max(contributors_info.keys(), key=(lambda key: contributors_info[key]))
        total = len(contributors_info)
        total_n_commits = sum(contributors_info.values())
        owner_ownership = contributors_info[owner] / total_n_commits
        
        for author_name, n_commits in contributors_info.items():
            if author_name != owner:
                ownership = n_commits / total_n_commits
                
                if ownership > 0.05:
                    major += 1
                else:
                    minor += 1
    
        struct[file_path]['values'] = {
            'ownership': owner_ownership,
            'total': total,
            'major': major,
            'minor': minor,
            'bug_n': 0
        }
        
list(itertools.islice(struct.items(), 10))

[('lucene/core/src/java/org/apache/lucene/util/packed/Direct16.java',
  {'contributors': {'jpountz@apache.org': 1, 'rmuir@apache.org': 1},
   'values': {'bug_n': 0,
    'major': 1,
    'minor': 0,
    'ownership': 0.5,
    'total': 2}}),
 ('lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java',
  {'contributors': {'mikemccand@apache.org': 1,
    'rjernst@apache.org': 1,
    'rmuir@apache.org': 3},
   'values': {'bug_n': 0,
    'major': 2,
    'minor': 0,
    'ownership': 0.6,
    'total': 3}}),
 ('lucene/core/src/java/org/apache/lucene/util/packed/BulkOperationPacked24.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/analysis/CharFilter.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/index/FreqProxFields.java',
  {'contributors': {'jpountz@apache.org': 1,
    'mikemccand@apache.org': 7,
    'rmuir@apache.org': 1},
   'values': {'bug_n': 0,
    'major': 2,
    'minor': 0,
    'ownership': 0.7777777777777778,
    'total': 3}}),

#### Retrieving commits between Feb 1, 2015 and Aug 1, 2015

The next step is to retrieve the commits information from 2015-02-01 00:02 to 2015-08-01 00:00. We will use this information to look up on Jira for bugfixes.

In [11]:
temp_commithashes_list = git('rev-list', 'master', pretty='oneline', after="2015-02-01 00:02", before="2015-08-01 00:00").split('\n')[:-1]

commit_hashes_secondpart = [{'title': commit[41:], 'hash': commit[:40]} for commit in temp_commithashes_list]

commit_hashes_secondpart[:10]

[{'hash': '7b412fdc630081ef8299952e1ea583eee5e89197',
  'title': 'Remove JRockit (no longer supported)'},
 {'hash': '5f5ab2a79fb643ee69b6a654d9664f9dd5898411',
  'title': 'SOLR-2522: new two argument option for the existing field() function; picks the min/max value of a docValues field to use as a ValueSource: "field(field_name,min)" and "field(field_name,max)"'},
 {'hash': '22d67a637acb75b486f4e6ff9f599f0f4a505c1a',
  'title': 'SOLR-7823: TestMiniSolrCloudCluster.testCollectionCreateSearchDelete async collection-creation (sometimes)'},
 {'hash': '81df57baa28adcb3d0c5952015f7bcff85ff463e',
  'title': 'SOLR-5022: Merged revision(s) 1693559 from lucene/dev/branches/branch_5x: cleanup outdated Java 7 stuff'},
 {'hash': '57a15d9278ece538b765afd6d5b68e6db4cdd2a9',
  'title': 'SOLR-6625: Remove RequestInterceptor at the end of the test in BasicHttpSolrClientTest. It was interfering with other tests running the same JVM.'},
 {'hash': 'f8ae631751ae98ca770d8f387793d9846db62c48',
  'title': 'LUC

#### Computing bugs number

Since we are interested only in commits that correspond to a bugfix in Jira, we are looking for those that have in the title the code of an issue (i.e., LUCENE-XXXX).

For every commit we find that corresponds to an issue, we make a query to the Jira REST API to check if it's relative to a bugfix.

In [12]:
import re # we use the Regular Expressions module to filter out the commits relative to issues
import requests

base_url_jira_api = 'https://issues.apache.org/jira/rest/api/latest/issue/'

def addbug(changed_files):
    for file_path in changed_file_list:
        if file_path in struct and 'values' in struct[file_path]:
            struct[file_path]['values']['bug_n'] += 1

id_type_cache = {}
for commit in commit_hashes_secondpart:
    match = re.search(r'LUCENE-\d{4}', commit['title'])
    
    if match:
        issue_id = match.group()
        changed_file_list = git.show('--name-only', commit['hash'], pretty='').split('\n')[:-1]
        
        if issue_id in id_type_cache:
            if id_type_cache[issue_id] == 'Bug':
                addbug(changed_files)
            else:
                continue
        else:
            issue_res = requests.get(base_url_jira_api + issue_id)

            if issue_res.status_code == 200:
                issue_type = issue_res.json()['fields']['issuetype']['name']
                id_type_cache[issue_id] = issue_type

                if issue_type == 'Bug':
                    addbug(changed_files)
                        
list(itertools.islice(struct.items(), 10))

[('lucene/core/src/java/org/apache/lucene/util/packed/Direct16.java',
  {'contributors': {'jpountz@apache.org': 1, 'rmuir@apache.org': 1},
   'values': {'bug_n': 0,
    'major': 1,
    'minor': 0,
    'ownership': 0.5,
    'total': 2}}),
 ('lucene/core/src/java/org/apache/lucene/search/PhraseQuery.java',
  {'contributors': {'mikemccand@apache.org': 1,
    'rjernst@apache.org': 1,
    'rmuir@apache.org': 3},
   'values': {'bug_n': 24,
    'major': 2,
    'minor': 0,
    'ownership': 0.6,
    'total': 3}}),
 ('lucene/core/src/java/org/apache/lucene/util/packed/BulkOperationPacked24.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/analysis/CharFilter.java',
  {'contributors': {}}),
 ('lucene/core/src/java/org/apache/lucene/index/FreqProxFields.java',
  {'contributors': {'jpountz@apache.org': 1,
    'mikemccand@apache.org': 7,
    'rmuir@apache.org': 1},
   'values': {'bug_n': 6,
    'major': 2,
    'minor': 0,
    'ownership': 0.7777777777777778,
    'total': 3}})

#### Saving the results in a CSV file

The last step is to save the results we've found in a CSV file. We keep only files/classes for which we have information about the contributors in the specified timespan.

In [13]:
import csv

with open('datacollection.csv', 'w', newline='') as csvfile:
    fieldnames = ['file_name', 'package', 'minor', 'major', 'total', 'ownership', 'num_of_bugs']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';')
    
    writer.writeheader()
    
    for filepath, info in struct.items():
        if info['contributors']:
            tempsplit = filepath.split('/org/')[1].split('/')
            filename = tempsplit[-1]
            
            partial_package = tempsplit[:-1]
            package = 'org.' + '.'.join(partial_package)
    
            values = info['values']
    
            writer.writerow({
                    'file_name': filename,
                    'package': package,
                    'minor': values['minor'],
                    'major': values['major'],
                    'total': values['total'],
                    'ownership': values['ownership'],
                    'num_of_bugs': values['bug_n']
            })