Skip to content
This repository has been archived by the owner on Jan 2, 2024. It is now read-only.

haxscramper/code_forensics

Repository files navigation

Archived, v2 implementation is at https://github.com/haxscramper/haxorg/tree/master/scripts/cxx_repository – much better data model, should have tests and so on. This repo is kept around for my own reference, some thing from here are still worth copying.


Historical data analysis for the git repositories. Largely inspired by projects such as src-d/hercules: Gaining advanced insights from Git repository history, adamtornhill/code-maat: A command line tool to mine and analyze data from version-control systems, adamtornhill/maat-scripts: Scripts used to post-process the results from Code Maat, and most importantly a Your Code as a Crime Scene: Use Forensic Techniques to Arrest Defects, Bottlenecks, and Bad Design in Your Programs by Adam Tornhill book.

This project consists of two parts - (1) the C++ CLI script that generates SQLite database with information from the repository, and (2) a collection of user scripts for generating concrete analysis from the repository. Some scripts come with a repository (for more details on the usage see related wiki pages), but you are free to write your own in any language that can interact with a SQLite database file.

CLI Application

Main CLI options

code_forensics user_repo(1) [ --filter-script=file.py(2) ] [ --outfile=db.sqlite(3) ]

The main CLI application code_forensics is a self-contained binary file that takes in a (1) user-provided repository, an (2) optional python filter script, and possibly (3) previously generated database (for incremental updates) generates the new database file (or updates existing one).

  1. A user-provided git repository - no special constraints on the structure, size etc., although large repositories generally require (2) filter script in order to only pick certain commits - full analysis might take too much time.
  2. User-provided filter script - regular python script that is provided by the user in order to customize handling of certain commits and files. Script should import the forensics.config object and configure it as needed. For mode details on the script configuration see the “Scripting” section of this readme.
  3. Path to the output database file

Extra CLI configuration

  • --branch= which repository branch to analyze (defaults to master)

Scripting

Custom analysis logic can be injected directly in the database processing stage using a user-provided python script that is

NOTE: Filtering configuration script and subsequent analysis can be tied together via period indexing scheme or some other means. Examples in scripts also use this feature. For example, sampling period mapping is done using this function, which effectively translates 2011Q2 to 4022. Later on this is unpacked by the data visualization part.

def sample_period_mapping(date) -> bool:
    result = date.year * 2
    if 6 < date.month:
        result += 1

    return result

The database structure

Database is comprised of the several tables, mapping different types declared in the git_ir.hpp file. Mapping is done via the ORM library using ir:: types. General overview of each table present in the database:

line
Interned information about different lines in the code. Each line’s content is represented uniquely and includes the .author, .time and .content ID fields. .nesting is a line complexity analytics.
files
full list of all processed files - each separate version of the file is represented as a different database entry, with new ID, statistics etc.
file lines
auxiliary table for data found in the analyzed files. .file refers to the files table, index shows the position in the file and the .line refers to the information in the line table.
rcommit
full list of all processed commits. contains information about author id, time and timezone of the commit, period to which commit was attributed and commit message.
edited_files
files explicitly modified for each commit. Contains a commit id, number of added and removed lines and path ID of the file.
renamed
pair of old-new file path IDs for each commit
paths
interned list of paths used in different tables. Contains full path of the file and ID of it’s direct parent directory.
author
List of commit and line authors - name and email

In addition to the main tables a collection of views is added to the database after it is generated.

  • file_path_with_dir - simplify getting full file name from the path ID (without this view it would require two joins - on path ID and on the path’s name)
  • file_version_with_path - simplify getting full name, commit, directory of each file version.
  • file_version_with_path_dir - extension of the previous view with a joined path of the parent directory.

Analyzing data

SQL queries

Select every instance of the file, find it’s relative complexity and order by it.

select strings.`text`, file.total_complexity / cast(file.line_count as real) as result
from file, strings
where file.name == strings.id
order by result;