# Distance Of Changes

This notebook demonstrates the calculation for the distance of changes. This metric gives an indication about the locality of changes in a code base. Low values indicate good cohesion, a trend with growing values provide a hint that the code should be re-organized.

## Setup
The cell below is used to 
* import required libraries
* setting up the connection to the Neo4j database
* define the D3 based HTML template for custom visualizations

In [1]:
import pandas as pd 
import plotly.express as px
import pygal as pg
from string import Template
from IPython.core.display import display, HTML
from IPython.display import HTML, Javascript, display

neo4j_url=%env NEO4J_URL

%reload_ext cypher
%config CypherMagic.uri=neo4j_url + "/db/data"

def configure_d3():
    """Tell require where to get d3 from in `require(['d3'])`"""
    display(Javascript("""
    require.config({ 
      paths: {
        lodash: "/notebooks/vis/lib/lodash.min",  
        d3: "/notebooks/vis/lib/d3.v4.min"
      }
    })"""))

configure_d3()

base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="/notebooks/vis/lib/svg.jquery.js"></script>
  <script type="text/javascript" src="/notebooks/vis/lib/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

<IPython.core.display.Javascript object>

For computing the distance of changed files within a commit the relative path attributes of `:File` nodes must be converted to a tree structure.

Therefore first transform `relativePath` of each `:File`node into a linked list of `:Path` nodes, e.g. 

`(:File{relativePath:"/src/main/java/"}))` to `(:Path{relativePath: "/src"})-[:CONTAINS]->(:Path{relativePath: "/src/main"})`

In [2]:
%cypher MATCH \
         (f:Git:File)  \
        WHERE  \
          exists(f.relativePath)  \
        WITH f, apoc.text.indexesOf(f.relativePath, "/") as delimiters  \
        WITH f, [del in delimiters | substring(f.relativePath, 0, del)] + f.relativePath as paths  \
        UNWIND  \
          paths as path  \
        CALL  \
          apoc.create.node(['Path'], {relativePath:path}) YIELD node \
        WITH  \
          f, collect(node) as nodes  \
        CALL  \
          apoc.nodes.link(nodes,'CONTAINS')   \
        RETURN  \
          count(nodes)

1 rows affected.


count(nodes)
9153


The previous step created independent linked lists of `:Path` nodes per `:File` node. This step merges `:Path` nodes with the same relative path into one node, the result is the required tree of `:Path` nodes.

Merge duplicate paths, i.e. for files within same directory tree:

In [3]:
%cypher MATCH \
          (p:Path) \
        WITH \
          p.relativePath as relativePath, collect(p) as paths \
        CALL \
          apoc.refactor.mergeNodes(paths, {mergeRels:true}) YIELD node \
        RETURN \
          count(relativePath)

1 rows affected.


count(relativePath)
11965


The :File nodes created by the Git scanner are now linked to the :Path nodes:  

In [4]:
%cypher MATCH \
          (f:Git:File), \
          (p:Path) \
        WHERE \
          p.relativePath=f.fileName \
        MERGE \
         (f)-[:HAS_PATH]->(p) \
        RETURN \
          count(*)

1 rows affected.


count(*)
9153


Determine per commit (reachable from the main branch) the distance between all changed files by finding the length of the shortest path traversing the `:CONTAINS` relation, i.e. traversing upwards from both paths until the same parent is found.

- The query limits to production code only, i.e. files contained in `src/main/java`. Including `src/main/test` creates higher numbers which are not relevant.
- The shortest path returned by the query includes the files nodes themselves, for the distance only the hops over containing folders are relevant. Therefore the distance is computed by `length(path)-2`.
- The distances are first averaged per commit, then averaged per required time unit (e.g. month). An average only per time unit would blur locality of changes, i.e. one commit in a component followed by another commit in another component. 

In [33]:
dist = %cypher MATCH \
  shortestPath((:Branch{name:"heads/main"})-[:HAS_HEAD|HAS_PARENT*]->(c:Commit))  \
WHERE \
  not c:Merge \
WITH \
  c \
MATCH \
  (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(:File)-[:HAS_PATH]->(p1:Path), \
  (c)-[:CONTAINS_CHANGE]->()-[:MODIFIES]->(:File)-[:HAS_PATH]->(p2:Path) \
WHERE \
  id(p1) < id(p2) \
  and p1.relativePath contains "/src/main/" \
  and p2.relativePath contains "/src/main/" \
WITH \
  c, p1, p2 \
MATCH \
  path=shortestPath((p1)-[:CONTAINS*]-(p2)) \
WITH \
  c, avg(length(path)-2) as avgDistancePerCommit \
RETURN \
  substring(c.date, 0, 7) as `Month Of Year`, avg(avgDistancePerCommit) as `Average Distance` \
ORDER BY \
  `Month Of Year`
 
df=dist.get_dataframe()
fig = px.line(df, x="Month Of Year", y="Average Distance", line_shape="spline", title="Avg path distance per commit of production code files (avg per month)")
fig.show()

18 rows affected.
