# Generate Fix Commit Tree

If the same (project, file, line) sequence has a fix-commit
that is appeared as fix-commit-parent for the same sequence,
we are saying the first fix created / left a bug.

This notebook will find such commit sequence and make a tree from those sequence.
This will help to investigate in which situation
developers create / leave a bug while fixing another.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_json('../dataset/sstubsLarge')

In [3]:
print('Rows: {}, Columns: {}'.format(*df.shape))

Rows: 63923, Columns: 14


In [4]:
print('**Columns:**')
print(*df.columns, sep='\n')

**Columns:**
bugType
fixCommitSHA1
fixCommitParentSHA1
bugFilePath
fixPatch
projectName
bugLineNum
bugNodeStartChar
bugNodeLength
fixLineNum
fixNodeStartChar
fixNodeLength
sourceBeforeFix
sourceAfterFix


## Filter tree-commits

First, the fix-commits appearing as parent-commit in another entry are found via SQL.
Three different SQL queries are run producing the same result to grow confidence.
Then this query is applied to `pd.DataFrame`.

In [5]:
import sqlite3
conn = sqlite3.connect('../database/sstubs.db')
cursor = conn.cursor()

In [6]:
query = '''SELECT count(*)
    FROM (SELECT * 
    FROM sstubs_large AS b1
        INNER JOIN sstubs_large AS b2
        ON b1.parent = b2.child AND b1.project = b2.project AND b1.file = b2.file AND b1.line = b2.line
    GROUP BY b1.child, b1.parent, b1.project, b1.line, b1.file, b1.type)'''
for res in cursor.execute(query):
    print(*res)

400


In [7]:
query = '''SELECT count(*) FROM sstubs_large WHERE (parent, project, file, line) IN (
                SELECT child, project, file, line FROM sstubs_large
           ) '''
for row in cursor.execute(query):
    print(*row)


400


In [8]:
query = '''SELECT count(*) FROM sstubs_large AS P WHERE EXISTS (
                SELECT parent
                FROM sstubs_large AS C
                WHERE C.child = P.parent AND C.project = P.project AND C.file = P.file AND C.line = P.line
           ) '''
for row in cursor.execute(query):
    print(*row)

400


In [9]:
child_df = df[['fixCommitSHA1', 'fixCommitParentSHA1', 'bugFilePath', 'projectName', 'bugLineNum', 'bugType']]
parent_df = df[['fixCommitSHA1', 'bugFilePath', 'projectName', 'bugLineNum', 'bugType']]\
    .rename(columns={'fixCommitSHA1': 'fixCommitParentSHA1'})

merged_df = child_df.merge(
    parent_df,
    how='inner',
    on=['fixCommitParentSHA1', 'bugFilePath', 'projectName', 'bugLineNum'],
).drop(columns='bugType_y').rename(columns={'bugType_x': 'bugType'})

In [10]:
columns = ['fixCommitParentSHA1', 'fixCommitSHA1', 'projectName', 'bugFilePath', 'bugLineNum', 'bugType']
tree_df = pd.DataFrame(
    data=merged_df.groupby(columns).groups.keys(),
    columns=columns,
)
tree_df.shape

(400, 6)

In [11]:
tree_df.to_csv('../dataset/sequential_sstubs.csv', index=False)

## Build Tree

In [12]:
from collections import deque

In [13]:
roots = {}
for _, row in tree_df.iterrows():
    parent = row.fixCommitParentSHA1
    child = row.fixCommitSHA1
    project = row.projectName
    file = row.bugFilePath
    line = row.bugLineNum
    bug = row.bugType
    parentKey = (parent, project, file, line)
    if parent not in roots:
        roots[parentKey] = set()
    roots[parentKey].add((child, bug))
    childKey = (child, project, file, line)
    if childKey not in roots:
        roots[childKey] = set()
    roots[childKey].add((parent, bug))

In [14]:
len(roots)

614