Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified files in filename filtered repository traversal do not always match git log. #217

Closed
DylanLukes opened this issue May 12, 2022 · 7 comments

Comments

@DylanLukes
Copy link

DylanLukes commented May 12, 2022

Describe the bug

In cases where git log --follow infers heuristically that a changed+renamed file is the "same" file, pydriller with the filepath option returns modified files which do not respect git log --follow. In conjunction with some other factors (see the Rationale section), this significantly frustrates tracking a file across its lifetime.

To Reproduce

**Note: ** I've chosen this repository in particular as it demonstrates the issue. It has the following peculiarity:

In the first commit (457900f), the file in question is named Large+River_<snip>.ipynb. This file is both changed and renamed to Large_River_<snip>.ipynb. If one uses git --follow, it correctly identifies this as "the same file" in the logged results.

> git log --oneline --name-only --follow Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb

fdf9aba Adding BAP Documentation
Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb
0495c9a Adding BAP Documentation
Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb
457900f initial commit
Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb

However, while pydriller correctly includes the 457900f commit when we provide the filepath argument, it does not correctly populate the list of modified files when traversing the repo. For example...

file_name = "Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb"
repo_name = "cbelyea/LRMF-Biodiversity-BAP"
repo = drill.Repository(f"https://github.com/{name_repo}", filepath=file_name)

rows = []
for commit in repo.traverse_commits():
    for modf in commit.modified_files:
        rows.append({
            'commit': commit.hash,
            'filename': modf.filename,
            'old_path': modf.old_path,
            'new_path': modf.new_path,
        })

pd.DataFrame(data=rows)

This yields:

hash filename old_path new_path
457900f42c6e1778bbdec07ec41e7d34afc5a370 Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb None Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb
0495c9abbd308452961716255bd48a262f598498 Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb None Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb
fdf9aba883c0faafd9bfaed686e4cc0f26dbe4cf Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb
...

The first two modified files both show up as if they are new files.

Rationale

The problem here comes when we want to filter the commit.modified_files as we iterate to only make use of those which are the file in question (or the same file under a previous name). Consider a modified version of the loop above:

for commit in repo.traverse_commits():
    for modf in commit.modified_files:
        if not is_our_file(modf.filename): 
            continue
            do_something_with(modf)

This will clearly skip the first commit (where the file's name has + rather than _ in its name). The simplest potential way to fix this is to iterate commits in reverse order and keep track of renames.

Note: we're reversing the order because we're starting from the most recent known name, so we need to walk backwards.

def has_rename(modf, curr_path):
    return all([
        modf.new_path == curr_name,
        modf.new_path != modf.old_path
        modf.old_path is not None
    ])

curr_path = file_name
for commit in reversed(list(repo.traverse_commits())):
    for modf in commit.modified_files:
        # follow a rename if needed
        if has_rename(modf, curr_path):
            curr_path = modf.old_path

        # skip unrelated files
        if not modf.filename == curr_path: 
            continue
        
        do_something_with(modf)

However, this is thwarted in the concrete example I've given above, because the chain is broken by old_path being None for both files.

OS Version:
MacOS 12.3.1

@DylanLukes
Copy link
Author

The temporary fix for this I've come up with is using the following function to manually build a mapping from commit to correct filename.

def filename_by_commit(repo_name, filename, from_commit, to_commit, working_dir='scratch/'):
    """
    Parse the git log for a downloaded repository, returning a correct commit -> filename mapping.
    """

    repo_dir = jp(working_dir, repo_name)
    repo = git.Repo(repo_dir)
    log = repo.git.log('--oneline', '--name-only', from_commit, to_commit, '--follow', filename)
    
    # Note: the line noise at the end is a funky "take 2 at a time" iterator idiom.
    return {subject[:7]: filename for subject, filename in zip(*[iter(log.splitlines())]*2)}

filename_by_commit(
    "cbelyea/LRMF-Biodiversity-BAP",
    "Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb",
    "457900f",
    "206bcaa"
)

@ishepard
Copy link
Owner

ishepard commented May 17, 2022

Hi @DylanLukes!

By default Pydriller doesn't add the --follow option (same for Git).

So my question would be: if you use git log, does git return the result you want? (so does it say rename?)

@ishepard
Copy link
Owner

To solve this I'd suggest you to add a new parameter, follow_renames or something like this, in Repository.

When the argument is passed, we can add the --follow option. WDYT?

@DylanLukes
Copy link
Author

DylanLukes commented May 17, 2022

By default Pydriller doesn't add the --follow option (same for Git).

When the argument is passed, we can add the --follow option. WDYT?

When a filepath is provided to the Repository initializer, it appears that --follow is already always added and used for computing the list of commits. See below:

# Get the commits that modified the filepath. In this case, we can not use
# git rev-list since it doesn't have the option --follow, necessary to follow
# the renames. Hence, we manually call git log instead
if self._conf.get('filepath') is not None:
self._conf.set_value(
'filepath_commits',
git.get_commits_modified_file(self._conf.get('filepath'),
self._conf.get('include_deleted_files'))
)

And in turn:

pydriller/pydriller/git.py

Lines 300 to 320 in 58ae15e

def get_commits_modified_file(self, filepath: str, include_deleted_files=False) -> List[str]:
"""
Given a filepath, returns all the commits that modified this file
(following renames).
:param str filepath: path to the file
:param bool include_deleted_files: if True, include commits that modifies a deleted file
:return: the list of commits' hash
"""
path = str(Path(filepath))
commits = []
try:
if include_deleted_files:
commits = self.repo.git.log("--follow", "--format=%H", "--", path).split('\n')
else:
commits = self.repo.git.log("--follow", "--format=%H", path).split('\n')
except GitCommandError:
logger.debug(f"Could not find information of file {path}")
return commits

So my question would be: if you use git log, does git return the result you want? (so does it say rename?)

Not quite. It is detected as a copy (99%):

> git log --follow --oneline --name-only -- Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb    
...
commit 0495c9abbd308452961716255bd48a262f598498
<snip>
 copy Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb => Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb (99%)

commit 457900f42c6e1778bbdec07ec41e7d34afc5a370
<snip>
 create mode 100644 Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb

This repository has a somewhat weird edge case I should probably explain better:

  1. In 457900f, the Large+River.ipynb is created.
  2. In 0495c9a, the Large+River.ipynb is copied to Large_River.ipynb and modified.
  3. In 0495c9a, the old Large+River.ipynb is not deleted (both exist).
  4. Later on, Large+River.ipynb is deleted, leaving only Large_River.ipynb

So we have something like:

457900f       0495c9a     89803de      ...       HEAD 
   ┆             ┆           ┆
  [+]─┬─cp+mod──[_]───mod───[_]────────┄┄┄───────[_]
      └─────────[+]───del

This is the root of the discrepancy: from the perspective of git log --follow tracing the file backwards, there is a clear chain of history which includes the + file at 457900f. However, from the perspective of each individual modification list, _ in 0495c9a is a copy (shown by PyDriller as a new file, with old_file == None).

So, the issue isn't in which commits PyDriller collects for traversal, but in how the commit.modified_files members are populated during that traversal. It correctly computes filepath_commits, and iterates over them correctly. However, this weird edge case of "a rename that looks like a copy" is unhandled.

It is admittedly a strange and likely rare edge-case, but it seems to me that the most intuitive behavior would be to treat such copy pseudo-renames as renames when generating the modified_files, but only when following a particular file, and within each commit's modified_files only for that particular file. That is, in such a way that maps most closely to git log --follow's behavior.

@DylanLukes
Copy link
Author

DylanLukes commented May 17, 2022

Perhaps a less invasive approach would be to add a new method traverse_changes() which only works for a Repository initialized with a filename, that reflects "tracking a file across it's lifetime according to git follow" rather than traversing a series of commits which happen to contain a file?

@ishepard
Copy link
Owner

Sorry for the late reply, at work it has been busy 😄

Anyway, I remember I had in mind that traverse_changes(), but it will be behave exactly like passing the filepath, so at the end I didn't implement it.

It's definitely a corner case, but the most important thing for us is that Pydriller doesn't return results that are different from what Git returns.

It's my understanding that git log returns the same result of Pydriller (as it should). When added the --follow however, it fixes the problem.

What we can do is to add a new parameter to Repository (something like "follow"), that when passed "True" we call git log --follow.

@ishepard
Copy link
Owner

Created #223 that adds the --follow option. Closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants