-
-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified files in filename filtered repository traversal do not always match git log. #217
Comments
The temporary fix for this I've come up with is using the following function to manually build a mapping from commit to correct filename. def filename_by_commit(repo_name, filename, from_commit, to_commit, working_dir='scratch/'):
"""
Parse the git log for a downloaded repository, returning a correct commit -> filename mapping.
"""
repo_dir = jp(working_dir, repo_name)
repo = git.Repo(repo_dir)
log = repo.git.log('--oneline', '--name-only', from_commit, to_commit, '--follow', filename)
# Note: the line noise at the end is a funky "take 2 at a time" iterator idiom.
return {subject[:7]: filename for subject, filename in zip(*[iter(log.splitlines())]*2)}
filename_by_commit(
"cbelyea/LRMF-Biodiversity-BAP",
"Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb",
"457900f",
"206bcaa"
) |
Hi @DylanLukes! By default Pydriller doesn't add the So my question would be: if you use |
To solve this I'd suggest you to add a new parameter, follow_renames or something like this, in Repository. When the argument is passed, we can add the |
When a pydriller/pydriller/repository.py Lines 216 to 224 in 58ae15e
And in turn: Lines 300 to 320 in 58ae15e
Not quite. It is detected as a copy (99%): > git log --follow --oneline --name-only -- Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb
...
commit 0495c9abbd308452961716255bd48a262f598498
<snip>
copy Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb => Large_River_Monitoring_Forum_Biodiversity_Indices_Analysis.ipynb (99%)
commit 457900f42c6e1778bbdec07ec41e7d34afc5a370
<snip>
create mode 100644 Large+River+Monitoring+Forum+Biodiversity+Indices+Analysis.ipynb This repository has a somewhat weird edge case I should probably explain better:
So we have something like:
This is the root of the discrepancy: from the perspective of So, the issue isn't in which commits PyDriller collects for traversal, but in how the It is admittedly a strange and likely rare edge-case, but it seems to me that the most intuitive behavior would be to treat such copy pseudo-renames as renames when generating the |
Perhaps a less invasive approach would be to add a new method |
Sorry for the late reply, at work it has been busy 😄 Anyway, I remember I had in mind that traverse_changes(), but it will be behave exactly like passing the filepath, so at the end I didn't implement it. It's definitely a corner case, but the most important thing for us is that Pydriller doesn't return results that are different from what Git returns. It's my understanding that What we can do is to add a new parameter to Repository (something like "follow"), that when passed "True" we call |
Created #223 that adds the --follow option. Closing this one |
Describe the bug
In cases where
git log --follow
infers heuristically that a changed+renamed file is the "same" file,pydriller
with thefilepath
option returns modified files which do not respectgit log --follow
. In conjunction with some other factors (see the Rationale section), this significantly frustrates tracking a file across its lifetime.To Reproduce
**Note: ** I've chosen this repository in particular as it demonstrates the issue. It has the following peculiarity:
In the first commit (
457900f
), the file in question is namedLarge+River_<snip>.ipynb
. This file is both changed and renamed toLarge_River_<snip>.ipynb
. If one usesgit --follow
, it correctly identifies this as "the same file" in the logged results.However, while pydriller correctly includes the
457900f
commit when we provide thefilepath
argument, it does not correctly populate the list of modified files when traversing the repo. For example...This yields:
The first two modified files both show up as if they are new files.
Rationale
The problem here comes when we want to filter the
commit.modified_files
as we iterate to only make use of those which are the file in question (or the same file under a previous name). Consider a modified version of the loop above:This will clearly skip the first commit (where the file's name has
+
rather than_
in its name). The simplest potential way to fix this is to iterate commits in reverse order and keep track of renames.Note: we're reversing the order because we're starting from the most recent known name, so we need to walk backwards.
However, this is thwarted in the concrete example I've given above, because the chain is broken by
old_path
being None for both files.OS Version:
MacOS 12.3.1
The text was updated successfully, but these errors were encountered: