Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified files are not fetched for merge commits #89

Closed
greninja opened this issue Feb 23, 2020 · 3 comments
Closed

Modified files are not fetched for merge commits #89

greninja opened this issue Feb 23, 2020 · 3 comments

Comments

@greninja
Copy link

Describe the bug
I fetched all the commits from a repository using the standard commits = list(RepositoryMining(repo_path).traverse_commits()). It returns 2061 commits (as expected). Out of these, 331 commits are merged commits (for which commit.merge is true). For these 331 merged commits, PyDriller returns an empty list of modified files, even though there are many files modified in those commits.

To Reproduce
Eg: Commit number 115 (hash 13104b2651bed37c1eb238eacd09e05e5906534a) in the above mentioned repository is a merged commit. And it has 14 modified files (with 556 additions and 176 deletions), but none of them show up in commit.modifications.

OS Version:
Linux Ubuntu 18.04

@greninja greninja changed the title Modified files are not returned for merged commits Modified files are not fetched for merged commits Feb 23, 2020
@greninja greninja changed the title Modified files are not fetched for merged commits Modified files are not fetched for merge commits Feb 23, 2020
@ishepard
Copy link
Owner

ishepard commented Feb 24, 2020

Hi @greninja! Thank you for opening the issue!
You're right, pydriller doesn't return the modified files of a merge commit. However, this is not a bug, it's a feature 😃
Let me explain. In a merge commit, the files are not actually modified (more on this later, but for now, pass me this sentence). What you see in GitHub is not what changed in the merge commit, but just the diffs between all the files that were modified in the right branch against the left branch. But if you take the single commit, nothing changed there.
Let's look an example. Take this merge commit. If you see, in this commit there are 2 modified files.
Now, try to open the same commit using git show, you'll see there are no modified files:

> git show 9347dd1af75971cc6027cdf915df6713f7a80d2c
commit 9347dd1af75971cc6027cdf915df6713f7a80d2c
Merge: eae97730 90e52304
Author: Daniel Cohen Gindi <Danielgindi@gmail.com>
Date:   Sun Feb 9 07:34:02 2020 +0200

    Merge pull request #4802 from oatrice/fix_crash_solid_color_barchart

    fix NPE when use solid color with barchart

(END)

This is because when using git show, git is using a combined diff. A combined diff lists only files which were modified from all parents.
If you want to obtain a normal diff (also called "unified diff"), try to run git show -m 9347dd1af75971cc6027cdf915df6713f7a80d2c. That is, instead of trying to combine the diffs against each parent into one big combined diff, just show the diff against each parent, one diff at a time.
This is what GitHub is doing. However, those are not the files that changed in the commit, those are the files that changed in the commits before the merge.
So, imagine you have:

commit3
   | \
   |   commit2 (A.java was added here)
   | /
commit1

In this case, what do you think was modified in commit3?
Answer: nothing. It's just a merge commit with 0 files changed.

What do you think GitHub will show?
It will show that file A.java was added in commit3. But was it? Nope, it was added in commit2

Why pydriller doesn't show it?

Because if you analyse the history of a project, you'll see something like:

commit1 -> some changes
commit2 -> A.java added
commit3 -> A.java added

So A.java was added 2 times, without being deleted. This will create confusion and many errors in MSR studies.

So, which files are modified in the merge commit?

The so called "conflicts". Those are the files that, when merging 2 commits, you actually have to modify (and resolve the conflict) before committing.
For example, you can see them in commit 13104b2651bed37c1eb238eacd09e05e5906534a.

Why pydriller is not presenting them?

Because I can't find a way to parse them. If you notice, the output of a combined diff is a bit different than a unified diff. There are N columns (where N=#parents), each one of them showing what is different from the other branches. For example, you can see that there are these kind of lines:

--                if (mChart.isAdjustXLegendEnabled())
...
- import com.github.mikephil.charting.charts.LineChart;
 +import com.github.mikephil.charting.charts.BarLineChartBase.BorderStyle;
+ import com.github.mikephil.charting.charts.LineChart;

so "- " means deleted from the left branch, " +" added in the right branch, "--" deleted in both", etc..

So, until someone will build a parser for combined diffs, Pydriller will just ignore merge conflicts.

I hope I was clear enough for you to understand!
Anyway, I can leave you with some documentation on diffs of merge commits.
[1] https://stackoverflow.com/a/40986893/7053584
[2] https://git-scm.com/docs/git-diff#_combined_diff_format

@greninja
Copy link
Author

Hey @ishepard, that was a really lucid explanation! thanks!

@josemorenoo
Copy link

josemorenoo commented Feb 15, 2022

@ishepard @greninja

I had this problem too, but I may have found a solution:

git log -m --name-only <commit hash>

For me this shows all the files affected by a merge commit, I don't know the internals of pydriller well enough but I wonder if this might be usable. If this is an easy first PR I'm happy to contribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants