Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding mining strategy #109

Closed
sims1253 opened this issue Mar 24, 2020 · 3 comments
Closed

Question regarding mining strategy #109

sims1253 opened this issue Mar 24, 2020 · 3 comments

Comments

@sims1253
Copy link

sims1253 commented Mar 24, 2020

This is not entirely about pydriller but I figured there might be enough experience here to help with this:

I am trying to get the history of a branch and have to work with the diffs of those commits.
Before I did this manually by using git log --first-parent and some parsing to get all the hashes that I am interested in and then using hash^! to get the diffs for each commit.

First, I am not entirely sure this did what I intentioned. Merges are occult stuff I can not seem to wrap my head around.
However, I tried to recreate the behaviour with RepositoryMining using order="reverse", "only_in_branch=<origin_branch> and while comparing the results of the two approaches I noticed they don't return the same commits.

Shouldn't only_in_branch=<origin_branch> lead to the same behaviour as --first-parent?
And maybe someone can help me with understanding the merging stuff. Am I right in thinking that I do have to include merge commits in the history, otherwise there will be jumps I can not follow?

@ishepard
Copy link
Owner

Hi @sims1253! I will try to answer your question, and suggest you a couple of articles to read to clarify this point. It's quite complicated as you can imagine :)

The answer to your question is no, doing git log --first-parent is not the same of RepositoryMining(only_in_branch="branch"). Let me explain first from the second case.

When you do only_in_branch="branch", what pydriller does is actually git log branch.
In Git, a commit can be contemporary on different branches: this is the case when you merge a branch into another one.
If we take this history:

master    ---1----2----4----7---8
              \           /
branch1        3----5----6

and we do git log master, you will obtain the list of all the commits (including the ones on branch1): 8 7 6 5 4 3 2 1. The reason is that when you merge 2 commits, you are actually moving all the commits of the second branch into the first branch (so all the commits of branch1 into master). Hence, commit 3, 5, and 6 will also be in the master branch after the merge commit.
On the other hand, if the history is like this:

master    ---1----2----4----7---8
              \          
branch1        3----5----6

and there is not the merge commit, git log master will only return 8 7 4 2 1, while git log branch1 will return 6 5 3 1.
I hope I clarified what only_in_branch does!

Now let's move to --first-parent. When calling git log, by default Git returns all the commits. If we take the previous example:

master    ---1----2----4----7---8
              \           /
branch1        3----5----6

git log will return 8 7 6 5 4 3 2 1.
When you do git merge branch1, there are 2 parents (left and right). The left parent is the branch you are currently in, while the right parent is the branch you are merging:

> git checkout master
> git merge branch1

left parent is master, right parent is branch1.
If you use git log --first-parent, you are instructing git that every time it encounters a merge commit, it should follow only the first parent.
In the example above, this will return 8 7 4 2 1. That is: when you arrive at 7 (merge commit), instead of following the usual order based on dates, go to the left parent directly.

Depending on what you need, --first-parent can be useful or not. If you analyse source code, using first-parent can be tricky because you are gonna miss information (because you skip all the commits in the other branch). For example, in commit 7 you might see that a file has changed, then you decide to go back in history and get the commit that modified that file. You go to commit 4, 2, 1, but that file was not changed in any of these commits: so where was it changed? In the other branch 😄

More information on merge commits: look this
First-parent: http://www.davidchudzicki.com/posts/first-parent/

@sims1253
Copy link
Author

Thank you so much for taking the time :)
What I am trying to do is tracking individual lines through history to be able to identify their origin later. The use case is a line-based version of Bugspots (predicting future bugs from past bugs on a file-level).
So if I get this right, I want only_in_branch and not --first-parent as I need all the changes. I, however, don't need merge commits as they don't contain modifications.
I guess what confused me is GitHub showing diffs for merge commits.

Nice :) And thanks again!

@ishepard
Copy link
Owner

Indeed that's correct.
I would just use only_in_branch='master', and that's it 👍

You're welcome! Happy mining 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants