Transitioning from Subversion to Git

VWoeltjen edited this page Jul 16, 2012 · 8 revisions

Since open sourcing, the MCT development team has had to adapt to using git for version control. Coming from the use of Subversion for our internal version control, this was a surprising paradigm shift: Subtle differences in the "git" way of doing things can have dramatic effects on the overall workflow.

This is by no means intended as a thorough introduction to git; I just wanted to record some of the mental hurdles that seemed to come up in the transition. To effectively use git, I advocate reading available documentation thoroughly, including:

  • Github Help provides a good starting point for setup and basic use.
  • Pro Git by Scott Chacon is good for both getting a high-level understanding of how git models information, and for some specific use cases.
  • The Reference Manual is useful for understanding the specific behavior of commands and their options, but presumes a high-level understanding (see above)

Quick cheat sheet

In general, it's not always helpful (and in some circumstances may be harmful) to think of git commands by way of analogy to Subversion; there are some fundamental differences in their models. That said, if you need a starting point for specific tasks that you're familiar with from Subversion, these might be helpful.

  • svn checkout http://... -> git clone http://... (make a local clone of a repository, and a working copy)
  • svn update -> git pull (bring in changes from the remote to your local)
  • svn add ... -> git add ... (add a file to version control; it is also "staged" for commit)
  • svn log -> git log (show commit history; note that in git's case, this is of your local repository)
  • svn status -> git status (show the status of your working copy and staging area)
  • svn commit ... ~> git add ...; git commit ...; git push (share local changes)

That last one really highlights the differences between Subversion and git. First, git has the concept of a staging area: This is where files wait to be committed. git add stages files, allowing you to sculpt your commit. git commit is similar to svn commit - it takes a snapshot and records it to your repository, but in this case it's your local repository, not the remote repository. git push is how you share those changes: It updates your remote repository with your active snapshot (and, by necessity, its history).

Snapshots versus Files

In Subversion, the basic operating units are files and folders. You check out, work on, and check in files. This is nice, because it's analogous to the way we're all accustomed to dealing with our file system anyway. From this perspective, the repository becomes a linear aggregation of changes made to specific files. This is nice, because it's easy to understand and traverse, but it has limitations. When the real history of a project becomes non-linear (for instance, when branching and merging) it can become very difficult to track histories or resolve conflicting changes. And while it's convenient to check in just a few files and ignore others, this isn't the safest working practice when files have inter-dependencies (as is the case with source code). One team member may check in some version of file A while another checks in some version of file B, without any conflicts - but this doesn't mean those two files are compatible. If the incompatibility is major, your automated build breaks and you find out right away; if it is subtle, then who knows? The problem is that when you do find it, you are left with a repository history that doesn't necessarily correspond to what anybody was really ever working with locally.

While git is used to track files, its basic conceptual unit is the snapshot: A whole picture of the state of a repository. (Note that this is a conceptual unit: Obviously, git is not storing whole unique copies with every commit.) A commit records a whole snapshot, and it records it relative to some other commit (or commits). This is nice, because it helps avoid (or at least understand the history of) issues like the file A/B incompatibility described above. More importantly, commit histories don't have to be linear: The history of a repository is a graph of connected commits, where branching and merging is clear and it is easy to identify as work done in parallel.

In git, branches and merges are the norm. This sounds nasty and cumbersome coming from Subversion, but it shouldn't be: Even under Subversion, you are branching and merging all the time, you just don't necessarily call it that. Under any version control system, the team spends most of their time working in parallel on local versions of their code base, then bringing their changes back together in a shared repository. The work done on the project is non-linear; just because Subversion doesn't make it easy to see this history doesn't mean it didn't happen. Under git, the moment you make a local commit (a necessary step before sharing changes), you have diverged from the shared repository. And your teammates have diverged as well in the process of their work. Eventually you will have to converge: Again, this is just your normal working behavior. The difference in git is that it will not hide this workflow (unless you specifically ask it to): This means, by necessity, you must merge.

Distributed versus Delegated

Particularly coming from Subversion, our team is accustomed to thinking of the shared repository as special and different. It is special and different: Compare breaking the build on your laptop with breaking the build in the repository. Subversion supports this model directly by maintaining a distinct relationship between the shared repository and your working copy.

The git model is more egalitarian, so to speak. The code base on your laptop is a peer to the remote code base, at least from git's perspective: You have a local repository, and there is a remote repository, and they share some or all of their history. You can bring commits from the remote to your repository (a fetch or pull, roughly analogous to svn update) or move commits from the local to the remote (a push, roughly analogous to svn commit). In either case, though, you are either just sharing or merging snapshots on a graph: The process is not essentially different depending on which direction you are going. (This is why, for instance, you may end up with a new merge commit in your local repository after making a pull request that shows changes to a lot of files you never touched. The "changes" are relative to the linked commits in the graph - your changes and the changes from the remote are all the same, from the commit's perspective.)

Tools versus Tasks

Finally, git is much more fine-grained than our team was used to. Subversion's commands are tied pretty specifically to developer tasks: You check out to start working, you update to stay in sync, you check in to finish a bug. Coming to git, then, the expectation is to see some analogues to these commands, or at least to see commands that have a specific relationship to some workflow.

That is not the git mentality, however: There are exceptions, but in general, git commands aren't meant to represent workflow tasks, but to represent actions you might take upon or within a repository. For example, you may git merge (explicitly or implicitly) to update your own branch, or to share your changes, or even just to bring together changes between local branches. As far as git is concerned, git merge just means you want to create a new node on the commit graph that brings together two existing nodes - it doesn't care what specific working task that supports. That's up to you. This is both challenging, in that you have to deal with figuring out what set of steps will accomplish your goal, and empowering, in that you're free to mold and adapt your workflow based on the task at hand, without being constrained (as much) by the tools.

Frequently Asked Questions

Coming from Subversion, our team has encountered many workflow steps that need to be approached differently in git. Some common questions are:

"What should my standard workflow be?"

This is a tough one. In Subversion, it is easy to use svn update; do work; svn commit as an archetype. The aforementioned flexibility of git, though, makes this harder to follow. Depending on the nature of the work you want to do (that is, of the changes you want to make to the repository) you can do either subtly or dramatically different things.

My most general recommendation: Work on a branch, even (or especially!) for small issues. Coming from Subversion, "work on a branch" sounds like it's asking a lot - just remember, in git, you are branching and merging all the time. This means (a) branching is quite a bit easier, and (b) you are going to have to deal with merges anyway, and they won't necessarily be easy. Keeping your master branch "clean" (free from local changes, with the exception of when you are preparing to push) will help insulate your work from surprise merge conflicts.

In general, I stick to the following pattern to start working on an issue:

  • git checkout master (start off on the master branch - our "clean" snapshot of the shared code base)
  • git pull (get the most up-to-date snapshot)
  • git branch topic1234 (create a branch; here "topic1234" is the usually the issue identifier)
  • git checkout topic1234 (start working on that new branch)

At this point, your workspace will (a) correspond to all the latest stuff from the shared repository, but (b) changes will be committed on a branch that will not be effected by external changes until you want them to be. This is when you do whatever coding needs to be done, polish up your changes, until you are ready to commit. (Note that you can make incremental commits along the way. The nice thing is that these are very local - not only is the remote repository unaware of these changes, but your local master branch is, too. This can be helpful.)

Once work is done, I incorporate my changes into the master branch, first locally, then remotely:

  • git add <files> (stage whatever changes I have made)
  • git commit -m "<message>" (make a local commit of changes in topic1234)
  • git checkout master (switch back to the master branch)
  • git pull (get the most up-to-date snapshot, again!)
  • git merge topic1234 (merge in changes from our topic branch)

At this point, you may or may not have merge conflicts (git will tell you). The "nice" thing is, you didn't run into these conflicts in the middle of working on your issue, but at a time when you were prepared for them. And you still have a snapshot of all the work you did safely tucked away in the branch. If necessary, resolves conflicts (I usually do this by hand, but there is also git mergetool) then do another git add <merged files> and git commit -m "<message>"

Once merged, you are ready to push your changes back to the remote:

  • git push (push changes to remote repository)

"Why do I show all these files I never touched as being changed?"

Most likely, you will run into this either after a pull or a merge command. (Remember, a pull implicitly merges: In git, you branch and merge on a regular basis.) Merging creates a new commit with two (or more) parent commits: If possible, this commit will be made automatically when the merge command is issue, but when there are conflicts you will need to resolve them before finally issuing the "commit" part of the merge. Since git deals in whole snapshots (not just files) even files that did not have merge conflicts will show up when you do a git status: These files will still be part of the commit.

Again, a commit is a commit is a commit, and merging is just taking two (or more) commits and making a third which combines them. So while you, as a developer, think of your local changes and the stuff in the remote repository very differently, from git's perspective these are just two commits with some shared history. So if you added a new file, and someone else added a new file, then these are both equally relevant to the merge: They are both novel with regard to the shared history of the commits.

Coming from Subversion habits, the effect of this can be disconcerting, as you are inclined to focus on your changes versus what's in the shared repository. It can be particularly troubling when you have only a few small changes and bring in a lot of new ones (which trigger a conflict somewhere), and suddenly it looks like you're making some huge alteration. And that reflects the reality that you are making a big alteration: It's just being made to your local code base.

If you find yourself in the middle of a merge conflict when you're not ready to deal with resolving it, git reset --hard will restore the state of the local repository before you tried to merge. But be aware that this will also erase uncommitted changes.

"What exactly am I about to push?"

Subversion users are probably accustomed to checking svn status before committing their changes to a remote repository, as an important last-chance correctness check (because it's no fun to break the build.) While there is an analogous git status, this shows changes (staged, unstaged, or unversioned) relative to your local repository.

You can use git show to view the contents of your most recent commit. Working locally, however, you may make a few separate commits before getting to a point you want to push back to the shared code base. In these situations, you still want a complete picture of what you're about to change.

I personally prefer:

git diff HEAD origin/master

(assuming you're committing back to the master branch of a remote named origin, which is the norm for our development process; you can replace origin/master with whatever / is more appropriate)

Remember, a commit is a commit is a commit. From git's perspective, there's nothing inherently special about the shared repository (although, of course, to you there is), so instead of asking "what am I about to push?" (Note that this will only show the net changes to the repository - you will be pushing up a commit history as well, which may include intermediary changes that you don't want to push.)