Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
111 lines (77 sloc) 4.13 KB
title date images readmore
Loading Git into Pandas' dataframe
2018-01-29 06:20:29 -0800

I recently finished Udacity's Intro to Machine Learning, while I haven't finished the final assignment yet. The next step may be learning Deep Neural Network but before that, I'd like to see what I can do with what I've learned.

The course was using Enron email dataset a lot, and I want to use something similar but not the same to recap things. The dataset should have messages, authors, ... How about Git repositories?

But before starting the machine learning part, the first step is loading a Git repository into Python.


There are multiple Python libraries that can interact with Git. I'm unsure which would be the best, but GitPython is good enough for me.

First, I convert a Git repository (I used facebook/react) into a JSON file.

import git
import json

def commit_summary(c):
    result = {}
    for path, stats in c.stats.files.iteritems():
        for k in stats:
            result[k] = result.get(k, 0) + stats[k]
    result['file_count'] = len(c.stats.files)
    result['committed_date'] = c.committed_date
    result['hexsha'] = c.hexsha
    result['message'] = c.message
    result['email'] =
    return result

react = git.Repo('../react')

with open('react-commits.json', 'w') as out:

    commits = react.iter_commits('master')
    index = 0
    for c in commits:
        if index != 0:
        index += 1

        json.dump(commit_summary(c), out)


Then load the JSON file into Pandas.

import pandas as pd

commits = pd.read_json('react-commits.json')
commits['committed'] = pd.to_datetime(commits['committed_date'], unit = 's')

React Commits over Time

The Y-axis have insertions and deletions.

Commits on facebook/react
ggplot(aes('committed', 'insertions'), commits) + \
  geom_line(aes(color = 1)) + \
  geom_line(aes('committed', '-deletions', color = 2)) + \
  ylab('Added/Deleted')  + xlab('Committed Date') + \
  guides(color=False) + scale_color_gradient()

There are a few spikes on deletions (newer to older);

  1. Delete documentation and website source (#11137)
  2. [site] Load libraries from unpkg (#9499)
  3. New Documentation
  4. Merge remote-tracking branch 'facebook/master'
  5. remove likebutton from docs for now

While most of them were administrative changes, the last, oldest commit was a bit funny;

it has some facebook-ism in there and it's probably shouldn't be on the site.

I would agree so :)

React Committers

This plot, React commiters in the Y-axis is aesthetically pleasing. It would be interesting to compare that to similar projects, such as Vue or Angular.

Commits on facebook/react
commits['is_fb_com'] = commits['email'].str.contains('')

ggplot(aes('committed', 'factor(email)', color='is_fb_com'), commits) + \
  geom_point(aes(size = 'lines')) + \
  ylab('Author')  + xlab('Committed Date') + \
  scale_y_discrete(labels = [])

The amount of commits from email addresses is surprising. I wanted to use that as an indicator of Facebook employees, but that may be too naive.