Some experiments visualising the GitHub programming language usage correlation dataset
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Some experiments visualising the GitHub programming language usage correlation dataset.

Here's my effort: for more

The aim here is to visualise overlaps between different programming language communities on github.

Node sizes represent relative popularities of the languages; edge thickness represents the strength of associations between pairs of languages, also weighted according to their overall popularity. Closely associated languages are positioned close to eachother, to the extent possible.

Some less popular languages have been removed, as estimates of their correlations were likely not statistically significant, and they would only have served to clutter the graph. Similarly, correlations below a threshold are not visualised.

By way of (perhaps over-)interpretation, some hubs can be spotted:

  • A central systems programming/native application programming cluster around C and C++, with Objective C, assembly, and some interesting choices like Rust and D a little further out

  • Common scripting languages (Perl, Python, Shell) lie next to this core, with strong links to lower-level languages. I suspect their roles in build tools and peripheral scripting around software projects likely has a large effect.

  • 'Webby' languages with Javascript at their heart; Ruby in particular seems closer to this webby cluster than to its scripting language cousin Python

  • The separate worlds of the JVM and of Microsoft are clearly visible but very much on the peripheries at github

  • Functional (and hybrid) languages live together on the peripheries too with a lot of overlap in their communities: Haskell, Clojure, Scheme, Scala and Erlang all likely to form part of a curious programmer's mind-expanding explorations.

  • Lua has a strong showing close to the systems programming languages -- perhaps partly due to its use as an embeddable scripting language, partly due to usage in config files.

Some flaws:

  • It's hard to separate the effects of bundling due to the language choices of build tools, frameworks etc in this dataset, from genuinely active programmer community overlap. For example note the strong association of CoffeeScript with Ruby, perhaps largely due to Rails generators.

  • Language detection may not always be spot on. For example Prolog has a surprisingly strong showing; I suspect it's sometimes confused with Perl (both have .pl extensions). Similarly I wonder if detection of R (.r) and D (.d ?) might be a little over-eager.

  • This is only very handwavey exploratory work.

Some geeky details:

  • Graph generated by Gephi, conditional probability matrix pre-processed in R to obtain (decent estimates of) marginal probabilities and covariances, which are used to weight the nodes and edges respectively. (I didn't use the directionality in the original data). Thresholding on node and edge weights. Force-directed graph layout with overlap prevention and some manual tweaks.