What are some languages used most often during the weekends? Are there some languages that are inherently more 'hobbyist' than others?
I have attempted to answer these questions for this year's GitHub Data Challenge.
One way to answer these questions is to survey thousands of programmers about their language use over weekdays and weekends (which might be fairly difficult, and may not be economically viable). But fortunately for us, GitHub records a swath of data from which such information can be mined. Whenever programmers push code to GitHub, or do other activities such as forking, downloading etc., information is recorded.
The data is available to be downloadable as
json files at the GitHub Archive, or as a dataset at Google BigQuery. I used the latter.
One could argue that of all the type of events performed,
PushEvents could indicate the language use better than other events (such as WatchEvents). There are definitely several limitations with this approach but nevertheless, let us stick with this metric. The results when all events are used, is shown in subsequent sections.
The high level overview of what I did is as follows: for each language, the number of pushes during the weekends is counted and is divided by the total number of pushes (for that language) to get a ratio for that language. This ratio will indicate roughly how active these programming languages have been during weekends, and so I used these ratios to rank the languages from the most weekend-oriented to the least weekend-oriented.
(Here, in order to avoid too many languages, I counted only those languages which had consistently all non-zero number of events every single day.)
Next, let us rank this percentage use during weekends for the most popular languages.
To determine language popularity, here are the top-ten languages ranked, determined by counting the number of
PushEvents (the number after the language is the number of events counted from the dataset).
For these languages, the weekend popularity is as follows:
Note that these ratios are not that different, so I would argue that one shouldn't try to find out why C is before C#. But the overall point here is that Perl seems to be used more often during weekends than Java.
Results based on all Event Types
Why just count the PushEvents? One could as well include the WatchEvents and ForkEvents, and may be something else. To keep things simple, I did the same analysis by counting all types of events.
The results are below (again, only those languages are considered which are used everyday).
Most of the pattern followed previously still holds, except now there are a few more languages popping into the diagram.
And as before, here are top-ten languages counted by aggregating all events (note the difference between this list and the previous list. Objective-C was not there above, but is there in this list; Perl is missing from the above).
The corresponding ranking is as follows:
The complete list (making sure there have been events for at least half of the total number of days):
- Common Lisp,26.4666
- Pure Data,24.5808
- Emacs Lisp,24.0541
- Visual Basic,23.5394
- DCPU-16 ASM,22.822
- OpenEdge ABL,21.9477
- Standard ML,21.4867
Thanks to Google BigQuery, it was a breeze to extract the required information from approximately 60GB of data (of course after many days of tinkering with manually downloading data, figuring out staring at it continuously). But it wasn't so much of a breeze to download the output so I could further process them. So I have put the csv files in the Data directory. Feel free to use them.
I used all the data on BigQuery (which started from 11th March 2012 until 8th May 2013, for a total of 424 days).
The following query lists the number of events per day per language. You can add a
type='PushEvent' if you want. The downloaded CVS files are in the data folder of the repository.
SELECT day, repository_language, COUNT(day) AS count FROM (SELECT repository_language, UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at))/1000000/3600/24 AS day FROM githubarchive:github.timeline WHERE repository_language IS NOT NULL) GROUP BY day, repository_language ORDER BY day;
Conclusion and Reflections
There are indeed several limitations (see below) of this work. The goal here is to not obtain absolute truths, but to try to glean into vast amounts of data and see what it says. Are there some patterns in it, is there something in here that we don't know? I definitely think this work tells us something about the various programming languages. And if anything, it has helped me view programming languages from a different perspective.
I started out with some other plans:
Compute PageRanks of the various repositories based on the repository fork graph to determine their relative importance. I was mainly interested in doing so since I think there must be a difference between the ordered list of popular repositories (in terms of the number of forks) vs the ordered list of important repositories. For example, Spoon-Knife is the second most forked repository, but its importance should be very low. Getting this graph from BigQuery was very hard and so I had to abandon it.
I wanted to see whether news announcements or product launches cause more people to get interested in that language, and therefore more pushes in that language. But from the limited analysis I did, I am afraid I couldn't find anything like that (may be that tells us something?). For example, Bit.ly announced a realtime distributed message processing system called NSQ on 9th October 2012 and I saw a nice discernible spike in the number of watch events, but I couldn't see any such trend for the push events. Hopefully I will look more closely into it in the future. Meanwhile, more comments are welcome.
- The data is limited to only those programmers who use GitHub (gasp!).
- There are definitely many programmers who don't use GitHub for their work. So all those repositories are not taken into account (which is probably good).
- GitHub classification of languages is not always accurate (see http://datahackermd.com/2013/language-use-on-github/#comment-798271901).
- I haven't accounted for different timezones. While it may not be too difficult to account for it, I think its not worth the effort. Only location information of the programmers are recorded, which can be ambiguous/inaccurate.
Update: Fixed two graphs that were showing erroneous values. Blogged this at my blog - might update there with more details.