I've generated a dataset of the file sizes for the Linux source tree, and a general-purpose file size script (in Python, unfortunately).
I called kernel.org the data set source, since turning the kernel source into file sizes is fairly trivial, but if you'd prefer a short writeup, or just a link to the generation script, let me know.
A Python script to generate the first digit distribution of the byte …
…count of any directory.
Add the results of the ben.py run on the 22.214.171.124 kernel.
The results file wasn't getting cleared, skewing the results. Results…
... And forgot to un-comment the dataset generator lines after I fini…
…shed tweaking the output.
Super cool man, thanks. We're still figuring out how best to deal with the various tools people will create for crunching the data, but I think for now maybe it's best to drop your script in a Gist and then we can add a section to the README linking to it. Since it's your script, would you like to create the Gist?
Sounds good: https://gist.github.com/1049438
Thanks again - I've deployed your dataset to the live site.