Scripts for locally hosting and searching a compressed StackOverflow data dump
C Shell PHP Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.txt
icon.png
index.html
install.sh
isQaId
isQaId.c
readBlock.py
readPost.php
readPost.sh
searchTitles.php
searchTitles.sh

README.txt

StackOverflow Local
Peter Burkimsher
peterburk@gmail.com
2014-02-10

This collection of scripts allows the StackOverflow data dump to be rapidly accessed while offline. 


-- About --
The data dump is compressed to only 2.09 GB, plus 230.5 MB for titles. 
Search time is about 8 seconds. 
Post reading time is about 6 seconds. 


-- Installing --
I recommend not running ./install.sh as normal! It will be very slow. 
Instead, open three Terminal windows and run the commands inside install.sh in parallel. 


-- Requirements --
You'll need cc to compile isQaId for setup. 
At runtime, you'll need a web server, PHP, Python, and bzip2recover. 


-- Developers --
You can run the scripts without the web server.
1. ./searchTitles.sh "your search here"
(returns the IDs and titles, separated by a > character)
2. ./readPost.sh 1234567
(returns the post, in compressed form)

The compressed format of the database is delimited by < and > characters as follows:
<<Post ID<>Answer Id><Body>>

For reference, the original StackOverflow database is 4.6 GB. 
It's compressed by only keeping the accepted answer for each question, and cutting out a lot of the XML. 
Searching for all answers to a question takes 30 minutes of search time per question, which is too slow. 

Using dd instead of readBlock.py is possible, but post reading time increases to 30 seconds.