Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
This is a Perl port of scy's levitation. It reads MediaWiki dump files revision by revision and writes a data stream to stdout suitable for git fast-import. The first 1000 pages of the german Wikipedia and all their revisions (about 390000) can be dumped in about 15 min on relatively moderate hardware. INSTALL: You need at least Perl 5.10. The Perl interpreter has to be compiled with threads support. You also need a working C compiler for the inline SHA1 C function. Although gcc 4.3 was specified by old versions of levitation-perl (circa May 2010), gcc 4.5.2 also appears to work (at least for some small wiki dumps). You need zlib, for example the following should work under Debian/Ubuntu: sudo apt-get install zlib1g-dev You need the following modules and their dependencies from CPAN: - Regexp::Common - Inline - JSON::XS - Compress::Raw::Zlib - Carp::Assert - Devel::Size - CDB_File - XML::Bare >= 0.44 - Deep::Hash::Utils - File::Path > 2.04 ? (2.08 is fine; it has to export make_path) Some Linux distributions will already have the first set. Under Debian / Ubuntu the following command should set you: sudo apt-get install libregexp-common-perl \ libinline-perl libjson-xs-perl \ libcompress-raw-zlib-perl libcarp-assert-perl \ libdevel-size-perl For Fedora, the equivalent is sudo yum install perl-Regexp-Common \ perl-Inline perl-JSON-XS \ perl-Compress-Raw-Zlib perl-Carp-Assert For the second set, you may be able to just run: sudo cpan -i CDB_File sudo cpan -i XML::Bare sudo cpan -i Deep::Hash::Utils USAGE: First, initialize a git repository: cd /tmp mkdir blawiki cd blawiki git init (Alternately, go to the working directory of an existing git repository where you want to import the dump, such as one from a previous run of levitation-perl). Then, "levitate". This is a three-step process: cat /path/to/blawiki-dump.xml | /path/to/levitation-perl/step1.pl LC_ALL=C sort rev-table.txt > rev-sorted.txt /path/to/levitation-perl/step2.pl | /path/to/levitation-perl/gfi.pl Alternatively, you can just change to an empty directory and call the "levitate" helper script with a path to a dump as parameter (may be 7z, bz2, gz or xml): mkdir /tmp/blawiki cd /tmp/blawiki /path/to/levitation-perl/levitate /path/to/blawiki-dump... Lots of progress information is printed to standard error, so it may be best to redirect that to a file. Have fun.