-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Proxy error(ArgumentOutOfRangeException): Year, Month, and Day parameters describe an un-representable DateTime. #32
Comments
My guess is an invalid date (e.g. month and day mixed up from US style somewhere, or something strange like 29 February in a non-leap year). I would start by adding some exception handling to print out some debug information. Are you willing and able to share the Wiki dump with me by email (assuming it is not overly large)? |
Larger than I was expecting, assuming this is the URL you meant: zhwiki-20230920-pages-articles.xml.bz2 2.5 GB I need to have a clean out - this machine's drive is fuller than I thought! |
Do you have the full traceback error still? I wanted to check where in the code this RuntimeError was triggered. [The size of the Chinese wiki example makes testing this harder] |
This script is not really suitable for a wiki dump this big! It took 30mins before I killed it, but Python was apparently using 18GB or RAM and had only recorded 1.8 million revisions in SQLite (taking 3.8GB). Update: The file has over 4 million revisions, so I got less than halfway:
[I'm trying this on Python 3 with some modifications, I assume you are using it on Python 2 - see issue #33] |
Switched from macOS to Linux, over 3 million revisions parsed in ~15mins but hit a 32GB memory limit. It occurs to me that the SQLite database currently has no indexing - I'd never pushed the script to such a large example. |
I use ubuntu 20.04 |
I'm curious if it would be possible to migrate straight from MySQL to Markdown/Git without the SQLite intermediate DB ? |
@mathieujobin Currently the script does XML dump to SQLite to mediawiki files on disk, to markdown files on disk, which get tracked in git. The SQLite intermediate is to sort the changes so that the git log is chronological. Looking back the earlier version checked in had this, even before it dealt with uploaded files - so perhaps the XML is sorted by page first? |
why?
The text was updated successfully, but these errors were encountered: