Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Proxy error(ArgumentOutOfRangeException): Year, Month, and Day parameters describe an un-representable DateTime. #32

Open
459737087 opened this issue Jan 11, 2024 · 9 comments

Comments

@459737087
Copy link

why?

@peterjc
Copy link
Owner

peterjc commented Jan 11, 2024

My guess is an invalid date (e.g. month and day mixed up from US style somewhere, or something strange like 29 February in a non-leap year).

I would start by adding some exception handling to print out some debug information.

Are you willing and able to share the Wiki dump with me by email (assuming it is not overly large)?

@459737087
Copy link
Author

@peterjc
Copy link
Owner

peterjc commented Jan 12, 2024

Larger than I was expecting, assuming this is the URL you meant: zhwiki-20230920-pages-articles.xml.bz2 2.5 GB

I need to have a clean out - this machine's drive is fuller than I thought!

@peterjc
Copy link
Owner

peterjc commented Jan 12, 2024

Do you have the full traceback error still? I wanted to check where in the code this RuntimeError was triggered.

[The size of the Chinese wiki example makes testing this harder]

@peterjc
Copy link
Owner

peterjc commented Jan 12, 2024

This script is not really suitable for a wiki dump this big! It took 30mins before I killed it, but Python was apparently using 18GB or RAM and had only recorded 1.8 million revisions in SQLite (taking 3.8GB).

Update: The file has over 4 million revisions, so I got less than halfway:

$ cat zhwiki-20230920-pages-articles.xml.bz2 | bzip2 -d | grep "<revision>" -c
4339799

[I'm trying this on Python 3 with some modifications, I assume you are using it on Python 2 - see issue #33]

@peterjc
Copy link
Owner

peterjc commented Jan 14, 2024

Switched from macOS to Linux, over 3 million revisions parsed in ~15mins but hit a 32GB memory limit. It occurs to me that the SQLite database currently has no indexing - I'd never pushed the script to such a large example.

@459737087
Copy link
Author

I use ubuntu 20.04
and python 3.8

@mathieujobin
Copy link

I'm curious if it would be possible to migrate straight from MySQL to Markdown/Git without the SQLite intermediate DB ?

@peterjc
Copy link
Owner

peterjc commented Jan 15, 2024

@mathieujobin Currently the script does XML dump to SQLite to mediawiki files on disk, to markdown files on disk, which get tracked in git.

The SQLite intermediate is to sort the changes so that the git log is chronological. Looking back the earlier version checked in had this, even before it dealt with uploaded files - so perhaps the XML is sorted by page first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants