Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error downloading eswiki dumps #19

Open
armaseg opened this issue Jan 26, 2016 · 4 comments
Open

Error downloading eswiki dumps #19

armaseg opened this issue Jan 26, 2016 · 4 comments

Comments

@armaseg
Copy link
Contributor

armaseg commented Jan 26, 2016

Hello,

I'm trying to download eswiki dumps but I got the following error:

OverflowError: unbounded read returned more bytes than a Python string can hold

This is the whole trace:

(py3)moroco@gaba:~/Tools/Apps/WikiDAT/WikiDAT/wikidat$ python main.py 

*** WIKIPEDIA DATA ANALYSIS TOOLKIT ***

----------------------------------------------------------
Executing ETL:RevHistory on lang: eswiki date: 20151202
ETL lines = 1 page_fan = 1 rev_fan = 1
Download files = True
Start time is 2016-01-26 13:50:50 CET
----------------------------------------------------------

Downloading new dump files from http://dumps.wikimedia.your.org/, for language eswiki
Target URL is: http://dumps.wikimedia.your.org/eswiki/20151202
File URL is: http://dumps.wikimedia.your.org//eswiki/20151202/eswiki-20151202-pages-meta-history1.xml.7z
File URL is: http://dumps.wikimedia.your.org//eswiki/20151202/eswiki-20151202-pages-meta-history2.xml.7z
Downloading: eswiki-20151202-pages-meta-history1.xml.7z - [Size: 2.2 GiB]
Downloading: eswiki-20151202-pages-meta-history2.xml.7z - [Size: 2.4 GiB]
File URL is: http://dumps.wikimedia.your.org//eswiki/20151202/eswiki-20151202-pages-meta-history3.xml.7z
File URL is: http://dumps.wikimedia.your.org//eswiki/20151202/eswiki-20151202-pages-meta-history4.xml.7z
Downloading: eswiki-20151202-pages-meta-history3.xml.7z - [Size: 2.6 GiB]
Downloading: eswiki-20151202-pages-meta-history4.xml.7z - [Size: 4.6 GiB]
Paths in download:  ['data/eswiki_dumps/20151202/eswiki-20151202-pages-meta-history1.xml.7z', 'data/eswiki_dumps/20151202/eswiki-20151202-pages-meta-history2.xml.7z', 'data/eswiki_dumps/20151202/eswiki-20151202-pages-meta-history3.xml.7z', 'data/eswiki_dumps/20151202/eswiki-20151202-pages-meta-history4.xml.7z']
Traceback (most recent call last):
  File "main.py", line 322, in <module>
    debug=args.debug)
  File "/home/moroco/Tools/Apps/WikiDAT/WikiDAT/wikidat/tasks/tasks.py", line 129, in execute
    self.paths, self.date = self.down.download(self.date)
  File "/home/moroco/Tools/Apps/WikiDAT/WikiDAT/wikidat/tasks/download.py", line 136, in download
    self._verify(self.target_url)
  File "/home/moroco/Tools/Apps/WikiDAT/WikiDAT/wikidat/tasks/download.py", line 205, in _verify
    file_md5 = hashlib.md5(f.read()).hexdigest()
OverflowError: unbounded read returned more bytes than a Python string can hold

Do you have any clue?

Thanks

@glimmerphoenix
Copy link
Owner

At first sight, this is a problem related with the function that validates the MD5 checksum of downloaded files. In this way, the program can double-check that the file contents actually matches the original file on the server.

For some reason, the function that calculates the MD5 checksum of the downloaded file is failing and throws the exception in your error. It's strange, since I've just parsed few weeks ago the same dump without any issue.

¿How much RAM do you have? The size of the last file is 4.6 GB and you may run out of memory when it tries to load it to calculate the MD5 checksum.

@armaseg
Copy link
Contributor Author

armaseg commented Jan 27, 2016

Hello,

I have 8GB RAM.

moroco@gaba:~$ cat /proc/meminfo 
MemTotal:        8113472 kB

I will try to run again this process tonight after a reboot in order to have the memory as empty as possible.

Thank you very much for your fast reply.

@glimmerphoenix
Copy link
Owner

Great, please let me know about the results. It might be difficult to replicate the bug in our systems without identifying first the possible cause behind this error.

@armaseg
Copy link
Contributor Author

armaseg commented Mar 8, 2016

I ran again 3 times on my machine and I had the same problem. Later I ran it on a more powerful server and It ran that part successfully. So, I think that this issue is related to the memory as you said.
I think this may be fix it using one of this approachs. If you are agree, I can test it on my machine and make the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants