Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to download corpus panlex_lite package in nltk in python #1253

Closed
udaysai50 opened this issue Jan 17, 2016 · 30 comments
Closed

how to download corpus panlex_lite package in nltk in python #1253

udaysai50 opened this issue Jan 17, 2016 · 30 comments

Comments

@udaysai50
Copy link

I am able to download all the packages except the panlex_lite how to download it?

@alvations
Copy link
Contributor

Try within python:

>>> import nltk
>>> nltk.download('panlex_lite')

Or on command line:

$ python -m nltk.downloader panlex_lite

Note: It might take some time to download the data.

@stevenbird
Copy link
Member

Note that you need to install the development version of NLTK in order to do this.

@xiaozongyang
Copy link

use this url [http://dev.panlex.org/db/panlex_lite.zip] to download it manually.

@alvations
Copy link
Contributor

Wait for NLTK v3.2 and please see extensive discussion on #1283

@racekiller
Copy link

Hi once panlex_lite is downloaded manually where should I put it within nltk_data?
Thanks

@stevenbird
Copy link
Member

Please see http://www.nltk.org/data.html

@xiaozongyang
Copy link

corpora, my complete path is /usr/local/share/nltk_data/corpora

------------------ Original ------------------
From: "racekiller"notifications@github.com;
Date: Sat, May 21, 2016 08:53 PM
To: "nltk/nltk"nltk@noreply.github.com;
Cc: "肖宗阳"xiaozy@mails.ccnu.edu.cn; "Comment"comment@noreply.github.com;
Subject: Re: [nltk/nltk] how to download corpus panlex_lite package in nltk inpython (#1253)

Hi once panlex_lite is downloaded manually where should I put it within nltk_data?
Thanks


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@deepp
Copy link

deepp commented Aug 10, 2016

Hi,
Does anyone have idea why its downloading so slow ? At my end its showing 20 hours. Rest of the packages have been downloaded.

@xiaozongyang
Copy link

@deepp I upload this zip file to baidu cloud. Following is the link and password
link: https://pan.baidu.com/s/1kVavU7d password: 7b5n

@deepp
Copy link

deepp commented Aug 10, 2016

@XiaoZYang Thanks for response I downloaded file manually from your previous response link. Thanks a ton

@xiaozongyang
Copy link

@deepp pleasure. be glad to help u

@fwanghe
Copy link

fwanghe commented Sep 3, 2016

You can download the panlex_lite.zip from https://dev.panlex.org/db/, and put it in "/nltk_data/corpora/"

@SissyCat
Copy link

SissyCat commented Sep 6, 2016

While downloading panlex with nltk downloader, my whole system just froze - even the caps lock indicator light on my keyboard wasn't working anymore. I've restarted my computer, tried again and the same thing happened.
Is there a logfile anywhere to provide you with more info on this?
FYI: I'm running idle3/nltk3/python 3.5.2 on KDE Neon on an AMD64 machine.

I'll just download the zip-file manually.

@eupherntech
Copy link

what to do after downloading the zip of panlex_lite so that rest packages are downloaded when nltk.download('all') is given? so that it skips panlex_lite downloading? i unzipped the zip folder but still when i try to download rest packages it shows downloading panlex_lite... help please.

@stevealbertwong
Copy link

stevealbertwong commented Oct 13, 2016

@eupherntech same issue.

@iamprakashom
Copy link

I am also facing the same issue.

BTW, downloaded panlex_lite data manually.

@cimarie
Copy link

cimarie commented Nov 28, 2016

@eupherntech @stevealbertwong You could use nltk.download('all', halt_on_error=False), so that after failing to download the package, you will be asked whether you want to retry to download it. Press n and the rest of packages should be downloaded.

@JustFly1984
Copy link

Same issue here, even manually it takes up to 8 hours. Do something about it please!

@hcharley
Copy link

hcharley commented Mar 4, 2017

Based on the file mentioned above, it looks like it's a 2.2 GB file. So you might just need to hang tight and wait!

One thing you can do in the meantime to get some more information is to look at the filesize and last modified time of the panlex_lite.zip file in nltk_data/corpora/ like so:

$ ls -lh nltk_data/corpora/ | grep panlex_lite
-rw-r--r--     1 username  1607558449   2.1G Mar  4 10:51 panlex_lite.zip

@aetilley
Copy link
Contributor

I'm having the same issue. I have panlex_lite successfully dowloaded (from http://dev.panlex.org/db/panlex_lite.zip) and located in the correct directory, but when nltk.download() is called it tries to download it again. Is there some other file that needs to be updated to show that the corpus is in place?

Please Note: I would try @cimarie 's suggestion, but the problem is that I'm trying to use tox to test a branch before submitting a pull request, and tox calls nltk.download internally, so I don't think I have the ability to include those options.

@stevenbird
Copy link
Member

I've updated the checksums, so please try again

@aetilley
Copy link
Contributor

aetilley commented Apr 14, 2017

@stevenbird Which checksums?

Anyway, it does not appear to have worked. nltk.download('all') still tries to download panlex light, even though I have put the file attached to the above link in my ~/nltk_data/corpora folder.

Also of note, the downloader tries to download panlex_swadesh every time (although this is a much shorter download than panlex_lite). I noticed panlex_swadesh.zip is in the corpora folder, and attempting to unzip it manually gives

Arthurs-MacBook-Pro:corpora aetilley$ unzip panlex_swadesh.zip
Archive: panlex_swadesh.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of panlex_swadesh.zip or
panlex_swadesh.zip.zip, and cannot find panlex_swadesh.zip.ZIP, period.

@stevenbird
Copy link
Member

@aetilley – the checksums are published on this page – may need to "view source".

They are from this file: https://dev.panlex.org/db/panlex_lite-20170401.zip

Unfortunately I don't have the bandwidth to download it.

There's two things you might try. Maybe you already just did the first in which case the second might be worth a shot.

  1. sudo python -m nltk.downloader panlex_lite
  2. cd PATH_TO_NLTK_DATA; wget https://dev.panlex.org/db/panlex_lite-20170401.zip; unzip panlex_lite-20170401.zip

@aetilley
Copy link
Contributor

aetilley commented May 8, 2017

@stevenbird

I'm afraid that after running both of these (both successfully), nltk.download('all') still can't see panlex_lite.

Again, the main problem here is that it makes it difficult to use tox.

So am I the only one having this problem?

@alvations
Copy link
Contributor

alvations commented May 8, 2017

Is nltk.download('all') the main cause of these problems? If so, then I think nltk/nltk_data#69 would be something to consider.

Otherwise, the workaround is something like:

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('all')

@aetilley
Copy link
Contributor

aetilley commented May 9, 2017

@alvations

More specifically, that nltk.download('all') correctly skips over all other corpora that I already have, but for some reason tries to get panlex_lite each time.

Also that tox calls nltk.download('all'), so it's difficult to test locally before making a pull request.

@alvations
Copy link
Contributor

Hopefully, nltk/nltk_data#75 would resolve some of the issues. And after that's merged, users should be able to do nltk.download('all-nltk') instead of nltk.download('all') if they don't want to wait to download the large panlex_lite file.

@aetilley
Copy link
Contributor

@alvations

And what will tox call?

Again, I'm happy to download a large file once but the downloader doesn't seem so see that I already have it so it tries to download it every time.

And again, if I'm the only person having this problem, then maybe it's not a problem, but I'm baffled.

@stevenbird
Copy link
Member

@aetilley: is this still happening? I think it should be fixed now that we've dropped panlex-lite from the NLTK corpus collection.

@aetilley
Copy link
Contributor

@stevenbird, @alvations

Yes, tox appears to be working for me now. Sorry, I didn't catch that you had fixed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests