Skip to content
This repository was archived by the owner on May 4, 2021. It is now read-only.

Conversation

@treigerm
Copy link

@treigerm treigerm commented Aug 1, 2017

The new script allows us to use the public CommonCrawl Index Server API to locate documents in the CommonCrawl corpus. This eliminates the need to build a custom database from the CommonCrawl index files.

@achimr
Copy link
Contributor

achimr commented Oct 4, 2017

Thanks for the contribution Tim! I think there should be an option to rate-limit the querying of http://index.commoncrawl.org as the server is reported to be quite loaded most of the time - see https://groups.google.com/forum/#!topic/common-crawl/o_MuZViu0O0 I'll add that as an enhancement issue. Of course the other option is to run one's own index server as described in the thread.

@achimr achimr merged commit 3ec4526 into modernmt:master Oct 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants