Add support for CommonCrawl Index Server API #11

treigerm · 2017-08-01T13:17:37Z

The new script allows us to use the public CommonCrawl Index Server API to locate documents in the CommonCrawl corpus. This eliminates the need to build a custom database from the CommonCrawl index files.

achimr · 2017-10-04T19:21:13Z

Thanks for the contribution Tim! I think there should be an option to rate-limit the querying of http://index.commoncrawl.org as the server is reported to be quite loaded most of the time - see https://groups.google.com/forum/#!topic/common-crawl/o_MuZViu0O0 I'll add that as an enhancement issue. Of course the other option is to run one's own index server as described in the thread.

Tim Reichelt added 2 commits July 19, 2017 10:56

Add script to locate candidates from CC index API

cee4e5e

Add parsing of .kv.gz files

8d09dfa

achimr merged commit 3ec4526 into modernmt:master Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for CommonCrawl Index Server API #11

Add support for CommonCrawl Index Server API #11

Uh oh!

treigerm commented Aug 1, 2017

Uh oh!

achimr commented Oct 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for CommonCrawl Index Server API #11

Add support for CommonCrawl Index Server API #11

Uh oh!

Conversation

treigerm commented Aug 1, 2017

Uh oh!

achimr commented Oct 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants