common_crawl_corpus

Scripts for building a geo-located web corpus using Common Crawl data. This modules processes segments from the Common Crawl dataset (http://commoncrawl.org/) and then cleans, dedupes, organizes, and runs lidNet on the data.

Import the CC_Corpus class

from common_crawl_corpus.cc_corpus import CC_Corpus

Initialize the CC_Corpus object

CC_Corpus = CC_Corpus()

Define which segments of the Common Crawl to process

prefix_list = ["CC-MAIN-2017-04", "CC-MAIN-2017-09", "CC-MAIN-2017-13"]

Begin crawling; results will be saved to the specified S3 bucket. Credentials must be available via the AWS-CLI

CC_Corpus.crawl_cc(prefix_list, "Your_S3_Bucket", workers = 8)

Merge and dedupe processed crawl files after saving to local drive

CC_Corpus.format_cc("CC-Main-2017-04", "./Data/Input/", "./Data/Output/")

Optionally, use lidNet to identify languages in the corpus. This process downloads deduped corpus segments from one S3 bucket, runs lidNet, and saves to an S3 bucket

prefix_list = ["2016-18/", "2016-22/", "2016-26/", "2016-30/"]
lid_model = "Model.LID.464langs.50chars.130k_hash.3-3.Callback.hdf"

lid_cc("Your_Input_Bucket", "Your_Output_Bucket", prefix_list, lid_model)

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
bin		bin
common_crawl_corpus		common_crawl_corpus
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh
test.log		test.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

common_crawl_corpus

About

Releases 1

Packages

Contributors 3

Languages

License

jonathandunn/common_crawl_corpus

Folders and files

Latest commit

History

Repository files navigation

common_crawl_corpus

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages