playing with the common crawl

serious work in progess

common crawl is a freely available 25+TB webcrawl.

dependencies

jets3t for getting data (requester pays support)
nutch for the ArcInputFormat
boilerpipe for extracting visible text from html (the

KeepEverythingWithMinKWordsExtractor has been working well for me... * tika for language detection. * the stanford parser for general NLP witchcraft.

method

pass 0) download the data

download the data using jets3t from s3 unmodified to hdfs. was using common crawl input format (which did the download) but had lots of problems.

see simple_dist_cp.sh

pass 1) filter text/html

map only pass using the nutch arc input format to ignore everything but mime_type 'text/html'

also converts from raw http response (ie ascii headers + encoded bytes) to just utf-8 encoded html

want to just have this so can do experiments in either link graph or visible text

outputs (as sequence file) key: url, value: html response (utf-8 encoded)

see text_html.sh

pass 2 ) visible text extraction

map only pass html through boilerpipe to extract visible text

uses the boilerpipe KeepEverythingWithMinKWordsExtractor to ignore block elements that don't have at least 5 terms

outputs (as sequence file) key: url, value: visible text, each line denotes a seperate block element from html

pass 3) filter english text only

map only pass visible text through tika to identify language and ignore everything but language 'en'

outputs (as sequence file) key: url value: visible text

see visible_en_text.sh

pass 4 ) tokenisation

map/reduce pass visible text, a paragraph at a time, through the stanford parser and extract sentences / tokens

ignore a sentence that tokens to less than 3 terms.

only emit each sentence once per page since the vast majority of these duplicates represent noise (headers / footers / list structures etc)

outputs (as sequence file) key: url \t paragraph_idx \t sentence_in_paragraph_idx value: one sentence, tokens space seperated

#reducers ~= 3gb to get under 5gb s3 limit (ie sans multipart upload)

see sentences.sh

pass 2 -> pass 4

see run.sh for a ChainMapper version that does steps 2 -> 4 in a single map/reduce pass

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
analysis		analysis
extract_all_pdfs		extract_all_pdfs
java/src/cc		java/src/cc
meta_data_example		meta_data_example
.gitignore		.gitignore
2010.mime.distribution		2010.mime.distribution
README.markdown		README.markdown
arc_files.gz		arc_files.gz
compact.pig		compact.pig
compact_sentences.pig		compact_sentences.pig
counters.cc.FilterTextHtml2.tsv		counters.cc.FilterTextHtml2.tsv
counters.runall.tsv		counters.runall.tsv
first.experiment.mime.distribution		first.experiment.mime.distribution
merge_mimes.py		merge_mimes.py
run_all.sh		run_all.sh
sentences.sh		sentences.sh
simple_dist_cp.sh		simple_dist_cp.sh
text_html.sh		text_html.sh
text_html2.sh		text_html2.sh
update_jets3t_ba.sh		update_jets3t_ba.sh
visible_en_text.sh		visible_en_text.sh
visible_text.sh		visible_text.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

playing with the common crawl

dependencies

method

pass 0) download the data

pass 1) filter text/html

pass 2 ) visible text extraction

pass 3) filter english text only

pass 4 ) tokenisation

pass 2 -> pass 4

About

Releases

Packages

Languages

matpalm/common-crawl

Folders and files

Latest commit

History

Repository files navigation

playing with the common crawl

dependencies

method

pass 0) download the data

pass 1) filter text/html

pass 2 ) visible text extraction

pass 3) filter english text only

pass 4 ) tokenisation

pass 2 -> pass 4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages