Wet Extractor is a little Java software that download latest WET Common Crawl files content from Amazon S3. It extracts and generates a Dummy file with real Searchable content for Search Engine Research and Development purpose.
Why Wet Extractor
Imagine you want to create dummy file with real searchable content. If you make your own crawler, it gone take time, a lot of time. Wet Extractor will do all this for you in matter of minutes.
If you find an issue, please file a report here
Wet Extractor updates
06th September 2017 - wet extractor 4.0 release
I am glad to announce the 4.0.0 release of Wet extractor.
- Improve design pattern (Builder).
- Bugs fixed
15th August 2017 - wet extractor 3.0 released
30th June 2017 - wet extractor 2.0 released
1st September 2016 - wet extractor 1.0 released
I welcome contributors to join developing this tool.
If you like so, please email me: mailto:firstname.lastname@example.org.
If you find bugs, please report.
How to use
- Import the project as Maven (I use Intellij Community Edition).
- By default set to 2 wet files, you can increase the number in
MainAppand Wait (see note).
- It starts initialize the system.
- Then starts to download WET files.
- When download is done, it starts processing files.
- When program is done, all dummy files will be found in under src/main/resources/output folder.
- Download process might take time depending on your internet speed.
- Each file will take 5-10 second to process.
- So if you process 200 WET files, it gone take some time.
- Remember! Depending on how many files you download, be aware to have enough storage resources.