MDBubing - ridiculous wordplay to merge the words MongoDB and BUbiNG - is a library aimed to:
- Make it simple to migrate existing WARC files into MongoDB, namely exporting each record in a separate document.
- Save MongoDB documents at BUbiNG crawl-time, bypassing WARC files creation.
- Create a properties file defining the values of the following fields:
- connectionString: the connection string of a MongoDB instance (you can also use it to specify eventual write majority concerns)
- database: the database
- collection: the collection to save documents in
- warcFilePath: path of a WARC file (formats supported:
.warc
and.warc.gz
)
You can refer to the following sample configuration: WarcToMongo-sample-configuration.properties.
- Execute the following command:
$ java dev.pstux.mdbubing.WarcToMongo -P <properties_file_path>
- Wait for the records to be exported and enjoy!
TODO: document this section
This project expects source files to be formatted following the Google Java style. Run ./gradlew goJF
in order to automatically format all .java
files under src/
.
Execute unit tests (mocking MongoDB entities): ./gradlew test
.
Execute integration tests (automatically running a MongoDB docker container, testing, and shutting the instance down): ./gradlew integrationTestWithDocker
.
CI tasks definitions can be found in the workflows directory.
- Javadoc
- Upload artifact to sonatype
- Add ability to write into multiple collections
- Map by default
WARC-Record-ID
into_id
field - Add ability to specify the desired
<WARC header, document field>
mapping - Performance benchmarks