Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOOLS-3271 Import Time-Series Collections via mongoimport #535

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

themattman
Copy link
Contributor

@themattman themattman commented Mar 9, 2023

What

Adds support in mongoimport for importing data from CSV & TSV directly into a time-series collection.

Today, mongoimport can import data into an existing time-series collection, only if it was already created correctly prior to running mongoimport. However, if there are any issues with the schema, the user will see one error per document. This improvement will fail immediately before trying to insert data so the user can more quickly resolve issues.

How

  • Runs a custom createCollection with the user-provided time-series options before inserting.
  • Rigorously validates new time-series options.
    • For time-series, user must provide a field with explicit type of date via --fields, --fieldFile, or --headerline.
      • auto type is not allowed because a date type cannot be coerced.
      • Not having a date type will fail on insert, so failing validation prior to insertion is more user-friendly.
    • For time-series, --columnsHaveTypes is therefore required.
  • Adds support to validate column_names from an option against user-provided fields.

API Changes

Four new parameters added:

  • --timeSeriesTimeField=<column_name>
  • --timeSeriesMetaField=<column_name>
  • --timeSeriesGranularity=[seconds(default),minutes,hours]
  • --timeSeriesExists=[false(default), true]

How Tested

$ go run vendor/github.com/3rf/mongo-lint/golint/golint.go mongo* bson* common/*
common/db/optime.go:91:9: if block ends with a return statement, so drop this else and outdent its block
common/idx/index_catalog.go:228:10: if block ends with a return statement, so drop this else and outdent its block

$ go test -v ./...
...
PASS
ok  	github.com/mongodb/mongo-tools/release/platform	(cached)

Standalone

$ ./test/qa-tests/mongod --dbpath /data/db/mongoimport_test_standalone --logpath /data/db/mongoimport_test_standalone/mongod.log --fork --port 28010
$ ./bin/mongoimport --host localhost:28010 --file sensor_data.csv --fields="timestamp.date(2006-01-02 15:04:05),mac_address.string(),humidity.decimal(),temp_celsius0.decimal()" --db=import_test --collection=csv_timeseries_test --type=csv --columnsHaveTypes --timeSeriesMetaField=mac_address --timeSeriesTimeField=timestamp

ReplicaSet

$ mlaunch init --binarypath ./test/qa-tests --replicaset --port 28090 --dir /data/db/mongoimport_test_replset
$ ./bin/mongoimport --host localhost:28090 --file sensor_data.csv --fields="timestamp.date(2006-01-02 15:04:05),mac_address.string(),humidity.decimal(),temp_celsius0.decimal()" --db=import_test_rs --collection=csv_timeseries_test_rs --type=csv --columnsHaveTypes --timeSeriesMetaField=mac_address --timeSeriesTimeField=timestamp --drop
2023-03-14T09:38:26.680-0700	connected to: mongodb://localhost:28090/
2023-03-14T09:38:26.680-0700	dropping: import_test_rs.csv_timeseries_test_rs
2023-03-14T09:38:26.693-0700	creating time-series collection: import_test_rs.csv_timeseries_test_rs
2023-03-14T09:38:27.172-0700	2154 document(s) imported successfully. 0 document(s) failed to import.

Sharded Cluster

$ mlaunch init --binarypath ./test/qa-tests --replicaset --port 28090 --dir /data/db/mongoimport_test_shardedcluster --sharded tic tac toe
$ ./bin/mongoimport --host localhost:28090 --file sensor_data.csv --fields="timestamp.date(2006-01-02 15:04:05),mac_address.string(),humidity.decimal(),temp_celsius0.decimal()" --db=import_test_sharded --collection=csv_timeseries_test_sharded --type=csv --columnsHaveTypes --timeSeriesMetaField=mac_address --timeSeriesTimeField=timestamp --drop
2023-03-14T09:53:07.669-0700	connected to: mongodb://localhost:28090/
2023-03-14T09:53:07.670-0700	dropping: import_test_sharded.csv_timeseries_test_sharded
2023-03-14T09:53:07.672-0700	creating time-series collection: import_test_sharded.csv_timeseries_test_sharded
2023-03-14T09:53:08.261-0700	2154 document(s) imported successfully. 0 document(s) failed to import.

Known Issues

Documents are inserted unordered. I've seen this with the Python driver, pymongo, when not using OrderedDict. The Go driver appears to use the ordered bson.D and not unordered bson.M, so this is confusing behavior.

> db.csv_timeseries_test_rs.find()
{ "timestamp" : ISODate(), "temp_celsius" : NumberDecimal(), "mac_address" : "", ...
{ "timestamp" : ISODate(), "mac_address" : "", "temp_celsius" : NumberDecimal(), ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant