Skip to content

Import dataset data in batches

Paul Walk edited this page Aug 1, 2019 · 2 revisions

Steps to do before importing the datasets

  • Clone the nims-mdr repo

  • Copy the sample datasets to /srv/ngdr/data/sample_datasets

    NOTE: /srv/ngdr/data is mounted into docker at /data/data

  • SSH into docker

    docker exec -it nims-hyrax_git_web_1 /bin/bash
    
  • Start the rails console

    bundle exec rails c
    

The bulk importer takes the following input values

  • import_dir
    The import directory containing the dataset directories to import. The path needs to be relative to the rails application or an absolute path
  • metadata_filename
    Default is nil. It assumes there is just one xml file in the import directory and uses that as the metadata file
    If a filename is given, it looks for an xml file with that name in the import directory
  • collection_ids
    Default is nil. We are not using this feature until the issue with collections is resolved.
    collection_ids expects an array of collection ids (['qweq323', 'qwe422']). The dataset will be made a member of the collections
  • debug
    Default is false.
    If set to true, a dry run of the import will take place. The dataset is not imported into Hyrax, but all of the processing is done and you can see the dataset attributes parsed, the list of files to be imported and errors, if any.

To run the bulk importer in debug mode

require 'importers/dataset_importer'
import_dir = '/data/data/sample_datasets/characterization'
metadata_filename = nil
collection_ids = nil
debug = true
i = Importers::DatasetImporter::BulkImporter.new(import_dir, metadata_filename, collection_ids, debug)
i.perform_create

This performs just a dry run of the bulk import
The datasets are not imported into Hyrax, but all of the processing is done
The outcomes are written to a log file
The log file can be found in /srv/ngdr/data/ and is timestamped.
The log file will containe the following columns:

  • Current time
  • Datsaet directory - the name of the directory being imported
  • attributes - the dataset attributes parsed from the xml file
  • files - the list of files to be imported
  • errors - the list of errors from the import, if any
  • time taken - the time taken to do the import

To bulk import the datasets

require 'importers/dataset_importer'
import_dir = '/data/data/sample_datasets/characterization'
i = Importers::DatasetImporter::BulkImporter.new(import_dir)
i.perform_create

The outcomes are written to a log file
The log file can be found in /srv/ngdr/data/ and is timestamped.
The log file will containe the following columns:

  • Current time
  • Datsaet directory - the name of the directory being imported
  • attributes - the dataset attributes parsed from the xml file
  • files - the list of files to be imported
  • errors - the list of errors from the import, if any
  • time taken - the time taken to do the import