ingest.sh shell script is used to ingest data onto a PNDA cluster.
To install dependencies and configure the tool for an HttpFS endpoint, run the
install command with the IP address of a node running
HttpFS role and port 14000.
ingest.sh install http://host:port e.g ingest.sh install http://192.168.0.0:14000
Please check with your cluster administrator for the correct IP address and port, as it may change in production deployments.
To upload file or directory onto a cluster, run the
upload command with the dir/file name.
You can use the
-f flag to overwrite existing files, and
-t number of threads for parallelism.
ingest.sh upload localfile or local_directory e.g ingest.sh upload Readme.txt ingest.sh upload -f -t 10 /user/data
Once the upload completes, verify whether the transferred files are stored in the
/user/pnda/PNDA_datasets/bulk/ folder in HDFS.
The tool depends on the
hdfs python pip package. The
install command when run also sets up the package, as well as populating the cli config.
- Sometime the files or nested directories are not overwritten for large folders. In such case rerun the upload command with -f switch.
- Appending a '/' slash at the end seem to have weird effect, in some case even overwriting the directory. To upload a directory just use its name as the argument.