Pull request Compare This branch is 4 commits ahead, 32 commits behind develop.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


HDFS Cleaner

This tool can be used to spawn multiple jobs to perform file clean up and data archival tasks when invoked by cron or any job scheduler.

Directory cleanup

Typically of any system that runs multiple components or services, the log files may accumulate over time like running MapReduce jobs, spark applications, batch jobs etc.

HDFS cleaner can be configured to remove old files when either age or size threshold is reached by adding it as part of properties.json.

Data Management

HDFS cleaner also interacts with Data service to perform data management. It either deletes old data when size or age threshold is reached or archive datasets on distributed storage like Swift or S3 containers. It archives files under archive folder by tagging file with name of datasource and its timestamp

##How to restore archived files Archived files can be copied back to cluster using cp or distcp commands. e.g. To retrieve archived data under archive folder in pnda container, type follwing command on edge node terminal.

sudo -u hdfs hadoop distcp swift://archive.pnda/* hdfs://testcluster-cdh-mgr1:8020/user/pnda/