# Cleanup Reprocessing Directory
This is the python script that will cleanup a directory based on DQR. It can be run from anywhere, the only input is the dqr. must be set manually in cell to for the notebook to work. 

### DQR tracking tool
* [Reprocessing Dashboard](https://task.arm.gov/report/repo/#s/_::D150701.31&_r::_)
* Only cleanup tasks that are "Close completed" or "Close canceled" unless specifically asked to. 

### ServiceNow
* Once the task is cleaned, the user should "close complete" the Delete Original Data task on [ServiceNow](https://armcrf.service-now.com/).
* [REPO-04.2] Delete Original Data

### Todo: 
* allow for a list of dqrs to be cleaned
* query ServiceNow for completion status and loop over all DQR folders in the reproc home directory using subprocess ls call. 
* test the recursive folder move, it may not work as intended


In [1]:
"""
This must be set for the rest to work
The dqr should be similar to the following form:
D123456 | D123456.1 | D123456.12
"""
dqr = "D180827.3" # "Set DQR# Here."

In [5]:
from os import environ, path, walk, makedirs
from shutil import copyfile, rmtree, SameFileError
from distutils.dir_util import copy_tree
from sys import argv
"""
distutils.dir_util.copy_tree used instead of shutil copytree because
the shutil version ran into an error when trying to copy a directory
that already existed at the destination but distutils version grace-
fully handles this situation
"""

# File extensions to archive, inclusive list
archive_file_params = ['ncr_', 'conf', 'json', '.py', '.ipynb', 'log', '.sh', '.csh', '.bash', '.Ingest']
archive_folder_params = ['ncr_', 'script']

# Get environment variables
reproc_home = environ['REPROC_HOME']
post_proc = environ['POST_PROC']

# Create directory to clean and archive
clean_dir = path.join(reproc_home, dqr)
archive_dir = path.join(post_proc, dqr, 'auto_archive')

# Create archive directory if it doesn't exist, pass if it does exist
makedirs(archive_dir, exist_ok=True)

# Walk the directory and get all directories, subdirectories, and files
for dirName, subdirList, fileList in walk(clean_dir):
    print('Found directory: {}'.format(dirName))
    # Recursively copy any directories used for ncreviews TODO test this, it may not work...
    if any(param in dirName for param in archive_folder_params):
        src = dirName
        dest = path.join(archive_dir, dirName)
        print("Archiving directory: {}".format(src))
        copy_tree(src, dest)
    # Iterate over list of files in current directory on walk
    for fname in fileList:
        # Check if any inclusive parameters are in the filename
        if any(param in fname for param in archive_file_params):
            # Make source and destination file paths
            src = path.join(dirName, fname)
            dest = path.join(archive_dir, fname)
            print("Archiving file: {}".format(fname))

            # This try loop necessary in case file is a symlink
            try:
                copyfile(src, dest)
            except FileNotFoundError:
                print('\tFile DNE: {}'.format(fname))
            except SameFileError:
                print("\tFile exists at dest: {}".format(fname))
                
print('REMOVING: {}'.format(clean_dir))
rmtree(clean_dir)
print('Finished ceaning: {}'.format(clean_dir))

Found directory: /data/project/0021718_1509993009/D180827.3
Archiving file: env.bash
Archiving file: D180827.3.conf
	File exists at dest: D180827.3.conf
Archiving file: env.csh
Found directory: /data/project/0021718_1509993009/D180827.3/logs
Found directory: /data/project/0021718_1509993009/D180827.3/health
Found directory: /data/project/0021718_1509993009/D180827.3/collection
Found directory: /data/project/0021718_1509993009/D180827.3/collection/sgp
Found directory: /data/project/0021718_1509993009/D180827.3/collection/sgp/sgpaosnanosmpsE13.00
Found directory: /data/project/0021718_1509993009/D180827.3/conf
Found directory: /data/project/0021718_1509993009/D180827.3/tmp
Found directory: /data/project/0021718_1509993009/D180827.3/db
Found directory: /data/project/0021718_1509993009/D180827.3/datastream
Found directory: /data/project/0021718_1509993009/D180827.3/quicklooks
Found directory: /data/project/0021718_1509993009/D180827.3/www
Found directory: /data/project/0021718_1509993009/D

## Python Script Version

This is a version of the code that could be a python script that is run from the command line and supplied with one dqr. The extension for a python script is .py and this one should be run with Python3.

In [None]:
from os import environ, path, walk, makedirs
from shutil import copyfile, rmtree
from distutils.dir_util import copy_tree
from sys import argv

msg = 'First arguement should be a dqr #.'
try:
    user_input = argv[1]
except IndexError:
    print(msg)
if user_input in ['-h', '--help']:
    print(msg)
else:
    dqr = user_input

archive_file_params = ['ncr_', 'conf', 'json', '.py', '.ipynb', 'log', '.sh', '.csh', '.bash', '.Ingest']
archive_folder_params = ['ncr_', 'script']
reproc_home = environ['REPROC_HOME']
post_proc = environ['POST_PROC']

clean_dir = path.join(reproc_home, dqr)
archive_dir = path.join(post_proc, dqr, 'auto_archive')
makedirs(archive_dir, exist_ok=True)

for dirName, subdirList, fileList in walk(clean_dir):
    print('Found directory: {}'.format(dirName))
    if any(param in dirName for param in archive_folder_params):
        src = dirName
        dest = path.join(archive_dir, dirName)
        print("Archiving directory: {}".format(src))
        copy_tree(src, dest)
    for fname in fileList:
        if any(param in fname for param in archive_file_params):
            src = path.join(dirName, fname)
            dest = path.join(archive_dir, fname)
            print("Archiving file: {}".format(fname))
            #print('{}'.format(src))
            #print('---> {}'.format(dest))
            try:
                copyfile(src, dest)
            except FileNotFoundError:
                print('\tFile DNE: {}'.format(fname))
print('REMOVING: {}'.format(clean_dir))
rmtree(clean_dir)
print('Finished ceaning: {}'.format(clean_dir))


## Extras
Below are extras that were used for testing.
1. Tried to connect to db to check if a dqr has been completed
2. Tried to exclude known types to archive everything else
  * Too many unknown file extensions to exclude all
  * Too much chance of archiving lots of files accidentally
  * Decided to go with list of extensions to archive rather than exclude

In [None]:
import os
import subprocess
import psycopg2

conn_string = "host='armdev-pgdb.ornl.gov' dbname='arm_all' user='***' password='******'"
conn = psycopg2.connect(conn_string)
conn.execute()

reproc_home = os.environ['REPROC_HOME']
cmd = ["ls", "-tr", reproc_home]
print(f"cmd = {cmd}")
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
out, err = proc.communicate()
directories = (str(out))[2:-3].split("\\n")
print("directories:")
for d in directories:
    print(d)

In [None]:
reproc_home = environ['REPROC_HOME']
clean_dir = path.join(reproc_home, dqr)

exclude = ['cdf', 'nc', 'icm', 'csv', 'mpl', '.tar', '.C1','.dat', '.tsv']

for dirpath, subdirs, files in walk(clean_dir):
    for x in files:
        if not any(ex in x for ex in exclude):
            print(x)