# 'Find' Syntax Structure Examples

In this notebook, we will be exploring different syntax representations for the 'find' bash command to help improve the generator, particularly its ability to generate valid 'find' commands, with the hope being we can also apply these techniques to all of the other commands.

To do this, we will be going through our entire pipeline and optimizing for efficiency and correctness at every step.

In [1]:
from generator import *
import json

## Exploring Known Find Commands

From the training data, parsing a list of all the find commands to see examples of what proper syntax structure looks like for the find command

In [2]:
with open('cmds_proccess_train.txt') as fp:
    txt = fp.read()
    
find_cmds = []
for cmd in txt.split('\n'):
    if cmd.split(" ")[0] == "find":
        find_cmds.append(cmd)
        
find_cmds[:20]

['find Path -type d ! -perm -Permission',
 'find Path -type d -name Regex -execdir tar -c -v -f File File \\;',
 'find Path -user Regex',
 'find Path -name Regex -delete',
 'find Path -name Regex',
 'find Path -group Regex',
 'find Path -type d -exec chmod Permission {} +',
 'find Path -type f -print0 | xargs -r -0 -I {} grep -F Regex {}',
 'find Path -name Regex | xargs -I {} grep -r Regex {}',
 'find Path Path Path Path',
 'find Path -type f -name Regex',
 'find Path -print0 | xargs -0 -I {} echo {}',
 'find Path -type d -name Regex -exec rsync -a -v -R {} File \\; -exec rm -r -f File \\;',
 'find Path -mmin -Quantity',
 'find Path -type f -exec bzip2 {} \\;',
 'find Path -name Regex -type f -exec wc -l File \\;',
 'find Path -type c',
 'find Path -name Regex -prune -or -print',
 'find Path -nouser -exec rm File \\;',
 'find Path -name Regex -exec grep Regex {} \\;']

Clearly, the syntax structure is different than the initial thought which is likely to have caused a lot of the problems. The initial syntax structure was 'find options Folder Regex' when it's clear that the Folder (or path) comes before the options, and that the Regex is not mandatory, but instead comes as a result of the options. 

So, I have switched the syntax structure to 'find Folder options' and let's keep moving forward.

## Scraping of Find Command Arguments

In [2]:
from scraper import WebScraper

ws = WebScraper(utilities=['find'])
ws.extract_utilities()
ws.save_json(map_path='find_scrape.json')

find
syntax not found for find




  soup = BeautifulSoup(r.text)


In [3]:
ws.non_conforming_flags()

['find:-D:debugopts',
 'find:-type:c',
 'find:-printf:format',
 'find:-xtype:c',
 'find:-fstype:type',
 'find:-perm:+mode',
 'find:-regextype:type',
 'find:-mindepth:levels',
 'find:-not:expr',
 'find:-maxdepth:levels',
 'find -L:-D:debugopts',
 'find -L:-type:c',
 'find -L:-printf:format',
 'find -L:-xtype:c',
 'find -L:-fstype:type',
 'find -L:-perm:+mode',
 'find -L:-regextype:type',
 'find -L:-mindepth:levels',
 'find -L:-not:expr',
 'find -L:-maxdepth:levels']

## Generation of Find Commands

Now that the syntax is switched, the commands look like the following.

In [7]:
find_gen = Generator(map_path='find_scrape.json', utilities=["find"])

In [8]:
find_cmds = find_gen.generate_all_commands()
find_cmds[:10]

['find Folder -mmin Number -L -or',
 'find Folder -execdir -nogroup -user Number',
 'find Folder -fprint0 File -print0 -O',
 'find Folder -daystart -executable -empty',
 'find Folder -wholename Regex -follow -print0',
 'find Folder -ok -d -ctime Number',
 'find Folder -fprint File -empty -mount',
 'find Folder -path Regex -Olevel -true',
 'find Folder -L -okdir -xdev',
 'find Folder -ok -ilname Regex -size Number']

In [9]:
len(find_cmds)

64823

It looks like we were able to generate over 64,000 find commands

In [10]:
with open('generic_find_cmds.txt', 'w') as fp:
    fp.write("\n".join(find_cmds))

## Replacement of Arguments in Generic Commands.

In [11]:
replace(rep_path='find_replacement_map.json', in_path='generic_find_cmds.txt', out_path='replaced_find_cmds_2.txt')

In [13]:
with open('replaced_find_cmds_2.txt') as fp:
    cmds = fp.read().split("\n")
cmds[:10]

['find . -fprint temp.txt -d',
 'find . -links 1 -noignore_readdir_race',
 "find . -exec -ipath '*txt' -lname '*txt'",
 'find . -readable -fprint temp.txt -false',
 "find . -d -ilname '*txt'",
 "find . -newer 1 -okdir -ilname '*txt'",
 'find . -o -d -noignore_readdir_race',
 'find . -mount -user 1 -mtime 1',
 "find . -xdev -context '*txt' -ctime 1",
 "find . -H -false -ilname '*txt'"]

## Validation of the find commands

Our generator includes the "validate_commands()" function which runs all of the generated commands in a particular file at the command line, and determines whether they are valid based on the exit status returned. It's important that the validation aspect is working correctly, so that we can maximize the number of actual valid commands we can return.

Here we will be running some of our generated commands through the validation pipeline and observing how they perform.

In [30]:
import subprocess

subprocess.check_output(cmds[1], shell=True)

CalledProcessError: Command 'find . -print -H -xdevs' returned non-zero exit status 1.

In [14]:
with open('verified_find2.txt') as fp:
    txt = fp.read()
    
txt = txt.split('\n')
len(txt)

25491

In [15]:
def make_generic(rep_path, in_file, out_file):
    with open(rep_path, 'r') as fp:
        reps = json.load(fp)
    
    d = {}
    for k in reps:
        d[reps[k]] = k
        
    with open(in_file) as fp:
        cmds = fp.read().split('\n')
    
    ret = []
    for cmd in cmds:
        ret_cmd = []
        lst = cmd.split(" ")
        for val in lst:
            if val in d:
                val = d[val]
            ret_cmd.append(val)
        ret.append(" ".join(ret_cmd))
    
    with open(out_file, 'w') as fp:
        fp.write("\n".join(ret))

In [16]:
make_generic('find_replacement_map.json', 'verified_find2.txt', 'verified_generic_find_cmds2.txt')

In [18]:
ret = []
with open('verified_generic_find_cmds2.txt') as fp:
    cmds = fp.read().split('\n')
    for cmd in cmds:
        cmd = cmd.replace("Folder", "Path")
        ret.append(cmd)
ret = "\n".join(ret)

with open('verified_generic_find_cmds2.txt', 'w') as fp:
    fp.write(ret)

In [1]:
with open('generic_training.txt') as fp:
    txt = fp.read()

In [7]:
ret = []
for cmd in txt.split('\n'):
    ret.append(cmd.replace('Directory', 'Path'))
ret

['find Path -print -nouser -help',
 'find Path -inum Timespan -samefile File -xdev',
 'find Path -used Timespan -fprint0 File -xdev',
 'find Path -ls -noignore_readdir_race -size Timespan',
 'find Path -depth -used Timespan -ilname Regex',
 'find Path -version -lname Regex -empty',
 'find Path -version -depth -samefile File',
 'find Path -readable -inum Timespan -group Timespan',
 'find Path -fls File -cmin Timespan -true',
 'find Path -print -nouser -samefile File',
 'find Path -ls -fprint0 File -amin Timespan',
 'find Path -version -inum Timespan -wholename Regex',
 'find Path -ignore_readdir_race -false',
 'find Path -nogroup -executable -fprint0 File',
 'find Path -name Regex -ctime Timespan -atime Timespan',
 'find Path -mtime Timespan -delete -print0',
 'find Path -ignore_readdir_race -ilname Regex -size Timespan',
 'find Path -iregex Regex -inum Timespan -ctime Timespan',
 'find Path -print -empty -executable',
 'find Path -print -quit -mmin Timespan',
 'find Path -version -dept

In [8]:
with open('generic_training2.txt', 'w') as fp:
    fp.write('\n'.join(ret))