# Bash Generation: Piped Commands

In this notebook, I hope to explore concatenating generated bash commands together using a pipe.

## Exploring the training data

First , I hope to explore how commands are used in the training data and try and see patterns to give us insight into the implementation of piped commands.

In [1]:
with open('cmds_proccess_train.txt') as fp:
    txt = fp.read()
    
piped_cmds = [cmd for cmd in txt.split('\n') if '|' in cmd]
piped_cmds[:10]

['find Path -type f -print0 | xargs -r -0 -I {} grep -F Regex {}',
 'find Path -name Regex | xargs -I {} grep -r Regex {}',
 'zcat Regex | grep -i Regex',
 'zcat Regex | head -n Quantity',
 'fold File | wc -l',
 'find Path -print0 | xargs -0 -I {} echo {}',
 'cd $( find Path -name Regex | xargs -I {} dirname {} )',
 'set | grep Regex',
 'who | wc -l',
 'ls -t -p | grep -v Regex | tail -n +Quantity | xargs -I {} rm -- {}']

## A deeper look
Let's look a little deeper, particularly into what commands are generally used before and after the pipes, and the average number of pipes used in piped commands.

In [2]:
num = m = 0
for cmd in piped_cmds:
    num += cmd.count('|')
    m = max(m, cmd.count('|'))
print(f"Average number of pipes in piped commands is {num / len(piped_cmds)}")
print(f"Maximum number of pipes in piped commands is {m}")
print(f"Total number of piped commands in training data is {len(piped_cmds)}")

Average number of pipes in piped commands is 1.454713493530499
Maximum number of pipes in piped commands is 7
Total number of piped commands in training data is 3246


In [3]:
from collections import Counter

first_uts = [cmd.split('|')[0].strip().split(' ')[0] for cmd in piped_cmds]
second_uts =[cmd.split('|')[1].strip().split(' ')[0] for cmd in piped_cmds]
common_pairings = [(cmd.split('|')[0].strip().split(' ')[0], cmd.split('|')[1].strip().split(' ')[0]) for cmd in piped_cmds]

pre_uts = Counter(first_uts)
post_uts = Counter(second_uts)
pairs = Counter(common_pairings)

print("Most common utilities before first pipe: \n", pre_uts.most_common(5), "\n")
print("Most common utilities after first pipe: \n",post_uts.most_common(5), "\n")
print("Most common utility pairs:")
for pair in pairs.most_common(10):
    print(pair)


Most common utilities before first pipe: 
 [('find', 1721), ('echo', 223), ('cat', 160), ('ls', 99), ('grep', 56)] 

Most common utilities after first pipe: 
 [('xargs', 1037), ('grep', 443), ('awk', 254), ('sort', 253), ('sed', 218)] 

Most common utility pairs:
(('find', 'xargs'), 956)
(('find', 'grep'), 164)
(('find', 'sort'), 149)
(('find', 'wc'), 97)
(('find', 'sed'), 80)
(('find', 'awk'), 77)
(('find', 'head'), 43)
(('ifconfig', 'grep'), 36)
(('find', 'cpio'), 35)
(('echo', 'tee'), 26)


## Initial Observations

It looks like find is most commonly used before the pipe, with xargs, grep and sort most commonly following the pipe, so it makes sense to start with these utilties when creating our pipes. Similar to the rest of the training data, find makes up over half of the piped commands.

## Command Generation

Our first approach will consist of generating two separate lists of commands, one for find and one for another utiltiy (starting with xargs, grep, and sort) and then concatenate using all possible combinations from the two lists with a pipe in between.

In [4]:
from scraper import WebScraper

pipe_scraper = WebScraper(utilities=['find', 'xargs', 'grep', 'sort'])

pipe_scraper.extract_utilities()
pipe_scraper.save_json(map_path='pipe_scrape.json')

find
syntax not found for find




  soup = BeautifulSoup(r.text)


xargs
grep
sort


In [5]:
from generator import Generator

pipe_gen = Generator(map_path='pipe_scrape.json', utilities=['find', 'xargs', 'grep', 'sort'])
pre_pipe_lst = pipe_gen.generate_commands('find')
post_pipe_lst = pipe_gen.generate_commands(['xargs', 'grep', 'sort'])
post_pipe_lst[:10]

['xargs -a File -e Regex -l Number',
 'xargs --show-limits -I -i Regex',
 'xargs -I',
 'xargs -t -0 -p',
 'xargs -a File -e Regex -n Number',
 'xargs -0 -p --delimiter Regex',
 'xargs --delimiter Regex -I',
 'xargs --show-limits -e Regex -E',
 'xargs --show-limits --version -I',
 'xargs -n Number -r -L']

## Verifying Commands

So now we have two lists, one of commands to put before a pipe, and one of commands to put after the pipe. In order to put less strain on the verification and validation of the commands, we will validate the commands before and after the pipe to see if they run on their own.

As we have already run the validation process for find commands, we can simply extract these and use them for our piped command generation instead.

In [6]:
with open('verified_generic_find_cmds2.txt') as fp:
    txt = fp.read()
    
pre_pipe_lst = txt.split('\n')
pre_pipe_lst[:3]

['find Path -fprint File -d',
 'find Path -links Size -noignore_readdir_race',
 'find Path -readable -fprint File -false']

## Mix and Match: Creating the Piped Commmands

This part is pretty straightforward and involves taking every command in the pre-command list and matching it with every command in the post-command list and adding all the combinationations to one main list. However, since this would create an outrageous number of commands, we will start with 90,000 commands: 300 pre commands matched with 300 post commands.

In [14]:
import random
piped_cmds = set()

for pre_cmd in random.sample(pre_pipe_lst, 300):
    for post_cmd in random.sample(post_pipe_lst, 300):
        piped_cmds.add(" | ".join([pre_cmd, post_cmd]))

list(piped_cmds)[:6]

['find Path -wholename Regex -ilname Regex -noleaf | grep --line-buffered -Z --help Number File',
 'find Path -mount -iname Regex -print0 | grep -d Number -v --no-messages Number Number File',
 'find Path -xdev -amin Size -writable | grep --no-messages Number -H --exclude-from File Number File',
 'find Path -samefile File -links Size -size Size | sort -d -R -m File',
 'find Path -daystart -ipath Regex -quit | grep -C Number -x -l Number File',
 'find Path -cmin Size -prune -nouser | grep --color Number -l -c Number File']

## Replacement

In [15]:
with open('pipe_generic.txt', 'w') as fp:
    fp.write("\n".join(piped_cmds))

from generator import replace

replace(rep_path='rep_map.json', in_path='pipe_generic.txt', out_path='replaced_pipe.txt')

And after running verification on the virtual machine . . .

In [16]:
with open('verified_pipe_find.txt') as fp:
    txt = fp.read()
    
piped = txt.split("\n")
len(piped)

21101

## Reverse Replcacement

In [2]:
from generator import replace
replace(rep_path='rep_map.json', in_path='verified_pipe_find.txt', out_path='verified_pipe_generic.txt', reverse=True)

In [3]:
with open('verified_pipe_generic.txt') as fp:
    txt = fp.read()
    
piped = txt.split("\n")
piped[:5]

['find Path -wholename Regex -ilname Regex -noleaf | grep --line-buffered -Z --help Timespan File',
 'find Path -samefile File -links Timespan -size Timespan | sort -d -R -m File',
 'find Path -nouser -help -ipath Regex | grep -E -o --help Timespan File',
 'find Path -print -mtime Timespan -atime Timespan | grep -L --help -T Timespan File',
 'find Path -used Timespan -nogroup -fprint0 File | xargs -p -I -i Regex']