# Command Line Automation in Python

## IPython Shell Commands

In [2]:
from random import choices

days = ['Mo', 'Tu', 'We', 'Th', 'Fr']

# Print a random day of the week
print(choices(days))

['Fr']


In [3]:
!python3 -c "from random import choices;days = ['Mo', 'Tu', 'We', 'Th', 'Fr'];print(choices(days))"

['Mo']


`-c` is used to execute a program passed as a string.

In [4]:
ls

README.md                      [34mmy_package[m[m/
command-line-automation.ipynb  setup.py
[34mdata[m[m/                          software-engineering.ipynb
data-engineering.ipynb         spark-script.py
data-pipeline.ipynb            stable-req.txt


### Storing a variable from a shell command in IPython

In [1]:
# Store a variable from a shell command in IPython
var = !ls -h *.ipynb

# Print the resulting variable
print(var)

# Print the number of .ipynb files in the current directory
len(var)

['command-line-automation.ipynb', 'data-engineering.ipynb', 'data-pipeline.ipynb', 'software-engineering.ipynb']


4

In [10]:
type(var)

IPython.utils.text.SList

In [9]:
!ls */*.csv

data/athlete_events.csv data/flights.csv        data/noc_regions.csv


We can also use `%%bash` magic syntax to capture the output of a script in IPython. For instance, below code runs a code block with output stored in the variable `output`.

In [13]:
%%bash --out output --err error
ls

In [14]:
print(output)

README.md
command-line-automation.ipynb
data
data-engineering.ipynb
data-pipeline.ipynb
my_package
setup.py
software-engineering.ipynb
spark-script.py
stable-req.txt



In [14]:
print(error)




> One good use case is needing to download machine learning training data using `wget`, then uncompressing it.

### AWK

`awk` is a scripting language used for manipulating data, generating reports and text processing. 

AWK Operations:
- Scans a file line by line
- Splits each input line into fields
- Compares input line/fields to pattern
- Performs action(s) on matched lines

Useful For:
- Transform data files
- Produce formatted reports

> Awk is a tool that is used often on the Unix command line because it understands how to deal with whitespace delimited output from shell commands. The awk command works well at grabbing fields from a string.

In [21]:
# Sum the file sizes
!ls -l | awk '{SUM+=$5} END {print SUM}'

# -l: use a long listing format

111923


### Automation with SList Data Type

`SList` is an IPython data type, which enables a user to perform powerful operations on shell commands.

In [2]:
type(var)

IPython.utils.text.SList

Three main methods of the SList are:
- fields
- grep
- sort

Other methods are shown below.

In [7]:
print(dir(var))

['_SList__nlstr', '_SList__paths', '_SList__spstr', '__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'append', 'clear', 'copy', 'count', 'extend', 'fields', 'get_list', 'get_nlstr', 'get_paths', 'get_spstr', 'grep', 'index', 'insert', 'l', 'list', 'n', 'nlstr', 'p', 'paths', 'pop', 'remove', 'reverse', 's', 'sort', 'spstr']


The `df` command is a command line utility for reporting file system disk space usage. It can be used to show the free space on a Unix or Linux computer and to understand the filesystems that have been mounted. 

In [None]:
disk_space = !df -h

# Print the total size of the mounted volumes
disk_space.fields(1)

In [4]:
ls = !ls

# Find the files with ".py" in them
ls.grep('.py')

['command-line-automation.ipynb',
 'data-engineering.ipynb',
 'data-pipeline.ipynb',
 'setup.py',
 'software-engineering.ipynb',
 'spark-script.py']

In [5]:
import os

# Find the files with ".py" in them
result = ls.grep('.py')

# Extract the filenames
for res in result:
	filename = res.split()[-1]
    
	# Create the full path
	fullpath = os.path.join('root', filename)
	print(f"fullpath of the file: {fullpath}")

fullpath of the file: root/command-line-automation.ipynb
fullpath of the file: root/data-engineering.ipynb
fullpath of the file: root/data-pipeline.ipynb
fullpath of the file: root/setup.py
fullpath of the file: root/software-engineering.ipynb
fullpath of the file: root/spark-script.py


## Executing Shell Commands in Python with Subprocess

The [subprocess](https://docs.python.org/3/library/subprocess.html#module-subprocess) module allows to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. The recommended approach to invoking subprocesses is to use the `run()` function for all use cases it can handle. Note that, for more advanced use cases, the underlying `Popen` interface can be used directly.

We can run shell commands using `subprocess.run()` using Python 3.5+. It takes a list of strings, runs the command described its args, waits for command to complete, then returns a `CompletedProcess` instance.

> If `shell` is `True`, the specified command will be executed through the shell. This can be useful if you are using Python primarily for the enhanced control flow it offers over most system shells and still want convenient access to other shell features such as shell pipes, filename wildcards, environment variable expansion, and expansion of ~ to a user’s home directory. However, note that Python itself offers implementations of many shell-like features (in particular, glob, fnmatch, os.walk(), os.path.expandvars(), os.path.expanduser(), and shutil).

_Note: It may be unsecure to use `shell=TRUE` in production._

In [11]:
import subprocess

out = subprocess.run(["ls", "-l"])

print(out)

CompletedProcess(args=['ls', '-l'], returncode=0)


In Unix sytems, successful comletion returns 0, whereas unsuccessful commands return non-zero values.

In [13]:
!echo $?

0


In [12]:
out.returncode

0

In [16]:
bad_out = subprocess.run(["ls", "--asdf"])

print(bad_out.returncode)

1


As an example, we can `touch` a file using the `subprocess` module and then inspect the permissions on the file that was created. (`os.stat` gives us useful metadata about files)

The `touch` command is used in UNIX/Linux operating system to create, change and modify timestamps of a file. Basically, there are two different commands to create a file in the Linux system which is as follows:

- `cat` command: It is used to create the file with content.
- `touch` command: It is used to create a file without any content. The file created using touch command is empty. [ref](https://www.geeksforgeeks.org/touch-command-in-linux-with-examples/)

In [27]:
import os
import subprocess

# Setup
file_location = "data/dp/tmp.txt"
uid = 100

# Touch a file
proc = subprocess.Popen(["touch", file_location])

# Check user permissions
stat = os.stat(file_location)
if stat.st_uid == 100:
    print(f"File System exported properly: {uid} == {stat.st_uid}")
else:
    print(f"File System NOT exported properly: {uid} != 300")

File System NOT exported properly: 100 != 300


> `subprocess.run` was added in Python 3.5 as a simplification over subprocess. `Popen` when you just want to execute a command and wait until it finishes, but you don't want to do anything else meanwhile. For other cases, you still need to use `subprocess.Popen`. [stackoverflow](https://stackoverflow.com/questions/39187886/what-is-the-difference-between-subprocess-popen-and-subprocess-run/39187984)

Another example to safely run two Unix commands in `subprocess.Popen`:

- The Unix command `head` will read the first few lines and `wc -w` will count the total number of words.

- Passing `stdout=subprocess.PIPE` into `Popen` captures the output of `wc`.

- `stdout` captures output of command.

- `stdout.read()` returns output as a string. 

- `stdout.readlines()` returns outputs as an interator.

- `shell=False` is default and recommended.


In [30]:
# Execute Unix command `head` safely as items in a list
with subprocess.Popen(["head", "data/dp/tmp.txt"], stdout=subprocess.PIPE) as head:
  
    # Print each line of list returned by `stdout.readlines()`
    for line in head.stdout.readlines():
        print(line)

b'The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. \n'
b'This module intends to replace several older modules and functions:\n'
b'\n'
b'os.system\n'
b'os.spawn*\n'
b'\n'
b'Information about how the subprocess module can be used to replace these modules and functions can be found in the following sections.\n'
b'\n'
b'See also PEP 324 \xe2\x80\x93 PEP proposing the subprocess module\n'
b'Using the subprocess Module\n'


In [32]:
# Execute Unix command `wc -w` safely as items in a list
with subprocess.Popen(["wc", "-w", "data/dp/tmp.txt"], stdout=subprocess.PIPE) as word_count:
  
    # Print the string output of standard out of `wc -w`
    print(word_count.stdout.read())

b'     122 data/dp/tmp.txt\n'


### Capturing the Output

In [38]:
from subprocess import Popen, PIPE
import json
import pprint

# Use the with context manager to run subprocess.Popen()
with Popen(["pip","list","--format=json"], stdout=PIPE) as proc:
    # Pipe the output of subprocess.Popen() to stdout
    result = proc.stdout.readlines()

# Convert the JSON payload to a Python dictionary
# JSON is a datastructure similar to a Python dictionary
converted_result = json.loads(result[0])

# Display the result in the IPython terminal (nicely formatted)
pprint.pprint(converted_result[1:3], compact=True)

[{'name': 'alabaster', 'version': '0.7.12'},
 {'name': 'alembic', 'version': '1.3.1'}]


We can catch a process as it timed out `using proc.kill()` when the TimeoutExpired exception was triggered.

In [47]:
# Start a long running process using subprocess.Popen()
proc = Popen(["sleep", "4"], stdout=PIPE, stderr=PIPE)

# Use subprocess.communicate() to create a timeout 
try:
    output, error = proc.communicate(timeout=3)
    
except subprocess.TimeoutExpired:

	# Cleanup the process if it takes longer than the timeout
    proc.kill()
    
    # Read standard out and standard error streams and print
    output, error = proc.communicate()
    print(f"Process timed out with output: {output}, error: {error}")

Process timed out with output: b'', error: b''


Finding duplicate files:

In [58]:
checksums = {}
duplicates = []

files = ['data/dp/tmp.txt', 'data/dp/tmp2.txt']

# Iterate over the list of files filenames
for filename in files:
  	# Use Popen to call the md5/md5sum utility
    with Popen(["md5", filename], stdout=PIPE) as proc:
        checksum = proc.stdout.read().split()[3]
        
        # Append duplicate to a list if the checksum is found
        if checksum in checksums:
            duplicates.append(filename)
        
        checksums[checksum] = filename

print(f"Found Duplicates: {duplicates}")

Found Duplicates: ['data/dp/tmp2.txt']


### Sending Input

Two ways to send input to shell are:
- run
- Popen

In [59]:
# Run 'find' command to search for files
find = subprocess.Popen(
    ["find", ".", "-type", "f", "-print"], stdout=subprocess.PIPE)

# Run 'wc' and counts the number of lines
word_count = subprocess.Popen(
    ["wc", "-l"], stdin=find.stdout, stdout=subprocess.PIPE)

# Print the decoded and formatted output
output = word_count.stdout.read()
print(output.decode("utf-8").strip())

99


### Security Issues

Security best practices for subprocesses:

- Always use `shell=False` (`shell=True` allows arbitrary code)
- Use `shlex` module to sanitize strings (when needed)
- Reduce complexity

Example: Using a Python list to safely pass arguments into the Unix `find` command to find all of the directories.

In [None]:
# Accepts user input
print("Enter a path to search for directories: \n")
user_input = "."
print(f"directory to process: {user_input}")

#Pass safe user input into subprocess
with subprocess.Popen(["find", user_input, "-type", "d"], stdout=subprocess.PIPE) as find:
    result = find.stdout.readlines()
    
    #Process each line and decode it and strip it
    for line in result:
        formatted_line = line.decode("utf-8").strip()
        print(f"Found Directory: {formatted_line}")

`shlex` example: Getting the total storage of a list of directories.

> We use the `shlex.split` command to create a safely run Unix tool that calculates disk usage. The key difference in `shlex.split` is that it can **safely quote unix strings** and prevent attack vectors versus a regular string split method that doesn't have this capability.

In [65]:
import shlex

print("Enter a list of directories to calculate storage total: \n")
user_input = "data my_package"

# Sanitize the user input
sanitized_user_input = shlex.split(user_input)
print(f"raw_user_input: {user_input} |  sanitized_user_input: {sanitized_user_input}")

# Safely Extend the command with sanitized input
cmd = ["du", "-sh", "--total"]
cmd.extend(sanitized_user_input)
print(f"cmd: {cmd}")

# Print the totals out
disk_total = subprocess.run(cmd, stdout=subprocess.PIPE)
print(disk_total.stdout.decode("utf-8"))

Enter a list of directories to calculate storage total: 

raw_user_input: data my_package |  sanitized_user_input: ['data', 'my_package']
cmd: ['du', '-sh', '--total', 'data', 'my_package']

