Replacing Bash Scripting with Python

If I didn't cover something you want to know about or you find another problem, please open an issue on github.

Contents

Introduction
Reading and Writing Files
Command-Line Interfaces
Filesystem Stuff
- Paths
- Replacing miscellaneous file operations: shutil
Replacing sed, grep, awk, etc: Python regex
Running Processes
Miscellaneous
Epilogue: Choose the right tool for the job.
- When to use Bash
- Warning Signs

Introduction

The Unix shell is one of my favorite inventions ever. It's genius, plain and simple. The idea is that the user environment is a Turing-complete, imperative programming language. It has a dead-simple model for dealing with I/O and concurrency, which are notoriously difficult in most other languages.

For problems where the data can be expressed as a stream of similar objects separated by newlines to be processed concurrently through a series of filters and handles a lot of I/O, it's difficult to think of a more ideal language than the shell. A lot of the core parts on a Unix or Linux system is designed to express data in such formats.

This tutorial is NOT about getting rid of bash altogether! In fact, one of the main goals of the section on Command-Line Interfaces is to show how to write programs that integrate well with the process orchestration faculties of the shell.

If the Shell is so great, what's the problem?

The problem is if you want to do basically anything else, e.g. write logic, use control structures, handle data... You're going to have big problems. When Bash is coordinating external programs, it's fantastic. When it's doing any work whatsoever itself, it disintegrates into a pile of garbage.

For me, the fundamental problem with Bash and many other shell dialects is that text is identifiers and identifiers are text -- and basically everything else is also text. In some sense, this makes the shell a homoiconic language, which theoretically means it might have an interesting metaprogramming story, until you realize that it basically just amounts to running eval on strings, which is a feature in basically any interpreted language today, and one that is frequently considered harmful. The problem with eval is that it's a pretty direct path to arbitrary code execution. This is great if arbitrary code execution is actually what you're trying to accomplish (like, say, in an HTML template engine), but it's not generally what you want.

Bash basically defaults to evaling everything. This is very handy for interactive use, since it cuts down in the need for a lot of explicit syntax when all you really want to do is, say, open a file in a text editor. This is pretty darn bad in a scripting context because it turns the entire language into an injection honeypot. Yes, it is possible and not so difficult to write safe Bash once you know the tricks, but it takes extra consideration and it is easy to forget or be lazy about it. Writing three or four lines of safe Bash is easy; two-hundred is quite a bit more challenging.

Bash has other problems. The syntax that isn't native to the Bourne Shell feels really ugly and bolted-on. For example, most modern shells have arrays. Let's look at the syntax for iterating on an array, but let's take the long way there.

$ foo='this   and   that' # variable assignment
$ echo $foo
this and that
$ # Oh dear. Text inside the variable was split into arguments on
$ # whitespace, because eval all the things.
$
$ # To avoid this insane behavior, do the obvious thing: use string
$ # interpolation. :-(
$ echo "$foo"
this   and   that

What does this have to do with iterating on arrays? Unfortunately, the answer is "something."

To properly iterate on the strings inside of an array (the only thing which an array can possibly contain), you also use variable interpolation syntax.

for item in "${my_array[@]}"; do
    stuff with "$item"
done

Why would string interpolation syntax ever be used to iterate over items in an array? I have some theories, but they are only that. I could tell you, but it wouldn't make this syntax any less awful. If you're not too familiar with Bash, you may also (rightly) wonder what this @ is, or why everything is in curly braces.

The answer to all these questions is more or less that they didn't want to do anything that would break compatibility with ancient Unix shell scripts, which didn't have these features. Everything just got shoe-horned in with the weirdest syntax you can imagine. Bash actually has a lot of features of modern programming languages, but the problem is that the syntax provided to access them is completely contrary to logic and dictated by legacy concerns.

The Bash IRC channel has a very helpful bot, greybot, written by one of the more important Bash community members and experts, greycat. This bot is written in Perl. I once asked why it wasn't written in Bash, and only got one answer: "greycat wanted to remain sane."

And really, that answer should be enough. Do you want to remain sane? Do you want people who maintain your code in the future not to curse your name? Don't use Bash. Do your part in the battle against mental illness.

Ok, that was a little hyperbolic. For an opinion about when it's aright to use Bash, see: Epilogue: Choose the right tool for the job.

Why Python?

No particular reason. Perl and Ruby are also flexible, easy-to-write languages that have robust support for administrative scripting and automation. I would recommend against Perl for beginners because it has some similar issues to Bash: it was a much smaller language when it was created, and a lot of the syntax for the newer features has a bolted-on feeling [1]. However, if one knows Perl well and is comfortable with it, it's well suited to the task and is still a much saner choice for non-trivial automation scripts, and that is one of its strongest domains.

Node.js is also starting to be used for administrative stuff these days, so that could also be an option, though JavaScript has similar issues to Perl. I've been investigating the possibility of using Julia for this as well. Anyway, most interpreted languages seem to have pretty good support for this kind of thing, and you should just choose one that you like and is widely available on Linux and other *nix operating systems.

The main reason I would recommend Python is if you already know it. If you don't know anything besides BASH (or BASH and lower-level languages like C or even Java), Python is a reasonable choice for your next language. It has a lot of mature, fast third-party libraries in a lot of domains -- science, math, web, machine learning, etc. It's also generally considered easy to learn and has become a major teaching language.

The other very compelling reason to learn Python is that it is the language covered in this very compelling tutorial.

[1]	I'm referring specifically to Perl 5 here. Perl 6 is a better language, in my opinion, but suffers from a lack of adoption. https://perl6.org/

Learn Python

This tutorial isn't going to teach you the Python core language, though a few built-in features will be covered. If you need to learn it, I highly recommend the official tutorial, at least through chapter 5. Through chapter 9 would be even better, and you might as well just read the whole thing at that point.

If you're new to programming, you might try the book Introducing Python or perhaps Think Python. Dive Into Python is another popular book that is available for free online. You may see a lot of recommendations for Learn Python the Hard Way. I think this method is flawed, though I do appreciate that it was written by someone with strong opinions about correctness, which has some benefits.

This tutorial assumes Python 3.5 or higher, though it may sometimes use idioms from newer versions, and I will attempt to document when have used an idiom which doesn't work in 3.4, which is apparently the version that ships with the latest CentOS and SLES. Use at least 3.6 if you can. It has some cool new features, but the implementation of dictionaries (Python's hash map) was also overhauled in this version of Python, which sort of undergirds the way the whole object system is implemented and therefore is a major win all around.

Basically, always try to use whatever the latest version of Python is. Do not use Python 2. It will be officially retired in 2020. That's two years. If a library hasn't been ported to Python 3 yet, it's already dead, just that its maintainers might not know it yet.

One last note about this tutorial: It doesn't explain so much. I have no desire to rewrite things that are already in the official documentation. It frequently just points to the relevant documentation for those wishing to do the kinds of tasks that Bash scripting is commonly used for.

Reading and Writing Files

If you're going to do any kind of administration or automation on a Unix system, the idea of working with files is pretty central. The great coreutils like grep, sed, awk, tr, sort, etc., they are all designed to go over text files line by line and do... something with the content of that line. Any shell scripter knows that these "files" aren't always really files. Often as not, it's really dealing with the output of another process and not a file at all. Whatever the source, the organizing principle is streams of text divided by newline characters. In Python, this is what we'd call a "file-like object."

Because the idea of working with text streams is so central to Unix programming, we start this tutorial with the basics of working with text files and will go from there to other streams you might want to work with.

One handy thing in the shell is that you never really need file handles. All you have to type to loop over lines in a file would be something like:

while read line; do
    stuff with "$line"
done < my_file.txt

(Don't use this code. You actually have to do some things with $IFS to make it safe. Don't use any of my Bash examples. Don't use Bash! The proper one is while IFS= read -r line, but that just raises more questions.)

In Python, you need to turn a path into a file object. The above loop would be something like this:

with open('my_file.txt') as my_file:
    for line in my_file:
        do_stuff_with(line.rstrip())

## the .rstrip() method is optional. It removes trailing whitespace
## from the line (including the newline character).

Let's take that apart.

The open() function returns a file object. If you just send it the path name as a string, it's going to assume it's a text file in the default system encoding (UTF-8, right?), and it is opened only for reading. You can, of course, do my_file = open('my_file.txt') as well. When you use with x as y: instead of assignment, it ensures the object is properly cleaned up when the block is exited using something called a "context manager". You can do my_file.close() manually, but the with block will ensure that happens even if you hit an error without having to write a lot of extra code.

The gross thing about context managers is that they add an extra level of indentation. Here's a helper function you can use to open a context manager for something you want to be cleaned up after you loop.

def iter_with(obj):
    with obj:
        yield from obj

and then you use it like this:

for line in iter_with(open('my_file.txt')):
    do_stuff_with(line)

yield from means it's a generator function, and it's handing over control to a sub-iterator (the file object, in this case) until that iterator runs out of things to return. Don't worry if that doesn't make sense. It's a more advanced Python topic and not necessary for administrative scripting.

If you don't want to iterate on lines, which is the most memory-efficient way to deal with text files, you can slurp entire contents of a file at once like this:

with open('my_file.txt') as my_file:
    file_text = my_file.read()
    ## or
    lines = list(my_file)
    ## or with newline characters removed
    lines = my_file.read().splitlines()

## This code wouldn't actually run because the file hasn't been
## rewound to the beginning after it's been read through.

## Also note: list(my_file). Any function that takes an iterable can
## take a file object.

You can also open files for writing with, like this:

with open('my_file.txt', 'w') as my_file:
    my_file.write('some text\n')
    my_file.writelines(['a\n', 'b\n', 'c\n'])
    print('another line', file=my_file)        # print adds a newline.

The second argument of open() is the mode. The default mode is 'r', which opens the file for reading text. 'w' deletes everything in the file (or creates it if it doesn't exist) and opens it for writing. You can also use the mode 'a'. This goes to the end of a file and adds text there. In shell terms, 'r' is a bit like <, 'w' is a bit like >, and 'a' is a bit like >>.

This is just the beginning of what you can do with files. If you want to know all their methods and modes, check the official tutorial's section on reading and writing files. File objects provide a lot of cool interfaces. These interfaces will come back with other "file-like objects" which will come up many times later, including in the very next section.

Command-Line Interfaces

Working with `stdin`, `stdout` and `stderr`

Unix scripting is all about filtering text streams. You have a stream that comes from lines in a file or output of a program and you pipe it through other programs. Unix has a bunch of special-purpose programs just for filtering text (some of the more popular of which are enumerated at the beginning of the previous chapter). Everyone using a *nix system has probably done something like this at one point or another:

program-that-prints-something | grep 'a pattern'

This is the "normal" way to search through the output of a program for lines containing whatever it is you're searching for. Your setting the stdout of program-that-prints-something to the stdin of grep.

Great CLI scripts should follow the same pattern so you can incorporate them into your shell pipelines. You can, of course, write your script with its own "interactive" interface and read lines of user input one at a time:

username = input('What is your name? ')

This is fine in some cases, but it doesn't really promote the creation of reusable, multi-purpose filters. With that in mind, allow me to introduce the sys module.

The sys module has all kinds of great things as well as all kinds of things you shouldn't really be messing with. We're going to start with sys.stdin.

sys.stdin is a file-like object that, you guessed it, allows you to read from your script's stdin. In Bash you'd write:

while read line; do # <- not actually safe. Don't use bash.
    stuff with "$line"
done

In Python, that looks like this:

import sys
for line in sys.stdin:
    do_stuff_with(line) # <- we didn't remove the newline char this
                        #    time. Just mentioning it because it's a
                        #    difference between python and shell.

Naturally, you can also slurp stdin in one go, though this isn't the most Unix-y design choice and you could use up your RAM with a very large file:

text = sys.stdin.read()

As far as stdout is concerned, you can access it directly if you like, but you'll typically just use the print() function.

print("Hello, stdout.")
# ^ functionally same as:
sys.stdout.write('Hello, stdout.\n')

Anything you print can be piped to another process. Pipelines are great. For stderr, it's a similar story:

print('a logging message.', file=sys.stderr)
# or:
sys.stderr.write('a logging message.\n')

If you want more advanced logging functions, check out the logging module.

Using stdin, stdout and stderr, you can write python programs which behave as filters and integrate well into a Unix workflow.

CLI Arguments

Arguments are passed to your program as a list which you can access using sys.argv. This is a bit like $@ in Bash, or $1 $2 $3... etc. e.g.:

for arg in "$@"; do
    stuff with "$arg"
done

looks like this in Python:

import sys
for arg in sys.argv[1:]:
    do_stuff_with(arg)

Why sys.argv[1:]? sys.argv[0] is like $0 in Bash or argv[0] in C. It's the name of the executable. Just a refresher (because you read the tutorial, right?) a_list[1:] is list-slice syntax that returns a new list starting on the second item of a_list, going through to the end.

If you want to build a more complete set of flags and arguments for a CLI program, the standard library module for that is argparse. The tutorial in that link leaves out some useful info, so here are the API docs. click is a popular and powerful third-party module for building even more advanced CLI interfaces.

Environment Variables and Config files

Ok, environment variables and config files aren't necessarily only part of CLI interfaces, but they are part of the user interface in general, so I stuck them here. Environment variables are in the os.environ mapping, so you get to $HOME like this:

>>> import os
>>> os.environ['HOME']
'/home/ninjaaron'

As far as config files, in Bash, you frequently just do a bunch of variable assignments inside of a file and source it. You can also just write valid python files and import them as modules or eval them... but don't do that. Arbitrary code execution in a config file is generally not what you want.

The standard library includes configparser, which is a parser for .ini files, and also a json parser. I don't really like the idea of human-edited json, but go ahead and shoot yourself in the foot if you want to. At least it's flexible.

PyYAML, the YAML parser, and TOML are third-party libraries that are useful for configuration files.

Filesystem Stuff

Paths

So far, we've only seen paths as strings being passed to the open() function. You can certainly use strings for your paths, and the os and os.path modules contain a lot of portable functions for manipulating paths as strings. However, since Python 3.4, we have pathlib.Path, a portable, abstract type for dealing with file paths, which will be the focus of path manipulation in this tutorial.

>>> from pathlib import Path
>>> # make a path of the current directory
>>> p = Path()
>>> p
PosixPath('.')
>>> # iterate over directory contents
>>> for i in p.iterdir():
...     print(repr(i))
PosixPath('.git')
PosixPath('out.html')
PosixPath('README.rst')]
>>> # use filename globbing
>>> for i in p.glob('*.rst'):
...     print(repr(i))
PosixPath('README.rst')
>>> # get the full path
>>> p = p.absolute()
>>> p
PosixPath('/home/ninjaaron/doc/replacing-bash-scripting-with-python')
>>> # get the basename of the file
>>> p.name
'replacing-bash-scripting-with-python'
>>> # name of the parent directory
>>> p.parent
PosixPath('/home/ninjaaron/doc')
>>> # split path into its parts.
>>> p.parts
('/', 'home', 'ninjaaron', 'doc', 'replacing-bash-scripting-with-python')
>>> # do some tests about what the path is or isn't.
>>> p.is_dir()
True
>>> p.is_file()
False
>>> # more detailed file stats.
>>> p.stat()
os.stat_result(st_mode=16877, st_ino=16124942, st_dev=2051, st_nlink=3, st_uid=1000, st_gid=100, st_size=4096, st_atime=1521557933, st_mtime=1521557860, st_ctime=1521557860)
>>> # create new child paths with slash.
>>> readme = p/'README.rst'
>>> readme
PosixPath('/home/ninjaaron/doc/replacing-bash-scripting-with-python/README.rst')
>>> # open files
>>> with readme.open() as file_handle:
...     pass
>>> # make file executable with mode bits
>>> readme.chmod(0o755)
>>> # ^ note that octal notation is must be explicite.

Again, check out the documentation for more info. pathlib.Path. Since pathlib came out, more and more builtin functions and functions in the standard library that take a path name as a string argument can also take a Path instance. If you find a function that doesn't, or you're on an older version of Python, you can always get a string for a path that is correct for your platform by using str(my_path). If you need a file operation that isn't provided by the Path instance, check the docs for os.path and os and see if they can help you out. In fact, os is always a good place to look if you're doing system-level stuff with permissions and UIDs and so forth.

If you're doing globbing with a Path instance, be aware that, like ZSH, ** may be used to glob recursively. It also (unlike the shell) will include hidden files (files whose names begin with a dot). Given this and the other kinds of attribute testing you can do on Path instances, it can do a lot of the kinds of stuff find can do.

>>> [p for p in Path().glob('**/*') if p.is_dir()]

Oh. Almost forgot. p.stat(), as you can see, returns an os.stat_result instance. One thing to be aware of is that the st_mode, (i.e. permissions bits) is represented as an integer, so you might need to do something like oct(p.stat().st_mode) to show what that number will look like in octal, which is how you set it with chmod in the shell.

Replacing miscellaneous file operations: `shutil`

There are certain file operations which are really easy in the shell, but less nice than you might think if you're using python file objects or the basic system calls in the os module. Sure, you can rename a file with os.rename(), but if you use mv in the shell, it will check if you're moving to a different file system, and if so, copy the data and delete the source -- and it can do that recursively without much fuss. shutil is the standard library module that fills in the gaps. The docstring gives a good summary: "Utility functions for copying and archiving files and directory trees."

Here's the overview:

>>> import shutil
>>> # $ mv src dest
>>> shutil.move('src', 'dest')
>>> # $ cp src dest
>>> shutil.copy2('src', 'dest')
>>> # $ cp -r src dest
>>> shutil.copytree('src', 'dest')
>>> # $ rm a_file
>>> os.remove('a_file') # ok, that's not shutil
>>> # $ rm -r a_dir
>>> shutil.rmtree('a_dir')
>>> # $ tar caf 'my_archive.tar.gz' 'my_folder'
>>> shutil.make_archive('my_archive.tar.gz', 'gztar', 'my_folder')
>>> # $ tar xaf 'my_archive.tar.gz'
>>> shutil.unpack_archive('my_archive.tar.gz')
>>> # chown user:ninjaaron a_file.txt
>>> shutil.chown('a_file.txt', 'ninjaaron', 'user')
>>> # info about disk usage, a bit like `df`, but not exactly.
>>> shutil.disk_usage('.')
usage(total=123008450560, used=86878904320, free=36129546240)
>>> #  ^ sizes in bytes
>>> # which vi
>>> shutil.which('vi')
'/usr/bin/vi'
>>> # info about the terminal you're running in.
>>> shutil.get_terminal_size()
os.terminal_size(columns=138, lines=30)

That's the thousand-foot view of the high-level functions you'll normally be using. The module documentation is pretty good for examples, but it also has a lot of details about the functions used to implement the higher-level stuff I've shown which may or may not be interesting.

I should probably also mention os.link and os.symlink at this point. They create hard and soft links respectively (like link and link -s in the shell). Path instances also have .symlink_to() method, if you want that.

Replacing `sed`, `grep`, `awk`, etc: Python regex

This section is not so much for experienced programmers who already know more or less how to use regexes for matching and string manipulation in other "normal" languages. Python is not so exceptional in this regard, though if you're used to JavaScript, Ruby, Perl, and others, you may be surprised to find that Python doesn't have regex literals. The regex functionally is all encapsulated in the re module. (The official docs have a regex HOWTO, which is a good place to start if you don't know anything about regular expressions. If you have some experience, I'd recommend going straight for the re API docs.)

This section is for people who know how to use programs like sed, grep and awk and wish to get similar results in Python, though short explanations will be provided of what those utilities are commonly used for. The intent is not that you should use Python wherever you might use one-liners with these programs in the course of normal shell usage (or in the the middle of the kinds of process orchestration scripts that Bash does so well). The idea is rather that, when writing a Python script, you won't be tempted to shell out for text processing.

I admit that writing simple text filters in Python will never be as elegant as it is in Perl, since Perl was more or less created to be like a super-powered version of the sh + awk + sed. The same thing can sort of be said about awk, the original text-filtering language on Unix. The main reason to use Python for these tasks is that the project is going to scale a lot more easily when you want to do something a bit more complex.

Another thing to keep in mind is that python has built-in operations that you can use if you just need to match a string, rather than a regular expression. Simple string operations are much faster than regular expressions, though not as powerful.

Note

One thing to be aware of is that Python's regex is more like PCRE (Perl-style -- also similar to Ruby, JavaScript, etc.) than BRE or ERE that most shell utilities support. If you mostly do sed or grep without the -E option, you may want to look at the rules for Python regex (BRE is the regex dialect you know). If you're used to writing regex for awk or egrep (ERE), Python regex is more or less a superset of what you know. You still may want to look at the documentation for some of the more advanced things you can do. If you know regex from either vi/Vim or Emacs, they both use their own dialect of regex, but they are supersets of BRE, and Python's regex will have some major differences.

How to `grep`

grep is the Unix utility that goes through each line of a file, tests if it contains a certain pattern, and then prints the lines that match. If you're a programmer and you don't use grep, start using it! Retrieving matching lines in a file is easy with Python, so we'll start there.

If you don't need pattern matching (i.e. something you could do with fgrep), you don't need regex to match a substring. You can simply use built-in syntax:

>>> 'substring' in 'string containing substring'
True

Otherwise, you need the regex module to match things:

>>> import re
>>> re.search(r'a pattern', r'string containing a pattern')
<_sre.SRE_Match object; span=(18, 27), match='a pattern'>
>>> re.search(r'a pattern', r'string without the pattern')
>>> # Returns None, which isn't printed in the Python REPL

I'm not going to go into the details of the "match object" that is returned at the moment. The main thing for now is that it evaluates to True in a boolean context. You may also notice I use raw strings r''. This is to keep Python's normal escape sequences from being interpreted, since regex uses its own escapes.

So, to use these to filter through strings:

>>> ics = an_iterable_containing_strings
>>> # like fgrep
>>> filtered = (s for s in ics if substring in s)
>>> # like grep (or, more like egrep)
>>> filtered = (s for s in ics if re.search(pattern, s))

an_iterable_containing_strings here could be a list, a generator or even a file/file-like object. Anything that will give you strings when you iterate on it. I use generator expression syntax here instead of a list comprehension because that means each result is produced as needed with lazy evaluation. This will save your RAM if you're working with a large file. You can invert the result, like grep -v simply by adding not to the if clause. There are also flags you can add to do things like ignoring the case (flags=re.I), etc. Check out the docs for more.

Example: searching logs for errors

Say you want to look through the log file of a certain service on your system for errors. With grep, you might do something like this:

$ grep -i error: /var/log/some_service.log

This will search through /var/log/some_service.log for any line containing the string error:, ignoring case. To do the same thing in Python:

with open('/var/log/some_service.log') as log:
    matches = (line for line in log if 'error:' in line.lower())
    # line.lower() is a substitute for -i in grep, in this case

The difference here is that the bash version will print all the lines, and the python version is just holding on to them for further processing. If you want to print them, the next step is print(*matches) or for line in matches: print(line, end=''). However, this is in the context of a script, so you probably want to extract further information from the line and do something programmatically with it anyway.

How to `sed`

sed can do a LOT of things. It's more or less "text editor" without a window. Instead of editing text manually, you give sed instructions about changes to apply to lines, and it does it all in one shot. (The default is to print what the file would look like with modification. The file isn't actually changed unless you use a special flag.)

I'm not going to cover all of that. Back when I wrote more shell scripts and less Python, the vast majority of my uses for sed were simply to use the substitution facilities to change instances of one pattern into something else, which is what I cover here.

>>> # sed 's/a string/another string/g' -- i.e. doesn't regex
>>> replaced = (s.replace('a string', 'another string') for s in ics)
>>> # sed 's/pattern/replacement/g' -- needs regex
>>> replaced = (re.sub(r'pattern', r'replacement', s) for s in ics)

re.sub has a lot of additional features, including the ability to use a function instead of a string for the replacement argument. I consider this to be very useful. If you're new to regex, note especially the section about backreferences in replacements. You may wish to check the section in the regex HOWTO about Search and Replace as well.

How to `awk`

The sed section needed a little disclaimer. The awk section needs a bigger one. AWK is a Turing-complete text/table processing language. I'm not going to cover how to do everything AWK can do with Python idioms. [2]

However, inside of shell scripts, it's most frequently used to extract fields from tabular data, such as tsv files. Basically, it's used to split strings.

>>> # awk '{print $1}'
>>> field1 = (f[0] for f in (s.split() for s in ics))
>>> # awk -F : '{print $1}'
>>> field1 = (f[0] for f in (s.split(':') for s in ics))
>>> # awk -F '[^a-zA-Z]' '{print $1}'
>>> field1 = (f[0] for f in (re.split(r'[^a-zA-Z]', s) for s in ics))

As is implied in this example, the str.split method splits on sections of contiguous whitespace by default. Otherwise, it will split on whatever is given as a delimiter. For more on splitting with regular expressions, see re.split and Splitting Strings.

[2]	It has been pointed out to me that `sed` is also Turing complete, and it seems to be the case. However, implementing algorithms in `sed` is not nice. AWK is really a rather pleasant language.

Running Processes

Disclaimer

I come to this section at the end of the tutorial because one generally should not be running a lot of processes inside of a Python script. One common strategy in the realm of complex administrative tasks is to do the orchestration in bash and hand data handling off to Python, which is one of the reasons it's important for your program to have a good command-line interface. If you can read data from stdin and print to stdout and stderr, you're in good shape!

However, there are times when this model of separation of domains between Python and the shell is not practical, and it's easier simply to execute the external program from inside your Python script. Practicality beats purity.

Say you want to do some automation with packages on your system; you'd be nuts not to use apt or yum (spelled dnf these days) or whatever your package manager is. Same applies if you're doing mkfs or using a very mature and featureful program like rsync. My general rule is that any kind of filtering utility should be avoided, but specialized programs for manipulating the system are fair game -- However, in some cases, there will be a 3rd-party Python library that provides a wrapper on the underlying C code. The library will, of course, be faster than spawning a new process in most cases. Use your best judgment. Be extra judicious if you're trying to write re-usable library code.

Another thing to keep in mind (and this goes for the shell as well, it's just much more difficult to avoid it), is don't spawn processes inside of hot loops. Spawning new processes is a relatively expensive job for the operating system. Spawning one instance or even ten is no big deal (depending on the program, of course). Spawning a process thousands or millions of times in a loop, no matter how lightweight the process is, is a terrible idea. On the other hand, using an optimized C program that can do a lot of work at one shot may well be faster than trying to do the same work natively in Python (provided there is no well-supported C library for Python).

The `subprocess` Module

There are a number of functions which shall not be named in the os module that can be used to spawn processes. They have a variety of problems. Some run processes in subshells (c.f. injection vulnerabilities). Some are thin wrappers on system calls in libc, which you may want to use if you implement your own processes library, but are not particularly fun to use. Some are simply older interfaces left in for legacy reasons, which have actually been re-implemented on top of the new module you're supposed to use, subprocess. For administrative scripting, just use subprocess directly.

This tutorial focuses on using the Popen constructor and the run function, the latter of which was only added in Python 3.5. If You are using Python 3.4 or earlier, you need to use the old API, though a lot of what is said here will still be relevant.

The Popen API (over which the run function is a thin wrapper) is a very flexible, securely designed interface for running processes. Most importantly, it doesn't open a subshell by default. That's right, it's completely safe from shell injection vulnerabilities -- or, the injection vulnerabilities are opt-in. There's always the shell=True option if you're determined to write bad code.

On the other hand, it is a little cumbersome to work with, so there are a lot of third-party libraries to simplify it. Plumbum is probably the most popular of these. Sarge is also not bad. My own contribution to the field is easyproc (though the documentation needs to be completely rewritten).

There are also a couple of Python supersets that allow inlining shell commands in python code. xonsh is one, which also provides a fully functional interactive system shell experience and is the program that runs every time I open a terminal. I highly recommend it!

Anyway, on with the show.

>>> import subprocess as sp
>>> sp.run(['ls', '-lh'])
total 104K
-rw-r--r-- 1 ninjaaron users 69K Mar 21 16:40 out.html
-rw-r--r-- 1 ninjaaron users 32K Mar 23 11:11 README.rst
CompletedProcess(args=['ls', '-lh'], returncode=0)

As you see, the first and only required argument of the run function is a list (or any other iterable) of command arguments. stdout is not captured, it just goes wherever the stdout of the script goes. What is returned is a CompletedProcess instance, which has an args attribute and a returncode attribute. More attributes may also become available when certain keyword arguments are used with run.

Dealing with Exit Codes

Unlike most other things in Python, a process that fails doesn't raise an exception by default.

>>> sp.run(['ls', '-lh', 'foo bar baz'])
ls: cannot access 'foo bar baz': No such file or directory
CompletedProcess(args=['ls', '-lh', 'foo bar baz'], returncode=2)

This is the same way it works in the shell. However, you usually are going to want your script to stop if your command didn't work, or at least try something else. You could, do this manually:

>>> proc = sp.run(['ls', '-lh', 'foo bar baz'])
ls: cannot access 'foo bar baz': No such file or directory
>>> if proc.returncode != 0:
...     # do something else

This would be most useful in cases where a non-zero exit code indicates something other than an error. For example, grep returns 1 if no lines were matched. Not really an error, but something you might want to check for.

However, in the majority of cases, you probably want a non-zero exit code to crash the program, especially during development. This is where you need the check parameter:

>>> sp.run(['ls', '-lh', 'foo bar baz'], check=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ls', '-lh', 'foo bar baz']' returned non-zero exit status 2.
Command '['ls', '-lh', 'foo bar baz']' returned non-zero exit status 2.

Much better! You can also use normal Python exception handling now, if you like.

Redirecting process IO (i.e. pipes)

If you want to capture the output of a process, you need to use the stdout parameter. If you wanted to redirect it to a file, it's pretty straight-forward:

>>> with open('./foo', 'w') as foofile:
...     sp.run(['ls'], stdout=foofile)

Pretty similar with input:

>>> with open('foo') as foofile:
...     sp.run(['tr', 'a-z', 'A-Z'], stdin=foofile)
...
FOO
OUT.HTML
README.RST

If you want to do something with input and output text inside the script itself, you need to use the special constant, subprocess.PIPE.

>>> proc = sp.run(['ls'], stdout=sp.PIPE)
>>> print(proc.stdout)
b'foo\nout.html\nREADME.rst\n'

What's this now? Oh, right. Streams to and from processes default to bytes, not strings. You can decode your string, or you can use the flag to ensure the stream is a python string, which, in their infinite wisdom, the authors of the subprocess module chose to call universal_newlines, as if that's the most important distinction between bytes and strings in Python. Update: as of Python 3.7, `universal_newlines` is aliased to `text`

>>> proc = sp.run(['ls'], stdout=sp.PIPE, universal_newlines=True)
>>> print(proc.stdout)
foo
out.html
README.rst

So that's awkward. In fact, this madness was one of my primary motivations for writing easyproc.

If you want to send a string to the stdin of a process, you will use a different run parameter, input (again, requires bytes unless universal_newlines=True).

>>> sp.run(['tr', 'a-z', 'A-Z'], input='foo bar baz\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/subprocess.py", line 405, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 828, in communicate
    self._stdin_write(input)
  File "/usr/lib/python3.6/subprocess.py", line 781, in _stdin_write
    self.stdin.write(input)
TypeError: a bytes-like object is required, not 'str'
a bytes-like object is required, not 'str'
>>>
>>> ## Makes nothing but sense...
>>>
>>>
>>> sp.run(['tr', 'a-z', 'A-Z'], input='foo bar baz\n', universal_newlines=True)
FOO BAR BAZ
CompletedProcess(args=['tr', 'a-z', 'A-Z'], returncode=0)

The `stderr` Parameter

Just as there is an stdout parameter, there is also an stderr parameter for dealing with messages from the process. It works as expected:

>>> with open('foo.log', 'w') as logfile:
...     sp.run(['ls', 'foo bar baz'], stderr=logfile)
...
>>> sp.run(['ls', 'foo bar baz'], stderr=sp.PIPE).stderr
b"ls: cannot access 'foo bar baz': No such file or directory\n"

However, another common thing to do with stderr in administrative scripts is to combine it with stdout using the oh-so-memorable incantation shell incantation of 2>&1. subprocess has a thing for that, too, the STDOUT constant.

>>> proc = sp.run(['ls', '.', 'foo bar baz'], stdout=sp.PIPE, stderr=sp.STDOUT)
>>> print(proc.stdout.decode())
ls: cannot access 'foo bar baz': No such file or directory
.:
foo
foo.log
out.html
README.rst

You can also redirect stdout and stderr to /dev/null with the constant subprocess.DEVNULL.

There's a lot more you can do with the run function, but that should be enough to be getting on with.

Background Processes and Concurrency

subprocess.run starts a process, waits for it to finish, and then returns a CompletedProcess instance that has information about what happened. This is probably what you want in most cases. However, if you want processes to run in the background or need to interact with them while they continue to run, you need the the Popen constructor.

If you simply want to start a process in the background while you get on with your script, it's a lot like run.

>>> ## Time for popcorn...
>>> sp.Popen(['mpv', 'Star Trek II: The Wrath of Kahn.mkv'])
<subprocess.Popen object at 0x7fc35f4c0668>
>>> ## and the script continues while we enjoy the show...

This isn't quite the same as backgrounding a process in the shell using &. I haven't looked into what happens technically, but I can tell you that the process will keep going even if the terminal it was started from is closed. It's a bit like nohup. However, if not redirected, stdout and stderr will still be printed to that terminal.

Other reasons to do this might be to kick off a process at the beginning of the script that you need output from, and then come back to it later to minimize wait-time. For example, I use a Python script to generate my ZSH prompt. Among other things, this script checks the git status of the folder. However, that can take some time and I want the script to do as much work as possible while it's waiting on those commands.

## somewhere near the top of the script:
branch_proc = sp.Popen(['git', 'branch'], stdout=sp.PIPE,
                       stderr=sp.DEVNULL, universal_newlines=True)
status_proc = sp.Popen(['git', 'status', '-s'], stdout=sp.PIPE,
                       stderr=sp.DEVNULL, universal_newlines=True)

## ... somewhere further down:

branch = [i for i in branch_proc.stdout if i.startswith('*')][0][2:-1]
color = 'red' if status_proc.stdout.read() else 'green'

Notice that stdout in this case is not a string. It's a file-like object. This is perfect for dealing with output from a program line-by-line, as many system utilities do. This is particularly important if the program produces a lot of lines of output and reading the whole thing into a Python string could potentially use up a lot of RAM. It's also useful for long-running programs that may produce output slowly, but you want to process it as it comes. e.g.:

>>> # don't actually use `find` in Python. Path.glob and os.walk
>>> # are better.
>>> with sp.Popen(['find', '/'], stdout=sp.PIPE,
...                universal_newlines=True) as proc:
...     for line in proc.stdout:
...         do_stuff_with(line)

You can also use this mechanism to pipe processes together, though the cases when you need to do this in python should be rare, since text filtering is best done in python itself. A case where you might want to pipe processes together could be extracting the content of an rpm package:

>>> # rpm2cpio a_package.rpm | cpio -idm
>>> r2c = sp.Popen(['rpm2cpio', 'a_package.rpm'], stdout=sp.PIPE)
>>> sp.run(['cpio', '-idm'], stdin=r2c.stdout)

`shlex.quote`: protecting against shell injection

The subprocess module, as mentioned earlier, is safe from injection by default, unless shell=True is used. However, there are some programs that will give arguments to a shell after they are started. SSH is a classic example. Every argument you send with ssh gets parsed by a shell on the remote system.

As soon as a process gets a shell, you're giving up one of the main benefits of using Python in the first place. You get back into the realm of injection vulnerabilities.

Basically, instead of this:

>>> sp.run(['ssh', 'user@host', 'ls', path])

You need to do something like this:

>>> import shlex
>>> sp.run(['ssh', 'user@host', 'ls', shlex.quote(path)])

shlex.quote will ensure that any spaces or shell metacharacters are properly escaped. The only trouble with it is that you actually have to remember to use it.

The shlex module also has a split function which will split a string into a list the same way the shell would split arguments. This is useful if you have a string that looks like a shell command and you want to send it to subprocess.run or subprocess.Popen.

Miscellaneous

This is where all the stuff goes that doesn't really need detailed coverage in this tutorial, but it's something you need to do often enough in shell scripts that it deserves pointers to additional resources.

Getting the Time

In administrative scripting, one frequently wants to put a timestamp in a file name for naming logs or whatever. In a shell script, you just use the output of date for this. Python has two libraries for dealing with time, and either is good enough to handle this. The time module wraps time functions in libc. If you want to get a timestamp out of it, you do something like this:

>>> import time
>>> time.strftime('%Y.%m.%d')
'2018.08.18'

This can use any of the format spec you see when you run $ man date. There is also a time.strptime function which will take a string as input and use the same kind of format string to parse the time out of it and into a tuple.

The datetime module provides classes for working with time at a high level. It's a little cumbersome for very simple things, and incredibly helpful for more sophisticated things like math involving time. The one handy thing it can do for our case is to give us a string of the current time without the need for a format specifier.

>>> import datetime
>>> # get the current time as a datetime object
>>> datetime.datetime.now()
datetime.datetime(2018, 8, 18, 10, 5, 56, 518515)
>>> now = _
>>> str(now)
'2018-08-18 10:05:56.518515'
>>> now.strftime('%Y.%m.%d')
'2018.08.18'

This means that, if you're happy with the default string representation of the datetime class, you can just do str(datetime.datetime.now()) to get the current timestamp. There is also a datetime.datetime.strptime() to generate a datetime instance from a timestamp.

Interprocess Communication

I'm not sure if IPC is really part of bash scripting, but sometimes administrators might need to write a daemon or whatever that runs in the background, but is still able to receive communication from the user via a client.

The simplest way to do this is with a fifo, a.k.a. a named pipe.

import os

myfifo = '/tmp/myfifo'
os.mkfifo(myfifo)
try:
    while True:
        with open(myfifo) as fh:
            do_something(fh.read())
except:
    os.remove(myfifo)
    raise

That's your server that you start with your init system. The simplest client could just be echo; echo some text > /tmp/myfifo. Of course, you can do a lot more with the client if you like. The limitation of a fifo is that it's one-way communication. If you want two-way, you need two fifos. Alternatively, use a TCP socket.

Python has a dead-simple library for making a socket server, aptly named socketserver. Scroll down to the examples and they have basically everything you need to know for implementing your server and client. For a daemon that you're just interacting with over localhost, you're going to get better performance using the UnixStreamServer class, and you won't use up a port. Plus, Unix sockets will make your Unix beard grow better.

The problem with either of these is that they just block until they get a message (unless you use the threaded socket server, which might be fine in some cases). If you want your daemon to do work while simultaneously listening for input, you need threads or asyncio. Unfortunately for you, this tutorial is about replacing Bash with Python, and I'm not about to try to teach you concurrency.

Note

I'll just say that the python threading module is fine for IO-bound multitasking on a small scale. If you need something large-scale, use asyncio. If you need real concurrent execution, know that Python threads are a lie, and asyncio doesn't do that. You need multiprocessing. If you need concurrent execution, but processes are too expensive, use another programming language. Python has limitations in this area.

Downloading Web Pages and Files

If you're doing any kind of fancy http requests that require things like interacting with APIs, shooting data around, doing authentication, or basically anything besides downloading static assets, use requests. In fact, you should probably even use it for the simple case of downloading things. However, this is also possible with the standard library, and not particularly painful.

For that, you need urllib.request.

Epilogue: Choose the right tool for the job.

One of the main criticism of this tutorial (I suspect from people who haven't read it very well) is that it goes against the philosophy of using the best tool for the job. My intention is not that people rewrite all existing Bash in Python (though sometimes rewrites might be a net gain), nor am I attempting to get people to entirely stop writing new Bash scripts.

The tutorial has also been accused of being a "commercial for Python." I would have thought the Why Python? section would show that this is not the case, but if not, let me reiterate: Python is one of many languages well suited to administrative scripting. The others also provide a safer, clearer way to deal with data than the shell. My goal is not to get people to use Python as much as it is to try to get people to stop handling data in shell scripts.

The "founding fathers" of Unix had already recognized the fundamental limitations of the Bourne shell for handling data and created AWK, a complementary, string-centric data parsing language. Modern Bash, on the other hand, has added a lot of data related features which make it possible to do many of the things you might do in AWK directly in Bash. Do not use them. They are ugly and difficult to get right. Use AWK instead, or Perl or Python or whatever.

When to use Bash

I do believe that for a program which deals primarily with starting processes and connecting their inputs and outputs, as well as certain kinds of file management tasks, the shell should still be the first candidate. A good example might be setting up a server. I keep config files for my shell environment in Git (like any sane person), and I use sh for all the setup. That's fine. In fact, it's great. Running some commands and symlinking files is a usecase that fits perfectly to the strengths of the shell.

I also have shell scripts for automating certain parts of my build, testing and publishing workflow for my programming, and I will probably continue to use such scripts for a long time. (I also use Python for some of that stuff. Depends on the nature of the task.)

Warning Signs

Many people have rule about the length of their Bash scripts. It is oft repeated on the Internet that, "If your shell script gets to fifty lines, rewrite in another language," or something similar. The number of lines varies from 10 to 20 to 50 to 100. Among the Unix old guard, "another language" is basically always Perl. I like Python because reasons, but the important thing is that it's not Bash.

This kind of rule isn't too bad. Length isn't the problem, but length can be a side-effect of complexity, and complexity is sort of the arch-enemy of Bash. I look for the use of certain features to be an indicator that it's time to consider a rewrite. (note that "rewrite" can mean moving certain parts of the logic into another language while still doing orchestration in Bash). These "warning signs are" listed in order of more to less serious.

If you ever need to type the characters IFS=, rewrite immediately. You're on the highway to Hell.
If data is being stored in Bash arrays, either refactor so the data can be streamed through pipelines or use a different language. As with IFS, it means you're entering the wild world of the shell's string splitting rules. That's not the world for you.
If you find yourself using braced parameter expansion syntax, ${my_var}, and anything is between those braces besides the name of your variable, it's a bad sign. For one, it means you might be using an array, and that's not good. If you're not using an array, it means you're using the shell's string manipulation capabilities. There are cases where this might be allowable (determining the basename of a file, for example), but the syntax for that kind of thing is very strange, and so many other languages supply better string manipulating tools. If you're doing batch file renaming, pathlib provides a much saner interface, in my opinion.
Dealing with process output in a loop is not a great idea. If you HAVE to do it, the only right way is with while IFS= read -r line. Don't listen to anyone who tells you differently, ever. Always try to refactor this case as a one-liner with AWK or Perl, or write a script in another language to process the data and call it from Bash. If you have a loop like this, and you are starting any processes inside the loop, you will have major performance problems. This will eventually lead to refactoring with Bash built-ins. In the final stages, it results in madness and suicide.
Bash functions, while occasionally useful, can be a sign of trouble. All the variables are global by default. It also means there is enough complexity that you can't do it with a completely linear control flow. That's also not a good sign for Bash. A few Bash functions might be alright, but it's a warning sign.
Conditional logic, while it can definitely be useful, is also a sign of increasing complexity. As with functions, using it doesn't mean you have to rewrite, but every time you write one, you should ask yourself the question as to whether the task you're doing isn't better suited to another language.

Finally, whenever you use a $ in Bash (parameter expansion), you must use quotation marks. Always only ever use quotation marks. Never forget. Never be lazy. This is a security hazard. As previously mentioned, Bash is an injection honeypot. There are a few cases where you don't need the quotation marks. They are the exceptions. Do not learn them. Just use quotes all the time. It is always correct.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitignore		.gitignore
README.rst		README.rst

ninjaaron/replacing-bash-scripting-with-python

Folders and files

Latest commit

History

Repository files navigation

Replacing Bash Scripting with Python

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!