#Building Pipelines and Software


By the end of this lesson, you should be able to get started with the following tasks:

- Automating complex workflows and analysis pipelines
- Configuring and installing external libraries
- Creating build systems for your own software



## make

make is a command-line utility that determines which elements of a process pipeline
need to be executed, and then executes them. 



### Exercise: Read the man page for make

1. Open a terminal, or use the  ! syntax in the ipython notebook.
2. Execute ```which make```. If this fails, alert your instructor. 
3. Execute ```man make```
4. To prepare to use make, what kind of file must you create?
5. Do you recall how to exit a man page? Exit it.

`make` can be used to automate every step of a process 
quite efficiently, because it keeps track of how things 
depend on one another and detects which pieces are not 
up to date. Given the file dependency tree and a description 
of the processes that compile each file based on the others, 
make can execute the necessary processes in the appropriate 
order. Because it detects which files in the dependency 
tree have changed, make executes only the necessary processes, 
and no more. This saves time, especially when some actions take 
a long time to execute but are not always necessary. 

When a new data file is added, make can determine what analysis 
files, figures, and documents are affected. It can then execute 
the processes to update them. In this way, it can automatically 
rerun Prof. Mayer’s data analysis, regenerate appropriate plot, 
and rebuild the paper accordingly.

`make` can be run on the command line with the following syntax:

```
make [ -f makefile ] [ options ] ... [ targets ] ...
```

### Exercise: Draw Your Process Dependency Tree

To run an analysis for your class project, more than one 
command probably needs to be executed. 

- Write down the commands you need to execute for your project.
- Note which files each command creates and which it relies on. 
- Draw a tree describing this dependency. For an example, see Figure 14-1 in the book.
- Can you tell what processes would need to be reexecuted if the source code changes?
- Try drawing a path up the branches of the tree to the top. Which commands do you pass on the way?
- Share your drawing with a partner.



### The Makefile

It looks like the make command can be run without any arguments. So, in some directory
where your work is stored, you can try typing this magical ```make``` command:

```
~/shell_model $ make
make: *** No targets specified and no makefile found. Stop.
```

Uh oh, it looks like it may not be magic after all. It seems to require a makefile. But what is a makefile? A makefile is just a plain-text file that obeys certain formatting rules. Its purpose is to supply the make utility with a full definition of the dependency tree describing the relationships between files and tasks. The makefile also describes the steps for updating each file based on the others (i.e., which commands must be executed to update
one of the nodes).

If the make command is run with no arguments, then the make utility seeks a file called Makefile in the current directory to fulfill this purpose. The error response occurs because you have not yet created a Makefile in the directory where you hold your analysis.


### If the makefile has any name other than Makefile, its name must be
provided explicitly. The -f flag indicates the location of that file to
the make utility. Makefiles with names other than Makefile are typically
only necessary if more than one makefile must exist in a single
directory. By convention, makefiles not called Makefile end in
the .mk extension.

This section will discuss how to write makefiles by hand. Such makefiles can be used
to automate simple analysis and software pipelines. Prof. Mayer will create one to
update the plots in her paper based on new data.

### Targets
First and foremost, the makefile defines targets. Targets are the nodes of the dependency
tree. They are typically the files that are being updated. The makefile is made up
mostly of a series of target-prerequisite-action maps defined in the following syntax:

```
target : prerequisites
         action
```

A colon separates the target name from the list of prerequisites.
Note that the action must be preceded by a single tab character.
The analyzed .dat files depend on the raw .h5 files in the raw_data directory. They
also depend on the bash scripts that churn through the .h5 files to convert them into
useful .dat files. Therefore, the photon_photon.dat target depends on two prerequisites,
the set of ./raw_data/*.h5 files and the photon_analysis.sh shell script.
Let us imagine the shell script is quite clever, having been written by Prof. Mayer herself.
It has been written to generically model various interactions and accepts arguments
at runtime that modify its behavior. One of the arguments it accepts is the
number of photons involved in the interaction. Since the photon_photon.dat file
describes the two-photon interaction, the shell script can be modified with the special
flag -n=2 indicating the number of photons. The following definition in a makefile
sets up the target with its prerequisites and passes in this argument:

# Building the Shell Model Paper

```
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
```

The target file to be created or updated is the photon_photon.dat file. The prerequisites
(the files on which it depends) are the shell script and the .h5 files.
This command is the action that must be taken to update the target using the
prerequisites.
338 | Chapter 14: Building Pipelines and Software
In this example, the first line is a comment describing the file. That’s just good practice
and does not affect the behavior of the makefile. The second line describes the
target and the prerequisites, and the third line describes the action that must be taken
to update photon_photon.dat in the event that make detects any changes to either of
its prerequisites.
If this source code is saved in a file called Makefile, then it will be found when make is
executed.
Exercise: Create a Makefile
1. In the make directory of the files associated with this book,
create an empty file called Makefile.
2. Add the photon_photon.dat target as described.
3. Save the file.
Now that the makefile defines a target, it can be used to update that target. To build or
update a target file using make, you must call it with the name of the target defined in
the makefile. In this case, if make photon_photon.dat is called, then make will:
1. Check the status of the prerequisites and photon_photon.dat.
2. If their timestamps do not match, it will execute the action.
3. However, if their timestamps do match, nothing will happen, because everything
is up to date already.
The makefile is built up of many such target-prerequisite-action maps. The full
dependency tree can accordingly be built from a set of these directives. The next node
Prof. Mayer might define, for example, is the one that rebuilds Figure 4 any time the
photon_photon.dat file is changed. That figure is generated by the plot_response.py
Python script, so any changes to that script should also trigger a rebuild of fig4.svg.
The makefile grows accordingly as each target definition is added. The new version
might look like this:
# Building the Shell Model Paper
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
fig4.svg : photon_photon.dat plot_response.py
python plot_dat.py --input=photon_photon.dat --output=fig4.svg
A new target, fig4.svg, is defined.
make | 339
The fig4.svg file depends on photon_photon.dat as a prerequisite (as well as a
Python script, plot_response.py).
The action to build fig4.svg executes the Python script with specific options.
Since the figure relies on photon_photon.dat as a prerequisite, it also, in turn, relies on
prerequisites of photon_photon.dat. In this way, the dependency tree is made. So,
when make fig4.svg is called, make ensures that all the prerequisites of its prerequisites
are up to date.
Exercise: Add Additional Targets
1. Open the Makefile created in the previous exercise.
2. Add the fig4.svg target as above.
3. Can you tell, from Figure 14-1, how to add other targets? Try
adding some.
4. Save the file.
The final paper depends on all of the figures and the .tex files. So, any time a figure or
the .tex files change, the LaTeX commands must be reissued. The LaTeX program will
be covered in much greater detail in Chapter 20. At that time, you may combine your
knowledge of make with your knowledge of LaTeX to determine what targets should
be included in a makefile for generating a LaTeX-based document.
Special Targets
The first target in a file is usually run by default. That target is the one that is built
when make is called with no arguments. Often, the desired default behavior is to
update everything. An “all” target is a common convention for this. Note that the target
name does not have to be identical to the filename. It can be any word that is convenient.
The “all” target simply needs to depend on all other top-level targets.
In the case of Prof. Mayer’s paper, the all target might be defined using the wildcard
character (*):
# Building the Shell Model Paper
all: figure*.svg *.dat *.tex *.pdf
photon_photon.dat : photon_analysis.sh ./raw_data/*.h5
./photon_analysis.sh -n=2 > photon_photon.dat
fig4.svg : photon_photon.dat
python plot_response.py --input=photon_photon.dat --output=fig4.svg
340 | Chapter 14: Building Pipelines and Software
...
Note how the all target does not define an action. It just collects prerequisites.
The all target tells make to do exactly what is needed. That is, when this target is
called (with make or make all), make ensures that all prerequisites are up to date, but
performs no final action.
Exercise: Create a Special Target
Another common special target is clean. This target is typically
used to delete generated files in order to trigger a fresh reupdate of
everything.
1. Open the Makefile you have been working with.
2. Create a “clean” target.
3. What are the appropriate prerequisites? Are there any?
4. What is the appropriate command to delete the auxiliary files
created by LaTeX?
Now that she knows how to create a makefile, Prof. Mayer can use it to manage
dependencies for the entire process of building her paper from the raw data. This is a
common use for makefiles and facilitates many parts of analysis, visualization, and
publication. Another common use for makefiles is configuring, compiling, building,
linking, and installing software libraries. The next section will cover many aspects of
this kind of makefile.
Building and Installing Software
Python is called a compiled language because it does not need to be compiled. That is,
Python is precompiled. However, that compilation step is not handled so nicely by all
programming languages. C, C++ , Fortran, Java, and many others require multiple
stages of building before they are ready to run. We said in the introduction to this
chapter that these stages were:
1. Configuration
2. Compilation
3. Linking
4. Installation
Building and Installing Software | 341
From a user’s perspective, this maps onto the following set of commands for installing
software from source :
~ $ .configure [options]
~ $ make [options]
~ $ make test
~ $ [sudo] make install
The configuration step may be called with a different command (i.e., ccmake or
scons). This step creates a makefile based on user options and system characteristics.
The build step compiles the source code into binary format and incorporates file
path links to the libraries on which it depends.
Before installing, it is wise to execute the test target, if available, to ensure that
the library has built successfully on your platform.
The installation step will copy the build files into an appropriate location on your
computer. Often, this may be a location specified by the user in the configuration
step. If the install directory requires super-user permissions, it may be necessary
to prepend this command with sudo, which changes your role during this action
to the super-user role.
For installation to succeed, each of these steps requires commands, flags, and customization
specific to the computer platform, the user, and the environment. That is, the
“action” defined by the makefile may involve commands that should be executed differently
on different platforms or for different users.
For example, a compilation step can only use the compiler available on the computer.
Compilation is done with a command of the form:
compiler [options] <source files> <include files> [-l linked libraries]
For C++ programs, one user may use g++ while another uses clang and a third uses
gcc. The appropriate command will be different for each user. The makefile, therefore,
must be configured to detect which compiler exists on the machine and to adjust the
“action” accordingly. That is, in the action for the compilation step, the compiler command
and its arguments are not known a priori. Configuration, Compilation, Linking,
and Installation depend on the computer environment, user preferences, and
many other factors.
For this reason, when you are building and installing software libraries, makefiles can
become very complex. However, at their core, their operation is no different than for
simple analysis pipeline applications like the one in the previous section. As the
dependency tree grows, more targets are added, and the actions become more com‐
342 | Chapter 14: Building Pipelines and Software
plex or system-dependent, more advanced makefile syntax and platform-specific configuration
becomes necessary. Automation is the only solution that scales.
Configuration of the Makefile
It would be tedious and error-prone to write a custom makefile appropriate for each
conceivable platform-dependent combination of variables. To avoid this tedium, the
most effective researchers and software developers choose to utilize tools that automate
that configuration. These tools:
• Detect platform and architecture characteristics
• Detect environment variables
• Detect available commands, compilers, and libraries
• Accept user input
• Produce a customized makefile
In this way, configuration tools (a.k.a. “build systems”) address all aspects of the
project that may be variable in the build phase. Additionally, they enable the developer
to supply sensible default values for each parameter, which can be coupled with
methods to override those defaults when necessary.
Why Not Write Your Own Installation Makefile?
Writing your own makefile from scratch can be time-consuming and error-prone.
Furthermore, as a software project is adopted by a diversity of users and scales to
include dependencies on external libraries, generating an appropriate array of makefiles
for each use case becomes untenable. So, the makefile should be generated by a
sophisticated build system, which will enable it to be much more flexible across platforms
than would otherwise be possible.
Some common build system automation tools in scientific computing include:
• CMake
• Autotools (Automake + Autoconf)
• SCons
Rather than demonstrating the syntax of each of these tools, the following sections
will touch on shared concepts among them and the configurations with which they
assist.
First among these, most build systems enable customization based on the computer
system platform and architecture.
Building and Installing Software | 343
Platform configuration
Users have various computer platforms with similarly various architectures. Most
software must be built differently on each. Even the very simplest things can vary
across platforms. For example, libraries have different filename extensions on each
platform (perhaps libSuperPhysics.dll on Windows, libSuperPhysics.so on Linux, and
libSuperPhysics.dyld on Unix). Thus, to define the makefile targets, prerequisites, and
actions, the configuration system must detect the platform. The operating system
may be any of the following, and more:
• Linux
• Unix
• Windows
• Mobile
• Embedded
Additionally, different computer architectures store numbers differently. For example,
on 32-bit machines, the processors store integers in 32-bit-sized memory blocks.
However, on a 64-bit machine, an integer is stored with higher precision (64 bits).
Differences like this require that the configuration system detect how the current
architecture stores numbers. These specifications often must be included in the compilation
command.
Beyond the platform and architecture customizations that must be made, the system
environment, what libraries are installed, the locations of those libraries, and other
user options also affect the build.
System and user configuration
Most importantly, different computers are controlled by different users. Thus, build
systems must accommodate users who make different choices with regard to issues
such as:
• What compiler to use
• What versions of libraries to install
• Where to install those libraries
• What directories to include in their PATH and similar environment variables
• What optional parts of the project to build
• What compiler flags to use (debugging build, optimized build, etc.)
The aspects of various systems that cause the most trouble when you’re installing a
new library are the environment variables (such as PATH) and their relationship to the
344 | Chapter 14: Building Pipelines and Software
locations of installed libraries. In particular, when this relationship is not precise and
accurate, the build system can struggle to find and link dependencies.
Dependency configuration
When one piece of software depends on the functionality of another piece of software,
the second is called a dependency. For example, if the SuperPhysics library
relies on the EssentialPhysics library and the ExtraPhysics library, then they are
its dependencies. Before attempting to install the SuperPhysics library, you must
install EssentialPhysics and ExtraPhysics.
The build can fail in either of these cases:
• The build system cannot locate a dependency library.
• The available library is not the correct version.
The build system seeks the libraries listed in the PATH, LD_LIBRARY_PATH, and similar
environment variables. Thus, the most common problems in building software arise
when too many or not enough dependency libraries appear in the directories targetted
by the environment.
When too many versions of the ExtraPhysics library are found, for example, the
wrong version of the library might be linked and an error may occur. At the other
extreme, if no EssentialPhysics library is found, the build will certainly fail. To fix
these problems, be sure all dependencies are appropriately installed.
Once all dependencies, environment variables, user options, and other configurations
are complete, a makefile or installation script is generated by the build system. The
first action it conducts is the compilation step.
Compilation
Now that the makefile is configured, it can be used to compile the source code. The
commands in the makefile for a software build will be mostly compiler commands.
Without getting into too much detail, compilers are programs that turn source code
into a machine-readable binary format.
The build system, by convention, likely generated a makefile with a default target
designed to compile all of the source code into a local directory. So, with a simple
make command, the compiled files are generated and typically saved (by the makefile)
in a temporary directory as a test before actual installation. Additionally, once compiled,
the build can usually be tested with make test.
If the tests pass, the build system can also assist with the next step: installation.
Building and Installing Software | 345
Installation
As mentioned in Chapter 13, the key to attracting users to your project is making it
installable.
On Windows, this means creating a Setup.exe file. With Python, it means implementing
a setup.py or other distribution utility. For other source code on Unix systems,
this means generating a makefile with an install target so that make install can be
called.
Why not just write a simple script to perform the installation?
The user may eventually want to upgrade or even uninstall your program, fantastic as
it may be. By tradition, the installation program is usually created by the application
developer, but the uninstall program is usually the responsibility of the operating system.
On Windows, this is handled by the Add/Remove Programs tool. On Unix, this
is the responsibility of the package manager. This means the installation program
needs special platform-dependent capabilities, which are usually taken care of by the
build system.
For example, on Linux, make install is not used when creating packages. Instead,
make DESTDIR=<a_fake_root_dir> install installs the package to a fake root directory.
Then, a package is created from the fake root directory, and uninstallation is
possible because a manifest is generated from the result.
The build system will have created this target automatically. If the installation location
chosen in the configuration step is a restricted directory, then you must execute
the make install command with sudo in order to act as the superuser:
sudo make install
At that point, the software should be successfully installed.


## Software Installation Steps

Software installation has the following steps, for which analogies exist in analysis pipelines:

1. Configuration 
2. Compilation - 
3. Linking - 
4. Installation - 

### Configuration
Configuration detects platform-dependent variables and user-specified options to customize later steps.
For example, an analysis pipeline might run a Python program on a data file to create a particular plot.
However, the location of the data file may vary from analysis to analysis or from computer
to computer. Therefore, the configuration step occurs before the program is
executed. The configuration step seeks out the proper data path by querying the environment
and the user to configure the execution accordingly.

### Compilation 
Compilation is only necessary you’re when building software written in a compiled language. The compilation step relies on a compiler to convert source code into a machine-readable binary format. 

###  Linking
The linking step attaches that binary-formatted library or executable to other libraries on which it may depend.

### Installation/Execution 
These two steps prepare the software for installation.




## Building Software and Pipelines Wrap-up
At the end of this chapter, you should now feel comfortable with automating pipelines
and building software using makefiles. You should also now be familiar with the
steps involved in building a non-Python software library from source:
1. Configuration
2. Compilation
3. Linking
4. Installation