# Unix commands and scripting


There are several great resources to get you started on Unix commands. Nemeth is a good reference to get started. 

- Nemeth, E.; Snyder, G.; Hein, T. R. & Whaley, B. Taub, M. (Ed.) Unix and Linux System Administration Handbook Prentice Hall, 2010. **Required reading: chapter 2, pp 29-72**
- Kerrisk, M. The Linux Programming Interface: A Linux and UNIX System Programming Handbook No Starch Press, 2010.
- Blum, R. & Bresnahan, C. Linux Command Line and Shell Scripting Bible John Wiley &38; Sons, Inc., 2015.
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-Tree Filesystem. Trans. Storage 9, 3, Article 9 (August 2013), 32 pages

You can practice linux commands online at [hackerrank](https://www.hackerrank.com/domains/shell) and [learnshell](http://www.learnshell.org/). More complex based bash tools can be obtained at [ostechnix](https://www.ostechnix.com/collection-useful-bash-scripts-heavy-commandline-users/) and [awesome-bash](https://github.com/awesome-lists/awesome-bash).

Two assignments are given. Assignment 1 request you build pipeline commands. Assignment 2 requires you build bash and Python scripts.

- Assigment date: **August 23, 2019**
- Duedate: **August 30, 2019. 4:00 PM**
- Students must work in teams of two. Teams must apply [pair programming](https://en.wikipedia.org/wiki/Pair_programming). 
- Submit to git: 

### Basic linux commands

A prerequisite for this notebook is that you have knowledge on the usage of the Linux desktop environment and on the usage of basic command line commands. See introduction to Linux for an introductory lecture (optional). 

**Note. You must read reference [1] to continue.** Linux command are usually run in terminal. To facilitate explanations, we will run them in Jupiter. However, you will be eventually required to work in a text terminal.

Jupyter interprets code in cells. By default, all cells in a notebook interprets code in the language selected at the notebook creation. For instance, Python. Cells, however, can interpret other languages by defining [cell magics]( https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics).  For instance, %%bash, %%html, %%perl, and many others. 


In [3]:
%%bash 
# Let’s begin by running commands that retrieve state from the file system or alter it

# Gets the current working directory
pwd

# Lists files and directories stored in the current working directory
ls

# Creates, a temporary directory named tmp 
mkdir tmp

# Changes the location of the working directory to tmp
cd tmp

# Creates an empty file
touch archivo.txt

# Again, lists the contents of the current working directory
ls 

# Remioves the file archivo.txt a single file
rm archivo.txt

# Changes location to the parent directory
cd ..

# Deletes the subdirectory tmp
rm -rf tmp

# Again, lists the contents of the current working directory
ls 

/home/ahiralesc/mysrc/notebooks/sistemas_operativos
OS_Conferences_and_resources.ipynb
OS_Lab1_unix_commands_and_scripting.ipynb
archivo.txt
OS_Conferences_and_resources.ipynb
OS_Lab1_unix_commands_and_scripting.ipynb


Files and directories can be accessed by tree class of users: user –the owner-, group, and everyone. Properties can be viewed via the ls comand

In [1]:
ls -al

total 6260
drwxr-xr-x 23 jonathan jonathan    4096 ago 28 13:08 [0m[01;34m.[0m/
drwxr-xr-x  3 root     root        4096 oct  9  2017 [01;34m..[0m/
drwx------  3 jonathan jonathan    4096 oct 27  2017 [01;34m.adobe[0m/
drwxr-xr-x 24 jonathan jonathan    4096 ago 28 12:58 [01;34manaconda[0m/
drwxr-xr-x  3 jonathan jonathan    4096 ago 28 13:01 [01;34m.anaconda[0m/
-rw-------  1 jonathan jonathan    1296 ago 28 12:03 .bash_history
-rw-r--r--  1 jonathan jonathan     220 oct  9  2017 .bash_logout
-rw-r--r--  1 jonathan jonathan    3806 ago 28 12:56 .bashrc
drwx------ 21 jonathan jonathan    4096 ago 28 10:06 [01;34m.cache[0m/
drwx------  3 jonathan jonathan    4096 oct  9  2017 [01;34m.compiz[0m/
drwxr-xr-x  3 jonathan jonathan    4096 ago 28 13:08 [01;34m.conda[0m/
-rw-r--r--  1 jonathan jonathan      40 ago 28 13:02 .condarc
drwx------ 21 jonathan jonathan    4096 ago 28 09:13 [01;34m.config[0m/
drwx------  3 root     root        4096 ago 28 10:10 [01;34

See [ls -l field explanation](http://blmrgnn.blogspot.com/2016/11/ls-l-field-explanation.html) and [Linux and Unix ls command tutorial with examples](https://shapeshed.com/unix-ls/) for field properties and other use cases.

Internally, Linux organizes files and directories in a B-tree [4]. Logically, files and directories are organized in a tree-like data structure. Mayor, or level one, directories include: bin, boot, dev, etc. See the [linux file system explained](https://www.linux.com/blog/learn/intro-to-linux/2018/4/linux-filesystem-explained) for details. You can list level one subdirectories with the *tree* command. i.e. 

In [2]:
%%bash
tree -L 1 /

bash: line 1: tree: command not found


In [3]:
%%bash
# Your $HOME directory is 
tree -L 1 $HOME

bash: line 2: tree: command not found


There are some many applications for bash. In what fallows, we will focus in using bash as a [scripting language](https://en.wikipedia.org/wiki/Scripting_language) for processing large data files. A data file is stored either in binary or text format. Data in it, is organized further in human, or non-human, readable formats. Examples of human readable formats include: [CVS](https://en.wikipedia.org/wiki/Comma-separated_values), [XML](https://en.wikipedia.org/wiki/XML), and [JSON](https://en.wikipedia.org/wiki/JSON). 

The aim of the following scripts is to estimate statistical data from a set of log files. We will assume, the log files are store in a subdirectory named logs. You must download tree files from blackboard contenido/logs subdirectory and store them in your logs subdirectory. **You must do the later manually**. Logs are CSV formatted. 

In [12]:
%%bash

# Set the name of the working directory
working_dir=logs # Do not include black spaces.

# If the logs subdirectory does not exist, create it.
if [ ! -f $working_dir ]; # ! stands for not or negation
then
    mkdir $working_dir # Creates the subdirectory logs
fi
# Copy the logs from blackboard to the previos subdirectory

Files are labeled *part-00000-of-00500.csv.gz, part-00001-of-00500.csv.gz, and part-00003-of-00500.csv.gz*. These are very large CSV files from the [google cluster repository](https://github.com/google/cluster-data). First lets assume you don't know how many files are in the logs subdirectory and you want to find out such information.

In [16]:
%%bash --out log_file_names --err error
# Note. A little trick is applied so that STDOUT is redirected to your machine and stored in log_file_names
ls -AS logs/

In [17]:
# The following code is Python based.
print(log_file_names)

part-00000-of-00500.csv.gz
part-00002-of-00500.csv.gz
part-00001-of-00500.csv.gz



Each line in a log corresponds to a task event. There are 13 rows with the following information. 

- timestamp
- missing_info,
- job_id,
- task_index,
- machine_id,
- event_type,
- user_name,
- scheduling_class,
- priority,
- rr_cpu,
- rr_ram,
- rr_disk,
- constraints 

Lets assume you need a script that counts the number of events in each log. [Parameter expantion](http://wiki.bash-hackers.org/syntax/pe) is applied in the following example.

In [2]:
%%bash
# In a bash, the first line of a script header contains the line
# !/bin/bash followed by linux command lines

#  We will count the number of events in each file and store individual counts in a text file

# First, lets make sure the file is not present. If so, delete it.
if [ -f events_per_log.txt ]; then
    rm events_per_log.txt
fi

# Parameter expantion is applied to get all file names in a given path location
# The path and file name is stored in variable i.
# The zcat uncompresses file i and fowards the output to the pipe |
# The pipe fowards its output to the wc 
# finally, wc counts the number of lines per file and assings such count to the
# variable lines
# The echo command creates a formatted string and appends (>>) the output to the
# file name events_per_log.txt
# NOTE: YOU MUST CHANGE THE PATH LOCATION
for i in $(ls -AS ~ahiralesc/mysrc/notebooks/sistemas_operativos/logs/); do
    lines=$(zcat ~ahiralesc/mysrc/notebooks/sistemas_operativos/logs/"$i" | wc -l)
    echo -e "$i, $lines" >> events_per_log.txt
done

# Head is used to show the begining lines of the file events_per_log.txt
head events_per_log.txt

part-00000-of-00500.csv.gz, 450146
part-00002-of-00500.csv.gz, 160642
part-00001-of-00500.csv.gz, 77776


## Assigment 1.

Your objective will be to build the following table using several bash scripts. 

|  Log file properties |      |             |                       |                       |
|:--------------------:|------|-------------|-----------------------|-----------------------|
| Name                 | Size | Num. Events | Start time            | End time              |
| part-00000-of-00500  | 100  | 20          | 2011-05-01 19:00:00.000000 | 2011-05-01 19:00:00.000000 |
| part-00001-of-00500  | 102  | 30          | 2011-05-01 19:00:00.000000 | 2011-05-01 19:00:00.000000 |
| part-00002-of-00500  | 102  | 30          | 2011-05-01 19:00:00.000000 | 2011-05-01 19:00:00.000000 |

Note you must produce a CSV file, not a visual table. Thus the final output will be a file with the following content.

```bash
Name, Size, Num. Events, Start time, End time
part-00000-of-00500, 100, 20, 2011-05-01 19:00:00.000000, 2011-05-01 19:00:00.000000
part-00001-of-00500, 102, 30, 2011-05-01 19:00:00.000000, 2011-05-01 19:00:00.000000
part-00002-of-00500, 102, 30, 2011-05-01 19:00:00.000000, 2011-05-01 19:00:00.000000
```
**You must submit two products**:
- A single bash script that produces the previous output.
- The CSV file produced by the bash script.

In what follows, I will give some hints on how to approach the problem. 

#### Extracting the file name

Consider the filename part-00000-of-00500.csv.gz. The first task to address is to extract the string part-00000-of-00500.

In [9]:
%%bash
# Assume the file part-00000-of-00500.csv.gz is located in /home/ahiralesc/mysrc/cutu/extraction/workload/
file=/home/ahiralesc/mysrc/cutu/extraction/workload/part-00000-of-00500.csv.gz
# The following code extracts the filename
xpath=${file%/*}        # Gets the file path: /home/ahiralesc/mysrc/cutu/extraction/workload/
xbase=${file##*/}       # Gets the file name: part-00000-of-00500.csv
xfext=${xbase##*.}      # Gets the extension: csv
xpref=${xbase%.*}       # Truncates the file extension: part-00000-of-00500

echo;
echo path=${xpath};
echo path=$(dirname $file)         # Here is another way to get the path
echo filename=${xbase};
echo filename=$(basename $file)    # Here is another way to get the filename
echo pref=${xpref};
echo ext=${xfext}
echo;


path=/home/ahiralesc/mysrc/cutu/extraction/workload
path=/home/ahiralesc/mysrc/cutu/extraction/workload
filename=part-00000-of-00500.csv.gz
filename=part-00000-of-00500.csv.gz
pref=part-00000-of-00500.csv
ext=gz



A real bash script is often stored in a file and has sh extension. For instance, get_filenames.sh In it, the previous code can be structured as fallows:
```bash
#!/bin/bash
xpath=${1%/*}        # Gets the file path: /home/ahiralesc/mysrc/cutu/extraction/workload/
xbase=${1##*/}       # Gets the file name: part-00000-of-00500.csv
xfext=${xbase##*.}      # Gets the extension: csv
xpref=${xbase%.*}       # Truncates the file extension: part-00000-of-00500

echo;
echo path=${xpath};
echo path=$(dirname $1)         # Here is another way to get the path
echo filename=${xbase};
echo filename=$(basename $1)    # Here is another way to get the filename
echo pref=${xpref};
echo ext=${xfext}
echo;
```

Not the label **file** has been replaced with **1**. This corresponds to the first argument given to your script. For instance,

```bash
./get_filenames.sh part-00000-of-00500.csv.gz
```


Before, trying to execute the previous command in a shell. You must grant the script execution privileges. This is done with the chmod +x command. 

```bash
chmod +x part-00000-of-00500.csv.gz
```


### Assigment 1.1. [ 25 pts.]

Create a script that given a path to the *gz files reads each file and extracts the filenames.

#### Getting the size of the file name

File information is already available via the ls command. The fifth column of ls -la gives just what we need. 

In [12]:
ls -al logs/

total 6728
drwxrwxr-x 2 ahiralesc ahiralesc    4096 Aug 13 13:15 [0m[01;34m.[0m/
drwxrwxr-x 4 ahiralesc ahiralesc    4096 Aug 14 12:52 [01;34m..[0m/
-rw-rw-r-- 1 ahiralesc ahiralesc 4128899 Aug 13 13:10 [01;31mpart-00000-of-00500.csv.gz[0m
-rw-rw-r-- 1 ahiralesc ahiralesc  924634 Aug 13 13:10 [01;31mpart-00001-of-00500.csv.gz[0m
-rw-rw-r-- 1 ahiralesc ahiralesc 1821031 Aug 13 13:10 [01;31mpart-00002-of-00500.csv.gz[0m


However, there is a lot of clutter. We don’t  need directory information just filenames. You must find the proper ls switches that disable printing of directory information. The following code, uses [awk](https://www.tutorialspoint.com/awk/), a text processing command to extract column two of ls output. Awk reads stdin in $0. It later uses split to partition the input string in an array labeled a. Such is latter indexed to extract the field of interest.

In [15]:
ls -al logs/  | awk '{split($0,a); print a[2];}'

6728
2
4
1
1
1


### Assigment 1.2 [ 25 pts.]

Create a script that given a path to the *gz files estimates the size of each log file. 

#### working with time

The first trace, part-00000-of-00500.csv.gz, starts at 19:00 Hrs. EDT on Sunday May 1, 2011, and the datacenter is in that timezone (US Eastern/daylight saving time). Timestamps begin 600 seconds before the beginning of the trace period. The instance of time at wich an event ocurred is stored in the variable timestamp. However, all the first event in log part-00000-of-00500.csv.gz occurs at timestamp 0. This is wrong! Furthermore, timestamps are stored in microseconds, not milliseconds. 

In [52]:
from datetime import datetime
from pytz import timezone
from time import mktime

# EDT is equivalent to US/Eastern + daytime savings
eastern = timezone('US/Eastern')
time = eastern.localize(datetime(2011, 5, 1, 19, 0, 0, 0), is_dst=True).timetuple()
time = int(mktime(time) * 1000)
print('The first log time stamp starts at (milliseconds) : ', time)

The first log time stamp starts at (milliseconds) :  1304301600000


You must add this time (1304301600000) to the timestamps of all events in all logs. How do we do that? Consider the following code. 

In [26]:
%%bash
zcat logs/part-00000-of-00500.csv.gz | head -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat logs/part-00000-of-00500.csv.gz | tail -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'





gzip: logs/part-00003-of-00500.csv.gz: No such file or directory
gzip: logs/part-00003-of-00500.csv.gz: No such file or directory


Now timestamps are correct. The output must be redirected to file. Then, you must extract the first and last timestamp and transform these values to human readable format with the previous Python code. For instance, 

In [53]:
datetime.fromtimestamp(float(1304301600000)/1000).strftime('%Y-%m-%d %H:%M:%S.%f')

'2011-05-01 19:00:00.000000'

### Assigment 1.3 [ 25 pts.]

Create a script that computes each log start and end times.

### Assigment 1.4 [ 25 pts.]

Finally, create a script that invoke scrips 1 to 3 and produces the requested CSV file.


In [90]:
%%bash
# Solution 1.1. Extrating the log name and estimating the number of events per log
if [ -f events_per_log.txt ]; then
    rm events_per_log.txt
fi

for i in $(ls -AS ~ahiralesc/mysrc/notebooks/sistemas_operativos/logs/); do
    lines=$(zcat ~ahiralesc/mysrc/notebooks/sistemas_operativos/logs/"$i" | wc -l)
    xbase=${i##*/}
    echo -e "$xbase $lines" >> events_per_log.txt
done

cat events_per_log.txt

part-00000-of-00500.csv.gz 450146
part-00002-of-00500.csv.gz 160642
part-00001-of-00500.csv.gz 77776


# Solutions

In [17]:
%%bash
cd Documents


if [ -f Logs/logs1 ]; then
    rm Logs/logs1
fi

echo "NOMBRE" >> Logs/logs1

for i in $(ls -l SistemasOperativos/ | awk '{split($0,a); print a[9]}');

do
    xbase=${i##*/}
    xpref=${xbase%%.*}
    echo ${xpref} >> Logs/logs1
done

*last update: August 23, 2019*

In [18]:
%%bash 
cd Documents


if [ -f Logs/logs2 ]; then
    rm Logs/logs2
fi

echo "TAMANO" >> Logs/logs2

for i in $(ls -l  SistemasOperativos/ | awk '{split($0,a); print a[5];}')
do
   echo ${i} >> Logs/logs2
done

In [19]:
%%bash 

cd Documents/SistemasOperativos

zcat part-00003-of-00500.csv.gz | head -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat part-00003-of-00500.csv.gz | tail -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat part-00006-of-00500.csv.gz | head -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat part-00006-of-00500.csv.gz | tail -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat part-00009-of-00500.csv.gz | head -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;
zcat part-00009-of-00500.csv.gz | tail -n 1 | awk -F "\"*,\"*" '{ORS="\t"} {split($0,a); printf "%.2f",((a[1] * 0.001)+ 1304301600000);}'
echo ;

1304317234546.29
1304322244631.19
1304332267297.97
1304337277640.49
1304347300877.30
1304352311946.04


In [20]:
from datetime import datetime
f=open ("Documents/Logs/logs3","w")
f.write("FECHA INICIO\n")
f.write(datetime.fromtimestamp(float(1304317234546.29)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.write("\n")
f.write(datetime.fromtimestamp(float(1304332267297.97)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.write("\n")
f.write(datetime.fromtimestamp(float(1304347300877.30)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.close()

In [21]:
from datetime import datetime
f=open ("Documents/Logs/logs4","w")
f.write("FECHA FIN\n")
f.write(datetime.fromtimestamp(float(1304322244631.19)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.write("\n")
f.write(datetime.fromtimestamp(float(1304337277640.49)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.write("\n")
f.write(datetime.fromtimestamp(float(1304352311946.04)/1000).strftime('%Y-%m-%d %H:%M:%S.%f'))
f.close()


In [22]:
%%bash

cd Documents


if [ -f Logs/logs5 ]; then
    rm Logs/logs5
fi

echo "EVENTOS" >> Logs/logs5

for i in $(ls -AS SistemasOperativos/); do

    lines=$(zcat SistemasOperativos/"$i" | wc -l)

    echo "$lines" >> Logs/logs5

done

In [23]:
%%bash

cd Documents/Logs

paste logs1 logs2 logs5 logs3 logs4 > LogFinal.csv

In [24]:
%%bash
pwd

/home/jonathan
