## The unix system

* Prepending commands with !
* Files, directories and paths
* AWS, virtual machines and AMI's
* Bash scripts (just to know what they are)
* grep, wc, sort
* .bashrc, environmental variables, paths.

## listing files
* The `ls` command lists the files in the current directory
* `ls -F` identifies each files as directory, executable, text
* `ls -lrta`: 
    * `l` (long) lists extensive information about each file
    * `rt` (reverse time): list files in reverse order in time.
    * `a` show hidden files (whose names start with .)

In [None]:
%cd ~
!ls

In [None]:
!ls -F

In [None]:
!ls -lrta

In [None]:
## Find out type of file
!file anaconda/bin/*

## Navigating file paths

In [None]:
# pwd identifies the current working directory
!pwd

In [None]:
# /home/ubuntu is the home directory of the user "ubuntu" that is - you!
!ls -lrt /home/ubuntu

In [None]:
# A shorthand for the home directory of the current user is "~"
%cd ~

In [None]:
# !cd causes unix to move to a different directory, but leaves python behind
!cd ~/logs/
!pwd

In [None]:
# To actually change the directory use the magic %cd, which actually moves us and reports where we landed
%cd ~/logs/

Some useful shorthands for navigating directories
* **~** home directory of current user
* **~john** home directory of the user "john"
* **.** the current directory
* **..** the parent directory of the current directory.


In [None]:
%cd ~/data
print "!ls -F"
!ls -F
print "!ls ../"
!ls ../

### Symbolic links

In [None]:
# "scripts" and "data" are not actual directories, rather they are symbolic links to directories
%cd ~/data/
!ls *

In [None]:
# Symbolic links are created by the command "ln -s" 
# Here we create a link from the home directory to the directory DSE200/data/NLTK/Chopped
%cd ~
!ln -s DSE200/data/NLTK/Chopped minced
!ls -l minced
# the unix command "cat" prints out the contents of a file.
!cat minced/F0

### Creating files and directories

In [None]:
# to create a directory, use `mkdir`
!mkdir ~/tmp
%cd ~/tmp

In [None]:
# to create a file or update the time-stamp of the file use `touch`
for i in range(10):
    !touch file$i
!ls -l

### Moving and copying files

In [None]:
%cd ~/DSE200/data/NLTK/Chopped/
!ls

In [None]:
# cp copies a file to a new location, maintaining the original copy
!mkdir tmp
!cp F87 tmp   # copy a file to a new location, maintaining the name
!cp F87 tmp/newname # copy a file to a new location + name
!ls -l F87 tmp

In [None]:
# you can also copy a whole directory and all it's subdirectories
!cp -r tmp newtmp

In [None]:
# mv moves a file, or a whole directory, to a new location or a new name.
# it just manipulates pointers, so it is much faster than copy. (similar to 
# the difference between deep and shallow copy in python)
!mv newtmp tmp # move directory to a new location
!ls tmp/*

In [None]:
# mv file or directory to a new name (=rename)
!mv tmp/newname tmp/newername
!ls tmp/*

In [None]:
#cleanup
!rm -r tmp

### Removing files and directories

In [None]:
# to remove a file use the command `rm`
for i in range(1,10,2):
    !rm file$i
!ls -l

In [None]:
# to remove an empry directory, use 'rmdir'
# If you want to remove a directory and everything that is in it use `rm -rf`. Note that this
# is an irreversible action, it is NOT like moving a file to the trash bin.
%cd ~
!rm -rf tmp
!ls -l tmp

### Groups and Unix File Permissions

It is often the case that a file should not be readable/writeable by all users of a machine (IE private data, system configuration).  
To enforce this there are a number of file properties which UNIX enforces.

Each user falls into one of three relationships with the file:

* **Owner** - The user who created the file and is able to modify permissions
* **Group** - The user is in a the user group assigned to the file (we won't talk about this much)
* **World** - Everybody else

Each file has three permissions for each of these user sets:

* **Read**  - The ability to view the file's contents
* **Write** - The ability to modify the file
* **Excecute** - The ability to run the file (if it is a script or program).  

Since there are three user sets and three permissions, there are 9 distinct true/false permissions which can be granted.  Thus each file has 9 bits to define these permissions.

#### Viewing permissions

To view permissions of a file use the -l option for ls

In [None]:
#First we create some files:
#Disregard the chmod command for now
!mkdir examples
%cd examples
!touch NoPermissions
!chmod 000 NoPermissions 
!touch AllRead
!chmod 444 AllRead 
!touch FullPermission
!chmod 777 FullPermission 
!touch OwnerOnly
!chmod 700 OwnerOnly 
!touch GroupOnly
!chmod 070 GroupOnly 
!touch WorldOnly
!chmod 007 WorldOnly 

#Now we list the permissions of the files
!ls -l

#Return to old working directory
%cd ../


In the first column of the output you see dashes for ungranted permissions and letters (r, w, or x) for granted permissions.  

Lets break this down:

| Bit | Definition |
|---|------------------|
| 1 | Sticky Bit\* |
| 2 | Owner Read |
| 3 | Owner Write |
| 4 | Owner Execute |
| 5 | Group Read |
| 6 | Group Write |
| 7 | Group Execute |
| 8 | World Read |
| 9 | World Write |
| 10| World Execute |

\* The sticky bit is a special permission we won't be going into

Since you have three groups of three binary permissions, a common way to refer to permissions is via the octal representation of bits 2-10.  This yields a 3 digit octal number with the left most digit being owner permissions, middle digit group permissions, and right most bit the world permissions.

For example: -r--rw---x translates to 461 in octal

#### Changing Permissions

To change permissions, the owner of a file can use the command *chmod*.  The main use case is you specify the octal code of your desired permissions followed by the file name.  For examples of this look at the code we used to create the files above.

## manual pages

In [None]:
%man ls

## Exploring the computer

In [None]:
# Find out which version of Ubuntu you are running 
!lsb_release -a

In [None]:
# find out about the hardware
!cat /proc/cpuinfo

In [None]:
# find out how much memory you are using
!free -m

In [None]:
# find out how much disk space you are using
!df

In [None]:
#find out which directories consume most of this disk space
%cd ~
!du -s *

In [None]:
# based on what we see here, we check the directory anaconda
!du -s anaconda/*

## analyzing data
head, tail, more, grep, wc, sort, cut (awk)

In [None]:
%cd ~/DSE200/data/ThinkStatsData/
!ls

In [None]:
# print the number of lines, words and characters in each file
!wc *

In [None]:
# print the first 2 lines of a file
!head -2 2002FemPreg.dat

In [None]:
## This list of tuples defines the names and locations of the elements.
fields=[
    ('caseid', 1, 12, int),
    ('nbrnaliv', 22, 22, int),
    ('babysex', 56, 56, int),
    ('birthwgt_lb', 57, 58, int),
    ('birthwgt_oz', 59, 60, int),
    ('prglength', 275, 276, int),
    ('outcome', 277, 277, int),
    ('birthord', 278, 279, int),
    ('agepreg', 284, 287, int),
    ('finalwgt', 423, 440, float),
]

In [None]:
## Lets transform it into a dictionary whose keys are the names of the field
fields_dict={name:(f,t,typ) for (name,f,t,typ) in fields}
fields_dict

In [None]:
# print the lines that contain a particular string
string='3116'
!grep $string 2002FemPreg.dat

In [None]:
#suppose we just want to know how many lines have this string inside them.
# this is our first use of pipes
#the output from grep serves as the input to wc
!wc 2002FemPreg.dat
!grep $string 2002FemPreg.dat | wc

In [None]:
#cut is a command that cuts specific fields from from each line 
%man cut

In [None]:
# Extract from each line a specific field
field='babysex'
(fr,to,typ)=fields_dict[field]
Range=str(fr-1)+'-'+str(to)
print field,fr,to,Range
!cut -c $Range 2002FemPreg.dat | head -5

In [None]:
# lets sort these lines numerically, and look at the end, also known as the tail
!cut -c $Range 2002FemPreg.dat | sort -n | tail

In [None]:
#count the number of times each value appears using uniq
!cut -c $Range 2002FemPreg.dat | sort -n | uniq -c

In [None]:
# do the same thing but using an intermediary file
!cut -c $Range 2002FemPreg.dat > cut$Range
print 'head of cut',Range
!head cut$Range
!ls
print 'output from uniq'
!cat cut$Range | sort -n | uniq -c

## Environment variables

Environment variables are strings that define the set up of the session. Environment variables allow the user to avoid
retyping the same parameters over and over.

In [None]:
#view all of the currently defined environment variable
%env

In [None]:
#view the valuse of a particular variables 
!echo $HOME $USER

In [None]:
#The $ symbol is required as a prefix of the variable names
#When used inside ipython, this $ can be used to refer to any currently defined variable
i=25
!echo $i

In [None]:
#Particularly important are environment variables called "paths"
!env | grep -i path

In [None]:
# The path defines where the system will look for commands and in what order.
# PATH tells the unix shell (bash) where to find the executables corresponding to commands
# while PYTHONPATH tells python from where to `import` packages.
# lets see where unix finds the command "sort"
!which sort
# Check on the variable PATH and you will see that /usr/bin is on it.

In [None]:
# You can also find all of the places along the path that have a definition relevant to sort
!whereis sort
# The last one is the manual page for sort which you can view using the command %man sort

#### Excercise 
find where the location of the commands `python`,`ipython` and `mail`

In [None]:
!which python

## Wildcards and glob

We have seen the most used wild-card `*`, which matches any sequence of (non blank) characters.
For example `B*.py` will match any filename that starts with `B` and ends with `.py`.

Other useful wildcards are:

wild card | Description
--------|--------------------------------------------------------   
    `*`   |  An asterisk matches any number of characters in a filename, including none.
    `?`   |  The question mark matches any single character.
    `[ ]` |  Brackets enclose a set of characters, any one of which may match a single character at that position.
    `-`   |  A hyphen used within [ ] denotes a range of characters.
    `~`   |  A tilde at the beginning of a word expands to the name of your home directory.  If you append another user's login name to the character, it refers to that user's home directory.
    
**Here are some examples:**

1. **cat c* ** displays any file whose name begins with c including the file c, if it exists.
1. **ls *.c ** lists all files that have a .c extension.
1. **cp ../rmt?. ** copies every file in the parent directory that is four characters long and begins with rmt to the working directory. (The names will remain the same.)
1. **ls rmt[34567] ** lists every file that begins with rmt and has a 3, 4, 5, 6, or 7 at the end.
1. **ls rmt[3-7] ** does exactly the same thing as the previous example.
1. **ls ~ ** lists your home directory.
1. **ls ~hessen ** lists the home directory of the guy1 with the user id hessen.

#### within python, use glob()

You get the same functionality as wildcards by using the function `glob`, but instead of getting the result printed out, you get it as a list of strings.

In [None]:
!ls -d li* # do not descent into directories
from glob import glob
L=glob('li*')
L

## Loading and saving files

It is often useful to load short files into the notebook, alter them, and save them back into the file system. The magics `%load` and `%%writefile` are used to do that.


In [None]:
!ls

In [None]:
# the magic %load, unlike using !cat, creates a new cell that can be executed inside the notebook
%load survey.pl

## Processes

A process is a sequence of commands that are executed in sequence, one after the other.
By using "time sharing" a single CPU can compute many processes at the same time, frequently switching from one process to the next. In a multi-core machine, there are several CPUs and so even more processes can execute at the same time.

In [None]:
# You can find out the current processes on your system using the command "top"
# without flags, the command will open a window that will constantly update and that also
# allows you to quit (or kill) processes. Here we use the flags to specify that top should only run once.
!top -b -n 1

## Pipes
We used pipes above to communicate between two or more unix commands.
We now discuss this in more detail.

Unix processes have three default input and output channels
* **stdin** the standard input channel - by default - the keyboard
* **stdout** the standard output channel - by default - the terminal.
* **stderr** the standard error channel - by default - the terminal.

Channels can be used to connect programs to each other and to connect programs and files. This is called **I/O redirection**.

Connecting a standard channel to files is done using the following symbols
(in bash, which is the standard shell in both ubuntu and os-x).

| command      | result |
|--------------|-----------------------------------------------------|
| < filename   | Redirect stdin to read from the file "filename" |
| > filename  | Redirect stdout to file "filename." |
| >>filename  | Redirect and append stdout to file "filename." |
| 1>filename   | Redirect stdout to file "filename." |
| 1>>filename  | Redirect and append stdout to file "filename." |
| 2>filename   | Redirect stderr to file "filename." |
| 2>>filename  | Redirect and append stderr to file "filename." |
| &>filename   | Redirect both stdout and stderr to file "filename." |


In [None]:
# One of the most basic unix command is cat
%cd ~
!cat < .bash_logout

In [None]:
# echo is another basic command, it pipes the string it gets as a parameter to std-out
# To create a file with some specific line we can use
!echo "MAS-DSE is the best" | cat > "MAS-FILE"
!ls MAS*
!cat MAS-FILE
!rm MAS-FILE

In [None]:
# some time we want to supress the error messages.
# To do that we redirect stderr to a fictitious file called /dev/null
# In addition, we can take the output and sort it (numerically, in reverse) according to the size
!du -s /*  | 2> /dev/null  | sort -nr



## Interacting with external programs through pipes

In [None]:
#the command top gives us a snapshot of the currently running processes
!top -b -n 1

In [None]:
# We can run the program from within a python script (using ! requires running inside ipython)

# here we use python to find those processes that take a non-zero part of the memory.
import subprocess

output = subprocess.check_output(['top', '-b','-n','1'])
print 'Have %d bytes in output' % len(output)

lines=output.splitlines() # break output into lines
len(lines)
for line in lines:
    percent = line[47:50]
    try:
        p=float(percent)
    except:
        continue
    if p>0:
        print line,

In [None]:
# If we use top in it's default non-batch form, the program crashes
output = subprocess.check_output(['top'])
print 'Have %d bytes in output' % len(output)


In [None]:
# it runs fine with !
!top

In [None]:
# what we need to do is run the process 'top' in the background - in parallel to the ipython session
# and print what the program outputs to stdout as it becomes available

# The slight problem is that this code has a bug and it does not work right, can you make it work?

import select
from time import sleep

def dataWaiting(source):
    " Check if data is waiting to be read "
    return select.select([source], [], [], 0) == ([source], [], [])

proc = subprocess.Popen(['top'], 
                        stdout=subprocess.PIPE,
                        shell=False
                        )

while True:
    sleep(0.1)
    try:
        for line in proc.stdout.readline():
            print line
    except:
        print 'exception'
        
#while True:
#    sleep(0.1)
#    if dataWaiting(proc.stdout):
#        print 'Data Available'
#        line=proc.stdout.readline()
#        print line,
#    else:
#        print 'not available'
