# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction-to-the-command-line" data-toc-modified-id="Introduction-to-the-command-line-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to the command line</a></div><div class="lev2 toc-item"><a href="#Navigating-the-filesystem" data-toc-modified-id="Navigating-the-filesystem-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Navigating the filesystem</a></div><div class="lev2 toc-item"><a href="#Quotes,-special-characters,-and-variables" data-toc-modified-id="Quotes,-special-characters,-and-variables-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Quotes, special characters, and variables</a></div><div class="lev2 toc-item"><a href="#Displaying-file-contents" data-toc-modified-id="Displaying-file-contents-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Displaying file contents</a></div><div class="lev2 toc-item"><a href="#Redirection-of-standard-input/output-and-pipes" data-toc-modified-id="Redirection-of-standard-input/output-and-pipes-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Redirection of standard input/output and pipes</a></div><div class="lev2 toc-item"><a href="#Running-a-script" data-toc-modified-id="Running-a-script-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Running a script</a></div><div class="lev2 toc-item"><a href="#Pattern-matching-and-counting:-grep,-wc,-sort,-uniq-and-a-bit-of-awk" data-toc-modified-id="Pattern-matching-and-counting:-grep,-wc,-sort,-uniq-and-a-bit-of-awk-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Pattern matching and counting: grep, wc, sort, uniq and a bit of awk</a></div>

# Introduction to the command line

A summary of basic command line / shell usage, as discussed in class.


**Note:** If you're new to the command line, please do not use/run these commands in directories with important content, as you may overwrite existing files, etc. a good "scratch space" for exploring is the /tmp directory or similar (type "cd /tmp").

## Navigating the filesystem

In [1]:
# list files
ls

[1m[31mdownload_trips.sh[39;49m[0m        flip_coins.R
[1m[31mexplore_trips.sh[39;49m[0m         intro_command_line.ipynb


In [2]:
# list the present working directory
pwd

/Users/jmh/teaching/msd2017/public/lectures/lecture_2


In [3]:
# go one directory "up" in hierarchy, list files, come back
cd ..
ls
cd lecture_2

[36mlecture_1[39;49m[0m [36mlecture_2[39;49m[0m [36mlecture_3[39;49m[0m


In [4]:
# get (terse) description of options for ls
ls --help

ls: illegal option -- -
usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]


: 1

See 'man ls' for more (q to quit) and 'man man' for how to use man pages

In [5]:
# use options to see all files (including hidden ones, starting with '.') in a list format
ls -a -l

total 104
drwxr-xr-x  7 jmh  staff    238 Jan 31 15:28 [36m.[39;49m[0m
drwxr-xr-x  6 jmh  staff    204 Jan 31 14:22 [36m..[39;49m[0m
drwxr-xr-x  3 jmh  staff    102 Jan 31 14:23 [36m.ipynb_checkpoints[39;49m[0m
-rwxr-xr-x  1 jmh  staff    926 Jan 31 15:04 [1m[31mdownload_trips.sh[39;49m[0m
-rwxr-xr-x  1 jmh  staff   2252 Jan 27 08:34 [1m[31mexplore_trips.sh[39;49m[0m
-rw-r--r--@ 1 jmh  staff    344 Jan 27 09:24 flip_coins.R
-rw-r--r--  1 jmh  staff  37842 Jan 31 15:25 intro_command_line.ipynb


In [6]:
# shortcut for same, adding human readable file sizes
ls -alh

total 104
drwxr-xr-x  7 jmh  staff   238B Jan 31 15:28 [36m.[39;49m[0m
drwxr-xr-x  6 jmh  staff   204B Jan 31 14:22 [36m..[39;49m[0m
drwxr-xr-x  3 jmh  staff   102B Jan 31 14:23 [36m.ipynb_checkpoints[39;49m[0m
-rwxr-xr-x  1 jmh  staff   926B Jan 31 15:04 [1m[31mdownload_trips.sh[39;49m[0m
-rwxr-xr-x  1 jmh  staff   2.2K Jan 27 08:34 [1m[31mexplore_trips.sh[39;49m[0m
-rw-r--r--@ 1 jmh  staff   344B Jan 27 09:24 flip_coins.R
-rw-r--r--  1 jmh  staff    37K Jan 31 15:25 intro_command_line.ipynb


## Quotes, special characters, and variables

In [7]:
# use '*' as a wildcard to list all files ending in '.sh'
ls -l *sh

-rwxr-xr-x  1 jmh  staff   926 Jan 31 15:04 [1m[31mdownload_trips.sh[39;49m[0m
-rwxr-xr-x  1 jmh  staff  2252 Jan 27 08:34 [1m[31mexplore_trips.sh[39;49m[0m


In [8]:
# be careful with quotes, they prevent some characters, such as *, from being interpreted
ls -l "*sh"
ls -l '*sh'

ls: *sh: No such file or directory
ls: *sh: No such file or directory


: 1

In [9]:
# for instance, the following are equivalent for printing the string 'sh' to the screen
echo sh
echo "sh"
echo 'sh'

sh
sh
sh


In [10]:
# but these are quite different
echo *sh    # expands second argument to all files ending in sh, then prints their names
echo "*sh"  # double quotes prevent extension
echo '*sh'  # single quotes prevent extension

download_trips.sh explore_trips.sh
*sh
*sh


In [11]:
# yet another subtlety between double and single quotes is how variables are handled
# double quotes expand variables, but not glob patterns
# single quotes treat things literally

# define a variable for the extension and use it below
ext=sh
echo *$ext    # substitutes 'sh' for the variable, expands second argument to all files ending in sh, then prints their names
echo "*$ext"  # substitutes 'sh' for the variable, but double quotes prevent extension
echo '*$ext'  # prevents any substitution and treats string literally

download_trips.sh explore_trips.sh
*sh
*$ext


More on quoting here http://tldp.org/LDP/abs/html/quoting.html#QUOTINGREF

## Displaying file contents

In [12]:
# concatenate contents of script on screen
cat download_trips.sh

#!/bin/bash
#
# description:
#   fetches trip files from the citibike site http://www.citibikenyc.com/system-data
#   e.g., https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip
#
# usage: ./download_trips.sh
#
# requirements: curl or wget
#
# author: jake hofman
#

# set a relative path for the citibike data
# (use current directory by default)
DATA_DIR=.

# change to the data directory
cd $DATA_DIR

# loop over each year/month
for year in 2014
do
    for month in 02 #01 02 03 04 05 06 07 08 09 10 11 12
    do

    # download the zip file
    # alternatively you can use wget if you don't have curl
    # wget $url
    url=https://s3.amazonaws.com/tripdata/${year}${month}-citibike-tripdata.zip
    curl -O $url

    # define local file names
    file=`basename $url`
    csv=${file//.zip/}".csv"

    # unzip the downloaded file
    unzip -p $file > $csv

    # remove the zip file
    rm $file
    done
done


More is a more friendly version of cat, with pagination / scroll (q to quit)

    more download_trips.sh

Less is better than more

    less download_trips.sh

Each of these can take multiple arguments, showing several files in a row, e.g.

    cat file1 file2
    more file1 file2
    less file1 file2

## Redirection of standard input/output and pipes

In [13]:
# redirect output of program to file outputfile using
# program > outputfile
ls -alh > files

In [14]:
# redirect contents of file inputfile to program using
# program < inputfile
cat < files

total 104
drwxr-xr-x  8 jmh  staff   272B Jan 31 15:29 .
drwxr-xr-x  6 jmh  staff   204B Jan 31 14:22 ..
drwxr-xr-x  3 jmh  staff   102B Jan 31 14:23 .ipynb_checkpoints
-rwxr-xr-x  1 jmh  staff   926B Jan 31 15:04 download_trips.sh
-rwxr-xr-x  1 jmh  staff   2.2K Jan 27 08:34 explore_trips.sh
-rw-r--r--  1 jmh  staff     0B Jan 31 15:29 files
-rw-r--r--@ 1 jmh  staff   344B Jan 27 09:24 flip_coins.R
-rw-r--r--  1 jmh  staff    37K Jan 31 15:25 intro_command_line.ipynb


In [15]:
# redirect output of program1 to input of program2 using
# program1 | program2
# add line numbers to directory listing
ls -talh | cat -n

     1	total 112
     2	drwxr-xr-x  8 jmh  staff   272B Jan 31 15:29 .
     3	-rw-r--r--  1 jmh  staff   473B Jan 31 15:29 files
     4	-rw-r--r--  1 jmh  staff    37K Jan 31 15:25 intro_command_line.ipynb
     5	-rwxr-xr-x  1 jmh  staff   926B Jan 31 15:04 download_trips.sh
     6	drwxr-xr-x  3 jmh  staff   102B Jan 31 14:23 .ipynb_checkpoints
     7	drwxr-xr-x  6 jmh  staff   204B Jan 31 14:22 ..
     8	-rw-r--r--@ 1 jmh  staff   344B Jan 27 09:24 flip_coins.R
     9	-rwxr-xr-x  1 jmh  staff   2.2K Jan 27 08:34 explore_trips.sh


Visual trick: think of redirection operators as "funnels"

When using '>' or '<', you should always have a program on left and file on the right

When using '|', both left and right should have a program

An example of confusing the above:

      ls -l > more will write directory contents (from ls -l) to the
      file named "more"
      (you probably want ls -l | more, which will paginate ls -l)
    
The following are equivalent:

       cat inputfile | program
       program < inputfile

More on redirection at http://tldp.org/LDP/abs/html/io-redirection.htmlIOREDIRREF
   
More on pipes at http://tldp.org/LDP/abs/html/special-chars.htmlPIPEREF

## Running a script

In [16]:
# download one month of citibike data
bash download_trips.sh
# or ./download_trips.sh for short if the file is executeable and has the proper first line ('#!/bin/bash')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7076k  100 7076k    0     0  7113k      0 --:--:-- --:--:-- --:--:-- 7111k


In [17]:
ls -alh *.csv

-rw-r--r--  1 jmh  staff    42M Jan 31 15:29 201402-citibike-tripdata.csv


## Pattern matching and counting: grep, wc, sort, uniq and a bit of awk

In [18]:
# look at the first line
head -n1 201402-citibike-tripdata.csv

"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"


In [19]:
# create a numbered list of columns, translating commas to newlines and adding line numbers
head -n1 201402-citibike-tripdata.csv | tr , '\n' | cat -n

     1	"tripduration"
     2	"starttime"
     3	"stoptime"
     4	"start station id"
     5	"start station name"
     6	"start station latitude"
     7	"start station longitude"
     8	"end station id"
     9	"end station name"
    10	"end station latitude"
    11	"end station longitude"
    12	"bikeid"
    13	"usertype"
    14	"birth year"
    15	"gender"


In [20]:
# look at the first 10 lines
head 201402-citibike-tripdata.csv

"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","birth year","gender"
"382","2014-02-01 00:00:00","2014-02-01 00:06:22","294","Washington Square E","40.73049393","-73.9957214","265","Stanton St & Chrystie St","40.72229346","-73.99147535","21101","Subscriber","1991","1"
"372","2014-02-01 00:00:03","2014-02-01 00:06:15","285","Broadway & E 14 St","40.73454567","-73.99074142","439","E 4 St & 2 Ave","40.7262807","-73.98978041","15456","Subscriber","1979","2"
"591","2014-02-01 00:00:09","2014-02-01 00:10:00","247","Perry St & Bleecker St","40.73535398","-74.00483091","251","Mott St & Prince St","40.72317958","-73.99480012","16281","Subscriber","1948","2"
"583","2014-02-01 00:00:32","2014-02-01 00:10:15","357","E 11 St & Broadway","40.73261787","-73.99158043","284","Greenwich Ave & 8 Ave","40.7390169121","-74.0

In [21]:
# look at the last 10 lines
tail 201402-citibike-tripdata.csv

"982","2014-02-28 23:53:47","2014-03-01 00:10:09","512","W 29 St & 9 Ave","40.7500727","-73.99839279","483","E 12 St & 3 Ave","40.73223272","-73.98889957","18209","Subscriber","1989","1"
"1210","2014-02-28 23:53:59","2014-03-01 00:14:09","312","Allen St & E Houston St","40.722055","-73.989111","505","6 Ave & W 33 St","40.74901271","-73.98848395","17073","Subscriber","1984","2"
"988","2014-02-28 23:54:45","2014-03-01 00:11:13","428","E 3 St & 1 Ave","40.72467721","-73.98783413","268","Howard St & Centre St","40.71910537","-73.99973337","20813","Subscriber","1980","1"
"267","2014-02-28 23:55:39","2014-03-01 00:00:06","301","E 2 St & Avenue B","40.72217444","-73.98368779","403","E 2 St & 2 Ave","40.72502876","-73.99069656","16937","Subscriber","1978","1"
"175","2014-02-28 23:57:12","2014-03-01 00:00:07","383","Greenwich Ave & Charles St","40.735238","-74.000271","284","Greenwich Ave & 8 Ave","40.7390169121","-74.0026376103","15220","Subscriber","1956","1"
"848","2014-02-28 23:57:13","2014

In [22]:
# count the number of lines in this file
wc -l 201402-citibike-tripdata.csv

  224737 201402-citibike-tripdata.csv


In [23]:
# extract rider gender in column 15, specifying ',' as a delimiter
# limit output to first 10 lines
cut -d, -f15 201402-citibike-tripdata.csv | head

"gender"
"1"
"2"
"2"
"1"
"1"
"1"
"1"
"1"
"1"


In [24]:
# find the earliest birth year in column 14
cut -d, -f14 201402-citibike-tripdata.csv | sort | head

"1899"
"1899"
"1899"
"1899"
"1899"
"1899"
"1899"
"1899"
"1899"
"1900"


In [25]:
# find the latest birth year in column 14
cut -d, -f14 201402-citibike-tripdata.csv | sort | tail

\N
\N
\N
\N
\N
\N
\N
\N
\N
\N


In [26]:
# find all trips either starting or ending on broadway
grep Broadway 201402-citibike-tripdata.csv | head

"372","2014-02-01 00:00:03","2014-02-01 00:06:15","285","Broadway & E 14 St","40.73454567","-73.99074142","439","E 4 St & 2 Ave","40.7262807","-73.98978041","15456","Subscriber","1979","2"
"583","2014-02-01 00:00:32","2014-02-01 00:10:15","357","E 11 St & Broadway","40.73261787","-73.99158043","284","Greenwich Ave & 8 Ave","40.7390169121","-74.0026376103","17400","Subscriber","1981","1"
"439","2014-02-01 00:02:14","2014-02-01 00:09:33","285","Broadway & E 14 St","40.73454567","-73.99074142","247","Perry St & Bleecker St","40.73535398","-74.00483091","20875","Subscriber","1983","2"
"707","2014-02-01 00:02:50","2014-02-01 00:14:37","257","Lispenard St & Broadway","40.71939226","-74.00247214","345","W 13 St & 6 Ave","40.73649403","-73.99704374","17757","Subscriber","1962","1"
"695","2014-02-01 00:06:53","2014-02-01 00:18:28","490","8 Ave & W 33 St","40.751551","-73.993934","468","Broadway & W 55 St","40.7652654","-73.98192338","21122","Subscriber","1979","1"
"892","2014-02-01 00:07:22","2

In [27]:
# count all trips that start and end on broadway
cut -d, -f5,9 201402-citibike-tripdata.csv | grep 'Broadway.*Broadway' | wc -l

    2776


In [28]:
# list all of the unique stations in column 5 that are contain Broadway by first sorting and then removing running duplicates
cut -d, -f5 201402-citibike-tripdata.csv | grep Broadway | sort | uniq

"Broadway & Battery Pl"
"Broadway & Berry St"
"Broadway & E 14 St"
"Broadway & E 22 St"
"Broadway & W 24 St"
"Broadway & W 29 St"
"Broadway & W 32 St"
"Broadway & W 36 St"
"Broadway & W 37 St"
"Broadway & W 39 St"
"Broadway & W 41 St"
"Broadway & W 49 St"
"Broadway & W 51 St"
"Broadway & W 53 St"
"Broadway & W 55 St"
"Broadway & W 58 St"
"Broadway & W 60 St"
"E 11 St & Broadway"
"E 17 St & Broadway"
"Franklin St & W Broadway"
"Liberty St & Broadway"
"Lispenard St & Broadway"
"Pike St & E Broadway"
"Reade St & Broadway"
"W Broadway & Spring St"
"Washington Pl & Broadway"


In [29]:
# find the latest birth year in column 14, limiting to lines with a number
cut -d, -f14 201402-citibike-tripdata.csv | grep '[0-9]' | sort | tail

"1997"
"1997"
"1997"
"1997"
"1997"
"1997"
"1997"
"1997"
"1997"
"1997"


In [30]:
# count trips by gender
cut -d, -f15 201402-citibike-tripdata.csv | sort | uniq -c

6731 "0"
176526 "1"
41479 "2"
   1 "gender"


In [31]:
# convert comma-separated file to tab-separated file
cat 201402-citibike-tripdata.csv | tr , '\t' > 201402-citibike-tripdata.tsv
head -n1 201402-citibike-tripdata.tsv

"tripduration"	"starttime"	"stoptime"	"start station id"	"start station name"	"start station latitude"	"start station longitude"	"end station id"	"end station name"	"end station latitude"	"end station longitude"	"bikeid"	"usertype"	"birth year"	"gender"


In [32]:
# find the 10 most frequent station-to-station trips
cut -f5,9 201402-citibike-tripdata.tsv | sort | uniq -c | sort -nr | head

 156 "E 43 St & Vanderbilt Ave"	"W 41 St & 8 Ave"
 124 "Pershing Square N"	"W 33 St & 7 Ave"
 122 "Norfolk St & Broome St"	"Henry St & Grand St"
 121 "E 7 St & Avenue A"	"Lafayette St & E 8 St"
 118 "W 17 St & 8 Ave"	"8 Ave & W 31 St"
 118 "Henry St & Grand St"	"Norfolk St & Broome St"
 115 "Lafayette St & E 8 St"	"E 6 St & Avenue B"
 115 "Central Park S & 6 Ave"	"Central Park S & 6 Ave"
 108 "E 10 St & Avenue A"	"Lafayette St & E 8 St"
 103 "Canal St & Rutgers St"	"Henry St & Grand St"


In [33]:
# use awk to count all trips that start and end on broadway
awk -F, '$5 ~ /Broadway/ && $9 ~ /Broadway/' 201402-citibike-tripdata.csv | wc -l

    2776


In [34]:
# use awk to count trips by gender without having to sort
awk -F, '{counts[$15]++} END {for (k in counts) print counts[k]"\t" k }' 201402-citibike-tripdata.csv

176526	"1"
41479	"2"
1	"gender"
6731	"0"


In [35]:
# compare the time when using sort and uniq to awk
echo "sort and uniq"
time cut -d, -f15 201402-citibike-tripdata.csv | sort | uniq -c
echo
echo "awk"
time awk -F, '{counts[$15]++} END {for (k in counts) print counts[k]"\t" k }' 201402-citibike-tripdata.csv

sort and uniq
6731 "0"
176526 "1"
41479 "2"
   1 "gender"

real	0m4.360s
user	0m4.052s
sys	0m0.070s

awk
176526	"1"
41479	"2"
1	"gender"
6731	"0"

real	0m1.950s
user	0m1.859s
sys	0m0.045s


In [36]:
# by request: display a histogram of birth years, which each * counts 100 people
cut -d, -f14 201402-citibike-tripdata.csv | sort | uniq -c | awk '{printf $2"\t"; for (i=1; i<=$1/100; i++) printf "*"; printf "\n"}'

"1899"	
"1900"	
"1901"	
"1907"	
"1910"	
"1913"	
"1917"	
"1921"	
"1922"	
"1926"	
"1927"	
"1932"	
"1933"	
"1934"	
"1935"	
"1936"	
"1937"	
"1938"	
"1939"	
"1940"	
"1941"	*
"1942"	*
"1943"	*
"1944"	***
"1945"	**
"1946"	****
"1947"	****
"1948"	********
"1949"	*******
"1950"	********
"1951"	***********
"1952"	***********
"1953"	*****************
"1954"	*******************
"1955"	*******************
"1956"	***********************
"1957"	**********************
"1958"	*****************************
"1959"	****************************
"1960"	**********************************
"1961"	***************************
"1962"	************************************
"1963"	******************************************
"1964"	******************************************
"1965"	************************************
"1966"	****************************************
"1967"	**********************************************
"1968"	********************************************
"1969"	********************************************