# Exercises due by EOD 2018.09.29

## goal

in this homework assignment we will attempt to gain familiarity with linux command line tools and, in particular, `bash` scripting. we will also start using `git` (more info will be coming in future lectures)

## method of delivery

as mentioned in our first lecture, the method of delivery may change from assignment to assignment. we will include this section in every assignment to provide an overview of how we expect homework results to be submitted, and to provide background notes or explanations for "new" delivery concepts or methods.

this week you will be submitting the results of your homework by copying them to my `ec2` instance using `scp`, a linux command line tool used to copy files over the `ssh` protocol

summary:

| exercise | deliverable | method of delivery |
|----------|-------------|--------------------|
| 1 | a directory structure on my `ec2` | none, we will see it in our terminal |
| 2 | `gu511_download_{A or B}.sh`, a `bash` file | `scp` (see q3) |
| 3 | `scp` of `sh` file from q2 to my `ec2` instance | `scp` |
| 4 | none | none |
| 5 | a `push` of a first commit of a local repo to a shared `github` repo | none (we will get an email) |

# exercise 1: build me a `tree`house

after last week, you now have the ability to make an `ssh` connection into my `ec2` instance. now I want you to use it to build me something.

in your home directory on my `ec2` instance (`/home/[GU ID]`), I would like for you to reproduce the following directory structure:

```sh
ubuntu@ip-172-31-90-226:~$ tree ~/treehouse -p
/home/ubuntu/treehouse
├── [drwxrwxr-x]  A
│   ├── [-rw-rw-r--]  a
│   └── [-rw-rw-r--]  b
├── [drwxrwxr-x]  B
├── [drwxrwxr-x]  C
├── [-rw-rw-r--]  d
├── [drwxrwxr-x]  E
│   ├── [-rw-rw-r--]  a
│   ├── [-rw-rw-r--]  b
│   ├── [-rw-rw-r--]  c
│   └── [-rw-rw-r--]  d
├── [drwxrwxr-x]  F
└── [-r--------]  g
```

be sure to pay attention to:

1. what is a directory versus what is a file
1. nesting
1. the file permissions. I've updated one (and only one) object's permission, so figure out which it is and use `chmod` to update yours

##### nothing to submit; we'll see the answer on our `ec2` server

# exercise 2: creating a useful bash script 

there are two versions of this question -- one for beginners and another for experienced users. both will be graded completely equivalently, so choose based on your familiarity with linux or desire for a challenge.

I will give you a set of commands to execute to test this script; execute those commands on your `ec2` ubuntu image. we will be evaluating them with those same commands.

## 2.A: creating a "useful" bash script (linux beginners)

we're going to write a bash script that will download current weather information at DCA (Reagan National Airport). we'll do this in stages:

1. create a directory to hold our data
2. download the current weather and delay status for DCA (Reagan Washington National airport)
3. print a status message indicating whether or not we were successful to a log file

to create this script, we will move one step at a time; the final script will just be all of the commands put together into one script.

along the way, we will want to make sure that all of the commands we execute are *repeatable*: we should be able to run this script a *first* time (and it will do any setup we may need that first time), and then *again* (so it will be okay that this setup is already done, and not fail)

### create a directory

write a command to make a directory `~/data/weather/`

### make sure your "create a directory" command is *repeatable*

try running the command you just wrote *again* -- what happens?

in order to make this command repeatable, you will need to specify some flags to this command such that it will:

1. create both `~/data` and `~/data/weather` if they don't exist
    1. this is necessary the *first* time the script runs
2. to not to throw an error if that directory already exists
    1. this is necesary the *other* times the script runs

*hint: if you know how to make a directory, try `man [COMMAND]` to see how to make sure no error is thrown if a directory already exists*

### download the current weather and delay status for DCA

the FAA (Federal Aviation Administration) has created [a RESTful `xml` and `json` formatted endpoint](https://app.swaggerhub.com/apis/FAA/ASWS/1.1.0) for basic information about airports -- thanks, FFA! that page is a link to a "swagger" documentation page. it is a document which describes the various URLs you could access to get information. think of different URLs as functions; for some you add parameters, and the website will return results. this page is the documentation.

the second endpoint of that API is for airport status codes, and lives at `https://soa.smext.faa.gov/asws/api/airport/status/{airportCode}`, where `airportCode` is a real airport code. this is the second big blue `GET` button on the page followed by `/api/airport/status/{airportCode}`.

push that `GET` button!

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=118BzYW0zeDufZdpDfJ3_OaaCV0cAXWWR"></div>

this will pull up a box which describes the API endpoint and gives you a block called "Parameters". push the "Try it out" button, set the `airportCode` to be `DCA`, and hit the "Execute" button

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1yzAvgZQIXa2YSqNZPLzLISbVXTTAObYd"></div>

when you do this, it will give you a few new dark grey boxes. the first is a `curl` command (check our lecture notes for this one!). the third is a `json` response that looks like:

```json
{
  "Name": "Washington National Reagan International",
  "City": "Washington",
  "State": "District of Columbia",
  "ICAO": "KDCA",
  "IATA": "DCA",
  "SupportedAirport": true,
  "Delay": false,
  "DelayCount": 0,
  "Status": [{"Reason": "No known delays for this airport"}],
  "Weather": {
    "Weather": [{"Temp": ["Overcast"]}],
    "Visibility": [10],
    "Meta": [
      {
        "Credit": "NOAA's National Weather Service",
        "Url": "http://weather.gov/",
        "Updated": "Last Updated on Sep 20 2018, 9:52 pm EDT"
      }
    ],
    "Temp": ["74.0 F (23.3 C)"],
    "Wind": ["South at 10.4"]
  }
}
```

**copy** that `curl` command and run it from your terminal command line. see what happens when you do.

then, run it again, this time including the `--silent` flag. what is the difference?

using the *output* flag for the `curl` command line tool to download the `json` results of that API call to a file named `~/data/weather/dca.weather.json`. also use the `--silent` flag.

### get the status code from the download request

you just successfully wrote a linux command that can download the DCA json information from the API and wrote it to a file. any time that command runs, it will either be *successful* or *unsuccessful*.

after you run that command, get the **exit status** of that command and print it to the terminal.

### print a status message to a log file

let's get the following for a status message:

1. the current time
2. the result of the previous command (the download command) -- just as an error code, nothing more complicated than that

the end result should be a line formatted like

```
YYYY-mm-dd HH:MM:SS    gu511_download_A.sh    command status code was: [status code here]
```

write a command to save the current time to a variable called `$NOW`.

once you can construct such a line, *append* that line to a log file at `~/data/weather/download.log`

### combine all of the above into a bash script

create a file called `gu511_download_A.sh` by filling in the following template:

```sh
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program to use to execute the
# commands you provide below

# the following line(s) creates the directory 
# ~/data/weather if needed
FILL THIS IN

# the following line(s) downloads the current weather 
# and delay status for DCA into ~/data/weather
FILL THIS IN

# the following line(s) write a log message to file 
# indicating status code of previous line 
FILL THIS IN

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

### expected behavior

at this point you have a file `gu511_download_A.sh` which should be able to download information from the FAA about current weather conditions at DCA and log the results of that process to a file. doing all of this should not change the current working directory.

you should be able to execute the following commands in order and see a result like the ones described:

| command | output / result |
|-|-|
| `pwd` | print working directory to screen |
| `bash gu511_download_A.sh` | runs your script without error |
| `pwd` | the exact same working directory as above |
| `cat ~/data/weather/dca.weather.json` | a `json` blob is printed to the screen |
| `cat ~/data/weather/download.log` | a log message like `2018-09-20 16:52:11    gu511_download_A.sh    0` |

##### you will submit this file in exercise 3

## 2.B: create a *useful* bash script (advanced linux users)

we're going to write a bash script that will download an arbitrary number of urls from a text file in a highly parallel way. we'll write this script in stages:

1. create a directory to hold our downloaded data
2. download a list of urls from a text file

to create this script, we will move one step at a time; the final script will just be all of the commands put together into one script

### create a test csv

execute the following commands to create a list of test urls for downloading:

```sh
echo www.google.com > /tmp/test.urls
echo www.georgetown.edu >> /tmp/test.urls
echo www.elderresearch.com >> /tmp/test.urls
echo www.twitter.com >> /tmp/test.urls
echo www.facebook.com >> /tmp/test.urls
```

### create a directory

write a command to make a directory `~/data/downloads/`

### make sure your "create a directory" command is *repeatable*

try running the command you just wrote *again* -- what happens?

in order to make this command repeatable, you will need to specify some flags to this command such that it will:

1. create both `~/data` and `~/data/downloads` if they don't exist
    1. this is necessary the *first* time the script runs
2. to not to throw an error if that directory already exists
    1. this is necesary the *other* times the script runs

### write a command to print the contents of `test.csv` of urls to `stdout`

print the contents of `test.csv` to the terminal (for piping to a later function)

### use `xargs` to pipe the contents of `test.urls` to the `echo` function

soon we will write a function which will take a *single* url and download it. to pass many urls to this script and to create several forks (separate processes which will work in parallel) we will use the `xargs` command.

let's get some practice with the `xargs` command before trying to use it for our download function. in particular, let's look at the following flags:

1. `-P` or `--max-procs`: specify the maximum number of separate processes we should start (default is 1, 0 is interpreted as "maximum number possible")
2. `-n`: in conjunction with `-P`, the number of items passed to each process
3. `-I`: specify which sequence of characters in the command to follow should be replaced with the item passed in by `xargs`. a somewhat common option is `{}` because it is unlikely to be meaningful in any command that follows. that must be escaped, though -- see below

as an example, check out the results of the following:

```bash
cat /tmp/test.urls | xargs -P 100 -n 3 -I{} echo url is \{\}
```

### `curl` one of those urls

take one of those urls -- say, www.google.com -- and download it to a file. do the following:

1. run it in "silent" mode
1. follow redirects
1. write the contents of that download to a file in `~/data/downloads` with a the same name as the final portion (the `basename` of that url)

*hint*: suppose we have the url is a bash variable `$URL`. we could write

```sh
curl [silent flag and follow redirects flag] $URL > ~/data/downloads/$(basename $URL)
```

the `basename` piece is necessary for urls which are more complicated than just `www.xxxxxxxx.com`, such as `www.xxxxxxxx.com/a/longer/path/with?stuff=x&other_stuff=y`

verify that the downloaded contents for one test url match the source on the corresponding webpage

### export that `curl` statement as a function

you can create a bash function using the syntax

```sh
function my_function_name {
    # do bash stuff
}
```

arguments are passed to this function as bash variables `$1`, `$2`, and so on, such that if you write

```sh
my_function_name arg1 arg2 arg3 arg4
```

these will be "available" within the body of the function as

| variable name | value |
|---------------|-------|
| `$1`          | arg1  |
| `$2`          | arg2  |
| `$3`          | arg3  |
| `$4`          | arg4  |

for example, if we wanted to turn our echo command up above into a super l33t re-usable function, we could write

```sh
function l33t_url_echo {
    echo "the url is $1"
}

# test it out
l33t_url_echo www.google.com
```

we could also make this available in other bash shells be `export`-ing it:

```sh
export -f l33t_url_echo
```

so, let's talk about **what you should actually do**:

1. convert your `curl` statement from before into a bash function named `l33t_url_download` that will take a url as a parameter
2. export it for use in other bash sessions

### use that function with `xargs` on your test urls

for each of the urls filtered by `xargs` we want to run the newly-minted `bash` function with that url as the argument.

for example, if we wanted to use our `l33t_url_echo` function from above, we could write:

```sh
# ...it pays to read ahead...
cat /tmp/test.urls | xargs -P 100 -n 3 -I{} bash -c l33t_url_echo\ \{\}
```

in the above, the actual *command* we are executing with `xargs` is the `bash` command, which

1. starts a new `bash` shell
2. executes the *command* following flag `-c` (that's what the `-c` flag *is*)
3. replaces the occurrence of `\{\}` with whatever url is available
4. special characters such as spaces and braces need to be escaped to be passed in using the `-c` command

write your own version of the command above, replacing `l33t_url_echo` with the function you created previously (`l33t_url_download`).

delete all of the items in `~/data/downloads` to start from scratch, and run the whole `cat + xargs + your_function` line. verify it downloads each test url.

### replace `/tmp/test.urls` with a variable path name

create a variable `$URL_FILE` with a value of `/tmp/test.urls`, and invoke the previous `cat` + `xargs` + `your_function` line using the variable name instead of the hard-coded path

### understand command line arguments

the way that bash handles command line arguments to a shell script is identical to the way functions receive them -- the first word (first in a space-separated list) is stored to a variable `$1`, the second to `$2`, and so on. 

a common convention for command line arguments is to supply a default value, and this can be done with a bash variable resolution construct:

```sh
MY_VAR=${TRY_THIS_FIRST:-USE_THIS_IF_NOTHING_FOUND}
```

if `$TRY_THIS_FIRST` exists, bash resolves that expression to the value of `$TRY_THIS_FIRST` and uses it to set the value of `$MY_VAR`. if it does not, it will then try evaluating the *exact string* following the `:-` characters.

In the example above,

+ if `$TRY_THIS_FIRST` is set to some value, `MY_VAR` will be set to that value
+ if `$TRY_THIS_FIRST` is *not* set to some value, `MY_VAR` will be set to the `USE_THIS_IF_NOTHING_FOUND`
    + if `USE_THIS_IF_NOTHING_FOUND` is *itself* a variable expression (e.g. `$USER`), it will be resolved and then assigned to the variable `MY_VAR`
    
a common use of this is setting default command line argument values. for example, suppose I create a file `my_script.sh` that contains the following:

```sh
#!/usr/bin/bash

FIRST_ARGUMENT=${1:-defaultval}

echo $FIRST_ARGUMENT
```

if I call

```sh
bash my_script.sh
```

there is no argument passed and therefore `$1` will not be set. This will result in `FIRST_ARGUMENT` being set to the default value `defaultval`, and the script will print `defaultval` to the terminal.

if, on the other hand, I call

```sh
bash my_script.sh "print me"
```

bash will create a variable `$1` with a value `print me`, and the script will end up printing `print me` to the terminal.

### combine all of the above into a bash script

create a file called `gu511_download_B.sh` to the following format:

```sh
#!/usr/bin/bash
# when this script is run, the line above tells the
# command line what program (binary) to use to
# execute the commands

# allow the executing user to pass their own list of urls,
# but keep /tmp/test.urls as a default
URL_FILE=${1:-/tmp/test.urls}

# the following line(s) creates the directory 
# ~/data/downloads if needed
FILL THIS IN

# the following line(s) define our single-url curl
# download function
function l33t_url_download {
    FILL THIS IN
}

# the following line(s) export that function for use
# in other bash session
export -f l33t_url_download

# the following line is the "cat + xargs + your_function"
# line from the previous step
FILL THIS IN

# exit with the most recent error code -- you can
# leave this line alone
exit $?
```

##### postscript

*if everything went according to plan, this script should be among the fastest download programs I've ever come across (no exageration there). it was useful enough that I put it and some variants on a github repo I own.*

*...it **really** pays to read ahead...*

### expected behavior

at this point you have a file `gu511_download_B.sh` which should be able to download any number of urls.

you should be able to execute the following commands with the `/tmp/test.urls` as written above in order and see a result like the ones described:

| command | output / result |
|-|-|
| `pwd` | print working directory to screen |
| `bash gu511_download_B.sh` | runs your script without error |
| `pwd` | the exact same working directory as above |
| `ls -alhrt ~/data/downloads` | five files, all named `www.XXXXXX.com` with non-0 sizes |

##### you will submit this file in exercise 3

# exercise 3: submitting your homework via `scp`

you will submit your homework this week by copying it to my `ec2` server using a secure copy over `ssh`, aka `scp`

### tangent about how your `ssh` access was set up

in a previous weeks' exercises you created a public key and sent it to me along with a "hailing" ip address.

after receiving them, I used the following script to create your users and configure `ssh` access:

```sh
#!/bin/bash

# command line args
USERNAME=${1}
HOME=/home/$USERNAME
PUBKEY=${2}

# create user and set up home / .ssh director
useradd -m $USERNAME
mkdir -p $HOME/.ssh
chown $USERNAME:$USERNAME $HOME/.ssh

# add public key to authorized_keys
echo $PUBKEY >> $HOME/.ssh/authorized_keys
chown -R $USERNAME:$USERNAME $HOME/.ssh/
chmod 700 $HOME/.ssh
chmod 600 $HOME/.ssh/authorized_keys
```

I then sent you the information you need to sign in:

1. your user name on my machine (your `guid`)
1. the IP address / DNS name of my `ec2` server

you should then be able to log in to my `ec2` server with the command

```sh
ssh -i /path/to/your/private/key [YOUR USER NAME HERE]@[MY EC2 IP ADDRESS HERE]
```

try this and make sure it works!

### actually using `scp`

the point of this exercise is to use `scp` (the SSH copy command) or some [gui](https://en.wikipedia.org/wiki/Graphical_user_interface) `scp` application (e.g. WinSCP or Filezilla) to copy your bash script file to my `ec2` server.

on my `ec2` server you have a home directory (variable representation is `~`, absolute path is `/home/[YOUR GU ID]`). copy your shell script from its current location on your laptop into your home directory on my `ec2` server. keep the file name as it was (either `gu511_download_A.sh` or `gu511_download_B.sh`).

if you are using `scp`, the general structure of the command is

```bash
# copying a *local* file to a *remote* machine
scp -i /path/to/your/private/key [local files to copy] [user name]@[host name or ip]:[path on remote machine]
```

to go in the other direction (*i.e.* copy remote files to your local machine), just flip the order between the `[local files to copy]` element and the `[user name]@[host name or ip]:[path on remote machine]` element.

so for this particular copy operation:

```bash
scp -i /path/to/your/private/key /path/to/your/gu511_download_A.sh [your user name here]@[my aws ec2 ip]:~/gu511_download_A.sh

# or

scp -i /path/to/your/private/key /path/to/your/gu511_download_B.sh [your user name here]@[my aws ec2 ip]:~/gu511_download_B.sh
```

##### run the command above to copy the answer for exercise 2 to my `ec2` instance

# exercise 4: create a brand new *local* repository in `git`

we are going to create a directory with a single file (a `README.md` file) and we will start tracking it with version control.

we haven't covered what these commands are *doing* yet in lecture, so for the time being, break the cardinal rule of `linux` command line: press enter without fully understanding what you're doing

1. choose to use your home laptop or `ec2` instance. you can use either, but you will have to edit files and use `git` in that place you choose for the rest of the class, so choose wisely!
1. in some place you will not forget, create a directory named `gu511_git_hw`
1. move into that directory (`cd`)
1. run the following command: `echo "# 511 github repo" > README.md`
1. run `git init` to create a new `git` repository
1. run `git add README.md` to *stage* the new `README.md` file
1. run `git commit -m 'gu511 git hw: initial commit'` to create your first commit.

##### nothing to submit here; we will see it when you `push` to `github` in quesiton 5

# exercise 5: connect your *local* and *remote* `gu511_git_hw` repos

in a previous homework assignment we created an empty repo on `github` called `gu511_git_hw`. right now, on that repo home page there is a pair of commands under the title **"...or push an existing repository from the command line"**.

run those two commands to add your `github` repo as the *remote* for your *local* repo, and then `push` your single `commit` message to `github`

##### we will get an email that this has been done -- we are collaborators, after all