# Week 2 Problem 1

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

***

The purpose of this problem is twofold:

1. We want you to gain experience executing Unix commands
2. We want to demonstrate how that can be done from a Jupyter Notebook

Hopefully you've already gained some experience executing unix commands from the [Lesson 1 Readings](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/lesson1.md). There are several ways to execute the same Unix commands from a Jupyter notebook, many of which we will not cover.

#### Method 1: Cell Magics
Cell [magics](https://ipython.org/ipython-doc/3/interactive/magics.html) start with a double percent (`%%`) and affect the entire cell. One we might be interested in is the `%%bash` cell magic which allows us to effectively transform a code cell into a bash script. Don't worry if you don't know what all of the following code does, it is simply for demonstration.

In [1]:
%%bash

# set a name for the test directory
dir='testdir'

# go to the home directory
cd ~ 

# make a directory for testing
# remove it an remake if it already exists
if [[ -e $dir ]]; then
    rm -rf $dir 
fi
mkdir $dir

# go into that directory
cd $dir

# make 5 subdirectories if they don't already exist
mkdir -p test_subdir{01..5}

# list the contents of `dir`
ls

test_subdir01
test_subdir02
test_subdir03
test_subdir04
test_subdir05


#### Method 2: Exclamation Points

You can also precede a single shell command with an exclamation mark (`!`) to execute it from Jupyter (or IPython in general).

In [2]:
!ls ~/data

2001	      2002.csv.bz2  email	     nltk_data
2001.csv      airports.csv  enron-spam	     plane-data.csv
2001.csv.bz2  carriers.csv  misc	     textdm
2002.csv      delta.csv     ml-latest-small  weather


We can also store the output of a Unix command in a variable. This is how we'll grade this assignment. In a moment, we'll ask you to write a Unix command using the exclamation mark method and store the results in a variable. For now, just follow along.

In [3]:
out = !ls ~/data
print(out)

['2001', '2001.csv', '2001.csv.bz2', '2002.csv', '2002.csv.bz2', 'airports.csv', 'carriers.csv', 'delta.csv', 'email', 'enron-spam', 'misc', 'ml-latest-small', 'nltk_data', 'plane-data.csv', 'textdm', 'weather']


To be clear, the output of the Unix command `ls ~/data`, which lists the contents of the `~/data` directory, is now stored in a Python variable called `out`. And if we want to know the *number* of files in the `~/data` directory, we can use Python's [len](http://stackoverflow.com/questions/20860430/what-is-a-len-function-in-python-and-how-would-you-use-it) function.

In [4]:
len(out)

16

Equivalently, we could have used unix to accomplish the same task:

In [5]:
num_files = !ls ~/data | wc -l

In [6]:
print(num_files)

['16']


You might want to remember the above method of counting the files in a directory, as it may come in handy later.

# Problem 1

Your task is to count the number of files in the `~/data/weather` directory. (Hint: similar setup to how I counted the number of files in `~/data` above.) In the following code cell, write a unix command that lists the files in `~/data/weather` and pipe that command into one that will count the files. Use the exclamation mark method and store your result in a variable called `weather_file_count`. Replace the comment `#your code here` with your unix command.

In [7]:
weather_file_count = !ls ~/data/weather | wc -l

In [8]:
print(weather_file_count) # run this cell to view your result!

['366']


The following cells demonstrate how we will grade these assignments. The first cell makes sure your result is in the proper format, and attempts to fix some common errors in formatting. In the future, when you've learned more Python, you'll be expected to ensure your results are in the proper format yourself.

In [9]:
import IPython
def parse_result(res):
    if type(res) == IPython.utils.text.SList:
        res = res[0]
    if type(res) == str:
        try:
            res = int(res)
        except ValueError:
            print("Your code doesn't produce something we can convert to an integer.\nPlease check your result and try again")
    return(res)

In the next cell are items called [*unit tests*](https://en.wikipedia.org/wiki/Unit_testing), and are a common way to ensure software is working as expected. We will use them to ensure your result is what we expect it to be, which makes it easier for us to grade but also gives you as the student immediate feedback. **If your code does not pass the unit tests, it will not pass the autograder.**

In [10]:
from nose.tools import assert_equal
weather_file_count_parsed = parse_result(weather_file_count)
assert_equal(type(weather_file_count_parsed), int)
assert_equal(weather_file_count_parsed, 366)

# Problem 2

Now you want to know the total number of *lines* in **all files** in the `~/data/email/spam` directory. (Hint: there are two ways to do this. You could combine the files first and then pipe that result into a command to count the lines, or you could count each file individually and grab the total from the result. **The first approach will require no extra parsing if done correctly, so I recommend that approach.**) You may wish to use the `cat` command.


In [11]:
lines_of_spam = !cat ~/data/email/spam/* | wc -l

In [12]:
print(lines_of_spam)

['195563']


In [13]:
lines_of_spam_parsed = parse_result(lines_of_spam)
assert_equal(type(lines_of_spam_parsed), int)
assert_equal(lines_of_spam_parsed, 195563)

## Cleaning up

In [14]:
!rm -rf ~/testdir/ # remove that test directory we made before