# Linux lab 1: 
## In this lab you'll create a small classifier pipeline driven by the command line.

### A sample trained naive bayes classifier will be provided as well as sample data.   Your tasks will be the following.
* Use the command line to expand all the sample data that will be used in this exercise
* Following the instructions provided, you will write a script named `classifier.py` that will
    - Load the pretrained naive bayes classfier
    - read stdin, it will expect that each line passed into it will be a path to a file to score for classification
    - output the `score, filename` as output to stdout
* You will then extend the command line pipeline to sort the scores-names so that the top ten scores are displayed at the command line

As an example, if you had a bunch of files for which you needed the line count in a directory called `source`.  You could use the linux utility `wc` to get the line count of each by doing `ls easy_ham/* | xargs wc -l | sort -r | head -n 10`

## Part 1.1 Expanding the sample files

In this repository, under the directory called `data` you will see three files.
* 20021010_easy_ham.tar.bz2
* 20021010_hard_ham.tar.bz2
* 20021010_spam.tar.bz2

Use the `tar` utility to expand them.  Type `man tar` to bring up the linux documentation.  

When expanded, you will see three directories now, giving. you a structure like this:
```
.
├── 20021010_easy_ham.tar.bz2
├── 20021010_hard_ham.tar.bz2
├── 20021010_spam.tar.bz2
├── easy_ham
├── hard_ham
└── spam
```
Where easy_ham, hard_ham and spam contain the expanded files

![](pics_for_lab/tar_easy_ham.png)

![](pics_for_lab/tar_hard_ham.png)

![](pics_for_lab/tar_spam.png)

![](pics_for_lab/data_foler_after_tar.png)

## Part 1.2 Creating a file to load the model and read stdin

The classifier is in the naive_bayes.py file, and there is a stored file with the pretrained values.   Loading the file looks like

```
import naive_bayes as nb

model = nb.NaiveBayesClassifier(k=0.5)
model.load_from_file()
```

The call to load_from_file() will automatically load the pretrained settings.

The next step is to use pythons, `sys` module, to read from stdin.  That way, when you `cat` output at it, it will be recieved in your script.

What your script should expect, is to be sent file locations, from there you can iterate over the files 

A full template for this exercise is here:
```
import naive_bayes as nb
import sys

model = nb.NaiveBayesClassifier(k=0.5)
model.load_from_file()


def process_stdin(stream):
    < PUT YOUR CODE HERE>

def score_one_file(fname, model):
    try:
        sys.stderr.write(fname)
        subject = ""
        with open(fname, errors='ignore') as source:
            for line in source:
                if line.startswith("Subject:"):
                    subject = line.lstrip("Subject: ")

        score = model.predict(subject)
        formatted_return = "{}\t{}".format(str(score), fname)
        print(formatted_return)
    except Exception as e:
        sys.stderr.write("{}\tUncaught Exception:\t{}".format(fname, e))


files_to_score = process_stdin(sys.stdin)

for fname in files_to_score:
    score_one_file(fname, model)
```

Note that you're basically recieving stdin, iteraing over it and then passing the filtered content (only the subject line) to the classifier to score.

At the end of this section you should be able to `echo <path to file> | python classifier` and see something like `0.432    <path to file>` printed on the console.

![](pics_for_lab/1.2_cli_output.png)

## Part 1.3 Building the command line mini pipeline

Now that your classifier.py file reads from stdin and outputs a score to stdout, we can leverage the linux pipeline to classify a batch of files.

We'll use the files you expanded in part 1.1.  The trick here is to issue an `ls` command that will list the files recursively down into the data directory for the files we want to score. More concretely we want to chain 4 commands together.

* An `ls` command that lists our target files by path recursively into the data/directory
* The call to the classifier receiving that list of files
* The output of scores passed to the `sort` utility to order then by score, with the largest at the top
* The output from sort passed to `head` to trim only the top 10 lines since we're looking for the top scored files.

So the solution will take the shape of 
```
ls <wildcards here> | python classifier.py | sort <flags to sort> | head <flags to head>
```
Which should out put something like
```
9.981655083641292e-05	data/easy_ham/0520.db2ae930623e1db4c9cf60676f96c4e5
9.981655083641292e-05	data/easy_ham/0518.def6dfc3c2204dda12270b0ca97f0fc5
9.981655083641292e-05	data/easy_ham/0512.17bff8553d7e8f6c668166afe149795b
9.981655083641292e-05	data/easy_ham/0376.c0225fd19682f7ac58d090b6528af380
9.981655083641292e-05	data/easy_ham/0375.54d0a570b81851127b73cebb8741a2df
9.970924543171243e-05	data/hard_ham/0063.d84fa51cf5329f5e5b2f0c83b7ec94d0
9.779274072567135e-09	data/easy_ham/0606.246043a69d2c710dde0e67eedb1fd853
9.66684964867624e-06	data/easy_ham/0734.7dc0b0b5f6fb1977f0a146a44c4750aa
9.66684964867624e-06	data/easy_ham/0731.59e8a707586a8b3cfe89bff4024dead7
9.66684964867624e-06	data/easy_ham/0711.27203d4f43e71f7e1ced0cdd7f8685c8
```

Solution Command line output

![](pics_for_lab/1.3_solution_output.png)

CLI for only doing spam.
![](pics_for_lab/1.3_spam.png)

## Solution command line

In [4]:
! ls data/*/0* | python classifier.py 2> /dev/null | sort -rn | head -n 10

9.981655083641292e-05	data/easy_ham/0520.db2ae930623e1db4c9cf60676f96c4e5
9.981655083641292e-05	data/easy_ham/0518.def6dfc3c2204dda12270b0ca97f0fc5
9.981655083641292e-05	data/easy_ham/0512.17bff8553d7e8f6c668166afe149795b
9.981655083641292e-05	data/easy_ham/0376.c0225fd19682f7ac58d090b6528af380
9.981655083641292e-05	data/easy_ham/0375.54d0a570b81851127b73cebb8741a2df
9.970924543171243e-05	data/hard_ham/0063.d84fa51cf5329f5e5b2f0c83b7ec94d0
9.779274072567135e-09	data/easy_ham/0606.246043a69d2c710dde0e67eedb1fd853
9.66684964867624e-06	data/easy_ham/0734.7dc0b0b5f6fb1977f0a146a44c4750aa
9.66684964867624e-06	data/easy_ham/0731.59e8a707586a8b3cfe89bff4024dead7
9.66684964867624e-06	data/easy_ham/0711.27203d4f43e71f7e1ced0cdd7f8685c8
sort: Broken pipe
