![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

### Bash Scripting for Data Scientists
# Searching files with bash

This project will have somewhat unusual requirements among INE data science courses.  Most such projects ask you to complete cells in a notebook such as this.  However, working with the command line necessarily is about the command line, not notebooks.  This repository itself will provide a variety of files that we can search, filter, and archive.

There *does* exist a bash kernel for Jupyter, which this notebook is saved as. To install it execute in your terminal:

```
pip install bash_kernel
python -m bash_kernel.install
```

and restart your Notebook.

However, the kinds of interactivity you have at a shell are much more flexible than in a notebook.  However, you *can* run bash commands if you install that Jupyter kernel. E.g.:

In [5]:
ls -A1

.DS_Store
.ipynb_checkpoints
Project.ipynb
README.md
Solutions.ipynb


If you are using a Python kernel in Jupyter, you can run bash commands in cells using the `%%bash` "magic". E.g.

```
%%bash
ls -A1
```

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 1

**Searching**

* Locate all of the Markdown (`.md`) files that contain the string 'Moby'.  List only the filenames of matching files.

* Locate all of the Markdown files that *do not* contain the string Moby.

* Count the number of lines that contain matches, per file in this repository (include all file types).  Do not include files that contain zero matches.

* Locate all the files in this repository that are larger than 10,000 bytes.

**Solution**

In [2]:
grep -rl Moby .. | grep .md

../03-Text-Manipulation/demo-steps.md
../01-Working-with-Command-Line/demo-steps.md
../04-Special-Formats/.ipynb_checkpoints/demo-steps-checkpoint.md
../04-Special-Formats/demo-steps.md


In [3]:
grep -L Moby $(find .. -name '*.md')

../trailer.md
../02-The-Unix-Philosophy/demo-steps.md


In [4]:
grep -rc Moby .. | grep -v ':0$'

../03-Text-Manipulation/Solutions.ipynb:17
../03-Text-Manipulation/Project.ipynb:5
../03-Text-Manipulation/.ipynb_checkpoints/Project-checkpoint.ipynb:5
../03-Text-Manipulation/.ipynb_checkpoints/Solutions-checkpoint.ipynb:17
../03-Text-Manipulation/demo-steps.md:35
../03-Text-Manipulation/Moby-Dick.txt:78
../01-Working-with-Command-Line/demo-steps.md:5
../04-Special-Formats/.ipynb_checkpoints/demo-steps-checkpoint.md:1
../04-Special-Formats/demo-steps.md:1


In [5]:
find .. -size +10k

../00-Introduction/DQM-INE-masks-verso-angle.jpg
../00-Introduction/bash-logo.png
../00-Introduction/DQM-Millennium-Bridge-square.jpg
../03-Text-Manipulation/Moby-Dick.txt
../99-Conclusion/DQM-cartoon.png
../99-Conclusion/DQM-stained-glass.jpg


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 2

**Filtering**

* In the book *Moby Dick*, in this local directory, what percentage of the lines that contain the word "whale" also contain the word "white"? Do this search in a case insensitive way since some may be capitalized at the start of sentences (the correct answer is 7%, but write a command to find that).

* Create, using a bash pipeline, a histogram of the words in *Moby Dick*.  Canonicalize words to lower case.  Using the tool `tr` will likely be used in this; check its manual page. You may need a fairly long pipeline for this task, but it can be done in one line.  As a hint, the report should start like this:

```
14279 the
 6575 of
 6349 and
 4610 a
 4587 to
 4119 in
```

**Solution**

In [6]:
NUMERATOR=$(grep -i whale Moby-Dick.txt | grep -ci white)
DENOMINATOR=$(grep -ci whale Moby-Dick.txt)
echo "($NUMERATOR * 100) / $DENOMINATOR" | bc

7


In [7]:
cat Moby-Dick.txt | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | tr -dc '[:alnum:]\n' | sort | uniq -c | sort -nr  | head -6

  14279 the
   6575 of
   6349 and
   4610 a
   4587 to
   4119 in
sort: write failed: 'standard output': Broken pipe
sort: write error


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 3

**Archiving**

* Compare the compressed size of *Moby Dick* using each of `gzip`, `bzip2` and `xz`, each in minimum and maximum compression modes.

**Solution**

In [8]:
echo "Fastest gzip:  $(gzip -1c Moby-Dick.txt | wc -c)"
echo "Smallest gzip: $(gzip -9c Moby-Dick.txt | wc -c)"
echo "Fastest bz2:   $(bzip2 -1c Moby-Dick.txt | wc -c)"
echo "Smallest bz2:  $(bzip2 -9c Moby-Dick.txt | wc -c)"
echo "Fastest xz:    $(xz -1c Moby-Dick.txt | wc -c)"
echo "Smallest xz:   $(xz -9c Moby-Dick.txt | wc -c)"

Fastest gzip:  579179
Smallest gzip: 498410
Fastest bz2:   434415
Smallest bz2:  380471
Fastest xz:    479448
Smallest xz:   409584


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)