In [1]:
from IPython.core.display import HTML
css_file = "../notebook_style.css"
HTML(open(css_file, 'r').read())

# Text analysis

In these exercises, we will look at a python class that performs a analysis on a given text. The code is called `n_grams.py` as one of its key functions searches for repeated *n-grams* in a text. An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of n words. For example, in the sentence 

    The cow jumps over the moon
    
the 2-grams (or *bigrams*) would be 
- the cow 
- cow jumps 
- jumps over 
- over the 
- the moon 

The `n_grams` function constructs all the n-grams in the text and returns a dictionary indicating how many times they appear. The text report function prints out the 10 longest n-grams which appear more than 4 times. 

## Testing

The code as it is appears to run fine for a few 'normal' cases, however as it is untested it is likely that it will not do so well for all input data. 
Your task is to design a set of tests that ensure the code functions correctly for all possible input data. It should be able to deal with edge cases and suitably fail (e.g. terminate with an exception) for invalid data. 

When designing your tests, have in mind the following:
* What range of cases should the code be able to deal with? 
* How should the code deal with edge cases?
* What should the code do if it encounters invalid input data?
* Even for valid input data, does the code always give the same output or is there some randomness? If so, how can the tests be designed to deal with that?


A few examples of 'normal' cases have been given. You may wish to create some more input data for running your tests in order to cover the full range of valid input data (and to test the code fails for invalid input data).

In [2]:
from n_grams import n_grams

In [3]:
files = {"alice": "http://www.gutenberg.org/files/11/11-0.txt", 
         "dracula": "http://www.gutenberg.org/ebooks/345.txt.utf-8",
         "sherlock": "http://www.gutenberg.org/ebooks/1661.txt.utf-8",
         "poe": "the_raven.txt"}

txt = n_grams.Text("n_grams/wilde.txt")
txt.text_report()


There are 20740 words in the text.

Mean, median and mode word length is 4.328929604628737, 4.0, 4.

10 longest words:
misunderstanding
incomprehensible
enthusiastically
disrespectfully
personification
horticulturally
ostentatiously
metaphorically
reconciliation
investigations

Most common words:
543 x you
287 x jack
269 x algernon
256 x cecily
204 x have
177 x gwendolen
146 x are
146 x not
140 x your
139 x lady

Longest n-grams:
7 x engaged to be married to
8 x aunt augusta lady bracknell
8 x lady bracknell lady bracknell
6 x the name of ernest
6 x seems to me to
6 x i am glad to
6 x i beg your pardon
6 x cecily i dont think
5 x on earth do you
5 x in the country jack
5 x to me to be
5 x i think it is
5 x would like to be
5 x i am afraid i




## Continuous integration & code coverage



Now we have some tests for the code, let's try automating them. Write a `.travis.yml` file and link your code to a travis account.

Next, let's see how comprehensive these tests are. 
- try running your tests with the coverage flag and investigate the `.coverage` report it produces. 
- modify your `.travis.yml` file to run `coverage.py` and link your repo to a [Codecov](https://codecov.io/) account

## Documentation 

Add some sensible documentation to your code using `sphinx`
- Try adding some docstrings to functions in the `numpy` style
- Run `sphinx-quickstart`
- Try adding another page to your documentation and adding it to the `toctree`
- Compile your documentation to generate html files and/or a pdf (you can also try investigating other formats!)
- Link your project to a [`Read The Docs`](https://readthedocs.org/) account and get it to build the documentation for you

## Publishing code

### Make
In the n_grams folder, you should find the python script `analyse_texts.py`. When run, this will generate a png figure. This figure is included in the tex file `text_analysis.tex`. 
- Write a Makefile which automates the building of this tex file.
- Ensure the Makefile checks all necessary dependencies - e.g. that if you change the code that generates the figure, this will trigger the code to be rerun and the tex file to be rebuilt.
- Include a `clean` command which removes all files created in the build process.

### Docker 
Build a docker image for the n_grams code which contains all the necessary files to build the tex file.
- Write a `Dockerfile` with compilation and runtime instructions 
- Build the image on your machine's local Docker image registry 
- Run the container 
- Share your image in the docker cloud (e.g. by linking your docker cloud account to your GitHub account)
- Download someone else's image and try running it on your machine 
- Update your Makefile so that the docker image is rebuilt when your code changes

### Presenting results 
For this exercise, head over to the `spotify` folder and open up the `music_analysis` notebook.