# Exercise: Text analysis

In this exercise, we will look at a python class that performs a analysis on a given text. The code as it is appears to run fine for a few 'normal' cases, however as it is untested it is likely that it will not do so well for all input data. 

Your task is to design a set of tests that ensure the code functions correctly for all possible input data. It should be able to deal with edge cases and suitably fail (e.g. terminate with an exception) for invalid data. 

When designing your tests, have in mind the following:
* What range of cases should the code be able to deal with? 
* How should the code deal with edge cases?
* What should the code do if it encounters invalid input data?
* Even for valid input data, does the code always give the same output or is there some randomness? If so, how can the tests be designed to deal with that?


A few examples of 'normal' cases have been given. You may wish to create some more input data for running your tests in order to cover the full range of valid input data (and to test the code fails for invalid input data).

In [3]:
import n_grams

In [11]:
files = {"alice": "http://www.gutenberg.org/files/11/11-0.txt", 
         "dracula": "http://www.gutenberg.org/ebooks/345.txt.utf-8",
         "sherlock": "http://www.gutenberg.org/ebooks/1661.txt.utf-8",
         "poe": "the_raven.txt"}

txt = Text(files["poe"])

txt.text_report()


There are 1093 words in the text.

Mean, median and mode word length is 4.463860933211345, 4, 4.

10 longest words:
countenance
forgiveness
unmerciful
entreating
hesitating
expressing
sculptured
distinctly
melancholy
loneliness

Most common words:
17 x this
14 x door
11 x chamber
11 x nevermore
11 x raven
10 x bird
8 x with
8 x from
8 x more
8 x then

Longest n-grams:
5 x quoth the raven nevermore
5 x at my chamber door
6 x and nothing more
7 x on the
6 x and the
5 x said i




## Tests to make the code fail

* `blank.txt` - a blank text file. The code has nothing to stop it dividing by zero when it calculates mean word length.
* `repeat.txt` - this repeats the first verse of the Edward Lear poem *The Jumblies* 68 times. The longest n-gram function only looks for n-grams 50 words or shorter, so fails to spot this and instead slices up the poem. There are a couple of ways to deal with this. One would be that if the code finds an n-gram of length 50, check to see if this is a substring of a longer one until no longer find n-grams. Alternatively, could start looking for n-grams of length 2 and increase until no more are found, then eliminate substrings.
* `http://www.gutenberg.org/ebooks/844.epub.images?session_id=0fa3233ff1abe287d4a1ce534e052624dc702aec` - this is an EPUB file, so the code will not be able to read it
* Longest words and n-grams are both stored in dictionaries, so the in which these are printed will often vary between runs. Care should be taken when designing tests to account for this!
* `short.txt` - Length of file is only 5 words, so when code prints out '10 longest words', it only actually prints out 5. This is a minor error, but nonetheless is not correct behaviour.
* `nile.csv` - this is just a load of numbers, so results are pretty meaningless. Code should really check it's looking at an actual text file.