# Problem

Download the zipped text of a book from a given URL. Count the occurrences of each different word. Print the top 10 most common words, plus mean and median for word occurrence count.


# Setup

For this exercise we are going to use Mary Shelley's classic "Frankenstein; or, The Modern Prometheus". To avoid relying on external websites I've downloaded and zipped the plain text, which is now hosted in this very git repo under `/data`. The full address for the book is:

In [1]:
book_url = 'https://github.com/ne1s0n/coding_excercises/raw/master/data/Frankenstein%2C%20or%20the%20Modern%20Prometheus%20(First%20Edition%2C%201818).zip'

# Data retrieval

There are several modules that allow to download files, e.g. 
[urllib.request module](https://docs.python.org/3/library/urllib.request.html), 
[urllib2 module](https://docs.python.org/2/library/urllib2.html) (but only in Python 2), 
and [wget module](https://pypi.org/project/wget/).

I suggest using the [requests module](https://requests.readthedocs.io/en/master/), and even the very [python documentation does so](https://docs.python.org/3/library/urllib.request.html#module-urllib.request), but your mileage may vary.

The `requests module` approach is to use the `get` method to create a [response object](https://requests.readthedocs.io/en/master/api/#requests.Response)

In [2]:
import requests

r = requests.get(book_url)

We now have a `response object` named r. Let's take a look into it:

In [3]:
# Retrieve some meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.url)

200
application/zip
https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/data/Frankenstein%2C%20or%20the%20Modern%20Prometheus%20(First%20Edition%2C%201818).zip


If all went well we should see a 200 status code (= OK). If you see something different (especially codes starting with a 4, such as 404) something bad may have happened and you may want to check the list of [HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

## Accessing the data without local storage

The `requests module` allows to access zipped data transparently via the [.iter_lines() method](https://2.python-requests.org/en/master/api/#requests.Response.iter_lines) of a `response object`. This is very handy, very fast, and we are not going to use it :)

For the sake of completeness, it would be something like this:

In [4]:
#we are not executing this code
if False:
    for line in r.iter_lines():
        #do something with one line of the book, such as...
        print(line)

## Storing and accessing data locally

We want to be able to 1) store zipped data on local memory and to 2) access it at a later time. 

So, for the first part, we simply store the payload after opening a file pointer:

In [5]:
#saving to local file, you may want to change the path
with open('book.txt.zip', 'wb') as f:
    #the actual data payload is accessed via .content field
    f.write(r.content)

The book is now stored locally. Since it's a zipped file to access it we need to use the [zipfile module](https://docs.python.org/3/library/zipfile.html).

A zip archive can contain more than one file, so the first thing we need to do is to discover the actual list of files in the archive:

In [6]:
import zipfile
z = zipfile.ZipFile('book.txt.zip')
nl = z.namelist()
print(nl)

['Frankenstein, or the Modern Prometheus (First Edition, 1818).txt']


Please note the square brackets: the `.namelist()` method returns a list of file names, and in this case there's a single file in the archive. Let's open it and print the first line.

In [7]:
#let's just read one line of the file
with z.open(nl[0]) as f:
    print(f.readline())

b'Frankenstein; or, the Modern Prometheus\n'


That's the first line of the text. Please note the 'b' at the beginning, which tells that the data are not in the form of string but of binary data, here rendered as characters and escaped symbols. Also, there's a newline character at the end of the line (this is normal behaviour for `readline()`). 

To convert the data to actual strings we need to [decode it to UTF](https://docs.python.org/3/howto/unicode.html):

In [8]:
#same as above, let's just read one line
with z.open(nl[0]) as f:
    print(f.readline().decode("utf-8"))

Frankenstein; or, the Modern Prometheus



It may be hard to tell but there are two "newlines" printed: one is the '\n' from the file (now correctly rendered) and the other is the regular newline after `print()`

# Standard words

We are now able to read data from the desired file, and need to start thinking about the task ahead: collect statistics on the word frequency. To do that we need to have a way to standardize words, and in particular:

- be case insensitive, so that "Dog" and "dog" are considered the same word
- remove punctuation, so that "dog," is equal to "dog"
- reject things that are not proper words: numbers and isolated symbols (e.g. hyphens)

To ease this process we write a function. For string manipulation we'll use regular expressions through the [re module](https://docs.python.org/3/library/re.html).

In [9]:
import re

def standardize_word(raw_word):
    #case insensitive
    word = raw_word.lower()
    
    #removing what is not letters. This is a gross semplification
    #of the many intricacies of language (e.g. we are counting "Smith's" as "smiths")
    #but it will do for this exercise. First step is to compile the
    #regular expression
    myregex = re.compile(("[^a-zA-Z]"))
    
    #and then apply it to the string so that what is caught by the regular 
    #expression is substituted with an empty string
    word = myregex.sub('', word)
    
    #if word has become an empty string in means it's a non-word. We could
    #return it as is, but it's better to be explicit
    if len(word) == 0:
        return(None)
    else:
        return(word)

The function is ready, let's test it!

In [10]:
print(standardize_word("dog"))
print(standardize_word("Dog"))
print(standardize_word("dog11"))
print(standardize_word("12+3"))
print(standardize_word("hello{---}"))
print(standardize_word("What's the best color? Blue!"))

dog
dog
dog
None
hello
whatsthebestcolorblue


The function works exactly as expected: the passed argument is considered a single word, non-letter characters are removed, all is lower case. It will be our duty to feed only actual words to it and not pieces of sentences.

# Counting word frequencies

We are now ready to actually count how many instances of each word appears in the text. We'll keep track using a dictionary: keys are words, values are number of appearences. In this way we have a compact representation of our data that guarantees absence of duplicates.

In [11]:
tally = {}

It is now a matter of...

In [12]:
#opening the zip archive
with zipfile.ZipFile('book.txt.zip') as z:
    #opening the first (and only) file in the zip archive
    with z.open(nl[0]) as f:
        #reading the file line by line
        for line in f:
            #splitting the words using spaces as separator
            #remembering that "line" is a binary and we need to decode
            #it to string before invoking the string methods
            words = line.decode("utf-8").split()
            
            #each word need to be processed
            for w in words:
                #standardize it with our handy function
                w_st = standardize_word(w)
                
                #new or old word?
                if w_st in tally:
                    #old word, update count
                    tally[w_st] = tally[w_st] + 1
                else:
                    #new word, let's add it
                    tally[w_st] = 1

And we are done! How many different words did we collect?

In [13]:
print(len(tally))

7016


Seven thousands, not bad.  
Unfortunately to do statistics on the numbers a dictionary is not very handy. It would better to put data in a `DataFrame` from [pandas module](https://pandas.pydata.org/). Luckily there is a method that actually accepts [dicts as data source](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html):

In [14]:
import pandas

#creating a data frame
#keys become "index" (in R that would be row names)
#the only available value becomes column 0, which we name "count"
df = pandas.DataFrame.from_dict(tally, orient='index', columns=['count'])

#printing the data frame, which has a built-in print method that
#shows the first and last lines, plus a footer with its dimensions
df

Unnamed: 0,count
frankenstein,35
or,192
the,4053
modern,18
prometheus,9
...,...
peteforsyth,1
httpwikisourceorg,1
httpwwwcreativecommonsorglicensesbysa,1
httpwwwgnuorgcopyleftfdlhtml,1


This is exactly as expected: the first words are from title, and the last ones are the final bunch of websites that originally distributed the text. As a trivia, we now know that the word "Frankenstein" appears 34 times in the novel, not counting the title.

The problem ask us to find the top 10 most common words. So we must sort the data frame by descending order and then print the first part.

In [15]:
#sorting by count
df = df.sort_values(by=['count'], ascending = False)

#printing the first ten lines
df.head(10)

Unnamed: 0,count
the,4053
and,2875
i,2728
of,2537
to,1965
my,1666
a,1311
in,1067
that,1013
was,974


As expected from most English text the word "the" is the most common. This also confirms that we did a good job with the statistics.

We are now required to print mean and median of the word counts. Using `pandas` methods this is a trivial task.

In [16]:
df.mean()

count    10.361745
dtype: float64

In [17]:
df.median()

count    2.0
dtype: float64

That's quite a difference: the mean is five times the median value! This is certainly a skewed distribution, with very few words appearing a high number of times and many many rare words.

# Next steps...

This concludes our task. In this exercise we've worked with lists, dictionaries, dataframes, zip archives, remote and local files, and strings. We built a solid solution but there's always room for improvement. Consider:

- how to improve word splitting? Currently hyphened-words are fused...
- maybe apostrophes deserve a special status? For example "what's" should be considered as two separated words, not as "whats"
- it would be nice to plot an histogram of word frequencies using [matplotlib.pyplot.hist](https://matplotlib.org/3.2.2/api/_as_gen/matplotlib.pyplot.hist.html)