# Dictionaries for counting words

A common task in text processing is to produce a count of word
frequencies. While NumPy has a builtin histogram function for doing
numerical histograms, it won't work out of the box for couting discrete
items, since it is a binning histogram for a range of real values.

But the Python language provides very powerful string manipulation
capabilities, as well as a very flexible and efficiently implemented
builtin data type, the *dictionary*, that makes this task a very simple
one.

In this problem, you will need to count the frequencies of all the words
contained in a compressed text file supplied as input. Load and read the
data file `HISTORY.gz` (without uncompressing it on the filesystem
separately), and then use a dictionary count the frequency of each word
in the file. Then, display the 20 most and 20 least frequent words in
the text.

## Hints

-   To read the compressed file `HISTORY.gz` without uncompressing it
    first, see the gzip module.
-   Consider 'words' simply the result of splitting the input text into
    a list, using any form of whitespace as a separator. This is
    obviously a very naive definition of 'word', but it shall suffice
    for the purposes of this exercise.
-   Python strings have a `.split()` method that allows for very
    flexible splitting. You can easily get more details on it in
    IPython:


```
   In [2]: a = 'somestring'

   In [3]: a.split?
   Type:           builtin_function_or_method
   Base Class:     <type 'builtin_function_or_method'>
   Namespace:      Interactive
   Docstring:
       S.split([sep [,maxsplit]]) -> list of strings

       Return a list of the words in the string S, using sep as the
       delimiter string.  If maxsplit is given, at most maxsplit
       splits are done. If sep is not specified or is None, any
       whitespace string is a separator.
```

The complete set of methods of Python strings can be viewed by hitting
the TAB key in IPython after typing `a.`, and each of them can be
similarly queried with the `?` operator as above. For more details on
Python strings and their companion sequence types, see
[here](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range).


# Solution

In [3]:
def word_freq(text):
    """Return a dictionary of word frequencies for the given text."""

    freqs = {}
    for word in text.split():
        freqs[word] = freqs.get(word, 0) + 1
    return freqs

def print_vk(lst):
    """Print a list of value/key pairs nicely formatted in key/value order."""

    # Find the longest key: remember, the list has value/key paris, so the key
    # is element [1], not [0]
    #longest_key = max(map(lambda x: len(x[1]),lst))
    longest_key = max([len(word) for count, word in lst])
    # Make a format string out of it
    fmt = '%'+str(longest_key)+'s -> %s'
    # Do actual printing
    for v,k in lst:
        print(fmt % (k,v))

def freq_summ(freqs,n=10):
    """Print a simple summary of a word frequencies dictionary.

    Inputs:
      - freqs: a dictionary of word frequencies.

    Optional inputs:
      - n: the number of """

    words,counts = freqs.keys(),freqs.values()
    # Sort by count
    items = list(zip(counts,words))
    items.sort()

    print('Number of words:', len(freqs))
    print()
    print('%d least frequent words:' % n)
    print_vk(items[:n])
    print()
    print('%d most frequent words:' % n)
    print_vk(items[-n:])

In [4]:
import gzip
text = gzip.open('data/HISTORY.gz').read()
freqs = word_freq(text)
freq_summ(freqs,20)

Number of words: 12253

20 least frequent words:
            b'!)' -> 1
          b'""),' -> 1
          b'"").' -> 1
      b'"#define' -> 1
           b'"%%' -> 1
         b'"%%".' -> 1
          b'"%d"' -> 1
          b'"%x"' -> 1
  b'"\'single\'"' -> 1
b'"(?<!abc)(def)".' -> 1
      b'"(None)"' -> 1
  b'"(built-in)"' -> 1
 b'"*noconfig*",' -> 1
    b'"*shared*"' -> 1
           b'"+"' -> 1
           b'","' -> 1
           b'"-"' -> 1
b'"--with-pymalloc"' -> 1
        b'"-l_r"' -> 1
          b'"-x"' -> 1

20 most frequent words:
b'are' -> 314
 b'an' -> 319
b'with' -> 320
b'module' -> 354
 b'it' -> 365
 b'by' -> 380
b'new' -> 382
  b'*' -> 393
b'that' -> 452
b'The' -> 581
b'now' -> 600
b'for' -> 762
 b'is' -> 926
 b'of' -> 930
 b'in' -> 935
b'and' -> 1062
  b'a' -> 1294
 b'to' -> 1521
  b'-' -> 1624
b'the' -> 2461
